In this chapter, we review several basic features of the multivariate normal distribution. Section 3.2 considers general properties of the multivariate normal density and Section 3.3 considers the sampling distribution of the maximum likelihood estimators (MLEs) of the mean vector and variance–covariance matrix based on iid (independent, identically distributed) random sampling of a multivariate normal distribution. Section 3.4 reviews the standard conjugate distributions for Bayesian inference with the multivariate normal distribution and Section 3.5 considers various generalizations and robustifications of the normal model. The properties of the multivariate normal distribution are well known and available in many places; our primary sources are the texts by Johnson and Wichern (1998) and Morrison (2005). A classic and comprehensive treatment is given by Anderson’s (2003) text.
In this chapter, we review several basic features of the multivariate normal distribution. Section 3.2 considers general properties of the multivariate normal density and Section 3.3 considers the sampling distribution of the maximum likelihood estimators (MLEs) of the mean vector and variance–covariance matrix based on iid (independent, identically distributed) random sampling of a multivariate normal distribution. Section 3.4 reviews the standard conjugate distributions for Bayesian inference with the multivariate normal distribution and Section 3.5 considers various generalizations and robustifications of the normal model. The properties of the multivariate normal distribution are well known and available in many places; our primary sources are the texts by Johnson and Wichern (1998) and Morrison (2005). A classic and comprehensive treatment is given by Anderson’s (2003) text.
In item response theory (IRT), the multivariate normal and its generalizations are most often used as the underlying variables distribution for a data-augmentation version of the normal-ogive model (Bartholomew and Knott, 1999; Fox, 2010), as a population distribution for the proficiency parameters θ _{p} , for persons p = 1, ..., P, and as a prior distribution for other model parameters (e.g., difficulty parameters b_{i} , log-discrimination parameters log a_{i} , etc.) whose domain is the entire real line or Euclidean space. Especially, in the latter two cases, the multivariate normal distribution serves to link standard IRT modeling with hierarchical linear models (HLMs) and other structures that incorporate various group structures and other dependence on covariates, to better model item responses in terms of the contexts in which they are situated (e.g., Fox, 2003, 2005a,b, 2010).
Because of the wide variety of applications of the normal distribution in IRT, we will depart somewhat from the notation used in the rest of the book in this chapter. We use X = (X _{1}, X _{2}, ..., X_{K} )^{ T } to represent a generic K-dimensional random vector and X to represent a generic random variable. Their observed values will be denoted x and x, respectively. Also, because very many densities will be discussed, we will use the generic notation f () to denote a density. The particular role or nature of f () will be clear from context.
The univariate normal density with mean μ and variance σ^{2} for a random variable X is as follows:
As it is well known, E[X] = μ, and Var(X) = E[(X − μ)^{2}] = σ^{2}. We often write X ∼ N (μ, σ^{2}) to convey that the random variable X has this density. The quantity in the exponent of the normal density thus measures the square of the distance between realizations of the random variable, X, and the mean μ, scaled in units of the standard deviation σ.
The multivariate normal density, for X ∈ ℜ ^{K} , is as follows:
where again E[X] = μ = (μ_{1}, ..., μ _{K} )^{ T } is the mean vector and Var(X) = E[(X − μ) (X − μ)^{ T }] = Σ is the K × K symmetric nonnegative-definite variance–covariance matrix. We often write X ∼ N_{K} (μ, Σ), or just X ∼ N(μ, Σ) if the dimension K is clear from context, to indicate that X follows the multivariate normal density. The quantity (x − μ)^{ T } ∑ ^{−1}(x − μ) in the exponent is the squared distance from x to μ, again scaled by the variance–covariance matrix Σ. In other contexts, this is called the Mahalanobis distance between x and μ (Morrison, 2005).
The following theorem gives some properties of the multivariate normal random variables.
If X ∼ N_{K} (μ, Σ), then
A useful geometric property of the multivariate normal distribution is that it is log quadratic: log f(x) is a simple linear function of the symmetric nonnegative definite quadratic form (x − μ)^{ T } Σ ^{−1}(x − μ). This means that the level sets {x : f (x) = K}, or equivalently (after taking logs and omitting irrelevant additive constants) the level sets {x : (x − μ)^{ T } Σ ^{−1}(x − μ) = c ^{2}}, will be ellipsoids. Figure 3.1 depicts bivariate normal density plots and the same densities using contour plots for two sets of variables with equal variance; variables in subplot (a) are uncorrelated and variables in subplot (b) are highly correlated. The contours are exactly the level sets {x : (x − μ)^{ T } Σ ^{−1}(x − μ) = c ^{2}} for various values of c ^{2}.
Finding the principal axes of the ellipsoids is straightforward with Lagrange multipliers. For example, to find the first principal axis, we want to find the point x that is a maximum distance from μ (maximize the squared distance (x − μ)^{ T } (x − μ)) that is still on the contour (satisfies the constraint (x − μ)^{ T } Σ ^{−1}(x − μ) = c ^{2}). Differentiating the Lagrange-multiplier objective function
with respect to x and setting these derivatives equal to zero leads to the eigenvalue/eigenvector problem
It is now a matter of calculation to observe that the eigenvalues can be ordered so that
with corresponding mutually orthogonal eigenvectors y _{k} = (x _{k} − μ) lying along the principal axes of the ellipsoid with half-lengths $c\sqrt{{\lambda}_{k}}$
.Let A be the K × K matrix with columns a _{k} = y _{k} /∥y _{k} ∥. Then, A ^{T} A = I, the identity matrix, and A ^{T} A = diag(λ_{1}, ..., λ _{K} ), the K × K diagonal matrix with diagonal elements λ_{1}, ..., λ _{K} . Now, consider the random vector W = A ^{T} (X − μ). It is easy to verify that W ∼ N (0, diag(λ_{1}, ..., λ _{K} )). The components W_{k} of W are the (population) principal components of X.
Figure 3.1 Bivariate density plots. Plots in (a) are the bivariate density and contour plots for two uncorrelated variables σ12 = 0 and plots in (b) are the bivariate density and contour plots for two perfectly correlated variables σ12 = 1.
Let a set of NK × 1 vectors X _{1}, X _{2}, ..., X _{N} represent an iid random sample from a multivariate normal population with mean vector μ and covariance matrix Σ. Since densities of independent random variables multiply, it is easy to observe that the joint density will be as follows:
The log of this density can be written (apart from some additive constants) as follows:
The value of μ that maximizes L(μ, Σ) for any Σ is $\hat{\mathbf{\mu}}=\overline{\mathbf{x}}$
, since that makes the third term in this loglikelihood equal to zero, so that $\hat{\mathbf{\mu}}=\overline{\mathbf{x}}$ is the MLE. Furthermore, the first two terms of L(μ, Σ) may be rewritten as (N/2) log |Σ ^{−1}| − (1/2)trAΣ ^{−1} where $\mathbf{A}={\mathrm{\Sigma}}_{n=1}^{N}({\mathbf{x}}_{n}-\overline{\mathbf{x}}){({\mathbf{x}}_{n}-\overline{\mathbf{x}})}^{T}$ , and from this and a little calculus, the MLE for Σ may be deduced. Summarizing,If X _{1}, ..., X _{N} ∼ N_{K} (μ, Σ), then the MLE for μ is as follows:
The MLE for Σ is
Note that $\overline{\mathbf{X}}$
and $\overline{\mathbf{\Sigma}}$ are sufficient statistics; they contain all of the information about μ and Σ in the data matrix X. Furthermore, note that $\hat{\mathbf{\Sigma}}$ is a biased estimate of Σ; the unbiased estimator $\mathbf{S}=(1/N-1)\text{}{\mathrm{\Sigma}}_{n=1}^{N}({\mathbf{x}}_{n}-\overline{\mathbf{x}}){({\mathbf{x}}_{n}-\overline{\mathbf{x}})}^{T}$ is often used instead.The sampling distributions of $\overline{\mathbf{X}}$
and S are easily generalized from the univariate case. In the univariate case, $\overline{X}$ and S are independent, $\overline{X}\sim N(\mathrm{\mu},{\sigma}^{2}/N)$ , and $(N-1)\cdot S/{\sigma}^{2}\sim {\chi}_{N-1}^{2}$ , a χ-squared distribution with N − 1 degrees of freedom. Recall that ${\mathrm{\Sigma}}_{n=1}^{N}\text{}{({X}_{n}-\mu )}^{2}/{\sigma}^{2}={Z}_{1}^{2}+{Z}_{2}^{2}+\dots \text{}{Z}_{N}^{2}\sim {\chi}_{N}^{2}$ by definition, because it is a sum of squares of independent standard normals; intuitively, we lose one degree of freedom for (N − 1) · $S/{\sigma}^{2}={\mathrm{\Sigma}}_{n=1}^{N}{({X}_{n}-\overline{X})}^{2}/{\sigma}^{2}$ because we are estimating the mean μ with $\overline{X}$ in calculating S.In the multivariate case where we have the random sample X _{1}, X _{2}, ..., X _{N} , the following three theorems apply:
If X _{1}, ..., X _{N} ∼ N_{K} (μ, Σ) are iid, then
By definition, ${\mathrm{\Sigma}}_{n=1}^{N}({\mathbf{x}}_{n}-\mathbf{\mu}){({\mathbf{x}}_{n}-\mathbf{\mu})}^{T}={\mathbf{Z}}_{1}{\mathbf{Z}}_{1}^{T}+{\mathbf{Z}}_{2}{\mathbf{Z}}_{2}^{T}+\dots +{\mathbf{Z}}_{N}{\mathbf{Z}}_{N}^{T}\sim {W}_{N}\left(\mathbf{\Sigma}\right)$
, since it is the sum of outer products of N independent N(0, Σ) random vectors. Once again, we lose one degree of freedom for (N − 1) · $\mathbf{S}={\mathrm{\Sigma}}_{n=1}^{N}({\mathbf{x}}_{n}-\overline{\mathbf{x}}){({\mathbf{x}}_{n}-\overline{\mathbf{x}})}^{T}$ since we are estimating the mean μ with X. The Wishart density for a positive nonnegative definite random K × K matrix A with D > K degrees of freedom and parameter Σ is as follows:Recall that if the likelihood for data X is (any function proportional to) the conditional density of X given (possibly multidimensional) parameter η, f(x|η), then a conjugate family of prior distributions for f(x|η) is a parametric family of densities f(η; τ) for η with (possibly multidimensional) hyperparameter τ, such that for any member of the conjugate family, the posterior distribution f(η|x; τ) = f(x|η) f(η; τ)/ ∫_{h} f(x|h) f(h; τ) d h can be rewritten as f(η|τ ^{∗}), a member of the same parametric family as f(η|τ), where τ ^{∗} = τ ^{∗}(x, τ) is some function of x and τ. Since only the form of f(η|τ ^{∗}) as a function of η matters, in verifying that f(η|τ ^{∗}) and f(η; τ) belong to the same parametric family, it is usual to ignore multiplicative constants that do not depend on η.
For example, if X _{1}, ..., X_{N} are iid N(μ, σ^{2}), then the likelihood is
as a function of μ, as would be expected, since $\overline{x}$
is sufficient for μ.If we assume σ^{2} is known and place a normal $f(\mu ;{\mu}_{0},{\tau}_{0}^{2})=(1/\sqrt{2\pi}{\tau}_{0}){e}^{-(1/2{\tau}_{0}^{2}){(\mu -{\mu}_{0})}^{2}}$
prior on μ, then the posterior density for μ will beafter completing the square, collecting terms, and identifying the normalizing constant, where
In this case, the posterior mean is ${\mu}_{N}={\rho}_{N}\overline{x}+(1-{\rho}_{N}){\mu}_{0}$
where ${\rho}_{N}={\tau}_{0}^{2}/({\tau}_{0}^{2}+{\sigma}^{2}/n)$ is the classical reliability coefficient. Thus, if the prior distribution is μ ∼ N(μ_{0}, τ_{0}), then the posterior distribution will be μ|x _{1}, ..., x_{N} ∼ N(μ _{N} , τ _{N} ); this shows that the normal distribution is the conjugate prior for a normal mean μ, when the variance σ^{2} is known.Further calculation (as shown in Gelman et al., 2004) shows that when both the mean μ and the variance σ^{2} are unknown, and the data X _{1}, ..., X_{N} are an iid sample from N(μ, σ^{2}), the joint distribution
is the conjugate prior, where the notation “ ${\sigma}^{2}\sim \text{Inv}-{\chi}^{2}({\nu}_{0},{\sigma}_{0}^{2})$
” means that ${\sigma}_{0}^{2}/{\sigma}^{2}\sim {\chi}_{\nu 0}^{2}$ . Here, κ_{0} and ν_{0} are hyperparmeters that function as “prior sample sizes”—the larger κ_{0} and ν_{0}, the greater the influence of the prior on the posterior. In this case, the joint posterior distribution for μ, σ^{2} iswhere
with ${S}^{2}=(1/N-1)\text{}{\mathrm{\Sigma}}_{n=1}^{N}{({x}_{n}-\overline{x})}^{2}$
.Although this is the conjugate family when μ and σ^{2} are both unknown, the forced dependence between μ and σ^{2} in the prior is often awkward in applications. Common alternatives are as follows:
If one wishes for a noninformative prior for σ^{2} (analogous to the flat prior choice for μ in (b) above), a choice that preserves conjugacy would be to take the degrees of freedom ν_{0} = 0 in the $\text{Inv}-{\chi}^{2}({\nu}_{0},{\sigma}_{0}^{2})$
prior for σ^{2}. This leads to a prior f(σ^{2}) ∝ 1/σ^{2}. Another noninformative choice is the Jeffreys prior for σ^{2} (proportional to the square root of the Fisher information for the parameter), which in this case leads to the prior f(σ) ∝ 1/σ. The Jeffreys prior for log σ^{2} (or equivalently log σ)issimply f(σ^{2}) ∝ 1. All of these choices are improper priors and care must be taken that the posterior turns out to be proper. One common way to avoid this issue is to force the prior to be proper, say, by taking σ^{2} ∼ Unif(0, M) for some suitably large number M.The conjugate prior distribution for a multivariate normal distribution with parameters μ and Σ has a form similar to that of the univariate case, but with multivariate normal and inverse-Wishart densities replacing the univariate normal and inverse-χ-squared densities. In particular, assuming sampling x _{1}, ..., x _{N} iid from N(μ, Σ), the joint prior for μ and Σ is of the following form:
where the notation “ Σ ∼ Inv-Wishart $({\nu}_{0},{\mathrm{\Sigma}}_{0}^{-1})$
” means that ${\mathbf{\Sigma}}_{0}{\mathbf{\Sigma}}^{-1}\sim {\omega}_{{\nu}_{0}}\left({\mathbf{\Sigma}}_{0}\right)$ the usual Wishart distribution with ν_{0} degrees of freedom and parameter matrix Σ _{0}. Again, κ_{0} and ν_{0} function as prior sample sizes. Then, the joint posterior distribution will be as follows:where
with $\mathbf{S}=(1/N-1)\text{}{\mathrm{\Sigma}}_{n=1}^{N}\text{}({\mathbf{x}}_{n}-\overline{\mathbf{x}}){({\mathbf{x}}_{n}-\overline{\mathbf{x}})}^{T}$
(Gelman et al., 2004). Once again the joint conjugate prior forces some prior dependence between μ and Σ that may be awkward in practice. And once again the common fixes are as follows:The details, generally analogous to the univariate normal case, are worked out in many places; in particular, see Gelman et al. (2004). Gelman et al. (2004) suggest another way to reduce the prior dependence between μ and ∑ . Sun and Berger (2007) provide an extensive discussion of several “default” objective/noninformative prior choices for the multivariate normal.
Assessing multivariate normality in higher dimensions is challenging. A well-known theorem (Johnson and Wichern, 1998; see also Anderson, 2003) states that X is a multivariate normal vector if and only if each linear combination a ^{T} X of its components is univariate normal, but this is seldom feasible to show in practice. Instead one often only checks the one- and two-dimensional margins of X; for example, examining a Q—Q plot for each of the K components of X, and in addition, examining bivariate scatterplots of each possible pair of variables to determine whether the data points yield an elliptical appearance. (See Johnson and Wichern (1998) for more information on techniques for assessing multivariate normality.)
There is no doubt that true multivariate normality is a rare property for a multivariate dataset. Although the latent ability variable in IRT is often assumed to be normally distributed, the data may not conform well to this assumption at all (Casabianca, 2011; Casabianca et al., 2010; Moran and Dresher, 2007; Woods and Lin, 2009; Woods and Thissen, 2006). In these and other cases, robust alternatives to the multivariate normal distribution can be considered.
An important class of generalizations of the multivariate normal distribution is the family of multivariate elliptical distributions, so named because each of its level sets defines an ellipsoid (Fang and Zhang, 1990; Branco and Dey, 2002). The K-dimensional random variable X has an elliptical distribution if and only if its characteristic function (Billingsley, 1995; Lukacs, 1970) is of the form
where as usual μ is a K-dimensional vector and Σ is a symmetric nonnegative definite K × K matrix, and t ∈ ℜ ^{K} . When a density f(x) exists for X, it has the form
where g() is itself a univariate density; g() is called the generator density for f(). The density defines a location/scale family with location parameter μ and scale parameter Σ. The parameter μ is the median of f() in all cases, and if E[X] exists, E[X] = μ. If Var(X) exists, Var(X) = −(∂ψ(0)/∂t) · Σ. A little calculation shows that if W = AX + b is elliptically distributed with location μ and scale Σ, then W = AX + b is again elliptically distributed, with location Aμ + b and scale AΣA ^{T} .
Elliptical distributions are used as a tool for generalizing normal-theory structural equations modeling (e.g., Shapiro and Browne, 1987; Schumacker and Cheevatanarak, 2000). The special case of the K-dimensional multivariate-t distribution on ν degrees of freedom with density
has been used as the error distribution in robust Bayesian and non-Bayesian linear regression modeling since at least Zellner (1976); more recently, robust regression with general elliptical error distributions and the particular case of scale mixtures of normals (of which the univariate-t and multivariate-t are also examples) has also been studied (e.g., Fernandez and Steel, 2000).
A further generalization is the family of skew-elliptical distributions (e.g., Branco and Dey, 2001, 2002). When it exists, the density of a skew-elliptical distribution is of the form
where f _{ g1}() is the density of a multivariate elliptical distribution, and F _{ g2}() is the cumulative distribution function of a (possibly different) univariate elliptical distribution with location parameter 0 and scale parameter 1. The vector parameter λ is a skewness parameter; when λ = 0, the skew-elliptical density reduces to a symmetric elliptical density.
When the generator densities g _{1}(x) = g _{2}(x) = ϕ(x), the standard normal density, we obtain the special case of the skew-normal distributions. It has been observed (Moran and Dresher, 2007) that the empirical distribution of the latent proficiency variable in IRT models applied to large-scale educational surveys sometimes exhibits some nontrivial skewing, which if unmodeled can cause bias in estimating item parameters and features of the proficiency distribution. Skew-normal and related distributions have been proposed (Xu and Jia, 2011) as a way of accounting for this skewing in the modeling of such data, with as few extra parameters as possible.
Two additional classes of transformed normal distributions used in IRT are the lognormal and logit-normal distributions. A random variable X has a lognormal distribution when its logarithm is normally distributed. The density of the lognormal distribution is of the following form:
where μ and σ^{2} are the mean and variance of the variable’s natural log. This distribution is an alternative to Gamma and Weibull distributions for nonnegative continuous random variables and is used in psychometrics to model examinee response times (e.g., van der Linden, 2006) and latent processes in decision making (e.g., Rouder et al., 2014). It is also used as a prior distribution in Bayesian estimation of nonnegative parameters, for example the discrimination parameter in two- and three-parameter logistic IRT models (van der Linden, 2006).
Similarly, a random variable X has a logit-normal distribution when its logit is normally distributed. The density of the logit-normal distribution is of the following form:
Here, μ and σ^{2} are the mean and variance of the variable’s logit, and x is a proportion, bounded by 0 and 1. A multivariate generalization of the logit-normal distribution (Aitchison, 1985) has been used in latent Dirichlet allocation models for text classification ( Blei and Lafferty, 2007) and in mixed membership models for strategy choice in cognitive diagnosis (Galyardt, 2012).
This work was supported by a postdoctoral fellowship at Carnegie Mellon University and Rand Corporation, through Grant #R305B1000012 from the Institute of Education Sciences, U.S. Department of Education.