STAT 802: Multivariate Analysis

Course outline:

Basic structure of typical multivariate data set:

Case by variables: data in matrix. Each row is a case, each column is a variable.

Example: Fisher's iris data: 5 rows of 150 by 5 matrix:

Case Sepal Sepal Petal Petal
# Variety Length Width Length Width
1 Setosa 5.1 3.5 1.4 0.2
2 Setosa 4.9 3.0 1.4 0.2
&vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots;
51 Versicolor 7.0 3.2 4.7 1.4
&vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots;
Usual model: rows of data matrix are independent random variables.

Vector valued random variable: function $ {\bf X}:\Omega\mapsto \mathbb {R}^p$ such that, writing $ {\bf X}=(X_1,\ldots,X_p)^T$,

$\displaystyle P(X_1 \le x_1, \ldots , X_p \le x_p)
$

defined for any const's $ (x_1,\ldots,x_p)$.

Cumulative Distribution Function (CDF) of $ {\bf X}$: function $ F_{\bf X}$ on $ \mathbb {R}^p$ defined by

$\displaystyle F_{\bf X}(x_1,\ldots, x_p) =
P(X_1 \le x_1, \ldots , X_p \le x_p) \,.
$

Defn: Distribution of rv $ {\bf X}$ is absolutely continuous if there is a function $ f$ such that

$\displaystyle P({\bf X}\in A) = \int_A f(x) dx$ (1)

for any (Borel) set $ A$. This is a $ p$ dimensional integral in general. Equivalently

\begin{multline*}
F(x_1,\ldots,x_p) = \\
\int_{-\infty}^{x_1}\cdots
\int_{-\infty}^{x_p} f(y_1,\ldots,y_p) \, dy_p,\ldots,dy_1 \,.
\end{multline*}

Defn: Any $ f$ satisfying (1) is a density of $ {\bf X}$.

For most $ x$ $ F$ is differentiable at $ x$ and

$\displaystyle \frac{\partial^pF(x) }{\partial x_1\cdots \partial x_p} =f(x) \,.
$

Building Multivariate Models

Basic tactic: specify density of

$\displaystyle {\bf X}=(X_1,\ldots, X_p)^T.$

Tools: marginal densities, conditional densities, independence, transformation.

Marginalization: Simplest multivariate problem

$\displaystyle {\bf X}=(X_1,\ldots,X_p), \qquad Y=X_1$

(or in general $ Y$ is any $ X_j$).

Theorem 1   If $ {\bf X}$ has density $ f(x_1,\ldots,x_p)$ and $ q<p$ then $ {\bf Y}=(X_1,\ldots,X_q)$ has density

\begin{multline*}
f_{\bf Y}(x_1,\ldots,x_q)
=
\\
\int_{-\infty}^\infty \cdots \int_{-\infty}^\infty
f(x_1,\ldots,x_p) \, dx_{q+1} \ldots dx_p
\end{multline*}

$ f_{X_1,\ldots,X_q}$ is the marginal density of $ X_1,\ldots,X_q$ and $ f_{\bf X}$ the joint density of $ {\bf X}$ but they are both just densities. ``Marginal'' just to distinguish from the joint density of $ {\bf X}$.

Independence, conditional distributions

Def'n: Events $ A$ and $ B$ are independent if

$\displaystyle P(AB) = P(A)P(B) \,.
$

(Notation: $ AB$ is the event that both $ A$ and $ B$ happen, also written $ A\cap B$.)


Def'n: $ A_i$, $ i=1,\ldots,p$ are independent if

$\displaystyle P(A_{i_1} \cdots A_{i_r}) = \prod_{j=1}^r P(A_{i_j})
$

for any $ 1 \le i_1 < \cdots < i_r \le p$.

Def'n: $ {\bf X}$ and $ {\bf Y}$ are independent if

$\displaystyle P({\bf X}\in A; {\bf Y}\in B) = P({\bf X}\in A)P({\bf Y}\in B)
$

for all $ A$ and $ B$.

Def'n: Rvs $ {\bf X}_1,\ldots,{\bf X}_p$ independent:

$\displaystyle P({\bf X}_1 \in A_1, \cdots , {\bf X}_p \in A_p ) = \prod P({\bf X}_i \in A_i)
$

for any $ A_1,\ldots,A_p$.

Theorem:

  1. If $ {\bf X}$ and $ {\bf Y}$ are independent with joint density $ f_{{\bf X},{\bf Y}}(x,y)$ then $ {\bf X}$ and $ {\bf Y}$ have densities $ f_{\bf X}$ and $ f_{\bf Y}$, and

    $\displaystyle f_{{\bf X},{\bf Y}}(x,y) = f_{\bf X}(x) f_{\bf Y}(y) \,.
$


  2. If $ {\bf X}$ and $ {\bf Y}$ independent with marginal densities $ f_{\bf X}$ and $ f_{\bf Y}$ then $ ({\bf X},{\bf Y})$ has joint density

    $\displaystyle f_{{\bf X},{\bf Y}}(x,y) = f_{\bf X}(x) f_{\bf Y}(y) \,.
$


  3. If $ ({\bf X},{\bf Y})$ has density $ f(x,y)$ and there exist $ g(x)$ and $ h(y)$ st $ f(x,y) = g(x) h(y)
$ for (almost) all $ (x,y)$ then $ {\bf X}$ and $ {\bf Y}$ are independent with densities given by

    $\displaystyle f_{\bf X}(x) = g(x)/\int_{-\infty}^\infty g(u) du
$

    $\displaystyle f_{\bf Y}(y) = h(y)/\int_{-\infty}^\infty h(u) du \,.
$

Theorem: If $ {\bf X}_1,\ldots,{\bf X}_p$ are independent and $ {\bf Y}_i =g_i({\bf X}_i)$ then $ {\bf Y}_1,\ldots,{\bf Y}_p$ are independent. Moreover, $ ({\bf X}_1,\ldots,{\bf X}_q)$ and $ ({\bf X}_{q+1},\ldots,{\bf X}_{p})$ are independent.

Conditional densities

Conditional density of $ {\bf Y}$ given $ {\bf X}=x$:

$\displaystyle f_{{\bf Y}\vert{\bf X}}(y\vert x) = f_{{\bf X},{\bf Y}}(x,y)/f_{\bf X}(x) \, ;
$

in words ``conditional = joint/marginal''.

Change of Variables

Suppose $ {\bf Y}=g({\bf X}) \in \mathbb {R}^p$ with $ {\bf X}\in \mathbb {R}^p$ having density $ f_{\bf X}$. Assume $ g$ is a one to one (``injective") map, i.e., $ g(x_1) = g(x_2)$ if and only if $ x_1 = x_2$. Find $ f_{\bf Y}$:

Step 1: Solve for $ x$ in terms of $ y$: $ x=g^{-1}(y)$.

Step 2: Use basic equation:

$\displaystyle f_{\bf Y}(y) dy =f_{\bf X}(x) dx
$

and rewrite it in the form

$\displaystyle f_{\bf Y}(y) = f_{\bf X}(g^{-1}(y)) \frac{dx}{dy}
$

Interpretation of derivative $ \frac{dx}{dy}$ when $ p>1$:

$\displaystyle \frac{dx}{dy} = \left\vert \mbox{det}\left(\frac{\partial x_i}{\partial y_j}\right)\right\vert
$

which is the so called Jacobian.

Equivalent formula inverts the matrix:

$\displaystyle f_{\bf Y}(y) = \frac{f_{\bf X}(g^{-1}(y))}{ \left\vert\frac{dy}{dx}\right\vert} \,.
$

This notation means

$\displaystyle \left\vert\frac{dy}{dx}\right\vert =
\left\vert \mbox{det} \left...
...} & \cdots &
\frac{\partial y_p}{\partial x_p}
\end{array} \right]\right\vert
$

but with $ x$ replaced by the corresponding value of $ y$, that is, replace $ x$ by $ g^{-1}(y)$.

Example: The density

$\displaystyle f_{\bf X}(x_1,x_2) = \frac{1}{2\pi} \exp\left\{ -\frac{x_1^2+x_2^2}{2}\right\}
$

is the standard bivariate normal density. Let $ {\bf Y}=(Y_1,Y_2)$ where $ Y_1=\sqrt{X_1^2+X_2^2}$ and $ 0 \le Y_2< 2\pi$ is angle from the positive $ x$ axis to the ray from the origin to the point $ (X_1,X_2)$. I.e., $ {\bf Y}$ is $ {\bf X}$ in polar co-ordinates.

Solve for $ x$ in terms of $ y$:

$\displaystyle X_1$ $\displaystyle =$ $\displaystyle Y_1 \cos(Y_2)$  
$\displaystyle X_2$ $\displaystyle =$ $\displaystyle Y_1 \sin(Y_2)$  

so that
$\displaystyle g(x_1,x_2)$ $\displaystyle =$ $\displaystyle (g_1(x_1,x_2),g_2(x_1,x_2))$  
       
  $\displaystyle =$ $\displaystyle (\sqrt{x_1^2 + x_2^2},$argument$\displaystyle (x_1,x_2))$  
       
$\displaystyle g^{-1}(y_1,y_2)$ $\displaystyle =$ $\displaystyle (g^{-1}_1(y_1,y_2),g^{-1}_2(y_1,y_2))$  
       
  $\displaystyle =$ $\displaystyle (y_1\cos(y_2), y_1\sin(y_2))$  
       
$\displaystyle \left\vert\frac{dx}{dy}\right\vert$ $\displaystyle =$ $\displaystyle \left\vert \mbox{det}\left( \begin{array}{cc}
\cos(y_2) & -y_1\sin(y_2)
\\
\\
\sin(y_2) & y_1 \cos(y_2)
\end{array}\right) \right\vert$  
       
  $\displaystyle =$ $\displaystyle y_1 \,.$  

It follows that
$\displaystyle f_{\bf Y}(y_1,y_2)$ $\displaystyle =$ $\displaystyle \frac{1}{2\pi}\exp\left\{-\frac{y_1^2}{2}\right\}y_1 \times$  
       
    $\displaystyle 1(0 \le y_1 < \infty)
1(0 \le y_2 < 2\pi ) \,.$  

Next: marginal densities of $ Y_1$, $ Y_2$?

Factor $ f_{\bf Y}$ as $ f_{\bf Y}(y_1,y_2) = h_1(y_1)h_2(y_2)$ where

$\displaystyle h_1(y_1) = y_1e^{-y_1^2/2} 1(0 \le y_1 < \infty)
$

and

$\displaystyle h_2(y_2) = 1(0 \le y_2 < 2\pi )/ (2\pi) \,.
$

Then

$\displaystyle f_{Y_1}(y_1)$ $\displaystyle =$ $\displaystyle \int_{-\infty}^\infty h_1(y_1)h_2(y_2) \, dy_2$  
  $\displaystyle =$ $\displaystyle h_1(y_1) \int_{-\infty}^\infty h_2(y_2) \, dy_2$  

so marginal density of $ Y_1$ is a multiple of $ h_1$. Multiplier makes $ \int f_{Y_1} =1$ but in this case

$\displaystyle \int_{-\infty}^\infty h_2(y_2) \, dy_2 = \int_0^{2\pi} (2\pi)^{-1} dy_2 = 1
$

so that

$\displaystyle f_{Y_1}(y_1) = y_1e^{-y_1^2/2} 1(0 \le y_1 < \infty) \,.
$

(Special Weibull or Rayleigh distribution.) Similarly

$\displaystyle f_{Y_2}(y_2) = 1(0 \le y_2 < 2\pi )/ (2\pi)
$

which is the Uniform$ (0,2\pi)$ density. Exercise: $ W=Y_1^2/2$ has standard exponential distribution. Recall: by definition $ U=Y_1^2$ has a $ \chi^2$ distribution on 2 degrees of freedom. Exercise: find $ \chi^2_2$ density.

Remark: easy to check $ \int_0^\infty ye^{-y^2/2} dy = 1$.

Thus: have proved original bivariate normal density integrates to 1.

Put $ I=\int_{-\infty}^\infty e^{-x^2/2} dx$. Get

$\displaystyle I^2$ $\displaystyle = \int_{-\infty}^\infty e^{-x^2/2} dx \int_{-\infty}^\infty e^{-y^2/2} dy$    
  $\displaystyle = \int_{-\infty}^\infty \int_{-\infty}^\infty e^{-(x^2+y^2)/2} dydx$    
  $\displaystyle = 2\pi.$    

So $ I=\sqrt{2\pi}$.

Linear Algebra Review

Notation:

Defn: The transpose, $ A^T$, of an $ m \times n$ matrix $ A$ is the $ n \times m$ matrix whose entries are given by

$\displaystyle (A^T)_{ij} = A_{ji}
$

so that $ A^T$ is $ n \times m$. We have

$\displaystyle (A+B)^T = A^T+B^T
$

and

$\displaystyle (AB)^T = B^T A^T \, .
$

Defn: rank of matrix $ A$, rank$ (A)$: # of linearly independent columns of $ A$. We have

rank$\displaystyle (A)$ $\displaystyle =$   dim$\displaystyle ($column space of $ A$$\displaystyle )$    
  $\displaystyle =$   dim$\displaystyle ($row space of $ A$$\displaystyle )$    
  $\displaystyle =$   rank$\displaystyle (A^T)$    

If $ A$ is $ m \times n$ then rank$ (A) \le \min(m,n)$.

Matrix inverses

For now: all matrices square $ n \times n$.

If there is a matrix $ B$ such that $ BA=I_{n \times n}$ then we call $ B$ the inverse of $ A$. If $ B$ exists it is unique and $ AB=I$ and we write $ B=A^{-1}$. The matrix $ A$ has an inverse if and only if rank$ (A) = n$.

Inverses have the following properties:

$\displaystyle (AB)^{-1} = B^{-1}A^{-1}
$

(if one side exists then so does the other) and

$\displaystyle (A^T)^{-1} = (A^{-1})^T
$

Determinants

Again $ A$ is $ n \times n$. The determinant if a function on the set of $ n \times n$ matrices such that:

  1. det$ (I) = 1$.

  2. If $ A^\prime$ is the matrix $ A$ with two columns interchanged then

       det$\displaystyle (A^\prime) = -$   det$\displaystyle (A)\, .
$

    (So: two equal columns implies det$ (A)=0$.)

  3. det$ (A)$ is a linear function of each column of $ A$. If $ A = ( a_1, \ldots, a_n)$ with $ a_i$ denoting the $ i$th column of the matrix then

    det$\displaystyle (a_1,\ldots,$ $\displaystyle a_i+b_i,\ldots,a_n)$    
    $\displaystyle =$ det$\displaystyle (a_1,\ldots,a_i,\ldots,a_n)$    
      $\displaystyle +$   det$\displaystyle (a_1,\ldots,b_i,\ldots,a_n)$    

Here are some properties of the determinant:

  1. det$ (A^T) =$   det$ (A)$.

  2. det$ (AB) =$   det$ (A)$det$ (B)$.

  3. det$ (A^{-1})_ = 1/$det$ (A)$.

  4. $ A$ is invertible if and only if det$ (A) \neq 0$ if and only if rank$ (A) = n$.

  5. Determinants can be computed (slowly) by expansion by minors.

Special Kinds of Matrices

  1. $ A$ is symmetric if $ A^T=A$.

  2. $ A$ is orthogonal if $ A^T=A^{-1}$ (or $ AA^T = A^TA=I$).

  3. $ A$ is idempotent if $ AA\equiv A^2 = A$.

  4. $ A$ is diagonal if $ i \neq j$ implies $ A_{ij}=0$.

Inner Products, orthogonal, orthonormal vectors

Defn: Two vectors $ x$ and $ y$ are orthogonal if $ x^T y = \sum x_i y_i
= 0$.

Defn: The inner product or dot product of $ x$ and $ y$ is

$\displaystyle <x,y> = x^Ty = \sum x_i y_i
$

Defn: $ x$ and $ y$ are orthogonal if $ x^Ty=0$.

Defn: The norm (or length) of $ x$ is $ \vert\vert x\vert\vert = (x^Tx)^{1/2} = (\sum x_i^2)^{1/2}$

$ A$ is orthogonal if each column of $ A$ has length 1 and is orthogonal to each other column of $ A$.

Quadratic Forms

Suppose $ A$ is an $ n \times n$ matrix. The function

$\displaystyle g(x) = x^T A x
$

is called a quadratic form. Now

$\displaystyle g(x)$ $\displaystyle = \sum_{ij} A_{ij} x_i x_j$    
  $\displaystyle = \sum_i A_{ii} x_i^2 + \sum_{i < j} (A_{ij}+A_{ji}) x_ix_j$    

so that $ g(x)$ depends only on the total $ A_{ij}+A_{ji}$. In fact

$\displaystyle x^T Ax = x^T A^T x = x^T\left(\frac{A+A^T}{2} \right) x
$

Thus we will assume that $ A$ is symmetric.

Eigenvalues and eigenvectors

If $ A$ is $ n \times n$ and $ v\neq 0\in \mathbb {R}^n$ and $ \lambda\in \mathbb {R}$ such that

$\displaystyle Av=\lambda v
$

then $ \lambda$ is eigenvalue (or characteristic or latent value) of $ A$; $ v$ is corresponding eigenvector. Since $ Av-\lambda v = (A-\lambda I)v=0$ matrix $ A-\lambda I$ is singular.

Therefore det$ (A-\lambda I) =0$.

Conversely: if $ A-\lambda I$ singular then there is $ v \neq 0$ such that $ (A-\lambda I)v=0$.

Fact: det$ (A-\lambda I)$ is polynomial in $ \lambda$ of degree $ n$.

Each root is an eigenvalue.

General $ A$ the roots could be multiple roots or complex valued.

Diagonalization

Matrix $ A$ is diagonalized by a non-singular matrix $ P$ if $ P^{-1} A P \equiv D$ is diagonal.

If so then $ AP=PD$ so each column of $ P$ is eigenvector of $ A$ with the $ i$th column having eigenvalue $ D_{ii}$.

Thus to be diagonalizable $ A$ must have $ n$ linearly independent eigenvectors.

Symmetric Matrices

If $ A$ is symmetric then

  1. Every eigenvalue of $ A$ is real (not complex).

  2. $ A$ is diagonalizable; columns of $ P$ may be taken unit length, mutually orthogonal: $ A$ is diagonalizable by an orthogonal matrix $ P$; in symbols $ P^TAP = D$.

  3. Diagonal entries in $ D=$ eigenvalues of $ A$.

  4. If $ \lambda_1 \neq \lambda_2$ are two eigenvalues of $ A$ and $ v_1$ and $ v_2$ are corresponding eigenvectors then

    $\displaystyle v_1^T A v_2 = v_1^T \lambda_2 v_2 = \lambda_2 v_1^T v_2
$

    and

    $\displaystyle (v_1^T A v_2 )$ $\displaystyle = (v_1^T A v_2 )^T = v_2^T A^T v_1$    
      $\displaystyle = v_2^T A v_1 = v_2^T \lambda_1 v_1 = \lambda_1 v_2^T v_1$    

    Since $ (\lambda_1-\lambda_2)v_1^T v_2 =0$ and $ \lambda_1 \neq \lambda_2$ we see $ v_1^T v_2 = 0$. Eigenvectors corresponding to distinct eigenvalues are orthogonal.

Positive Definite Matrices

Defn: A symmetric matrix $ {\bf A}$ is non-negative definite if $ x^T {\bf A}x \ge 0$ for all $ x$. It is positive definite if in addition $ x^T {\bf A}x = 0$ implies $ x=0$.

$ {\bf A}$ is non-negative definite iff all its eigenvalues are non-negative.

$ {\bf A}$ is positive definite iff all eigenvalues positive.

A non-negative definite matrix has a symmetric non-negative definite square root. If

$\displaystyle {\bf P} {\bf D} {\bf P}^T = {\bf A}
$

for $ {\bf P}$ orthogonal and $ \bf D$ diagonal then

$\displaystyle {\bf A}^{1/2} = {\bf P} {\bf D}^{1/2} {\bf P}^T
$

is symmetric, non-negative definite and

$\displaystyle {\bf A}^{1/2}{\bf A}^{1/2}= {\bf A}
$

Here $ {\bf D}^{1/2}$ is diagonal with

$\displaystyle ({\bf D}^{1/2})_{ii} = ({\bf D}_{ii})^{1/2}.
$

Many other square roots possible. If $ {\bf A}{\bf A}^T = {\bf M}$ and $ \bf P$ is orthogonal and $ {\bf A}^*={\bf A}{\bf P}$ then $ {\bf A}^*({\bf A}^*)^T={\bf M}$.

Orthogonal Projections

Suppose $ S$ vector subspace of $ \mathbb {R}^n$, $ a_1,\ldots,a_m$ basis for $ S$. Given any $ x\in \mathbb {R}^n$ there is a unique $ y\in S$ which is closest to $ x$; $ y$ minimizes

$\displaystyle (x-y)^T(x-y)
$

over $ y\in S$. Any $ y$ in $ S$ is of the form

$\displaystyle y = c_1 a_1 + \cdots + c_m a_m = Ac
$

$ A$, $ n \times m$, columns $ a_1,\ldots,a_m$; $ c$ column with $ i$th entry $ c_i$. Define

$\displaystyle Q= A(A^TA)^{-1}A^T
$

($ A$ has rank $ m$ so $ A^TA$ is invertible.) Then

$\displaystyle (x$ $\displaystyle -Ac)^T (x-Ac)$    
  $\displaystyle = (x-Qx+Qx-Ac)^T (x-Qx+Qx-Ac)$    
  $\displaystyle = (x-Qx)^T(x-Qx) + (Qx-Ac)^T(x-Qx)$    
  $\displaystyle \quad +(x-Qx)^T(Qx-Ac)$    
  $\displaystyle \quad + (Qx-Ac)^T(Qx-Ac)$    

Note that $ x-Qx = (I-Q)x$ and that

$\displaystyle QAc = A(A^TA)^{-1}A^TAc = Ac
$

so that

$\displaystyle Qx-Ac = Q(x-Ac)
$

Then

$\displaystyle (Qx-Ac)^T(x-Qx) = (x-Ac)^T Q^T (I-Q) x
$

Since $ Q^T=Q$ we see that

$\displaystyle Q^T(I-Q)$ $\displaystyle = Q(I-Q)$    
  $\displaystyle = Q-Q^2$    
  $\displaystyle = Q-A(A^TA)^{-1}A^TA(A^TA)^{-1}A^T$    
  $\displaystyle = Q-Q = 0$    

This shows that

$\displaystyle (x-Ac)^T(x-Ac) =$ $\displaystyle (x-Qx)^T(x-Qx)$    
$\displaystyle \quad$ $\displaystyle + (Qx-Ac)^T(Qx-Ac)$    

Choose $ Ac$ to minimize: minimize second term.

Achieved by making $ Qx=Ac$.

Since $ Qx=A(A^TA)^{-1}A^T x$ can take

$\displaystyle c=(A^TA)^{-1}A^T x.$

Summary: closest point $ y$ in $ S$ is

$\displaystyle y=Qx=A(A^TA)^{-1}A^T x
$

call $ y$ the orthogonal projection of $ x$ onto $ S$.

Notice that the matrix $ Q$ is idempotent:

$\displaystyle Q^2=Q
$

We call $ Qx$ the orthogonal projection of $ x$ on $ S$ because $ Qx$ is perpendicular to the residual $ x-Qx = (I-Q)x$.

Partitioned Matrices

Suppose $ A_{11}$ $ p \times r$ matrix, $ A_{1,2}$ $ p \times s$, $ A_{2,1}$ $ q \times r$ and $ A_{2,2}$ $ q \times s$. Make $ (p+q) \times (r+s)$ matrix by putting $ A_{ij}$ in 2 by 2 matrix:

$\displaystyle A = \left[ \begin{array}{cc}
A_{11} & A_{12}
\\
A_{21} & A_{22}
\end{array}\right]
$

For instance if

$\displaystyle A_{11} = \left[ \begin{array}{cc} 1 & 0 \\  0 & 1 \end{array}\right]
$

$\displaystyle A_{12} = \left[ \begin{array}{c} 2 \\  3 \end{array}\right]
$

$\displaystyle A_{21} = \left[ \begin{array}{cc} 4 & 5\end{array}\right]
$

and

$\displaystyle A_{22} = \left[ 6 \right]
$

then

$\displaystyle A = \left[ \begin{array}{cc\vert c}
1 & 0 & 2
\\
0 & 1 & 3
\\
\hline
4 & 5 & 6
\end{array}\right]
$

Lines indicate partitioning.

We can work with partitioned matrices just like ordinary matrices always making sure that in products we never change the order of multiplication of things.

$\displaystyle \left[ \begin{array}{cc} A_{11} & A_{12} \\ \\ A_{21} & A_{22} \end{array} \right]$ $\displaystyle + \left[ \begin{array}{cc} B_{11} & B_{12} \\ \\ B_{21} & B_{22} \end{array} \right]$    
$\displaystyle =$ $\displaystyle \left[ \begin{array}{cc} A_{11}+B_{11} & A_{12}+B_{12} \\ \\ A_{21}+B_{21} & A_{22}+B_{22} \end{array} \right]$    

and

\begin{multline*}
\left[ \begin{array}{cc}
A_{11} & A_{12}
\\
\\
A_{21} & A_{2...
...{11}+A_{22}B_{21} &A_{21}B_{12}+ A_{22}B_{22}
\end{array}\right]
\end{multline*}

Note partitioning of $ A$ and $ B$ must match.

Addition: dimensions of $ A_{ij}$ and $ B_{ij}$ must be the same.

Multiplication formula $ A_{12}$ must have as many columns as $ B_{21}$ has rows, etc.

In general: need $ A_{ij}B_{jk}$ to make sense for each $ i,j,k$.

Works with more than a 2 by 2 partitioning.

Defn: block diagonal matrix: partitioned matrix $ A$ for which $ A_{ij}=0$ if $ i \neq j$. If

$\displaystyle A = \left[ \begin{array}{cc}
A_{11} & 0
\\
0 & A_{22}
\end{array}\right]
$

then $ A$ is invertible iff each $ A_{ii}$ is invertible and then

$\displaystyle A^{-1} = \left[ \begin{array}{cc}
A_{11}^{-1} & 0
\\
0 & A_{22}^{-1}
\end{array}\right]
$

Moreover det$ (A) =$   det$ (A_{11})$   det$ (A_{22})$. Similar formulas work for larger matrices.

Partitioned inverses. Suppose $ {\bf A}$, $ {\bf C}$ are symmetric positive definite. Look for inverse of

$\displaystyle \left[\begin{array}{cc} {\bf A}& {\bf B}\\  {\bf B}^T & {\bf C}\end{array}\right]
$

of form

$\displaystyle \left[\begin{array}{cc} {\bf E}& {\bf F}\\  {\bf F}^T & {\bf G}\end{array}\right]
$

Multiply to get equations

$\displaystyle {\bf A}{\bf E}+{\bf B}{\bf F}^T$ $\displaystyle = {\bf I}$    
$\displaystyle {\bf A}{\bf F}+{\bf B}{\bf G}$ $\displaystyle = {\bf0}$    
$\displaystyle {\bf B}^T{\bf E}+{\bf C}{\bf F}^T$ $\displaystyle = {\bf0}$    
$\displaystyle {\bf B}^T{\bf F}+{\bf C}{\bf G}$ $\displaystyle = {\bf I}$    

Solve to get

$\displaystyle {\bf F}^T$ $\displaystyle =-{\bf C}^{-1} {\bf B}^T{\bf E}$    
$\displaystyle {\bf A}{\bf E}-{\bf B}{\bf C}^{-1}{\bf B}^T {\bf E}$ $\displaystyle = {\bf I}$    
$\displaystyle {\bf E}$ $\displaystyle = ({\bf A}-{\bf B}{\bf C}^{-1}{\bf B}^T)^{-1}$    
$\displaystyle {\bf G}$ $\displaystyle = ({\bf C}-{\bf B}^T{\bf A}^{-1}{\bf B})^{-1}$    

The Multivariate Normal Distribution

Defn: $ Z \in \mathbb {R}^1 \sim N(0,1)$ iff

$\displaystyle f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2} \,.
$

Defn: $ {\bf Z}\in \mathbb {R}^p \sim MVN_p(0,I)$ if and only if $ {\bf Z}=(Z_1,\ldots,Z_p)^T$ with the $ Z_i$ independent and each $ Z_i\sim N(0,1)$.

In this case according to our theorem

$\displaystyle f_{\bf Z}(z_1,\ldots,z_p)$ $\displaystyle = \prod \frac{1}{\sqrt{2\pi}} e^{-z_i^2/2}$    
  $\displaystyle = (2\pi)^{-p/2} \exp\{ -z^T z/2\} \, ;$    

superscript $ t$ denotes matrix transpose.

Defn: $ {\bf X}\in \mathbb {R}^p$ has a multivariate normal distribution if it has the same distribution as $ {\bf A}{\bf Z}+\boldsymbol{\mu}$ for some $ \boldsymbol{\mu}\in \mathbb {R}^p$, some $ p\times q$ matrix of constants $ {\bf A}$ and $ Z\sim MVN_q(0,I)$.

$ p=q$, $ {\bf A}$ singular: $ {\bf X}$ does not have a density.

$ {\bf A}$ invertible: derive multivariate normal density by change of variables:

$\displaystyle {\bf X}={\bf A}{\bf Z}+\boldsymbol{\mu} \Leftrightarrow {\bf Z}={\bf A}^{-1}({\bf X}-\boldsymbol{\mu})
$

$\displaystyle \frac{\partial {\bf X}}{\partial {\bf Z}} = {\bf A}\qquad \frac{\partial {\bf Z}}{\partial {\bf X}} =
{\bf A}^{-1} \,.
$

So

$\displaystyle f_{\bf X}(x)$ $\displaystyle = f_{\bf Z}({\bf A}^{-1}(x-\boldsymbol{\mu})) \vert \det({\bf A}^{-1})\vert$    
  $\displaystyle = \frac{ \exp\{-(x-\boldsymbol{\mu})^T ({\bf A}^{-1})^T {\bf A}^{-1} (x-\boldsymbol{\mu})/2\} }{(2\pi)^{p/2} \vert\det{{\bf A}}\vert } \,.$    

Now define $ \boldsymbol{\Sigma}={\bf A}{\bf A}^T$ and notice that

$\displaystyle \boldsymbol{\Sigma}^{-1} = ({\bf A}^T)^{-1} {\bf A}^{-1} = ({\bf A}^{-1})^T {\bf A}^{-1}
$

and

$\displaystyle \det \boldsymbol{\Sigma} = \det {\bf A}\det {\bf A}^T = (\det {\bf A})^2 \,.
$

Thus $ f_{\bf X}$ is

$\displaystyle \frac{
\exp\{ -(x-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}
(x-\boldsymbol{\mu}) /2 \}}{
(2\pi)^{p/2} (\det\boldsymbol{\Sigma})^{1/2} }
\, ;
$

the $ MVN(\boldsymbol{\mu},\boldsymbol{\Sigma})$ density. Note density is the same for all $ {\bf A}$ such that $ {\bf A}{\bf A}^T=\boldsymbol{\Sigma}$. This justifies the notation $ MVN(\boldsymbol{\mu},\boldsymbol{\Sigma})$.

For which $ \boldsymbol{\mu}$, $ \boldsymbol{\Sigma}$ is this a density?

Any $ \boldsymbol{\mu}$ but if $ x \in \mathbb {R}^p$ then

$\displaystyle x^T \boldsymbol{\Sigma} x$ $\displaystyle = x^T {\bf A}{\bf A}^T x$    
  $\displaystyle = ({\bf A}^T x)^T ({\bf A}^T x)$    
  $\displaystyle = \sum_1^p y_i^2 \ge 0$    

where $ y={\bf A}^T x$. Inequality strict except for $ y=0$ which is equivalent to $ x=0$. Thus $ \boldsymbol{\Sigma}$ is a positive definite symmetric matrix.

Conversely, if $ \boldsymbol{\Sigma}$ is a positive definite symmetric matrix then there is a square invertible matrix $ {\bf A}$ such that $ {\bf A}{\bf A}^T=\boldsymbol{\Sigma}$ so that there is a $ MVN(\boldsymbol{\mu},\boldsymbol{\Sigma})$ distribution. ($ {\bf A}$ can be found via the Cholesky decomposition, e.g.)

When $ {\bf A}$ is singular $ {\bf X}$ will not have a density: $ \exists a$ such that $ P(a^T {\bf X}= a^T
\boldsymbol{\mu}) =1$; $ {\bf X}$ is confined to a hyperplane.

Still true: distribution of $ {\bf X}$ depends only on $ \boldsymbol{\Sigma}={\bf A}{\bf A}^T$: if $ {\bf A}{\bf A}^T = BB^T$ then $ {\bf A}{\bf Z}+\boldsymbol{\mu}$ and $ B{\bf Z}+\boldsymbol{\mu}$ have the same distribution.

Expectation, moments

Defn: If $ {\bf X}\in \mathbb {R}^p$ has density $ f$ then

$\displaystyle {\rm E}(g({\bf X})) = \int g(x)f(x)\, dx \,.
$

any $ g$ from $ \mathbb {R}^p$ to $ \mathbb {R}$.

FACT: if $ Y=g(X)$ for a smooth $ g$ (mapping $ \mathbb {R}\to \mathbb {R}$)

$\displaystyle {\rm E}(Y)$ $\displaystyle = \int y f_Y(y) \, dy$    
  $\displaystyle = \int g(x) f_Y(g(x)) g^\prime(x) \, dx$    
  $\displaystyle = {\rm E}(g(X))$    

by change of variables formula for integration. This is good because otherwise we might have two different values for $ {\rm E}(e^X)$.

Linearity: $ {\rm E}(aX+bY) = a{\rm E}(X)+b{\rm E}(Y)$ for real $ X$ and $ Y$.

Defn: The $ r^{\rm th}$ moment (about the origin) of a real rv $ X$ is $ \mu_r^\prime={\rm E}(X^r)$ (provided it exists). We generally use $ \mu$ for $ {\rm E}(X)$.

Defn: The $ r^{\rm th}$ central moment is

$\displaystyle \mu_r = {\rm E}[(X-\mu)^r]
$

We call $ \sigma^2 = \mu_2$ the variance.

Defn: For an $ \mathbb {R}^p$ valued random vector $ {\bf X}$

$\displaystyle \boldsymbol{\mu}_{\bf X}= {\rm E}({\bf X}) $

is the vector whose $ i^{\rm th}$ entry is $ {\rm E}(X_i)$ (provided all entries exist).

Fact: same idea used for random matrices.

Defn: The ( $ p \times p$) variance covariance matrix of $ {\bf X}$ is

$\displaystyle {\rm Var}({\bf X}) = {\rm E}\left[ ({\bf X}-\boldsymbol{\mu})({\bf X}-\boldsymbol{\mu})^T \right]
$

which exists provided each component $ X_i$ has a finite second moment.

Example moments: If $ Z\sim N(0,1)$ then

$\displaystyle {\rm E}(Z)$ $\displaystyle = \int_{-\infty}^\infty z e^{-z^2/2} dz /\sqrt{2\pi}$    
  $\displaystyle = \left.\frac{-e^{-z^2/2}}{\sqrt{2\pi}}\right\vert _{-\infty}^\infty$    
  $\displaystyle = 0$    

and (integrating by parts)

$\displaystyle {\rm E}(Z^r) =$ $\displaystyle \int_{-\infty}^\infty z^r e^{-z^2/2} dz /\sqrt{2\pi}$    
$\displaystyle =$ $\displaystyle \left.\frac{-z^{r-1}e^{-z^2/2}}{\sqrt{2\pi}}\right\vert _{-\infty}^\infty$    
  $\displaystyle + (r-1) \int_{-\infty}^\infty z^{r-2} e^{-z^2/2} dz /\sqrt{2\pi}$    

so that

$\displaystyle \mu_r = (r-1)\mu_{r-2}
$

for $ r \ge 2$. Remembering that $ \mu_1=0$ and

$\displaystyle \mu_0 = \int_{-\infty}^\infty z^0 e^{-z^2/2} dz /\sqrt{2\pi}=1
$

we find that

$\displaystyle \mu_r = \left\{ \begin{array}{ll}
0 & \mbox{$r$ odd}
\\
(r-1)(r-3)\cdots 1 & \mbox{$r$ even} \,.
\end{array}\right.
$

If now $ X\sim N(\mu,\sigma^2)$, that is, $ X\sim \sigma Z + \mu$, then $ {\rm E}(X) = \sigma {\rm E}(Z) + \mu = \mu$ and

$\displaystyle \mu_r(X) = {\rm E}[(X-\mu)^r] = \sigma^r {\rm E}(Z^r)
$

In particular, we see that our choice of notation $ N(\mu,\sigma^2)$ for the distribution of $ \sigma Z + \mu$ is justified; $ \sigma$ is indeed the variance.

Similarly for $ {\bf X}\sim MVN(\boldsymbol{\mu},\boldsymbol{\Sigma})$ we have $ {\bf X}={\bf A}{\bf Z}+\mu$ with $ {\bf Z}\sim MVN(0,I)$ and

$\displaystyle {\rm E}({\bf X}) = \boldsymbol{\mu}
$

and

$\displaystyle {\rm Var}({\bf X})$ $\displaystyle = {\rm E}\left\{({\bf X}-\boldsymbol{\mu})({\bf X}-\boldsymbol{\mu})^T\right\}$    
  $\displaystyle = {\rm E}\left\{ {\bf A}{\bf Z}({\bf A}{\bf Z})^T\right\}$    
  $\displaystyle = {\bf A}{\rm E}({\bf Z}{\bf Z}^T) {\bf A}^T$    
  $\displaystyle = {\bf A}I{\bf A}^T = \boldsymbol{\Sigma} \,.$    

Note use of easy calculation: $ {\rm E}({\bf Z})=0$ and

$\displaystyle {\rm Var}({\bf Z}) = {\rm E}({\bf Z}{\bf Z}^T) =I \,.
$

Moments and independence

Theorem: If $ X_1,\ldots,X_p$ are independent and each $ X_i$ is integrable then $ X=X_1\cdots X_p$ is integrable and

$\displaystyle {\rm E}(X_1\cdots X_p) = {\rm E}(X_1) \cdots {\rm E}(X_p) \,.
$

Moment Generating Functions

Defn: The moment generating function of a real valued $ X$ is

$\displaystyle M_X(t) = {\rm E}(e^{tX})
$

defined for those real $ t$ for which the expected value is finite.

Defn: The moment generating function of $ {\bf X}\in \mathbb {R}^p$ is

$\displaystyle M_{\bf X}(u) = {\rm E}[e^{u^T{\bf X}}]
$

defined for those vectors $ u$ for which the expected value is finite.

Example: If $ Z\sim N(0,1)$ then

$\displaystyle M_Z(t)$ $\displaystyle = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{tz-z^2/2} dz$    
  $\displaystyle = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{-(z-t)^2/2+t^2/2} dz$    
  $\displaystyle = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^\infty e^{-u^2/2+t^2/2} du$    
  $\displaystyle = e^{t^2/2}$    

Theorem: ($ p=1$) If $ M$ is finite for all $ t$ in a neighbourhood of $ {\bf0}$ then

  1. Every moment of $ X$ is finite.


  2. $ M$ is $ C^\infty$ (in fact $ M$ is analytic).


  3. $ \mu_k^\prime = \frac{d^k}{dt^k} M_X(0)$.


Note: $ C^\infty$ means has continuous derivatives of all orders. Analytic means has convergent power series expansion in neighbourhood of each $ t\in(-\epsilon,\epsilon)$.

The proof, and many other facts about mgfs, rely on techniques of complex variables.

Characterization & MGFs

Theorem: Suppose $ {\bf X}$ and $ {\bf Y}$ are $ \mathbb {R}^p$ valued random vectors such that

$\displaystyle M_{\bf X}({\bf u}) = M_{\bf Y}({\bf u})
$

for $ {\bf u}$ in some open neighbourhood of $ {\bf0}$ in $ \mathbb {R}^p$. Then $ {\bf X}$ and $ {\bf Y}$ have the same distribution.

The proof relies on techniques of complex variables.

MGFs and Sums

If $ X_1,\ldots,X_p$ are independent and $ Y=\sum X_i$ then mgf of $ Y$ is product mgfs of individual $ X_i$:

$\displaystyle {\rm E}(e^{tY}) = \prod_i {\rm E}(e^{tX_i})
$

or $ M_Y = \prod M_{X_i}$. (Also for multivariate $ X_i$.)

Example: If $ Z_1,\ldots,Z_p$ are independent $ N(0,1)$ then

$\displaystyle {\rm E}(e^{\sum a_i Z_i})$ $\displaystyle = \prod_i {\rm E}(e^{a_i Z_i})$    
  $\displaystyle = \prod_i e^{a_i^2/2}$    
  $\displaystyle = \exp(\sum a_i^2/2)$    

Conclusion: If $ {\bf Z}\sim MNV_p(0,I)$ then

$\displaystyle M_Z(u) = \exp(\sum u_i^2/2) = \exp({\bf u}^T {\bf u}/2).
$

Example: If $ X\sim N(\mu,\sigma^2)$ then $ X=\sigma Z + \mu$ and

$\displaystyle M_X(t) = {\rm E}(e^{t(\sigma Z+\mu)}) = e^{t\mu} e^{\sigma^2t^2/2}.
$

Theorem: Suppose $ {\bf X}= {\bf A}{\bf Z}+{\boldsymbol{\mu}}$ and $ {\bf Y}= {\bf A}^* {\bf Z}^* + {\boldsymbol{\mu^*}}$ where $ {\bf Z}\sim MVN_p(0,I)$ and $ {\bf Z}^* \sim MVN_q(0,I)$. Then $ {\bf X}$ and $ {\bf Y}$ have the same distribution if and only iff the following two conditions hold:

  1. $ \boldsymbol{\mu} = \boldsymbol{\mu}^*$.


  2. $ {\bf A}{\bf A}^T = {\bf A}^*({\bf A}^*)^T$.


Alternatively: if $ {\bf X}$, $ {\bf Y}$ each MVN then $ {\rm E}({\bf X})={\rm E}({\bf Y})$ and $ {\rm Var}({\bf X}) = {\rm Var}({\bf Y})$ imply that $ {\bf X}$ and $ {\bf Y}$ have the same distribution.

Proof: If 1 and 2 hold the mgf of $ {\bf X}$ is

$\displaystyle {\rm E}\left(e^{t^T{\bf X}}\right)$ $\displaystyle = {\rm E}\left(e^{t^T({\bf A}{\bf Z}+\boldsymbol\mu}\right)$    
  $\displaystyle = e^{t^T\boldsymbol\mu}{\rm E}\left(e^{({\bf A}^Tt)^T {\bf Z}}\right)$    
  $\displaystyle = e^{t^T\boldsymbol\mu+({\bf A}^Tt)^T({\bf A}^Tt)}$    
  $\displaystyle = e^{t^T\boldsymbol\mu+t^T\boldsymbol\Sigma t}$    

Thus $ M_{\bf X}=M_{\bf Y}$. Conversely if $ {\bf X}$ and $ {\bf Y}$ have the same distribution then they have the same mean and variance.

Thus mgf is determined by $ \boldsymbol\mu$ and $ \boldsymbol\Sigma$.

Theorem: If $ {\bf X}\sim MVN_p(\mu,\boldsymbol{\Sigma})$ then there is $ {\bf A}$ a $ p \times p$ matrix such that $ {\bf X}$ has same distribution as $ {\bf A}{\bf Z}+\boldsymbol{\mu}$ for $ {\bf Z}\sim MVN_p(0,I)$.

We may assume that $ {\bf A}$ is symmetric and non-negative definite, or that $ {\bf A}$ is upper triangular, or that $ Ba$ is lower triangular.

Proof: Pick any $ {\bf A}$ such that $ {\bf A}{\bf A}^T=\boldsymbol{\Sigma}$ such as $ {\bf P}{\bf D}^{1/2} {\bf P}^T$ from the spectral decomposition. Then $ {\bf A}{\bf Z}+\boldsymbol{\mu}\sim MVN_p(\boldsymbol{\mu},\boldsymbol{\Sigma})$.

From the symmetric square root can produce an upper triangular square root by the Gram Schmidt process: if $ {\bf A}$ has rows $ a_1^T,\ldots,a_p^T$ then let $ v_p$ be $ a_p/\sqrt{a_p^T a_p}$. Choose $ v_{p-1}$ proportional to $ a_{p-1} -bv_p$ where $ b = a_{p-1}^T v_p$ so that $ v_{p-1}$ has unit length. Continue in this way; you automatically get $ a_j^T v_k = 0$ if $ j < k$. If $ {\bf P}$ has columns $ v_1,\ldots,v_p$ then $ {\bf P}$ is orthogonal and $ {\bf A}{\bf P}$ is an upper triangular square root of $ \boldsymbol\Sigma$.

Variances, Covariances, Correlations

Defn: The covariance between $ {\bf X}$ and $ {\bf Y}$ is

$\displaystyle {\rm Cov}({\bf X},{\bf Y}) = {\rm E}\left\{
({\bf X}-\boldsymbol\mu_{\bf X})({\bf Y}-\boldsymbol\mu_{\bf Y})^T\right\}
$

This is a matrix.

Properties:

Properties of the $ MVN$ distribution

1: All margins are multivariate normal: if

$\displaystyle {\bf X}= \left[\begin{array}{c} {\bf X}_1\\  {\bf X}_2\end{array} \right]
$

$\displaystyle \mu = \left[\begin{array}{c} \mu_1\\  \mu_2\end{array} \right]
$

and

$\displaystyle \boldsymbol{\Sigma} = \left[\begin{array}{cc} \boldsymbol{\Sigma}...
...}
\\
\boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{array} \right]
$

then $ {\bf X}\sim MVN(\boldsymbol{\mu},\boldsymbol{\Sigma}) \Rightarrow {\bf X}_1\sim MVN(\boldsymbol{\mu}_1,\boldsymbol{\Sigma}_{11})$.

2: $ {\bf M}{\bf X}+\boldsymbol{\nu} \sim MVN({\bf M}\boldsymbol{\mu}+\boldsymbol{\nu}, {\bf M} \boldsymbol{\Sigma} {\bf M}^T)$: affine transformation of MVN is normal.

3: If

$\displaystyle \boldsymbol{\Sigma}_{12} = {\rm Cov}({\bf X}_1,{\bf X}_2) = {\bf0}
$

then $ {\bf X}_1$ and $ {\bf X}_2$ are independent.

4: All conditionals are normal: the conditional distribution of $ {\bf X}_1$ given $ {\bf X}_2=x_2$ is $ MVN(\boldsymbol{\mu}_1+\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}
(...
...-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})$ Proof of ( 1): If $ {\bf X}={\bf A}{\bf Z}+\boldsymbol{\mu}$ then

$\displaystyle {\bf X}_1 = \left[ I \vert {\bf0}\right] {\bf X}
$

for $ I$ the identity matrix of correct dimension.

So

$\displaystyle {\bf X}_1 = \left( \left[ I \vert {\bf0}\right]{\bf A}\right) {\bf Z}+ \left[ I \vert {\bf0}\right]
\boldsymbol{\mu}
$

Compute mean and variance to check rest.

Proof of ( 2): If $ {\bf X}={\bf A}{\bf Z}+\boldsymbol{\mu}$ then

$\displaystyle {\bf M}{\bf X}+\boldsymbol{\nu}= {\bf M}{\bf A}{\bf Z}
+ \boldsymbol{\nu} +{\bf M}\boldsymbol{\mu}$

Proof of ( 3): If

$\displaystyle {\bf u} = \left[\begin{array}{c}{\bf u}_1 \\  {\bf u}_2\end{array}\right]
$

then

$\displaystyle M_{\bf X}(u) = M_{{\bf X}_1}({\bf u}_1)M_{{\bf X}_2}({\bf u}_2)
$

Proof of ( 4): first case: assume $ \boldsymbol{\Sigma}_{22}$ has an inverse.

Define

$\displaystyle {\bf W}= {\bf X}_1 - \boldsymbol\Sigma_{12}\boldsymbol\Sigma_{22}^{-1}{\bf X}_2
$

Then

$\displaystyle \left[\begin{array}{c}
{\bf W}\\  {\bf X}_2\end{array}\right]
=
\...
...array}\right]
\left[\begin{array}{c}
{\bf X}_1 \\  {\bf X}_2\end{array}\right]
$

Thus $ ({\bf W},{\bf X}_2)^T$ is $ MVN(\boldsymbol\mu_1-\boldsymbol\Sigma_{12}\boldsymbol\Sigma_{22}^{-1}
\boldsymbol\mu_2,\boldsymbol\Sigma^*)$ where

$\displaystyle \boldsymbol\Sigma^* = \left[\begin{array}{cc}
\boldsymbol\Sigma_{...
...mbol\Sigma_{21} & {\bf0} \\
{\bf0}& \boldsymbol\Sigma_{22}\end{array}\right]
$

Now joint density of $ {\bf W}$ and $ {\bf X}$ factors

$\displaystyle f_{{\bf W},{\bf X}_2}(w,x_2) = f_{{\bf W}}(w)f_{{\bf X}_2}(x_2)
$

By change of variables joint density of $ {\bf X}$ is

$\displaystyle f_{{\bf X}_1,{\bf X}_2}(x_1,x_2) = cf_{{\bf W}}(x_1-{\bf M}x_2)f_{{\bf X}_2}(x_2)
$

where $ c=1$ is the constant Jacobian of the linear transformation from $ ({\bf W},{\bf X}_2)$ to $ ({\bf X}_1,{\bf X}_2)$ and

$\displaystyle {\bf M}= \boldsymbol\Sigma_{12}
\boldsymbol\Sigma_{22}^{-1}
$

Thus conditional density of $ {\bf X}_1$ given $ {\bf X}_2=x_2$ is

$\displaystyle \frac{f_{{\bf W}}(x_1-{\bf M}x_2)f_{{\bf X}_2}(x_2)}{f_{{\bf X}_2}(x_2)}
= f_{{\bf W}}(x_1-{\bf M}x_2)
$

As a function of $ x_1$ this density has the form of the advertised multivariate normal density.

Specialization to bivariate case:

Write

$\displaystyle \boldsymbol\Sigma = \left[\begin{array}{cc} \sigma_1^2 & \rho\sigma_1\sigma_2
\\
\rho\sigma_1\sigma_2
& \sigma_2^2\end{array}\right]
$

where we define

$\displaystyle \rho = \frac{{\rm Cov}(X_1,X_2)}{\sqrt{{\rm Var}(X_1){\rm Var}(X_2)}}
$

Note that

$\displaystyle \sigma_i^2 = {\rm Var}(X_i)
$

Then

$\displaystyle W= X_1 - \rho\frac{\sigma_1}{\sigma_2} X_2
$

is independent of $ X_2$. The marginal distribution of $ W$ is $ N(\mu_1-\rho\sigma_1 \mu_2/\sigma_2,\tau^2)$ where

$\displaystyle \tau^2 =$ $\displaystyle {\rm Var}(X_1) - 2\rho\frac{\sigma_1}{\sigma_2}{\rm Cov}(X_1,X_2)$    
  $\displaystyle + \left(\rho\frac{\sigma_1}{\sigma_2}\right)^2 {\rm Var}(X_2)$    

This simplifies to

$\displaystyle \sigma_1^2(1-\rho^2)
$

Notice that it follows that

$\displaystyle -1 \le \rho \le 1
$

More generally: any $ X$ and $ Y$:

0 $\displaystyle \le {\rm Var}(X-\lambda Y)$    
  $\displaystyle = {\rm Var}(X) - 2 \lambda{\rm Cov}(X,Y) + \lambda^2 {\rm Var}(Y)$    

RHS is minimized at

$\displaystyle \lambda = \frac{{\rm Cov}(X,Y)}{{\rm Var}(Y)}
$

Minimum value is

$\displaystyle 0 \le {\rm Var}(X) (1 - \rho_{XY}^2)
$

where

$\displaystyle \rho_{XY} =\frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var}(X){\rm Var}(Y)}}
$

defines the correlation between $ X$ and $ Y$.

Multiple Correlation
Now suppose $ {\bf X}_2$ is scalar but $ {\bf X}_1$ is vector.

Defn: Multiple correlation between $ {\bf X}_1$ and $ X_2$

$\displaystyle R^2({\bf X}_1,X_2) = \max\vert\rho_{{\bf a}^T{\bf X}_1,X_2}\vert^2
$

over all $ {\bf a}\neq 0$.

Thus: maximize

$\displaystyle \frac{{\rm Cov}^2({\bf a}^T {\bf X}_1,X_2)}{{\rm Var}
({\bf a}^T{...
...\left({\bf a}^T \boldsymbol\Sigma_{11} {\bf a}\right)
\boldsymbol\Sigma_{22}}
$

Put $ b=\boldsymbol\Sigma_{11}^{1/2}{\bf a}$. For $ \boldsymbol\Sigma_{11}$ invertible problem is equivalent to maximizing

$\displaystyle \frac{{\bf b}^T {\bf Q}{\bf b}}{{\bf b}^T {\bf b}}
$

where

$\displaystyle {\bf Q}= \boldsymbol\Sigma_{11}^{-1/2}\boldsymbol\Sigma_{12}\boldsymbol\Sigma_{21}\boldsymbol\Sigma_{11}^{-1/2}
$

Solution: find largest eigenvalue of $ {\bf Q}$.

Note

$\displaystyle {\bf Q}= {\bf v}{\bf v}^T
$

where

$\displaystyle {\bf v}=
\boldsymbol\Sigma_{11}^{-1/2}
\boldsymbol\Sigma_{12}
$

is a vector. Set

$\displaystyle {\bf v}{\bf v}^T {\bf x} = \lambda {\bf x}
$

and multiply by $ {\bf v}^T$ to get

$\displaystyle {\bf v}^T{\bf x} = 0$    or $\displaystyle \lambda = {\bf v}^T {\bf v}
$

If $ {\bf v}^T{\bf x}=0$ then we see $ \lambda=0$ so largest eigenvalue is $ {\bf v}^T{\bf v}$.

Summary: maximum squared correlation is

$\displaystyle R^2({\bf X}_1,X_2) = \frac{ {\bf v}^T {\bf v}}{\boldsymbol\Sigma_...
...1}\boldsymbol\Sigma_{11}^{-1}
\boldsymbol\Sigma_{12}
}{\boldsymbol\Sigma_{22}}
$

Achieved when eigenvector is $ {\bf x}={\bf v}={\bf b}$ so

$\displaystyle {\bf a}=\boldsymbol\Sigma_{11}^{-1/2}\boldsymbol\Sigma_{11}^{-1/2}
\boldsymbol\Sigma_{12}=\boldsymbol\Sigma_{11}^{-1}\boldsymbol\Sigma_{12}
$

Notice: since $ R^2$ is squared correlation between two scalars ( $ {\bf a}^t {\bf X}_1$ and $ X_2$) we have

$\displaystyle 0 \le R^2 \le 1
$

Equals 1 iff $ X_2$ is linear combination of $ {\bf X}_1$.

Correlation matrices, partial correlations:

Correlation between two scalars $ X$ and $ Y$ is

$\displaystyle \rho_{XY} =\frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var}(X){\rm Var}(Y)}}
$

If $ {\bf X}$ has variance $ \boldsymbol\Sigma$ then the correlation matrix of $ {\bf X}$ is $ {\bf R}_{\bf X}$ with entries

$\displaystyle R_{ij} = \frac{{\rm Cov}(X_i,X_j)}{\sqrt{{\rm Var}(X_i){\rm Var}(X_j)}}
= \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii}\Sigma_{jj}}}
$

If $ {\bf X}_1,{\bf X}_2$ are MVN with the usual partitioned variance covariance matrix then the conditional variance of $ {\bf X}_1$ given $ {\bf X}_2$ is

$\displaystyle \boldsymbol\Sigma_{11\cdot 2} = \boldsymbol\Sigma_{11} -
\boldsymbol\Sigma_{12}\boldsymbol\Sigma_{22}^{-1}\boldsymbol\Sigma_{21}
$

From this define partial correlation matrix

$\displaystyle {\bf R}_{11\cdot 2} = \frac{(\boldsymbol\Sigma_{11\cdot 2})_{ij}
}{\sqrt{\boldsymbol\Sigma_{11\cdot 2})_{ii}\boldsymbol\Sigma_{11\cdot 2})_{jj}}}
$

Note: these are used even when $ {\bf X}_1,{\bf X}_2$ are NOT MVN

Likelihood Methods of Inference

Given data $ X$ with model $ \{f_\theta(x);\theta\in\Theta\}$:

Definition: The likelihood function is map $ L$: domain $ \Theta$, values given by

$\displaystyle L(\theta) = f_\theta(X)
$

Key Point: think about how the density depends on $ \theta$ not about how it depends on $ X$.

Notice: $ X$, observed value of the data, has been plugged into the formula for density.

We use likelihood for most inference problems:

  1. Point estimation: we must compute an estimate $ \hat\theta =
\hat\theta(X)$ which lies in $ \Theta$. The maximum likelihood estimate (MLE) of $ \theta$ is the value $ \hat\theta $ which maximizes $ L(\theta)$ over $ \theta\in \Theta$ if such a $ \hat\theta $ exists.

  2. Point estimation of a function of $ \theta$: we must compute an estimate $ \hat\phi = \hat\phi(X)$ of $ \phi=g(\theta)$. We use $ \hat\phi=g(\hat\theta)$ where $ \hat\theta $ is the MLE of $ \theta$.

  3. Interval (or set) estimation. We must compute a set $ C=C(X)$ in $ \Theta$ which we think will contain $ \theta_0$. We will use

    $\displaystyle \{\theta\in\Theta: L(\theta) > c\}
$

    for a suitable $ c$.

  4. Hypothesis testing: decide whether or not $ \theta_0\in\Theta_0$ where $ \Theta_0 \subset \Theta$. We base our decision on the likelihood ratio

    $\displaystyle \frac{\sup\{L(\theta); \theta \in \Theta_0\}}{
\sup\{L(\theta); \theta \in \Theta\setminus\Theta_0\}}
$

Maximum Likelihood Estimation

To find MLE maximize $ L$.

Typical function maximization problem:

Set gradient of $ L$ equal to 0

Check root is maximum, not minimum or saddle point.

Often $ L$ is product of $ n$ terms (given $ n$ independent observations).

Much easier to work with logarithm of $ L$: log of product is sum and logarithm is monotone increasing.

Definition: The Log Likelihood function is

$\displaystyle \ell(\theta) = \log\{L(\theta)\} \,.
$

Samples from MVN Population

Simplest problem: collect replicate measurements $ {\bf X}_1,\ldots,{\bf X}_n$ from single population.

Model: $ X_i$ are iid $ MVN_p(\boldsymbol\mu,
\boldsymbol\Sigma)$.

Parameters ($ \theta$): $ (\boldsymbol\mu, \boldsymbol\Sigma)$. Parameter space: $ \boldsymbol\mu\in \mathbb {R}^p$ and $ \boldsymbol\Sigma$ is some positive definite $ p \times p$ matrix.

Log likelihood is

$\displaystyle \ell(\boldsymbol\mu, \boldsymbol\Sigma) = -n$ $\displaystyle p\log(\pi)/2 - n\log\det\boldsymbol\Sigma / 2$    
  $\displaystyle - \sum ({\bf X}_i -\boldsymbol\mu)^T\boldsymbol\Sigma^{-1}({\bf X}_i -\boldsymbol\mu)/2$    

Take derivatives.

$\displaystyle \frac{\partial\ell}{\partial\boldsymbol\mu}$ $\displaystyle = \boldsymbol\Sigma^{-1}\left\{\sum({\bf X}_i-\boldsymbol\mu)\right\}$    
  $\displaystyle = n\boldsymbol\Sigma^{-1}(\bar{{\bf X}}-\boldsymbol\mu)$    

where $ \bar{{\bf X}}= \sum {\bf X}_i/n$. Second derivative wrt $ \boldsymbol\mu$ is a matrix:

$\displaystyle -n\boldsymbol\Sigma^{-1}
$

Fact: if second derivative matrix is negative definite at critical point then critical point is a maximum.

Fact: if second derivative matrix is negative definite everywhere then function is concave; no more than 1 critical point.

Summary: $ \ell$ is maximized at

$\displaystyle \hat{\boldsymbol\mu} = \bar{{\bf X}}
$

(regardless of choice of $ \boldsymbol\Sigma$).

More difficult: differentiate $ \ell$ wrt $ \boldsymbol\Sigma$.

Somewhat simpler: set $ {\bf D}=\boldsymbol\Sigma^{-1}$

First derivative wrt $ {\bf D}$ is matrix with entries

$\displaystyle \frac{\partial\ell}{\partial{\bf D}_{ij}}
$

Warning: method used ignores symmetry of $ \boldsymbol\Sigma$.

Need: derivative of two functions:

$\displaystyle \frac{\partial\log\det{\bf A}}{\partial{\bf A}} = {\bf A}^{-1}
$

and

$\displaystyle \frac{\partial{\bf x}^T{\bf A}{\bf x}}{\partial{\bf A}} = {\bf x}{\bf x}^T
$

Fact: $ ij^$th entry of $ {\bf A}^{-1}$ is

$\displaystyle (-1)^{i+j}\frac{\det({\bf A}^{(ij)})}{\det{{\bf A}}}
$

where $ {\bf A}^{(ij)}$ denotes matrix obtained from $ {\bf A}$ by removing column $ j$ and row $ i$.

Fact: $ \det({\bf A})= \sum_k (-1)^{i+k} A_{ik} \det({\bf A}^{(ik)})$; expansion by minors.

Conclusion

$\displaystyle \frac{\partial\log\det{\bf A}}{\partial A_{ij}} = ({\bf A}^{-1})_{ij}
$

and

$\displaystyle \frac{\partial\log\det{\bf A}^{-1}}{\partial A_{ij}} = -({\bf A}^{-1})_{ij}
$

Implication

$\displaystyle \frac{\partial\ell}{\partial{\bf D}} = -n\boldsymbol\Sigma/2
-\sum_i({\bf X}_i - \boldsymbol\mu)({\bf X}_i - \boldsymbol\mu)^T/2
$

Set = 0 and find only critical point is

$\displaystyle \hat{\boldsymbol\Sigma} = \sum_i({\bf X}_i - \bar{{\bf X}})({\bf X}_i - \bar{{\bf X}})^T/n
$

Usual sample covariance matrix is

$\displaystyle {\bf S} = \sum_i({\bf X}_i - \bar{{\bf X}})({\bf X}_i - \bar{{\bf X}})^T/(n-1)
$

Properties of MLEs:

1) $ \bar{{\bf X}}\sim MVN_p(\boldsymbol\mu,n^{-1}\boldsymbol\Sigma)$

2) $ {\rm E}({\bf S}) = \boldsymbol\Sigma$.

Distribution of $ {\bf S}$? Joint distribution of $ \bar{{\bf X}}$ and $ {\bf S}$?

Univariate Normal samples: Distribution Theory

Theorem: Suppose $ X_1,\ldots,X_n$ are independent $ N(\mu,\sigma^2)$ random variables. Then

  1. $ \bar X$ (sample mean)and $ s^2$ (sample variance) independent.

  2. $ n^{1/2}(\bar{X} - \mu)/\sigma \sim N(0,1)$.

  3. $ (n-1)s^2/\sigma^2 \sim \chi^2_{n-1}$.

  4. $ n^{1/2}(\bar{X} - \mu)/s \sim t_{n-1}$.

Proof: Let $ Z_i=(X_i-\mu)/\sigma$.

Then $ Z_1,\ldots,Z_p$ are independent $ N(0,1)$.

So $ Z=(Z_1,\ldots,Z_p)^T$ is multivariate standard normal.

Note that $ \bar{X} = \sigma\bar{Z}+\mu$ and $ s^2 = \sum(X_i-\bar{X})^2/(n-1) = \sigma^2 \sum(Z_i-\bar{Z})^2/(n-1)$ Thus

$\displaystyle \frac{n^{1/2}(\bar{X}-\mu)}{\sigma} = n^{1/2}\bar{Z}
$

$\displaystyle \frac{(n-1)s^2}{\sigma^2} = \sum(Z_i-\bar{Z})^2
$

and

$\displaystyle T=\frac{n^{1/2}(\bar{X} - \mu)}{s} = \frac{n^{1/2} \bar{Z}}{s_Z}
$

where $ (n-1)s_Z^2 = \sum(Z_i-\bar{Z})^2$.

So: reduced to $ \mu=0$ and $ \sigma=1$.

Step 1: Define

$\displaystyle Y=(\sqrt{n}\bar{Z}, Z_1-\bar{Z},\ldots,Z_{n}-\bar{Z})^T \,.
$

(So $ Y$ has dimension $ n+1$.) Now

$\displaystyle Y =\left[\begin{array}{cccc}
\frac{1}{\sqrt{n}} &
\frac{1}{\sqrt{...
...]
\left[\begin{array}{c}
Z_1 \\
Z_2 \\
\vdots
\\
Z_n
\end{array}\right]
$

or letting $ {\bf M}$ denote the matrix

$\displaystyle Y={\bf M}Z \,.
$

It follows that $ Y\sim MVN(0,{\bf M}{\bf M}^T)$ so we need to compute $ {\bf M}{\bf M}^T$:

$\displaystyle {\bf M}{\bf M}^T$ $\displaystyle = \left[\begin{array}{c\vert cccc} 1 & 0 & 0 & \cdots & 0 \\ \hli...
...ots & -\frac{1}{n} \\ 0 & \vdots & \cdots & & 1-\frac{1}{n} \end{array} \right]$    
  $\displaystyle = \left[\begin{array}{c\vert c} 1 & 0 \\ \hline \\ 0 & {\bf Q} \end{array} \right] \,.$    

Put $ {\bf Y}_2=(Y_2,\ldots,Y_{n+1})$. Since

$\displaystyle {\rm Cov}(Y_1,{\bf Y}_2) = 0
$

conclude $ Y_1$ and $ {\bf Y}_2$ are independent and each is normal.

Thus $ \sqrt{n}\bar{Z}$ is independent of $ Z_1-\bar{Z},\ldots,Z_{n}-\bar{Z}$.

Since $ s_Z^2$ is a function of $ Z_1-\bar{Z},\ldots,Z_{n}-\bar{Z}$ we see that $ \sqrt{n}\bar{Z}$ and $ s_Z^2$ are independent.

Also, see $ \sqrt{n}\bar{Z}\sim N(0,1)$.

First 2 parts done.

Consider $ (n-1)s^2/\sigma^2 = {\bf Y}_2^T{\bf Y}_2 $. Note that $ {\bf Y}_2 \sim MVN(0,{\bf Q})
$.

Now: distribution of quadratic forms:

Suppose $ Z\sim MVN(0,{\bf I})$ and $ {\bf A}$ is symmetric. Put $ {\bf A}= {\bf P}{\bf D}{\bf P}^T$ for $ {\bf D}$ diagonal, $ {\bf P}$ orthogonal.

Then

$\displaystyle {\bf Z}^T{\bf A}{\bf Z}= ({\bf Z}^*)^T {\bf D}{\bf Z}^*
$

where

$\displaystyle {\bf Z}^* = {\bf P}^T{\bf Z}
$

But $ {\bf Z}^*\sim MVN(0,{\bf P}^T{\bf P}={\bf I})$ is standard multivariate normal.

So: $ {\bf Z}^T{\bf A}{\bf Z}$ has same distribution as

$\displaystyle \sum_i \lambda_i Z_i^2
$

where $ \lambda_1,\ldots,\lambda_n$ are eigenvalues of $ {\bf A}$.

Special case: if all $ \lambda_i$ are either 0 or 1 then $ {\bf Z}^T{\bf A}{\bf Z}$ has a chi-squared distribution with df = number of $ \lambda_i$ equal to 1.

When are eigenvalues all 1 or 0?

Answer: if and only if $ {\bf A}$ is idempotent.

1) If $ {\bf A}$ idempotent and $ \lambda, x$ is an eigenpair the

$\displaystyle {\bf A}x = \lambda x
$

and

$\displaystyle {\bf A}x = {\bf A}{\bf A}x = \lambda {\bf A}x = \lambda^2 x
$

so

$\displaystyle (\lambda-\lambda^2) x =0
$

proving $ \lambda$ is 0 or 1.

2) Conversely if all eigenvalues of $ {\bf A}$ are 0 or 1 then $ {\bf D}$ has 1s and 0s on diagonal so

$\displaystyle {\bf D}^2 = {\bf D}
$

and

$\displaystyle {\bf A}{\bf A}= {\bf P}{\bf D}{\bf P}^T {\bf P}{\bf D}{\bf P}^T
= {\bf P}{\bf D}^2 {\bf P}^t = {\bf P}{\bf D}{\bf P}= {\bf A}
$

Next case: $ {\bf X}\sim MVN_p(0,\boldsymbol\Sigma)$. Then $ {\bf X}={\bf A}{\bf Z}$ with $ {\bf A}{\bf A}^T=\boldsymbol\Sigma$.

Since $ {\bf X}^T{\bf X}= {\bf Z}^T {\bf A}^T{\bf A}{\bf Z}$ it has the law

$\displaystyle \sum \lambda_i Z_i^2
$

$ \lambda_i$ are eigenvalues of $ {\bf A}^T{\bf A}$. But

$\displaystyle {\bf A}^T{\bf A}x = \lambda x
$

implies

$\displaystyle {\bf A}{\bf A}^T{\bf A}x= \boldsymbol\Sigma {\bf A}x = \lambda{\bf A}x
$

So eigenvalues are those of $ \boldsymbol\Sigma$ and $ {\bf X}^T{\bf X}$ is $ \chi^2_\nu$ iff $ \boldsymbol\Sigma$ is idempotent and $ {\rm trace}(\boldsymbol\Sigma)=\nu$.

Our case: $ {\bf A}={\bf Q}= {\bf I}- {\bf 1}{\bf 1}^T/n$. Check $ {\bf Q}^2 = {\bf Q}$. How many degrees of freedom: $ {\rm trace}({\bf D})$.

Defn: The trace of a square matrix $ {\bf A}$ is

$\displaystyle {\rm trace}({\bf A}) = \sum {\bf A}_{ii}
$

Property: $ {\rm trace}({\bf A}{\bf B})={\rm trace}({\bf B}{\bf A})$.

So:

$\displaystyle {\rm trace}({\bf A})$ $\displaystyle ={\rm trace}({\bf P}{\bf D}{\bf P}^T)$    
  $\displaystyle ={\rm trace}({\bf D}{\bf P}^T{\bf P}) = {\rm trace}({\bf D})$    

Conclusion: df for $ (n-1)s^2/\sigma^2$ is

$\displaystyle {\rm trace}( {\bf I}- {\bf 1}{\bf 1}^T/n) = n-1.
$

Derivation of the $ \chi^2$ density:

Suppose $ Z_1,\ldots,Z_n$ independent $ N(0,1)$. Define $ \chi^2_n$ distribution to be that of $ U=Z_1^2 + \cdots + Z_n^2$. Define angles $ \theta_1,\ldots,\theta_{n-1}$ by

$\displaystyle Z_1$ $\displaystyle = U^{1/2} \cos\theta_1$    
$\displaystyle Z_2$ $\displaystyle = U^{1/2} \sin\theta_1\cos\theta_2$    
$\displaystyle \vdots$ $\displaystyle = \vdots$    
$\displaystyle Z_{n-1}$ $\displaystyle = U^{1/2} \sin\theta_1\cdots \sin\theta_{n-2}\cos\theta_{n-1}$    
$\displaystyle Z_n$ $\displaystyle = U^{1/2} \sin\theta_1\cdots \sin\theta_{n-1} \,.$    

(Spherical co-ordinates in $ n$ dimensions. The $ \theta$ values run from 0 to $ \pi$ except last $ \theta$ from 0 to $ 2\pi$.) Derivative formulas:

$\displaystyle \frac{\partial Z_i}{\partial U} = \frac{1}{2U} Z_i
$

and

$\displaystyle \frac{\partial Z_i}{\partial\theta_j} =
\left\{ \begin{array}{ll}...
...\
-Z_i\tan\theta_i & j=i
\\
Z_i\cot\theta_j & j < i \,.
\end{array}\right.
$

Fix $ n=3$ to clarify the formulas. Use shorthand $ R=\sqrt{U}$.

Matrix of partial derivatives is

$\displaystyle \left[\begin{array}{ccc}
\frac{\cos\theta_1}{2R}
&
-R \sin\theta_...
...R \cos\theta_1\sin\theta_2
&
R \sin\theta_1\cos\theta_2
\end{array}\right] \,.
$

Find determinant:

$\displaystyle U^{1/2}\sin(\theta_1)/2
$

(non-negative for all $ U$ and $ \theta_1$). General $ n$: every term in the first column contains a factor $ U^{-1/2}/2$ while every other entry has a factor $ U^{1/2}$.

FACT: multiplying a column in a matrix by $ c$ multiplies the determinant by $ c$.

SO: Jacobian of transformation is

$\displaystyle u^{(n-2)/2}u^{-1/2}/2 \times h(\theta_1,\theta_{n-1})
$

for some function, $ h$, which depends only on the angles.

Thus joint density of $ U,\theta_1,\ldots \theta_{n-1}$ is

$\displaystyle (2\pi)^{-n/2} \exp(-u/2) u^{(n-2)/2}h(\theta_1, \cdots, \theta_{n-1}) / 2 \,.
$

To compute the density of $ U$ we must do an $ n-1$ dimensional multiple integral $ d\theta_{n-1}\cdots d\theta_1$.

Answer has the form

$\displaystyle cu^{(n-2)/2} \exp(-u/2)
$

for some $ c$.

Evaluate $ c$ by making

$\displaystyle \int f_U(u) du$ $\displaystyle = c \int_0^\infty u^{(n-2)/2} \exp(-u/2) du$    
  $\displaystyle =1.$    

Substitute $ y=u/2$, $ du=2dy$ to see that

$\displaystyle c 2^{n/2} \int_0^\infty y^{(n-2)/2}e^{-y} dy$ $\displaystyle = c 2^{n/2} \Gamma(n/2)$    
  $\displaystyle =1.$    

CONCLUSION: the $ \chi^2_n$ density is

$\displaystyle \frac{1}{2\Gamma(n/2)} \left(\frac{u}{2}\right)^{(n-2)/2} e^{-u/2} 1(u>0) \,.
$

Fourth part: consequence of first 3 parts and def'n of $ t_\nu$ distribution.

Defn: $ T\sim t_\nu$ if $ T$ has same distribution as

$\displaystyle Z/\sqrt{U/\nu}
$

for $ Z\sim N(0,1)$, $ U\sim\chi^2_\nu$ and $ Z,U$ independent.

Derive density of $ T$ in this definition:

$\displaystyle P(T \le t)$ $\displaystyle = P( Z \le t\sqrt{U/\nu})$    
  $\displaystyle = \int_0^\infty \int_{-\infty}^{t\sqrt{u/\nu}} f_Z(z)f_U(u) dz du$    

Differentiate wrt $ t$ by differentiating inner integral:

$\displaystyle \frac{\partial}{\partial t}\int_{at}^{bt} f(x)dx
=
bf(bt)-af(at)
$

by fundamental thm of calculus. Hence

$\displaystyle \frac{d}{dt} P(T \le t) =
\int_0^\infty \frac{f_U(u)}{
\sqrt{2\pi}} \left(\frac{u}{\nu}\right)^{1/2}
\exp\left(-\frac{t^2u}{2\nu}\right) du \,.
$

Plug in

$\displaystyle f_U(u)= \frac{1}{2\Gamma(\nu/2)}(u/2)^{(\nu-2)/2} e^{-u/2}
$

to get

$\displaystyle f_T(t) = \frac{\int_0^\infty (u/2)^{(\nu-1)/2}
e^{-u(1+t^2/\nu)/2}
du
}{2\sqrt{\pi\nu}\Gamma(\nu/2)} \,.
$

Substitute $ y=u(1+t^2/\nu)/2$, to get

$\displaystyle dy=(1+t^2/\nu)du/2$

$\displaystyle (u/2)^{(\nu-1)/2}= [y/(1+t^2/\nu)]^{(\nu-1)/2}$

leading to

$\displaystyle f_T(t) = \frac{(1+t^2/\nu)^{-(\nu+1)/2}
}{\sqrt{\pi\nu}\Gamma(\nu/2)}
\int_0^\infty y^{(\nu-1)/2} e^{-y} dy
$

or

$\displaystyle f_T(t)= \frac{\Gamma((\nu+1)/2)}{\sqrt{\pi\nu}\Gamma(\nu/2)}\frac{1}{(1+t^2/\nu)^{(\nu+1)/2}} \,.
$

Multivariate Normal samples: Distribution Theory

Theorem: Suppose $ {\bf X}_1,\ldots,{\bf X}_n$ are independent $ N(\boldsymbol\mu,\boldsymbol\Sigma)$ random variables. Then

  1. $ \bar {\bf X}$ (sample mean)and $ {\bf S}$ (sample variance-covariance matrix) are independent.

  2. $ n^{1/2}(\bar{{\bf X}} - \boldsymbol\mu)\sim MVN(0,{\bf I})$.

  3. $ (n-1){\bf S}\sim {\rm Wishart}_p(n-1,\boldsymbol\Sigma)$.

  4. $ T^2=n(\bar{{\bf X}} - \boldsymbol\mu)^T{\bf S}^{-1}(\bar{{\bf X}} - \boldsymbol\mu)$ is Hotelling's $ T^2$. $ (n-p)T^2/(p(n-1))$ has an $ F_{p,n-p}$ distribution.

Proof: Let $ {\bf X}_i={\bf A}{\bf Z}_i+\boldsymbol\mu$ where $ {\bf A}{\bf A}^T=\boldsymbol\Sigma$ and $ {\bf Z}_1,\ldots,{\bf Z}_p$ are independent $ MVN(0,{\bf I})$.

So $ {\bf Z}=({\bf Z}_1^T,\ldots,{\bf Z}_p^T)^T \sim MVN_p(0,{\bf I})$.

Note that $ \bar{{\bf X}} = {\bf A}\bar{{\bf Z}}+\boldsymbol\mu$ and

$\displaystyle (n-1){\bf S}$ $\displaystyle = \sum({\bf X}_i-\bar{{\bf X}})({\bf X}_i-\bar{{\bf X}})^T$    
  $\displaystyle = {\bf A}\sum({\bf Z}_i-\bar{{\bf Z}})({\bf Z}_i-\bar{{\bf Z}})^T{\bf A}^T$    

Thus

$\displaystyle n^{1/2}(\bar{X}-\mu) = {\bf A}n^{1/2}\bar{{\bf Z}}
$

and

$\displaystyle T^2 =\left(n^{1/2}\bar{{\bf Z}}\right)^T {\bf S}_{\bf Z}^{-1}\left(n^{1/2}\bar{{\bf Z}}\right)
$

where

$\displaystyle {\bf S}_Z= \sum({\bf Z}_i-\bar{{\bf Z}})({\bf Z}_i-\bar{{\bf Z}})^T/(n-1).$

Consequences. In 1, 2 and 4: can assume $ \boldsymbol\mu=0$ and $ \boldsymbol\Sigma={\bf I}$. In 3 can take $ \boldsymbol\mu=0$. Step 1: Do general $ \boldsymbol\Sigma$. Define

$\displaystyle {\bf Y}=(\sqrt{n}\bar{{\bf Z}}^T, {\bf Z}_1^T-\bar{{\bf Z}}^T,\ldots,
{\bf Z}_{n}^T-\bar{{\bf Z}}^T)^T \,.
$

(So $ {\bf Y}$ has dimension $ p( n+1)$.) Clearly $ {\bf Y}$ is $ MVN$ with mean 0.

Compute variance covariance matrix

$\displaystyle \left[\begin{array}{cc}
{\bf I}_{p\times p} & 0 \\
0 & {\bf Q}^*
\end{array}\right]
$

where $ {\bf Q}^*$ has a pattern. It is a $ p \times p$ patterned matrix with entry $ ij$ being

$\displaystyle {\rm Cov}({\bf Z}_i- \bar{{\bf Z}},{\bf Z}_j- \bar{{\bf Z}})$ $\displaystyle = \begin{cases}-\boldsymbol\Sigma/n & i \neq j \\ (n-1)\boldsymbol\Sigma/n & i=j \end{cases}$    
  $\displaystyle = {\bf Q}_{ij}\boldsymbol\Sigma$    

Kronecker Products

Defn: If $ {\bf A}$ is $ p\times q$ and $ {\bf B}$ is $ r \times s$ then $ {\bf A}\bigotimes{\bf B}$ is the $ pr \times qs$ matrix with the pattern

$\displaystyle \left[\begin{array}{cccc}
{\bf A}_{11}{\bf B}& {\bf A}_{12}{\bf B...
...{\bf B}& {\bf A}_{p2} {\bf B}& \cdots & {\bf A}_{pq}{\bf B}
\end{array}\right]
$

So our variance covariance matrix is

$\displaystyle {\bf Q}^* = {\bf Q}\bigotimes \boldsymbol\Sigma
$

Conclusions so far:

1) $ \bar{{\bf X}}$ and $ {\bf S}$ are independent.

2) $ \sqrt{n}(\bar{{\bf X}} - \boldsymbol\mu ) \sim
MVN(0,\boldsymbol\Sigma)
$

Next: Wishart law.

Defn: The $ {\rm Wishart}_p(n,\boldsymbol\Sigma)$ distribution is the distribution of

$\displaystyle \sum_1^n {\bf Z}_i{\bf Z}_i^T
$

where $ {\bf Z}_1,\ldots,{\bf Z}_n$ are iid $ MVN_p(0,\boldsymbol\Sigma)$.

Properties of Wishart.

1) If $ {\bf A}{\bf A}^t=\boldsymbol\Sigma$ then

$\displaystyle {\rm Wishart}_p(0,\boldsymbol\Sigma) = {\bf A}{\rm Wishart}_p(0,{\bf I}){\bf A}^T
$

2) if $ {\bf W}_i, i=1,2$ independent $ {\rm Wishart}_p(n_i,\boldsymbol\Sigma)$ then

$\displaystyle {\bf W}_1+{\bf W}_2 \sim {\rm Wishart}_p(n_1+n_2,\boldsymbol\Sigma).
$

Proof of part 3: rewrite

$\displaystyle \sum ({\bf Z}_i-\bar{{\bf Z}})({\bf Z}_i-\bar{{\bf Z}})^T
$

in form

$\displaystyle \sum_{j=1}^{n-1} {\bf U}_i{\bf U}_i^T
$

for $ {\bf U}_i$ iid $ MVN_p(0,\boldsymbol\Sigma)$. Put $ {\bf Z}_1,\ldots,{\bf Z}_n$ as cols in matrix $ {\bf Z}$ which is $ p\times n$. Then check that

$\displaystyle {\bf Z}{\bf Q}{\bf Z}^T = \sum ({\bf Z}_i-\bar{{\bf Z}})({\bf Z}_i-\bar{{\bf Z}})^T
$

Write $ {\bf Q}= \sum {\bf v}_i {\bf v}_i^T$ for $ n-1$ orthogonal unit vectors $ {\bf v}_1,\ldots,{\bf v}_{n-1}$. Define

$\displaystyle {\bf U}_i = {\bf Z}{\bf v}_i
$

and compute covariances to check that the $ {\bf U}_i$ are iid $ MVN_p(0,\boldsymbol\Sigma)$. Then check that

$\displaystyle {\bf Z}{\bf Q}{\bf Z}^T = \sum{\bf U}_i{\bf U}_i^T
$

Proof of 4: suffices to have $ \boldsymbol\Sigma={\bf I}$.

Uses further props of Wishart distribution.

3: If $ {\bf W}\sim Wishart_p(n,\boldsymbol\Sigma)$ and $ {\bf a}\in \mathbb {R}$ then

$\displaystyle \frac{{\bf a}^T {\bf W}{\bf a}}{{\bf a}^T\boldsymbol\Sigma {\bf a}} \sim \chi_n^2
$

4: If $ {\bf W}\sim Wishart_p(n,\boldsymbol\Sigma)$ and $ n \ge p$ then

$\displaystyle \frac{
{\bf a}^T\boldsymbol\Sigma^{-1} {\bf a}
}{
{\bf a}^T {\bf W}^{-1} {\bf a}
} \sim \chi_{n-p+1}^2
$

5: If $ {\bf W}\sim Wishart_p(n,\boldsymbol\Sigma)$ then

$\displaystyle {\rm trace}(\boldsymbol\Sigma^{-1}{\bf W}) \sim \chi_{np}^2
$

6: If $ {\bf W}\sim Wishart_{p+q}(n,\boldsymbol\Sigma)$ is partitioned into components then

$\displaystyle {\bf W}_{11} - {\bf W}_{12}{\bf W}_{22}^{-1} {\bf W}_{21}
\sim Wishart_{p}(n-q,\boldsymbol\Sigma_{11.2})
$

One sample tests on mean vectors

Given data $ {\bf X}_1,\ldots,{\bf X}_n$ iid $ MVN(\boldsymbol\mu,\boldsymbol\Sigma)$ test

$\displaystyle H_o: \boldsymbol\mu=\boldsymbol\mu_o
$

by computing

$\displaystyle T^2 = n(\bar{{\bf X}} -\boldsymbol\mu_o)^T {{\bf S}}^{-1}(\bar{{\bf X}} -\boldsymbol\mu_o)
$

and getting $ P$-values from $ F$ distribution using theorem.

Example: no realistic ones. This hypothesis is not intrinsically useful. However: other tests can sometimes be reduced to it.

Example: Ten water samples split in half. One half of each to each of two labs. Measure biological oxygen demand (BOD) and suspended solids (SS). For sample $ i$ let $ X_{i1}$ be BOD for lab A, $ X_{i2}$ be SS for lab A, $ X_{i3}$ be BOD for lab B and $ X_{i4}$ be SS for lab B. Question: are labs measuring the same thing? Is there bias in one or the other?

Notation $ {\bf X}_i$ is vector of 4 measurements on sample $ i$.

Data:

  Lab A Lab B
Sample BOD SS BOD SS
1 6 27 25 15
2 6 23 28 13
3 18 64 36 22
4 8 44 35 29
5 11 30 15 31
6 34 75 44 64
7 28 26 42 30
8 71 124 54 64
9 43 54 34 56
10 33 30 29 20
11 20 14 39 21

Model: $ {\bf X}_1,\ldots,{\bf X}_{11}$ are iid $ MVN_4(\boldsymbol\mu,\boldsymbol\Sigma)$.

Multivariate problem because: not able to assume independence between any two measurements on same sample.

Potential sub-model: each measurement is

true mmnt + lab bias + mmnt error.

Model for measurement error vector $ {\bf U}_i$ is multivariate normal mean 0 and diagonal covariance matrix $ \boldsymbol\Sigma_{\bf U}$.

Lab bias is unknown vector $ \boldsymbol\beta$.

True measurement should be same for both labs so has form

$\displaystyle [Y_{i1},Y_{i2},Y_{i1},Y_{i2}]
$

where $ Y_{i1},Y_{i2}$ are iid bivariate normal with unknown means $ \theta_1,\theta_2$ and unknown $ 2 \times 2$ variance covariance $ \boldsymbol\Sigma_{\bf Y}$.

This would give structured model

$\displaystyle {\bf X}_i = {\bf C}{\bf Y}+ \boldsymbol\beta + {\bf U}
$

where

$\displaystyle {\bf C}= \left[\begin{array}{cc} 1 & 0 \\  0 & 1 \\  1 & 0 \\  0 &
1\end{array}\right]
$

This model has variance covariance matrix

$\displaystyle \boldsymbol\Sigma_{\bf X}=
{\bf C}\boldsymbol\Sigma_{\bf Y}{\bf C}^T + \boldsymbol\Sigma_{\bf U}
$

Notice that this matrix has only 7 parameters: four for the diagonal entries in $ \boldsymbol\Sigma_{\bf U}$ and 3 for the entries in $ \boldsymbol\Sigma_{\bf Y}$.

We skip this model and let $ \boldsymbol\Sigma_{\bf X}$ be unrestricted.

Question of interest:

$\displaystyle H_o: \mu_1=\mu_3$    and $\displaystyle \mu_2 = \mu_4
$

Reduction: partition $ {\bf X}_i$ as

$\displaystyle \left[\begin{array}{c} {\bf U}_i \\  {\bf V}_i \end{array}\right]
$

where $ {\bf U}_i$ and $ {\bf V}_i$ each have two components.

Define $ {\bf W}_i={\bf U}_i$. Then our model makes $ {\bf W}_i$ iid $ MVN_2(\boldsymbol\mu_{\bf W}, \boldsymbol\Sigma_{\bf W})$. Our hypothesis is

$\displaystyle H_o: \boldsymbol\mu_{\bf W}= 0
$

Carrying out our test in SPlus:

Working on CSS unix workstation:

Start SPlus then read in, print out data:


[61]ehlehl% mkdir .Data
[62]ehlehl% Splus
S-PLUS : Copyright (c) 1988, 1996 MathSoft, Inc.
S : Copyright AT&T.
Version 3.4 Release 1 for Sun SPARC, SunOS 5.3 : 1996 
Working data will be in .Data 
> # Read in and print out data
> eff <- read.table("effluent.dat",header=T)
> eff
   BODLabA SSLabA BODLabB SSLabB 
 1       6     27      25     15
 2       6     23      28     13
 3      18     64      36     22
 4       8     44      35     29
 5      11     30      15     31
 6      34     75      44     64
 7      28     26      42     30
 8      71    124      54     64
 9      43     54      34     56
10      33     30      29     20
11      20     14      39     21
Do some graphical preliminary analysis.

Look for non-normality, non-linearity, outliers.

Make plots on screen or saved in file.


> # Make pairwise scatterplots on screen using
> # motif graphics device and then in a postscript
> # file.
> motif()
> pairs(eff)
> postscript("pairs.ps",horizontal=F,
+   height=6,width=6)
> pairs(eff)
> dev.off()
Generated postscript file "pairs.ps".
 motif 
     2
\psfig {file=../manova/pairs.ps}
Examine correlations

> cor(eff)
          BODLabA    SSLabA   BODLabB    SSLabB 
BODLabA 0.9999999 0.7807413 0.7228161 0.7886035
 SSLabA 0.7807413 1.0000000 0.6771183 0.7896656
BODLabB 0.7228161 0.6771183 1.0000001 0.6038079
 SSLabB 0.7886035 0.7896656 0.6038079 1.0000001
Notice high correlations.

Mostly caused by variation in true levels from sample to sample.

Get partial correlations.

Adjust for overall BOD and SS content of sample.


> aug <- cbind(eff,(eff[,1]+eff[,3])/2,
+                  (eff[,2]+eff[,4])/2)
> aug
   BODLabA SSLabA BODLabB SSLabB   X2   X3 
 1       6     27      25     15 15.5 21.0
 2       6     23      28     13 17.0 18.0
 3      18     64      36     22 27.0 43.0
 4       8     44      35     29 21.5 36.5
 5      11     30      15     31 13.0 30.5
 6      34     75      44     64 39.0 69.5
 7      28     26      42     30 35.0 28.0
 8      71    124      54     64 62.5 94.0
 9      43     54      34     56 38.5 55.0
10      33     30      29     20 31.0 25.0
11      20     14      39     21 29.5 17.5
> bigS <- var(aug)

Now compute partial correlations for first four variables given means of BOD and SS:


> S11 <- bigS[1:4,1:4]
> S12 <- bigS[1:4,5:6]
> S21 <- bigS[5:6,1:4]
> S22 <- bigS[5:6,5:6]
> S11dot2 <- S11 - S12 %*% solve(S22,S21)
> S11dot2
           BODLabA     SSLabA    BODLabB     SSLabB 
BODLabA  24.804665  -7.418491 -24.804665   7.418491
 SSLabA  -7.418491  59.142084   7.418491 -59.142084
BODLabB -24.804665   7.418491  24.804665  -7.418491
 SSLabB   7.418491 -59.142084  -7.418491  59.142084
> S11dot2SD <- diag(sqrt(diag(S11dot2)))
> S11dot2SD
         [,1]     [,2]     [,3]     [,4] 
[1,] 4.980428 0.000000 0.000000 0.000000
[2,] 0.000000 7.690389 0.000000 0.000000
[3,] 0.000000 0.000000 4.980428 0.000000
[4,] 0.000000 0.000000 0.000000 7.690389
> R11dot2 <- solve(S11dot2SD)%*% 
+           S11dot2%*%solve(S11dot2SD)
> R11dot2
          [,1]      [,2]      [,3]      [,4] 
[1,]  1.000000 -0.193687 -1.000000  0.193687
[2,] -0.193687  1.000000  0.193687 -1.000000
[3,] -1.000000  0.193687  1.000000 -0.193687
[4,]  0.193687 -1.000000 -0.193687  1.000000
Notice little residual correlation. Carry out Hotelling's $ T^2$ test of $ H_0: \boldsymbol\mu_{\bf W}=0$.

> w <- eff[,1:2]-eff[3:4]
> dimnames(w)<-list(NULL,c("BODdiff","SSdiff"))
> w
      BODdiff SSdiff 
 [1,]     -19     12
 [2,]     -22     10
  etc
 [8,]      17     60
  etc
> Sw <- var(w)
> cor(w)
          BODdiff    SSdiff 
BODdiff 1.0000001 0.3057682
 SSdiff 0.3057682 1.0000000
> mw <- apply(w,2,mean)
> mw
   BODdiff   SSdiff 
 -9.363636 13.27273
> Tsq <- 11*mw%*%solve(Sw,mw)
> Tsq
         [,1] 
[1,] 13.63931
> FfromTsq <- (11-2)*Tsq/(2*(11-1))
> FfromTsq
        [,1] 
[1,] 6.13769
> 1-pf(FfromTsq,2,9)
[1] 0.02082779
Conclusion: Pretty clear evidence of difference in mean level between labs. Which measurement causes the difference?

> TBOD <- sqrt(11)*mw[1]/sqrt(Sw[1,1])
> TBOD
   BODdiff 
 -2.200071
> 2*pt(TBOD,1)
   BODdiff 
 0.2715917
> 2*pt(TBOD,10)
    BODdiff 
 0.05243474
> TSS <- sqrt(11)*mw[2]/sqrt(Sw[2,2])
> TSS
  SSdiff 
 2.15153
> 2*pt(-TSS,10)
     SSdiff 
 0.05691733
> postscript("differences.ps",
+      horizontal=F,height=6,width=6)
> plot(w)
> abline(h=0)
> abline(v=0)
> dev.off()
Conclusion? Neither? Not a problem - summarizes evidence!

Problem: several tests at level 0.05 on same data. Simultaneous or Multiple comparisons.

\psfig {file=../manova/differences.ps}
In general can test hypothesis $ H_o: {\bf C}\boldsymbol\mu=0$ by computing $ {\bf Y}_i = {\bf C}{\bf X}_i$ and then testing $ H_o: \boldsymbol\mu_{\bf Y}= 0$ using Hotelling's $ T^2$.

Simultaneous confidence intervals

Confidence interval for $ {\bf a}^T \boldsymbol\mu$:

$\displaystyle {\bf a}^T\bar{{\bf X}} \pm t_{n-1,\alpha/2}\sqrt{{\bf a}^T {\bf S}{\bf a}}
$

Based on $ t$ distribution.

Give coverage intervals for 6 parameters of interest: 4 entries in $ \boldsymbol\mu$ and $ \boldsymbol\mu_1-\boldsymbol\mu_3$ and $ \boldsymbol\mu_2-\boldsymbol\mu_4$

$ \boldsymbol\mu_1$ $ 25.27 \pm 2.23\times 19.68/\sqrt{11}$
$ \boldsymbol\mu_2$ $ 46.45 \pm 2.23\times 31.84/\sqrt{11}$
$ \boldsymbol\mu_3$ $ 34.64 \pm 2.23\times 10.45/\sqrt{11}$
$ \boldsymbol\mu_4$ $ 33.18 \pm 2.23\times 19.07/\sqrt{11}$
$ \boldsymbol\mu_1-\boldsymbol\mu_3$ $ -9.36 \pm 2.23\times 14.12/\sqrt{11}$
$ \boldsymbol\mu_2-\boldsymbol\mu_4$ $ 13.27 \pm 2.23\times 20.46/\sqrt{11}$

Problem: each confidence interval has 5% error rate. Pick out last interval (on basis of looking most interesting) and ask about error rate?

Solution: adjust 2.23, $ t$ multiplier to get

$\displaystyle P($all intervals cover truth$\displaystyle ) \ge 0.95
$

Rao or Scheffé type intervals

Based on inequality:

$\displaystyle \left\vert{\bf a}^T {\bf b} \right\vert^2
\le
{\bf a}^T {\bf M} {\bf a}{\bf b} ^T{\bf M}^{-1}{\bf b}
$

for any symmetric non-singular matrix $ {\bf M}$.

Proof by Cauchy Schwarz: inner product of vectors $ {\bf M}^{1/2} {\bf a}$ and $ {\bf M}^{-1/2} {\bf B}$.

Put $ {\bf b} = n^{-1/2}(\bar{{\bf X}} - \boldsymbol\mu)
$ and $ {\bf M}= {\bf S}$ to get

$\displaystyle \left\vert n^{1/2}({\bf a}^T\bar{{\bf X}} - {\bf a}^T\boldsymbol\mu)\right\vert^2
\le {\bf a}^T{\bf S}{\bf a} T^2
$

This inequality is true for all $ \bf a$. Thus the event that there is any $ \bf a$ such that

$\displaystyle \frac{({\bf a}^T\bar{{\bf X}} - {\bf a}^T\boldsymbol\mu)^2}{{\bf a}^T{\bf S}{\bf a}} > c
$

is a subset of the event

$\displaystyle T^2 > c
$

Adjust $ c$ to make the latter event have probability $ \alpha$ by taking

$\displaystyle c=\frac{p(n-1)}{n-p}F_{p,n-p,\alpha}.
$

Then the probability that every one of the uncountably many confidence intervals

$\displaystyle {\bf a}^T \bar{{\bf X}} \pm \sqrt{c} \sqrt{{\bf a}^T{\bf S}{\bf a}}
$

covers the corresponding true parameter value is at least $ 1-\alpha$.

In fact the probability of this happening is exactly equal to $ 1-\alpha$ because for each data set the supremum of

$\displaystyle \frac{({\bf a}^T\bar{{\bf X}} - {\bf a}^T\boldsymbol\mu)^2}{{\bf a}^T{\bf S}{\bf a} }
$

over all $ \bf a$ is $ T^2$.

Our case

$\displaystyle \sqrt{c} = \frac{4(10)}{7} F_{4,7,0.05} = 4.85
$

Coverage probability of single interval using $ \sqrt{c}4.85$? From $ t$ distribution:

$\displaystyle 99.93\%
$

Probability all 6 intervals would cover using $ \sqrt{c}4.85$?

Use Bonferroni inequality:

$\displaystyle P(\cup A_i) \le \sum P(A_i)
$

Simultaneous coverage probability of 6 intervals using $ \sqrt{c}4.85$

$\displaystyle \ge 1-6*(1-0.9993) = 99.59\%
$

Usually just use

$\displaystyle \sqrt{c} =t_{n-1,\alpha/12} =3.28
$

General Bonferroni strategy. If we want intervals for $ \theta_1, \ldots,\theta_k$ get interval for $ \theta_i$ at level $ 1-\alpha/k$. Simultaneous coverage probability is at least $ 1-\alpha$. Notice that Bonferroni narrower in our example unless $ 0.0007k=0.5$ giving $ k> 71$.

Motivations for $ T^2$:

1: Hypothesis $ H_o: \boldsymbol\mu=0$ is true iff all hypotheses $ H_{o{\bf a}}: {\bf a}^T\boldsymbol\mu=0$ are true. Natural test for $ H_{o{\bf a}}$ rejects if

$\displaystyle t({\bf a}) = \frac{n^{1/2}{\bf a}^T(\bar{{\bf X}}-\boldsymbol\mu)}{
{\bf a}^T {\bf S}{\bf a}}
$

large. So take largest test statistic.

Fact:

$\displaystyle \sup_{\bf a} t^2({\bf a}) = T^2
$

Proof: like calculation of maximal correlation.

2: likelihood ratio method.

Compute

$\displaystyle \ell(\hat{\boldsymbol\mu},\hat{\boldsymbol\Sigma}) - \ell(
\hat{\boldsymbol\mu}_o,\hat{\boldsymbol\Sigma}_o)
$

where the subscript $ o$ indicates estimation assuming $ H_o$.

In our case to test $ H_o: \boldsymbol\mu=0$ find

$\displaystyle \hat{\boldsymbol\mu}_o = 0 \quad \hat{\boldsymbol\Sigma}_o = \sum
{\bf X}_i{\bf X}_i^T/n
$

and

$\displaystyle \ell(\hat{\boldsymbol\mu},\hat{\boldsymbol\Sigma}) - \ell(
\hat{\...
...o)
= n \log\{\det(\hat{\boldsymbol\Sigma})/\det(\hat{\boldsymbol\Sigma}_0)\}/2
$

Now write

$\displaystyle \sum {\bf X}_i{\bf X}_i^T = \sum({\bf X}_i-\bar{{\bf X}})({\bf X}_i-\bar{{\bf X}})^T +
n\bar{{\bf X}}\bar{{\bf X}}^T
$

Use formula:

$\displaystyle \det({\bf A}+{\bf v}{\bf v}^T) = \det({\bf A})(1+{\bf v}^T{\bf A}^{-1}{\bf v})
$

to get

$\displaystyle \det(n\hat{\boldsymbol\Sigma}_0) =\det(n\hat{\boldsymbol\Sigma})(
1+n\bar{{\bf X}}^T\left(n\hat{\boldsymbol\Sigma}\right)^{-1}\bar{{\bf X}})
$

so that the ratio of determinants is a monotone increasing function of $ T^2$.

Again conclude: likelihood ratio test rejects for $ T^2>c$ where $ c$ chosen to make level $ \alpha$.

3: compare estimates of $ \boldsymbol\Sigma$.

In univariate regression $ F$ tests to compare a restricted model with a full model have form

$\displaystyle \frac{ESS_\text{Restricted}-ESS_\text{Full}}{ESS_\text{Full}}\frac{df_\text{Error}}{df_\text{difference}}
$

This is a monotone function of

$\displaystyle \frac{\hat\sigma^2_\text{Restricted}}{\hat\sigma^2_\text{Full}}
$

where $ ESS$ denotes an error sum of squares and $ \hat\sigma^2$ and estimate of the residual variance - $ ESS/df$.

Here: substitute matrices.

Analogue of ESS for full model:

$\displaystyle {\bf E}$

Analogue of ESS for reduced model:

$\displaystyle {\bf E}+{\bf H}
$

(This defined $ {\bf H}$ to be the change in the Sum of Squares and Cross Products matrix.)

In 1 sample example:

$\displaystyle {\bf E}= \sum ({\bf X}_i-\bar{{\bf X}})({\bf X}_i-\bar{{\bf X}})^T
$

and

$\displaystyle {\bf E}+{\bf H}= \sum {\bf X}_i{\bf X}_i^T
$

Test of $ \boldsymbol\mu=0$ based on comparing

$\displaystyle {\bf H}=\sum {\bf X}_i{\bf X}_i^T - \sum ({\bf X}_i-\bar{{\bf X}})({\bf X}_i-\bar{{\bf X}})^T = n\bar{{\bf X}}\bar{{\bf X}}^T
$

to

$\displaystyle {\bf E}= \sum ({\bf X}_i-\bar{{\bf X}})({\bf X}_i-\bar{{\bf X}})^T= (n-1){\bf S}
$

To make comparison. If null true

$\displaystyle {\rm E}(n\bar{{\bf X}}\bar{{\bf X}}^T) = \boldsymbol\Sigma
$

and

$\displaystyle {\rm E}({\bf S}) = \boldsymbol\Sigma
$

so try tests based on closeness of

$\displaystyle {\bf S}^{-1}n\bar{{\bf X}}\bar{{\bf X}}^T
$

to identity matrix.

Measures of size based on eigenvalues of

$\displaystyle {\bf S}^{-1}n\bar{{\bf X}}\bar{{\bf X}}^T
$

which are same as eigenvalues of

$\displaystyle {\bf S}^{-1/2}n\bar{{\bf X}}\bar{{\bf X}}^T{\bf S}^{-1/2}
$

Suggested size measures for $ {\bf A}-{\bf I}$:

For our matrix: eigenvalues all 0 except for one. (So really-matrix not close to $ {\bf I}$.)

Largest eigenvalue is

$\displaystyle T^2 = n\bar{{\bf X}}{\bf S}^{-1}\bar{{\bf X}}
$

But: see two sample problem for precise tests based on suggestions.

Two sample problem
Given data $ {\bf X}_{ij}; j=1,\ldots,n_i;i=1,2$. Model $ {\bf X}_{ij} \sim MVN_p(\boldsymbol\mu_i,\boldsymbol\Sigma_i)$, independent.

Test $ H_o: \boldsymbol\mu_1 = \boldsymbol\mu_2$.

Case 1: for motivation only. $ \boldsymbol\Sigma_i$ known $ i=1,2$.

Natural test statistic: based on

$\displaystyle {\bf D}= \bar{{\bf X}}_1 -\bar{{\bf X}}_2
$

which has $ MVN(\boldsymbol\mu_{\bf D},\boldsymbol\Sigma_{\bf D})$ where

$\displaystyle \boldsymbol\mu_{\bf D}=
\boldsymbol\mu_1-\boldsymbol\mu_2
$

and

$\displaystyle \boldsymbol\Sigma_{\bf D}= n_1^{-1} \boldsymbol\Sigma_1+
n_2^{-1}\boldsymbol\Sigma_2
$

So

$\displaystyle {\bf D}^T \left({\boldsymbol\Sigma}_1/n_1+{\boldsymbol\Sigma}_2/n_2\right)^{-1} {\bf D}
$

has a $ \chi_p^2$ distribution if null true.

If $ \boldsymbol\Sigma_i$ not known must estimate. No universally agreed best procedure (even for $ p=1$ -- called Behrens-Fisher problem).

Usually: assume $ \boldsymbol\Sigma_1 = \boldsymbol\Sigma_2$.

If so: MLE of $ \boldsymbol\mu_i$ is $ \bar{{\bf X}}_i$ and of $ \boldsymbol\Sigma$ is

$\displaystyle \frac{\sum_{ij} ({\bf X}_{ij}-\bar{{\bf X}}_i)({\bf X}_{ij}-\bar{{\bf X}}_i)^T}{n_1+n_2}
$

Usual estimate of $ \boldsymbol\Sigma$ is

$\displaystyle {\bf S}_$Pooled$\displaystyle = \frac{\sum_{ij}
({\bf X}_{ij}-\bar{{\bf X}}_i)({\bf X}_{ij}-\bar{{\bf X}}_i)^T}{n_1+n_2-2}
$

which is unbiased.

Possible test developments:

1) By analogy with 1 sample:

$\displaystyle T^2 = \frac{n_1n_2}{n_1+n_2} {\bf D}^T {\bf S}^{-1}_$Pooled$\displaystyle {\bf D}
$

which has the distribution

$\displaystyle \frac{n_1+n_2-1-p}{p(n_1+n_2-2)} T^2 \sim F_{p,n_1+n_2-1-p}
$

2) Union-intersection: test of $ {\bf a}^t({\boldsymbol\mu}_1-{\boldsymbol\mu}_2) = 0$ based on

$\displaystyle t_a = \sqrt{\frac{n_1n_2}{n_1+n_2}}\frac{{\bf a}^T {\bf D}}{\sqrt{{\bf a}^T {\bf S}{\bf a}}}
$

Maximize $ t^2$ over all $ \bf a$.

Get

$\displaystyle T^2 = \frac{n_1n_2}{n_1+n_2} {\bf D}^T {\bf S}^{-1}{\bf D}
$

3) Likelihood ratio: the MLE of $ {\boldsymbol\Sigma}$ for the unrestricted model is

$\displaystyle \frac{n_1+n_2-2}{n_1+n_2} {\bf S}
$

Under the hypothesis $ {\boldsymbol\mu}_1={\boldsymbol\mu}_2$ the mle of $ {\boldsymbol\Sigma}$ is

$\displaystyle \frac{\sum_{ij} ({\bf X}_{ij} -\bar{\bar{{\bf X}}})({\bf X}_{ij} -\bar{\bar{{\bf X}}})^T}{n_1+n_2}
$

where

$\displaystyle \bar{\bar{{\bf X}}} = \frac{ n_1\bar{{\bf X}}_1 + n_2\bar{{\bf X}}_2}{n_1+n_2}
$

This simplifies to

$\displaystyle \frac{{\bf E}+{\bf H}}{n_1+n_2}
$

The log-likelihood ratio is a multiple of

$\displaystyle \log\det\hat{\boldsymbol\Sigma}_$Full$\displaystyle - \log\det\hat{\boldsymbol\Sigma}_$Restricted$\displaystyle $

which is a function of

$\displaystyle \log\left\{\det{\bf E}/\det({\bf E}+{\bf H})\right\}
$

or equivalently a function of Wilk's $ \Lambda$:

$\displaystyle \Lambda = \frac{\det{\bf E}}{\det({\bf E}+{\bf H})} = \frac{1}{\det({\bf H}{\bf E}^{-1}+{\bf I})}
$

Compute det: multiply together eigenvalues.

If $ \lambda_i$ are the eigenvalues of $ {\bf H}{\bf E}^{-1}$ then

$\displaystyle \Lambda = \frac{1}{\prod(1+\lambda_i)}
$

Two sample analysis in SAS on css network


    data long;
    infile 'tab57sh';
    input group a b c;
run;
proc print;
run;
proc glm;
    class group;
    model a b c = group;
    manova h=group / printh printe;
run;
Notes:

1) First 4 lines form DATA step:

a) creates data set named long by reading in 4 columns of data from file named tab57sh stored in same directory as I was in when I typed sas.

b) Calls variables group (=1 or 2 as label for the two groups) and a, b, c which are names for the 3 test scores for each subject.

2) Next two lines: print out data: result is (slightly edited)


    Obs    group     a     b     c
      1      1      19    20    18
      2      1      20    21    19
      3      1      19    22    22
     etc till
     11      2      15    17    15
     12      2      13    14    14
     13      2      14    16    13
3) Then use proc glm to do analysis:

a) class group declares that the variable group defines levels of a categorical variable.

b) model statement says to regress the variables a, b, c on variable group.

c) manova statement says to do both 3 univariate regressions and a mulivariate regression and to print out the $ {\bf H}$ and $ {\bf E}$ matrices where $ {\bf H}$ is the matrix corresponding to the presence of the factor group in the model.

Output of MANOVA: First univariate results


    The GLM Procedure
 Class Level Information
        Class         Levels    Values
        group              2    1 2
 Number of observations    13
Dependent Variable: a
                Sum of
Source    DF   Squares  Mean Square F Value Pr > F
Model      1  54.276923  54.276923   19.38 0.0011
Error     11  30.800000   2.800000
Corrd Tot 12  85.076923
R-Square   Coeff Var      Root MSE        a Mean
0.637975    10.21275      1.673320      16.38462
Source    DF  Type ISS  Mean Square F Value Pr > F
group      1  54.276923  54.276923   19.38 0.0011
Source    DF  TypeIIISS Mean Square F Value Pr > F
group      1  54.276923  54.276923   19.38 0.0011

Dependent Variable: b
                Sum of
Source    DF   Squares  Mean Square F Value Pr > F
Model      1  70.892308  70.892308  34.20 0.0001
Error     11  22.800000   2.072727
Corrd Tot 12  93.692308

Dependent Variable: c
                Sum of
Source    DF   Squares Mean Square F Value Pr > F
Model      1   94.77692  94.77692   39.64 <.0001
Error     11   26.30000   2.39090
Corrd Tot 12  121.07692
The matrices $ {\bf E}$ and $ {\bf H}$.

  E = Error SSCP Matrix
   a     b     c
a 30.8  12.2  10.2
b 12.2  22.8   3.8
c 10.2   3.8  26.3
Partial Correlation Coefficients from 
    the Error SSCP Matrix / Prob > |r|
DF = 11     a              b              c
a    1.000000       0.460381       0.358383
                      0.1320         0.2527
b    0.460381       1.000000       0.155181
       0.1320                        0.6301
c    0.358383       0.155181       1.000000
       0.2527         0.6301
H = Type III SSCP Matrix for group
     a                 b                 c
a   54.276923077   62.030769231   71.723076923
b   62.030769231   70.892307692   81.969230769
c   71.723076923   81.969230769   94.776923077
The eigenvalues of $ {\bf E}^{-1}{\bf H}$.

Characteristic Roots and Vectors of: E Inverse * H
H = Type III SSCP Matrix for group
E = Error SSCP Matrix
Characteristic       Characteristic Vector V'EV=1
Root     Percent        a           b           c
5.816159 100.00  0.00403434  0.12874606 0.13332232
0.000000   0.00 -0.09464169 -0.10311602 0.16080216
0.000000   0.00 -0.19278508  0.16868694 0.00000000
MANOVA Test Criteria and Exact F Statistics 
 for the Hypothesis of No Overall group Effect
H = Type III SSCP Matrix for group
E = Error SSCP Matrix
S=1    M=0.5    N=3.5
Statistic            Value   F   NumDF DenDF Pr > F
Wilks' Lambda       0.1467 17.45    3     9  0.0004
Pillai's Trace      0.8533 17.45    3     9  0.0004
Hotelling-Lawley Tr 5.8162 17.45    3     9  0.0004
Roy's Greatest Root 5.8162 17.45    3     9  0.0004
Things to notice:

  1. The conclusion is clear. The mean vectors for the two groups are not the same.

  2. The four statistics have the following definitions in terms of eigenvalues of $ {\bf E}^{-1}{\bf H}$:

    Wilk's Lambda:

    $\displaystyle \frac{1}{\prod(1+\lambda_i)}=\frac{1}{6.816}
$

    Pillai's trace:

    $\displaystyle {\rm trace}({\bf H}({\bf H}+{\bf E})^{-1}) = \sum\frac{\lambda_i}{1+\lambda_i} = \frac{5.816}{6.816}
$

    Hotelling-Lawley trace:

    $\displaystyle {\rm trace}({\bf H}{\bf E}^{-1}) = \sum\lambda_i = 5.816
$

    Roy's greatest Root:

    $\displaystyle \max\{\lambda_i\} = 5.816
$

1 way layout
Also called $ m$ sample problem.

Data $ {\bf X}_{ij},j=1,\ldots,n_i;i=1,\ldots, m$.

Model $ {\bf X}_{ij}$ independent $ MVN_p({\boldsymbol\mu}_i,{\boldsymbol\Sigma})$.

First problem of interest: test

$\displaystyle H_o: {\boldsymbol\mu}_1= \cdots = {\boldsymbol\mu}_m
$

Based on $ {\bf E}$ and $ {\bf H}$. MLE of $ {\boldsymbol\mu}_i$ is $ \bar{{\bf X}}_i$.

$\displaystyle {\bf E}= \sum_{ij} ({\bf X}_{ij} - \bar{{\bf X}}_i)({\bf X}_{ij} - \bar{{\bf X}}_i)^T
$

Under $ H_o$ MLE of $ {\boldsymbol\mu}$, the common value of the $ {\boldsymbol\mu}_i$ is

$\displaystyle \bar{\bar{{\bf X}}} = \frac{\sum_{ij} {\bf X}_{ij}}{\sum_i n_i}
$

So

$\displaystyle {\bf E}+{\bf H}= \sum_{ij} ({\bf X}_{ij} - \bar{\bar{{\bf X}}})({\bf X}_{ij} - \bar{\bar{{\bf X}}})^T
$

This makes

$\displaystyle {\bf H}= \sum_{ij}(\bar{{\bf X}}_i - \bar{\bar{{\bf X}}})(\bar{{\bf X}}_i - \bar{\bar{{\bf X}}})^T
$

Notice can do sum over $ j$ to get factor of $ n_i$:

$\displaystyle {\bf H}= \sum_i n_i(\bar{{\bf X}}_i - \bar{\bar{{\bf X}}})(\bar{{\bf X}}_i - \bar{\bar{{\bf X}}})^T
$

Note: rank of $ {\bf H}$ is minimum of $ p$ and $ m-1$. The data

1 19 20 18
1 20 21 19
1 19 22 22
1 18 19 21
1 16 18 20
1 17 22 19
1 20 19 20
1 15 19 19
2 12 14 12
2 15 15 17
2 15 17 15
2 13 14 14
2 14 16 13
3 15 14 17
3 13 14 15
3 12 15 15
3 12 13 13
4  8  9 10
4 10 10 12
4 11 10 10
4 11  7 12
Code

    data three;
    infile 'tab57for3sams';
    input group a b c;
run;
proc print;
run;
proc glm;
    class group;
    model a b c = group;
    manova h=group / printh printe;
run;
    data four;
    infile 'table5.7';
    input group a b c;
run;
proc print;
run;
proc glm;
    class group;
    model a b c = group;
    manova h=group / printh printe;
run;
Pieces of output: first set of code does first 3 groups.

So: $ {\bf H}$ has rank 2.


Characteristic Roots & Vectors of: E Inverse * H
 Characteristic     Characteristic Vector V'EV=1
      Root Percent     a        b        c
6.90568180  96.94   0.01115  0.14375 0.08795
0.21795125   3.06  -0.07763 -0.09587 0.16926
0.00000000   0.00  -0.18231  0.13542 0.02083
S=2    M=0    N=5
Statistic         Value   F   NumDF  Den DF Pr > F
Wilks'           0.1039  8.41   6      24   <.0001
Pillai's         1.0525  4.81   6      26   0.0020
Hotelling-Lawley 7.1236 13.79   6   14.353  <.0001
Roy's            6.9057 29.92   3      13   <.0001
NOTE: F Statistic for Roy's is an upper bound.
NOTE: F Statistic for Wilks'is exact.
Notice two eigenvalues not 0. Note that exact distribution for Wilk's Lambda is available. Now 4 groups

      Root Percent      a       b       c
15.3752900  98.30   0.01128  0.13817 0.08126
0.2307260    1.48  -0.04456 -0.09323 0.15451
0.0356937    0.23  -0.17289  0.09020 0.04777
S=3    M=-0.5    N=6.5
Statistic          Value    F    NumDF Den DF Pr > F
Wilks'         0.04790913 10.12    9   36.657 <.0001
Pillai's       1.16086747  3.58    9       51 0.0016
Hot'ng-Lawley 15.64170973 25.02    9   20.608 <.0001
Roy's         15.37528995 87.13    3       17 <.0001
NOTE: F Statistic for Roy's is an upper bound.

Other Hypotheses
How do mean vectors differ? One possibility:

$\displaystyle {\boldsymbol\mu}_{ik} - {\boldsymbol\mu}_{jk} = c_{i} -c_{j}
$

for constants $ c_{i}$ and $ c_j$ which do not depend on $ k$. This is an additive model for the means.

Test ?

Define $ \alpha = \sum_{ij} {\boldsymbol\mu}_{ij} / (pk)$ Then put

$\displaystyle \beta_i$ $\displaystyle = \sum_j{\boldsymbol\mu}_{ij} /p -\alpha,$    
$\displaystyle \gamma_j$ $\displaystyle = \sum_i{\boldsymbol\mu}_{ij}/k -\alpha$    
$\displaystyle \tau_{ij}$ $\displaystyle = {\boldsymbol\mu}_{ij} -\beta_i-\gamma_j - \alpha$    

If the $ \tau_{ij}$ are all 0 then

$\displaystyle {\boldsymbol\mu}_{ik} - {\boldsymbol\mu}_{jk} = \beta_i-\beta_j
$

so we test the hypothesis that all $ \tau_{ij}$ are 0.
Univariate Two Way Anova

Data $ Y_{ijk}$

$ k=1,\ldots,n_{ij};j=1,\ldots,p;i=1,\ldots,m$.

Model: independent, $ Y_{ijk}\sim N(\mu_{ij},\sigma^2)$.

Note: this is the fixed effects model.

Usual approach: define grand mean, main effects, interactions:

$\displaystyle \mu$ $\displaystyle = \sum_{ijk} \mu_{ij} /\sum_{ij}n_{ij}$    
$\displaystyle \alpha_i$ $\displaystyle = \sum_{jk}\mu_{ij} /\sum_j n_{ij} - \mu$    
$\displaystyle \beta_j$ $\displaystyle = \sum_{ik}\mu_{ij} /\sum_i n_{ij} - \mu$    
$\displaystyle \gamma_{ij}$ $\displaystyle = \mu_{ij} -(\mu+\alpha_i +\beta_j)$    

Test additive effects: $ \gamma_{ij} = 0$ for all $ i,j$.

Usual test based on ANOVA:

Stack observations $ Y_{ijk}$ into vector $ {\bf Y}$, say.

Estimate $ \mu$, $ \alpha_i$, etc by least squares.

Form vectors with entries $ \hat\mu$, $ \hat\alpha_i$ etc.

Write

$\displaystyle {\bf Y} = \hat{\boldsymbol\mu}+ \hat{\boldsymbol\alpha}
+ \hat{\boldsymbol\beta}
+ \hat{\boldsymbol\gamma}
+ \hat{\boldsymbol\epsilon}
$

This defines the vector of fitted residuals $ \hat{\boldsymbol\epsilon}$.

Fact: all vectors on RHS are independent and orthogonal. So:

$\displaystyle \vert\vert{\bf Y}\vert\vert^2 = \vert\vert \hat{\boldsymbol\mu}\v...
...oldsymbol\gamma}\vert\vert^2
+ \vert\vert\hat{\boldsymbol\epsilon}\vert\vert^2
$

This is the ANOVA table. Usually we defined the corrected total sum of squares to be

$\displaystyle \vert\vert{\bf Y}\vert\vert^2 - \vert\vert \hat{\boldsymbol\mu}\vert\vert^2
$

Our problem is like this one BUT the errors are not modeled as independent.

In the analogy:

$ i$ labels group.

$ j$ labels the columns: ie $ j$ is a, b, c.

$ k$ runs from 1 to $ n_{ij}=n_i$.

But

\begin{displaymath}
{\rm Cov}(Y_{ijk},Y_{i^\prime j^\prime k^\prime})
=
\begin{...
... i = i^\prime, k=k^\prime
\\
0 & \text{otherwise}
\end{cases}\end{displaymath}

Now do analysis in SAS.

Tell SAS that the variables A, B and C are repeated measurements of the same quantity.


proc glm;
    class group;
    model a b c = group;
    repeated scale;
run;
The results are as follows:

General Linear Models Procedure
Repeated Measures Analysis of Variance
Repeated Measures Level Information
Dependent Variable  A   B    C
Level of SCALE      1   2    3

Manova Test Criteria and Exact F 
   Statistics for the Hypothesis of no
   SCALE Effect
H = Type III SS&CP Matrix for SCALE   
               E = Error SS&CP Matrix
S=1    M=0    N=7
Statistic
                  Value     F   NumDF DenDF Pr > F
Wilks' Lambda    0.56373  6.1912   2    16  0.0102
Pillai's Trace   0.43627  6.1912   2    16  0.0102
Hotelling-Lawley 0.77390  6.1912   2    16  0.0102
Roy's            0.77390  6.1912   2    16  0.0102
Note: should look at interactions first.

Manova Test Criteria and F Approximations 
 for the Hypothesis of no SCALE*GROUP Effect
S=2    M=0    N=7
Statistic          Value   F   NumDF DenDF Pr > F
Wilks' Lambda    0.56333 1.7725  6    32  0.1364
Pillai's Trace   0.48726 1.8253  6    34  0.1234
Hotelling-Lawley 0.68534 1.7134  6    30  0.1522
Roy's            0.50885 2.8835  3    17  0.0662
NOTE: F Statistic for Roy's Greatest 
        Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.

Only weak evidence of interaction. Repeated statement: univariate anova. Results:

Repeated Measures Analysis of Variance
Tests of Hypotheses for Between Subjects Effects
Source DF Type III SS Mean Square  F    Pr > F
GROUP   3  743.900000  247.966667 70.93 0.0001
Error  17   59.433333    3.496078
Repeated Measures Analysis of Variance
Univariate Tests of Hypotheses for 
          Within Subject Effects
Source: SCALE                    Adj Pr > F
DF TypeIIISS MS      F   Pr > F G - G  H - F
 2  16.624   8.312  5.39 0.0093 0.0101 0.0093
Source: SCALE*GROUP
DF TypeIII   MS    F   Pr > F G - G  H - F
6   18.9619 3.160 2.05 0.0860 0.0889 0.0860
Source: Error(SCALE)
DF TypeIII SS     Mean Square
34  52.4667     1.54313725
Greenhouse-Geisser Epsilon = 0.9664
       Huynh-Feldt Epsilon = 1.2806
Greenhouse-Geisser, Huynh-Feldt test to see if $ {\boldsymbol\Sigma}$ has certain structure.

Return to 2 way anova model. Express as:

$\displaystyle Y_{ijk} = \mu+\alpha_i+\beta_j+ \gamma_{ij}+\epsilon_{ijk}
$

For fixed effects model is $ \epsilon_{ijk}$ iid $ N(0,\sigma^2)$.

For MANOVA model vector of $ \epsilon_{ijk}$ is MVN but with covariance as for $ Y$.

Intermediate model. Put in subject effect.

Assume

$\displaystyle \epsilon_{ijk} = \delta_{ik} + u_{ijk}
$

where $ u_{ijk}$ iid $ N(0,\sigma^2)$ and $ \delta_{ik}$ are iid $ N(0,\tau^2)$. Then

\begin{multline*}
{\rm Cov}(Y_{ijk},Y_{i^\prime j^\prime k^\prime})
= \\
\begin...
..., k=k^\prime, j\neq j^\prime
\\
0 & \text{otherwise}
\end{cases}\end{multline*}

This model is usually not fitted by maximum likelihood but by analyzing the behaviour of the ANOVA table under this model.

Essentially model says

$\displaystyle {\boldsymbol\Sigma}= \tau^2 {\bf 1}{\bf 1}^T + \sigma^2 {\bf I}
$

GG, HF test for slightly more general pattern for $ {\boldsymbol\Sigma}$.

Do univariate anova: The data reordered:


1 1 1 19
1 1 2 20
1 1 3 18
2 1 1 20
2 1 2 21
2 1 3 19
 et cetera
2 4 2 10
2 4 3 12
3 4 1 11
3 4 2 10
3 4 3 10
4 4 1 11
4 4 2 7
4 4 3 12
The four columns are now labels for subject number, group, scale (a, b or c) and the response. The sas commands:

    data long;
    infile 'table5.7uni';
    input subject group scale score;
run;
proc print;
run;
proc glm;
    class group;
    class scale;
    class subject;
    model score =group subject(group) 
                 scale group*scale;
    random subject(group) ;
run;
Some of the output:

Dependent Variable: SCORE
               Sum of    Mean
Source   DF    Squares  Square F    Pr > F
Model    28   843.5333 30.126 19.52 0.0001
Error    34    52.4667  1.543
Total  62   896.0000
Root MSE      SCORE Mean
1.242231        15.33333


Source         DF  TypeISS    MS      F     Pr > F

GROUP           3 743.9000 247.9667 160.69  0.0001
SUBJECT(GROUP) 17  59.4333   3.4961   2.27  0.0208
SCALE           2  21.2381  10.6190   6.88  0.0031
GROUP*SCALE     6  18.9620   3.1603   2.05  0.0860

Source         DF TypeIIISS     MS    F     Pr > F
GROUP           3 743.9000  247.9667 160.69 0.0001
SUBJECT(GROUP) 17  59.4333    3.4961   2.27 0.0208
SCALE           2  16.6242    8.3121   5.39 0.0093
GROUP*SCALE     6  18.9619    3.1603   2.05 0.0860

Source         Type III Expected Mean Square

GROUP          Var(Error) + 3 Var(SUBJECT(GROUP)) 
                    + Q(GROUP,GROUP*SCALE)
SUBJECT(GROUP) Var(Error) + 3 Var(SUBJECT(GROUP))
SCALE          Var(Error) + Q(SCALE,GROUP*SCALE)
GROUP*SCALE    Var(Error) + Q(GROUP*SCALE)
Type I Sums of Squares:

Type III Sums of Squares:

Notice hypothesis of no group by scale interaction is acceptable.

Under the assumption of no such group by scale interaction the hypothesis of no group effect is tested by dividing group ms by subject(group) ms.

Value is 70.9 on 3,17 degrees of freedom.

This is NOT the F value in the table above since the table above is for FIXED effects.

Notice that the sums of squares in this table match those produced in the repeated measures ANOVA. This is not accidental.



Richard Lockhart
2002-10-10