STAT 802: Multivariate Analysis

Course outline:

• Multivariate Distributions.

• The Multivariate Normal Distribution.

• The 1 sample problem.

• Paired comparisons.

• Repeated measures: 1 sample.

• One way MANOVA.

• Two way MANOVA.

• Profile Analysis.

• Multivariate Multiple Regression.

• Discriminant Analysis.

• Clustering.

• Principal Components.

• Factor analysis.

• Canonical Correlations.

Basic structure of typical multivariate data set:

Case by variables: data in matrix. Each row is a case, each column is a variable.

Example: Fisher's iris data: 5 rows of 150 by 5 matrix:

 Case Sepal Sepal Petal Petal # Variety Length Width Length Width 1 Setosa 5.1 3.5 1.4 0.2 2 Setosa 4.9 3.0 1.4 0.2 &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; 51 Versicolor 7.0 3.2 4.7 1.4 &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots; &vellip#vdots;
Usual model: rows of data matrix are independent random variables.

Vector valued random variable: function such that, writing ,

defined for any const's .

Cumulative Distribution Function (CDF) of : function on defined by

Defn: Distribution of rv is absolutely continuous if there is a function such that

 (1)

for any (Borel) set . This is a dimensional integral in general. Equivalently

Defn: Any satisfying (1) is a density of .

For most is differentiable at and

Building Multivariate Models

Basic tactic: specify density of

Tools: marginal densities, conditional densities, independence, transformation.

Marginalization: Simplest multivariate problem

(or in general is any ).

Theorem 1   If has density and then has density

is the marginal density of and the joint density of but they are both just densities. Marginal'' just to distinguish from the joint density of .

Independence, conditional distributions

Def'n: Events and are independent if

(Notation: is the event that both and happen, also written .)

Def'n: , are independent if

for any .

Def'n: and are independent if

for all and .

Def'n: Rvs independent:

for any .

Theorem:

1. If and are independent with joint density then and have densities and , and

2. If and independent with marginal densities and then has joint density

3. If has density and there exist and st for (almost) all then and are independent with densities given by

Theorem: If are independent and then are independent. Moreover, and are independent.

Conditional densities

Conditional density of given :

in words conditional = joint/marginal''.

Change of Variables

Suppose with having density . Assume is a one to one (injective") map, i.e., if and only if . Find :

Step 1: Solve for in terms of : .

Step 2: Use basic equation:

and rewrite it in the form

Interpretation of derivative when :

which is the so called Jacobian.

Equivalent formula inverts the matrix:

This notation means

but with replaced by the corresponding value of , that is, replace by .

Example: The density

is the standard bivariate normal density. Let where and is angle from the positive axis to the ray from the origin to the point . I.e., is in polar co-ordinates.

Solve for in terms of :

so that
 argument

It follows that

Next: marginal densities of , ?

Factor as where

and

Then

so marginal density of is a multiple of . Multiplier makes but in this case

so that

(Special Weibull or Rayleigh distribution.) Similarly

which is the Uniform density. Exercise: has standard exponential distribution. Recall: by definition has a distribution on 2 degrees of freedom. Exercise: find density.

Remark: easy to check .

Thus: have proved original bivariate normal density integrates to 1.

Put . Get

So .

Linear Algebra Review

Notation:

• Vectors are column vectors

• An matrix has rows, columns and entries .

• Matrix and vector addition defined componentwise:

• If is and is then is the matrix

• The matrix or sometimes which is an matrix with for all and for any pair is called the identity matrix.

• The span of a set of vectors is the set of all vectors of the form . It is a vector space. The column space of a matrix, , is the span of the set of columns of . The row space is the span of the set of rows.

• A set of vectors is linearly independent if implies for all . The dimension of a vector space is the cardinality of the largest possible set of linearly independent vectors.

Defn: The transpose, , of an matrix is the matrix whose entries are given by

so that is . We have

and

Defn: rank of matrix , rank: # of linearly independent columns of . We have

 rank dimcolumn space of dimrow space of rank

If is then rank.

Matrix inverses

For now: all matrices square .

If there is a matrix such that then we call the inverse of . If exists it is unique and and we write . The matrix has an inverse if and only if rank.

Inverses have the following properties:

(if one side exists then so does the other) and

Determinants

Again is . The determinant if a function on the set of matrices such that:

1. det.

2. If is the matrix with two columns interchanged then

det   det

(So: two equal columns implies det.)

3. det is a linear function of each column of . If with denoting the th column of the matrix then

 det det det

Here are some properties of the determinant:

1. det   det.

2. det   detdet.

3. detdet.

4. is invertible if and only if det if and only if rank.

5. Determinants can be computed (slowly) by expansion by minors.

Special Kinds of Matrices

1. is symmetric if .

2. is orthogonal if (or ).

3. is idempotent if .

4. is diagonal if implies .

Inner Products, orthogonal, orthonormal vectors

Defn: Two vectors and are orthogonal if .

Defn: The inner product or dot product of and is

Defn: and are orthogonal if .

Defn: The norm (or length) of is

is orthogonal if each column of has length 1 and is orthogonal to each other column of .

Suppose is an matrix. The function

is called a quadratic form. Now

so that depends only on the total . In fact

Thus we will assume that is symmetric.

Eigenvalues and eigenvectors

If is and and such that

then is eigenvalue (or characteristic or latent value) of ; is corresponding eigenvector. Since matrix is singular.

Therefore det.

Conversely: if singular then there is such that .

Fact: det is polynomial in of degree .

Each root is an eigenvalue.

General the roots could be multiple roots or complex valued.

Diagonalization

Matrix is diagonalized by a non-singular matrix if is diagonal.

If so then so each column of is eigenvector of with the th column having eigenvalue .

Thus to be diagonalizable must have linearly independent eigenvectors.

Symmetric Matrices

If is symmetric then

1. Every eigenvalue of is real (not complex).

2. is diagonalizable; columns of may be taken unit length, mutually orthogonal: is diagonalizable by an orthogonal matrix ; in symbols .

3. Diagonal entries in eigenvalues of .

4. If are two eigenvalues of and and are corresponding eigenvectors then

and

Since and we see . Eigenvectors corresponding to distinct eigenvalues are orthogonal.

Positive Definite Matrices

Defn: A symmetric matrix is non-negative definite if for all . It is positive definite if in addition implies .

is non-negative definite iff all its eigenvalues are non-negative.

is positive definite iff all eigenvalues positive.

A non-negative definite matrix has a symmetric non-negative definite square root. If

for orthogonal and diagonal then

is symmetric, non-negative definite and

Here is diagonal with

Many other square roots possible. If and is orthogonal and then .

Orthogonal Projections

Suppose vector subspace of , basis for . Given any there is a unique which is closest to ; minimizes

over . Any in is of the form

, , columns ; column with th entry . Define

( has rank so is invertible.) Then

Note that and that

so that

Then

Since we see that

This shows that

Choose to minimize: minimize second term.

Achieved by making .

Since can take

Summary: closest point in is

call the orthogonal projection of onto .

Notice that the matrix is idempotent:

We call the orthogonal projection of on because is perpendicular to the residual .

Partitioned Matrices

Suppose matrix, , and . Make matrix by putting in 2 by 2 matrix:

For instance if

and

then

Lines indicate partitioning.

We can work with partitioned matrices just like ordinary matrices always making sure that in products we never change the order of multiplication of things.

and

Note partitioning of and must match.

Addition: dimensions of and must be the same.

Multiplication formula must have as many columns as has rows, etc.

In general: need to make sense for each .

Works with more than a 2 by 2 partitioning.

Defn: block diagonal matrix: partitioned matrix for which if . If

then is invertible iff each is invertible and then

Moreover det   det   det. Similar formulas work for larger matrices.

Partitioned inverses. Suppose , are symmetric positive definite. Look for inverse of

of form

Multiply to get equations

Solve to get

The Multivariate Normal Distribution

Defn: iff

Defn: if and only if with the independent and each .

In this case according to our theorem

superscript denotes matrix transpose.

Defn: has a multivariate normal distribution if it has the same distribution as for some , some matrix of constants and .

, singular: does not have a density.

invertible: derive multivariate normal density by change of variables:

So

Now define and notice that

and

Thus is

the density. Note density is the same for all such that . This justifies the notation .

For which , is this a density?

Any but if then

where . Inequality strict except for which is equivalent to . Thus is a positive definite symmetric matrix.

Conversely, if is a positive definite symmetric matrix then there is a square invertible matrix such that so that there is a distribution. ( can be found via the Cholesky decomposition, e.g.)

When is singular will not have a density: such that ; is confined to a hyperplane.

Still true: distribution of depends only on : if then and have the same distribution.

Expectation, moments

Defn: If has density then

any from to .

FACT: if for a smooth (mapping )

by change of variables formula for integration. This is good because otherwise we might have two different values for .

Linearity: for real and .

Defn: The moment (about the origin) of a real rv is (provided it exists). We generally use for .

Defn: The central moment is

We call the variance.

Defn: For an valued random vector

is the vector whose entry is (provided all entries exist).

Fact: same idea used for random matrices.

Defn: The ( ) variance covariance matrix of is

which exists provided each component has a finite second moment.

Example moments: If then

and (integrating by parts)

so that

for . Remembering that and

we find that

If now , that is, , then and

In particular, we see that our choice of notation for the distribution of is justified; is indeed the variance.

Similarly for we have with and

and

Note use of easy calculation: and

Moments and independence

Theorem: If are independent and each is integrable then is integrable and

Moment Generating Functions

Defn: The moment generating function of a real valued is

defined for those real for which the expected value is finite.

Defn: The moment generating function of is

defined for those vectors for which the expected value is finite.

Example: If then

Theorem: () If is finite for all in a neighbourhood of then

1. Every moment of is finite.

2. is (in fact is analytic).

3. .

Note: means has continuous derivatives of all orders. Analytic means has convergent power series expansion in neighbourhood of each .

The proof, and many other facts about mgfs, rely on techniques of complex variables.

Characterization & MGFs

Theorem: Suppose and are valued random vectors such that

for in some open neighbourhood of in . Then and have the same distribution.

The proof relies on techniques of complex variables.

MGFs and Sums

If are independent and then mgf of is product mgfs of individual :

or . (Also for multivariate .)

Example: If are independent then

Conclusion: If then

Example: If then and

Theorem: Suppose and where and . Then and have the same distribution if and only iff the following two conditions hold:

1. .

2. .

Alternatively: if , each MVN then and imply that and have the same distribution.

Proof: If 1 and 2 hold the mgf of is

Thus . Conversely if and have the same distribution then they have the same mean and variance.

Thus mgf is determined by and .

Theorem: If then there is a matrix such that has same distribution as for .

We may assume that is symmetric and non-negative definite, or that is upper triangular, or that is lower triangular.

Proof: Pick any such that such as from the spectral decomposition. Then .

From the symmetric square root can produce an upper triangular square root by the Gram Schmidt process: if has rows then let be . Choose proportional to where so that has unit length. Continue in this way; you automatically get if . If has columns then is orthogonal and is an upper triangular square root of .

Variances, Covariances, Correlations

Defn: The covariance between and is

This is a matrix.

Properties:

• .

• Cov is bilinear:

and

Properties of the distribution

1: All margins are multivariate normal: if

and

then .

2: : affine transformation of MVN is normal.

3: If

then and are independent.

4: All conditionals are normal: the conditional distribution of given is Proof of ( 1): If then

for the identity matrix of correct dimension.

So

Compute mean and variance to check rest.

Proof of ( 2): If then

Proof of ( 3): If

then

Proof of ( 4): first case: assume has an inverse.

Define

Then

Thus is where

Now joint density of and factors

By change of variables joint density of is

where is the constant Jacobian of the linear transformation from to and

Thus conditional density of given is

As a function of this density has the form of the advertised multivariate normal density.

Specialization to bivariate case:

Write

where we define

Note that

Then

is independent of . The marginal distribution of is where

This simplifies to

Notice that it follows that

More generally: any and :

 0

RHS is minimized at

Minimum value is

where

defines the correlation between and .

Multiple Correlation
Now suppose is scalar but is vector.

Defn: Multiple correlation between and

over all .

Thus: maximize

Put . For invertible problem is equivalent to maximizing

where

Solution: find largest eigenvalue of .

Note

where

is a vector. Set

and multiply by to get

or

If then we see so largest eigenvalue is .

Summary: maximum squared correlation is

Achieved when eigenvector is so

Notice: since is squared correlation between two scalars ( and ) we have

Equals 1 iff is linear combination of .

Correlation matrices, partial correlations:

Correlation between two scalars and is

If has variance then the correlation matrix of is with entries

If are MVN with the usual partitioned variance covariance matrix then the conditional variance of given is

From this define partial correlation matrix

Note: these are used even when are NOT MVN

Likelihood Methods of Inference

Given data with model :

Definition: The likelihood function is map : domain , values given by

Key Point: think about how the density depends on not about how it depends on .

Notice: , observed value of the data, has been plugged into the formula for density.

We use likelihood for most inference problems:

1. Point estimation: we must compute an estimate which lies in . The maximum likelihood estimate (MLE) of is the value which maximizes over if such a exists.

2. Point estimation of a function of : we must compute an estimate of . We use where is the MLE of .

3. Interval (or set) estimation. We must compute a set in which we think will contain . We will use

for a suitable .

4. Hypothesis testing: decide whether or not where . We base our decision on the likelihood ratio

Maximum Likelihood Estimation

To find MLE maximize .

Typical function maximization problem:

Set gradient of equal to 0

Check root is maximum, not minimum or saddle point.

Often is product of terms (given independent observations).

Much easier to work with logarithm of : log of product is sum and logarithm is monotone increasing.

Definition: The Log Likelihood function is

Samples from MVN Population

Simplest problem: collect replicate measurements from single population.

Model: are iid .

Parameters (): . Parameter space: and is some positive definite matrix.

Log likelihood is

Take derivatives.

where . Second derivative wrt is a matrix:

Fact: if second derivative matrix is negative definite at critical point then critical point is a maximum.

Fact: if second derivative matrix is negative definite everywhere then function is concave; no more than 1 critical point.

Summary: is maximized at

(regardless of choice of ).

More difficult: differentiate wrt .

Somewhat simpler: set

First derivative wrt is matrix with entries

Warning: method used ignores symmetry of .

Need: derivative of two functions:

and

Fact: th entry of is

where denotes matrix obtained from by removing column and row .

Fact: ; expansion by minors.

Conclusion

and

Implication

Set = 0 and find only critical point is

Usual sample covariance matrix is

Properties of MLEs:

1)

2) .

Distribution of ? Joint distribution of and ?

Univariate Normal samples: Distribution Theory

Theorem: Suppose are independent random variables. Then

1. (sample mean)and (sample variance) independent.

2. .

3. .

4. .

Proof: Let .

Then are independent .

So is multivariate standard normal.

Note that and Thus

and

where .

So: reduced to and .

Step 1: Define

(So has dimension .) Now

or letting denote the matrix

It follows that so we need to compute :

Put . Since

conclude and are independent and each is normal.

Thus is independent of .

Since is a function of we see that and are independent.

Also, see .

First 2 parts done.

Consider . Note that .

Suppose and is symmetric. Put for diagonal, orthogonal.

Then

where

But is standard multivariate normal.

So: has same distribution as

where are eigenvalues of .

Special case: if all are either 0 or 1 then has a chi-squared distribution with df = number of equal to 1.

When are eigenvalues all 1 or 0?

Answer: if and only if is idempotent.

1) If idempotent and is an eigenpair the

and

so

proving is 0 or 1.

2) Conversely if all eigenvalues of are 0 or 1 then has 1s and 0s on diagonal so

and

Next case: . Then with .

Since it has the law

are eigenvalues of . But

implies

So eigenvalues are those of and is iff is idempotent and .

Our case: . Check . How many degrees of freedom: .

Defn: The trace of a square matrix is

Property: .

So:

Conclusion: df for is

Derivation of the density:

Suppose independent . Define distribution to be that of . Define angles by

(Spherical co-ordinates in dimensions. The values run from 0 to except last from 0 to .) Derivative formulas:

and

Fix to clarify the formulas. Use shorthand .

Matrix of partial derivatives is

Find determinant:

(non-negative for all and ). General : every term in the first column contains a factor while every other entry has a factor .

FACT: multiplying a column in a matrix by multiplies the determinant by .

SO: Jacobian of transformation is

for some function, , which depends only on the angles.

Thus joint density of is

To compute the density of we must do an dimensional multiple integral .

for some .

Evaluate by making

Substitute , to see that

CONCLUSION: the density is

Fourth part: consequence of first 3 parts and def'n of distribution.

Defn: if has same distribution as

for , and independent.

Derive density of in this definition:

Differentiate wrt by differentiating inner integral:

by fundamental thm of calculus. Hence

Plug in

to get

Substitute , to get

or

Multivariate Normal samples: Distribution Theory

Theorem: Suppose are independent random variables. Then

1. (sample mean)and (sample variance-covariance matrix) are independent.

2. .

3. .

4. is Hotelling's . has an distribution.

Proof: Let where and are independent .

So .

Note that and

Thus

and

where

Consequences. In 1, 2 and 4: can assume and . In 3 can take . Step 1: Do general . Define

(So has dimension .) Clearly is with mean 0.

Compute variance covariance matrix

where has a pattern. It is a patterned matrix with entry being

Kronecker Products

Defn: If is and is then is the matrix with the pattern

So our variance covariance matrix is

Conclusions so far:

1) and are independent.

2)

Next: Wishart law.

Defn: The distribution is the distribution of

where are iid .

Properties of Wishart.

1) If then

2) if independent then

Proof of part 3: rewrite

in form

for iid . Put as cols in matrix which is . Then check that

Write for orthogonal unit vectors . Define

and compute covariances to check that the are iid . Then check that

Proof of 4: suffices to have .

Uses further props of Wishart distribution.

3: If and then

4: If and then

5: If then

6: If is partitioned into components then

One sample tests on mean vectors

Given data iid test

by computing

and getting -values from distribution using theorem.

Example: no realistic ones. This hypothesis is not intrinsically useful. However: other tests can sometimes be reduced to it.

Example: Ten water samples split in half. One half of each to each of two labs. Measure biological oxygen demand (BOD) and suspended solids (SS). For sample let be BOD for lab A, be SS for lab A, be BOD for lab B and be SS for lab B. Question: are labs measuring the same thing? Is there bias in one or the other?

Notation is vector of 4 measurements on sample .

Data:

 Lab A Lab B Sample BOD SS BOD SS 1 6 27 25 15 2 6 23 28 13 3 18 64 36 22 4 8 44 35 29 5 11 30 15 31 6 34 75 44 64 7 28 26 42 30 8 71 124 54 64 9 43 54 34 56 10 33 30 29 20 11 20 14 39 21

Model: are iid .

Multivariate problem because: not able to assume independence between any two measurements on same sample.

Potential sub-model: each measurement is

true mmnt + lab bias + mmnt error.

Model for measurement error vector is multivariate normal mean 0 and diagonal covariance matrix .

Lab bias is unknown vector .

True measurement should be same for both labs so has form

where are iid bivariate normal with unknown means and unknown variance covariance .

This would give structured model

where

This model has variance covariance matrix

Notice that this matrix has only 7 parameters: four for the diagonal entries in and 3 for the entries in .

We skip this model and let be unrestricted.

Question of interest:

and

Reduction: partition as

where and each have two components.

Define . Then our model makes iid . Our hypothesis is

Carrying out our test in SPlus:

Working on CSS unix workstation:

Start SPlus then read in, print out data:


[61]ehlehl% mkdir .Data
[62]ehlehl% Splus
S-PLUS : Copyright (c) 1988, 1996 MathSoft, Inc.
Version 3.4 Release 1 for Sun SPARC, SunOS 5.3 : 1996
Working data will be in .Data
> # Read in and print out data
> eff
BODLabA SSLabA BODLabB SSLabB
1       6     27      25     15
2       6     23      28     13
3      18     64      36     22
4       8     44      35     29
5      11     30      15     31
6      34     75      44     64
7      28     26      42     30
8      71    124      54     64
9      43     54      34     56
10      33     30      29     20
11      20     14      39     21

Do some graphical preliminary analysis.

Look for non-normality, non-linearity, outliers.

Make plots on screen or saved in file.


> # Make pairwise scatterplots on screen using
> # motif graphics device and then in a postscript
> # file.
> motif()
> pairs(eff)
> postscript("pairs.ps",horizontal=F,
+   height=6,width=6)
> pairs(eff)
> dev.off()
Generated postscript file "pairs.ps".
motif
2

Examine correlations

> cor(eff)
BODLabA    SSLabA   BODLabB    SSLabB
BODLabA 0.9999999 0.7807413 0.7228161 0.7886035
SSLabA 0.7807413 1.0000000 0.6771183 0.7896656
BODLabB 0.7228161 0.6771183 1.0000001 0.6038079
SSLabB 0.7886035 0.7896656 0.6038079 1.0000001

Notice high correlations.

Mostly caused by variation in true levels from sample to sample.

Get partial correlations.

Adjust for overall BOD and SS content of sample.


> aug <- cbind(eff,(eff[,1]+eff[,3])/2,
+                  (eff[,2]+eff[,4])/2)
> aug
BODLabA SSLabA BODLabB SSLabB   X2   X3
1       6     27      25     15 15.5 21.0
2       6     23      28     13 17.0 18.0
3      18     64      36     22 27.0 43.0
4       8     44      35     29 21.5 36.5
5      11     30      15     31 13.0 30.5
6      34     75      44     64 39.0 69.5
7      28     26      42     30 35.0 28.0
8      71    124      54     64 62.5 94.0
9      43     54      34     56 38.5 55.0
10      33     30      29     20 31.0 25.0
11      20     14      39     21 29.5 17.5
> bigS <- var(aug)


Now compute partial correlations for first four variables given means of BOD and SS:


> S11 <- bigS[1:4,1:4]
> S12 <- bigS[1:4,5:6]
> S21 <- bigS[5:6,1:4]
> S22 <- bigS[5:6,5:6]
> S11dot2 <- S11 - S12 %*% solve(S22,S21)
> S11dot2
BODLabA     SSLabA    BODLabB     SSLabB
BODLabA  24.804665  -7.418491 -24.804665   7.418491
SSLabA  -7.418491  59.142084   7.418491 -59.142084
BODLabB -24.804665   7.418491  24.804665  -7.418491
SSLabB   7.418491 -59.142084  -7.418491  59.142084
> S11dot2SD <- diag(sqrt(diag(S11dot2)))
> S11dot2SD
[,1]     [,2]     [,3]     [,4]
[1,] 4.980428 0.000000 0.000000 0.000000
[2,] 0.000000 7.690389 0.000000 0.000000
[3,] 0.000000 0.000000 4.980428 0.000000
[4,] 0.000000 0.000000 0.000000 7.690389
> R11dot2 <- solve(S11dot2SD)%*%
+           S11dot2%*%solve(S11dot2SD)
> R11dot2
[,1]      [,2]      [,3]      [,4]
[1,]  1.000000 -0.193687 -1.000000  0.193687
[2,] -0.193687  1.000000  0.193687 -1.000000
[3,] -1.000000  0.193687  1.000000 -0.193687
[4,]  0.193687 -1.000000 -0.193687  1.000000

Notice little residual correlation. Carry out Hotelling's test of .

> w <- eff[,1:2]-eff[3:4]
> dimnames(w)<-list(NULL,c("BODdiff","SSdiff"))
> w
BODdiff SSdiff
[1,]     -19     12
[2,]     -22     10
etc
[8,]      17     60
etc
> Sw <- var(w)
> cor(w)
BODdiff    SSdiff
BODdiff 1.0000001 0.3057682
SSdiff 0.3057682 1.0000000
> mw <- apply(w,2,mean)
> mw
BODdiff   SSdiff
-9.363636 13.27273
> Tsq <- 11*mw%*%solve(Sw,mw)
> Tsq
[,1]
[1,] 13.63931
> FfromTsq <- (11-2)*Tsq/(2*(11-1))
> FfromTsq
[,1]
[1,] 6.13769
> 1-pf(FfromTsq,2,9)
[1] 0.02082779

Conclusion: Pretty clear evidence of difference in mean level between labs. Which measurement causes the difference?

> TBOD <- sqrt(11)*mw[1]/sqrt(Sw[1,1])
> TBOD
BODdiff
-2.200071
> 2*pt(TBOD,1)
BODdiff
0.2715917
> 2*pt(TBOD,10)
BODdiff
0.05243474
> TSS <- sqrt(11)*mw[2]/sqrt(Sw[2,2])
> TSS
SSdiff
2.15153
> 2*pt(-TSS,10)
SSdiff
0.05691733
> postscript("differences.ps",
+      horizontal=F,height=6,width=6)
> plot(w)
> abline(h=0)
> abline(v=0)
> dev.off()

Conclusion? Neither? Not a problem - summarizes evidence!

Problem: several tests at level 0.05 on same data. Simultaneous or Multiple comparisons.

In general can test hypothesis by computing and then testing using Hotelling's .

Simultaneous confidence intervals

Confidence interval for :

Based on distribution.

Give coverage intervals for 6 parameters of interest: 4 entries in and and

Problem: each confidence interval has 5% error rate. Pick out last interval (on basis of looking most interesting) and ask about error rate?

Solution: adjust 2.23, multiplier to get

all intervals cover truth

Rao or Scheffé type intervals

Based on inequality:

for any symmetric non-singular matrix .

Proof by Cauchy Schwarz: inner product of vectors and .

Put and to get

This inequality is true for all . Thus the event that there is any such that

is a subset of the event

Adjust to make the latter event have probability by taking

Then the probability that every one of the uncountably many confidence intervals

covers the corresponding true parameter value is at least .

In fact the probability of this happening is exactly equal to because for each data set the supremum of

over all is .

Our case

Coverage probability of single interval using ? From distribution:

Probability all 6 intervals would cover using ?

Use Bonferroni inequality:

Simultaneous coverage probability of 6 intervals using

Usually just use

General Bonferroni strategy. If we want intervals for get interval for at level . Simultaneous coverage probability is at least . Notice that Bonferroni narrower in our example unless giving .

Motivations for :

1: Hypothesis is true iff all hypotheses are true. Natural test for rejects if

large. So take largest test statistic.

Fact:

Proof: like calculation of maximal correlation.

2: likelihood ratio method.

Compute

where the subscript indicates estimation assuming .

In our case to test find

and

Now write

Use formula:

to get

so that the ratio of determinants is a monotone increasing function of .

Again conclude: likelihood ratio test rejects for where chosen to make level .

3: compare estimates of .

In univariate regression tests to compare a restricted model with a full model have form

This is a monotone function of

where denotes an error sum of squares and and estimate of the residual variance - .

Here: substitute matrices.

Analogue of ESS for full model:

Analogue of ESS for reduced model:

(This defined to be the change in the Sum of Squares and Cross Products matrix.)

In 1 sample example:

and

Test of based on comparing

to

To make comparison. If null true

and

so try tests based on closeness of

to identity matrix.

Measures of size based on eigenvalues of

which are same as eigenvalues of

Suggested size measures for :

• (= sum of eigenvalues).

• (= product of eigenvalues).

• maximum eigenvalue of .

For our matrix: eigenvalues all 0 except for one. (So really-matrix not close to .)

Largest eigenvalue is

But: see two sample problem for precise tests based on suggestions.

Two sample problem
Given data . Model , independent.

Test .

Case 1: for motivation only. known .

Natural test statistic: based on

which has where

and

So

has a distribution if null true.

If not known must estimate. No universally agreed best procedure (even for -- called Behrens-Fisher problem).

Usually: assume .

If so: MLE of is and of is

Usual estimate of is

Pooled

which is unbiased.

Possible test developments:

1) By analogy with 1 sample:

Pooled

which has the distribution

2) Union-intersection: test of based on

Maximize over all .

Get

3) Likelihood ratio: the MLE of for the unrestricted model is

Under the hypothesis the mle of is

where

This simplifies to

The log-likelihood ratio is a multiple of

FullRestricted

which is a function of

or equivalently a function of Wilk's :

Compute det: multiply together eigenvalues.

If are the eigenvalues of then

Two sample analysis in SAS on css network

• Type sas to start system.

• Several windows open. Go to Program Editor.

• Under file menu open file with sas code. Contents of sas2sample.sas

data long;
infile 'tab57sh';
input group a b c;
run;
proc print;
run;
proc glm;
class group;
model a b c = group;
manova h=group / printh printe;
run;

Notes:

1) First 4 lines form DATA step:

a) creates data set named long by reading in 4 columns of data from file named tab57sh stored in same directory as I was in when I typed sas.

b) Calls variables group (=1 or 2 as label for the two groups) and a, b, c which are names for the 3 test scores for each subject.

2) Next two lines: print out data: result is (slightly edited)


Obs    group     a     b     c
1      1      19    20    18
2      1      20    21    19
3      1      19    22    22
etc till
11      2      15    17    15
12      2      13    14    14
13      2      14    16    13

3) Then use proc glm to do analysis:

a) class group declares that the variable group defines levels of a categorical variable.

b) model statement says to regress the variables a, b, c on variable group.

c) manova statement says to do both 3 univariate regressions and a mulivariate regression and to print out the and matrices where is the matrix corresponding to the presence of the factor group in the model.

Output of MANOVA: First univariate results


The GLM Procedure
Class Level Information
Class         Levels    Values
group              2    1 2
Number of observations    13
Dependent Variable: a
Sum of
Source    DF   Squares  Mean Square F Value Pr > F
Model      1  54.276923  54.276923   19.38 0.0011
Error     11  30.800000   2.800000
Corrd Tot 12  85.076923
R-Square   Coeff Var      Root MSE        a Mean
0.637975    10.21275      1.673320      16.38462
Source    DF  Type ISS  Mean Square F Value Pr > F
group      1  54.276923  54.276923   19.38 0.0011
Source    DF  TypeIIISS Mean Square F Value Pr > F
group      1  54.276923  54.276923   19.38 0.0011

Dependent Variable: b
Sum of
Source    DF   Squares  Mean Square F Value Pr > F
Model      1  70.892308  70.892308  34.20 0.0001
Error     11  22.800000   2.072727
Corrd Tot 12  93.692308

Dependent Variable: c
Sum of
Source    DF   Squares Mean Square F Value Pr > F
Model      1   94.77692  94.77692   39.64 <.0001
Error     11   26.30000   2.39090
Corrd Tot 12  121.07692

The matrices and .

E = Error SSCP Matrix
a     b     c
a 30.8  12.2  10.2
b 12.2  22.8   3.8
c 10.2   3.8  26.3
Partial Correlation Coefficients from
the Error SSCP Matrix / Prob > |r|
DF = 11     a              b              c
a    1.000000       0.460381       0.358383
0.1320         0.2527
b    0.460381       1.000000       0.155181
0.1320                        0.6301
c    0.358383       0.155181       1.000000
0.2527         0.6301
H = Type III SSCP Matrix for group
a                 b                 c
a   54.276923077   62.030769231   71.723076923
b   62.030769231   70.892307692   81.969230769
c   71.723076923   81.969230769   94.776923077

The eigenvalues of .

Characteristic Roots and Vectors of: E Inverse * H
H = Type III SSCP Matrix for group
E = Error SSCP Matrix
Characteristic       Characteristic Vector V'EV=1
Root     Percent        a           b           c
5.816159 100.00  0.00403434  0.12874606 0.13332232
0.000000   0.00 -0.09464169 -0.10311602 0.16080216
0.000000   0.00 -0.19278508  0.16868694 0.00000000
MANOVA Test Criteria and Exact F Statistics
for the Hypothesis of No Overall group Effect
H = Type III SSCP Matrix for group
E = Error SSCP Matrix
S=1    M=0.5    N=3.5
Statistic            Value   F   NumDF DenDF Pr > F
Wilks' Lambda       0.1467 17.45    3     9  0.0004
Pillai's Trace      0.8533 17.45    3     9  0.0004
Hotelling-Lawley Tr 5.8162 17.45    3     9  0.0004
Roy's Greatest Root 5.8162 17.45    3     9  0.0004

Things to notice:

1. The conclusion is clear. The mean vectors for the two groups are not the same.

2. The four statistics have the following definitions in terms of eigenvalues of :

Wilk's Lambda:

Pillai's trace:

Hotelling-Lawley trace:

Roy's greatest Root:

1 way layout
Also called sample problem.

Data .

Model independent .

First problem of interest: test

Based on and . MLE of is .

Under MLE of , the common value of the is

So

This makes

Notice can do sum over to get factor of :

Note: rank of is minimum of and . The data

1 19 20 18
1 20 21 19
1 19 22 22
1 18 19 21
1 16 18 20
1 17 22 19
1 20 19 20
1 15 19 19
2 12 14 12
2 15 15 17
2 15 17 15
2 13 14 14
2 14 16 13
3 15 14 17
3 13 14 15
3 12 15 15
3 12 13 13
4  8  9 10
4 10 10 12
4 11 10 10
4 11  7 12

Code

data three;
infile 'tab57for3sams';
input group a b c;
run;
proc print;
run;
proc glm;
class group;
model a b c = group;
manova h=group / printh printe;
run;
data four;
infile 'table5.7';
input group a b c;
run;
proc print;
run;
proc glm;
class group;
model a b c = group;
manova h=group / printh printe;
run;

Pieces of output: first set of code does first 3 groups.

So: has rank 2.


Characteristic Roots & Vectors of: E Inverse * H
Characteristic     Characteristic Vector V'EV=1
Root Percent     a        b        c
6.90568180  96.94   0.01115  0.14375 0.08795
0.21795125   3.06  -0.07763 -0.09587 0.16926
0.00000000   0.00  -0.18231  0.13542 0.02083
S=2    M=0    N=5
Statistic         Value   F   NumDF  Den DF Pr > F
Wilks'           0.1039  8.41   6      24   <.0001
Pillai's         1.0525  4.81   6      26   0.0020
Hotelling-Lawley 7.1236 13.79   6   14.353  <.0001
Roy's            6.9057 29.92   3      13   <.0001
NOTE: F Statistic for Roy's is an upper bound.
NOTE: F Statistic for Wilks'is exact.

Notice two eigenvalues not 0. Note that exact distribution for Wilk's Lambda is available. Now 4 groups

Root Percent      a       b       c
15.3752900  98.30   0.01128  0.13817 0.08126
0.2307260    1.48  -0.04456 -0.09323 0.15451
0.0356937    0.23  -0.17289  0.09020 0.04777
S=3    M=-0.5    N=6.5
Statistic          Value    F    NumDF Den DF Pr > F
Wilks'         0.04790913 10.12    9   36.657 <.0001
Pillai's       1.16086747  3.58    9       51 0.0016
Hot'ng-Lawley 15.64170973 25.02    9   20.608 <.0001
Roy's         15.37528995 87.13    3       17 <.0001
NOTE: F Statistic for Roy's is an upper bound.


Other Hypotheses
How do mean vectors differ? One possibility:

for constants and which do not depend on . This is an additive model for the means.

Test ?

Define Then put

If the are all 0 then

so we test the hypothesis that all are 0.
Univariate Two Way Anova

Data

.

Model: independent, .

Note: this is the fixed effects model.

Usual approach: define grand mean, main effects, interactions:

Test additive effects: for all .

Usual test based on ANOVA:

Stack observations into vector , say.

Estimate , , etc by least squares.

Form vectors with entries , etc.

Write

This defines the vector of fitted residuals .

Fact: all vectors on RHS are independent and orthogonal. So:

This is the ANOVA table. Usually we defined the corrected total sum of squares to be

Our problem is like this one BUT the errors are not modeled as independent.

In the analogy:

labels group.

labels the columns: ie is a, b, c.

runs from 1 to .

But

Now do analysis in SAS.

Tell SAS that the variables A, B and C are repeated measurements of the same quantity.


proc glm;
class group;
model a b c = group;
repeated scale;
run;

The results are as follows:

General Linear Models Procedure
Repeated Measures Analysis of Variance
Repeated Measures Level Information
Dependent Variable  A   B    C
Level of SCALE      1   2    3

Manova Test Criteria and Exact F
Statistics for the Hypothesis of no
SCALE Effect
H = Type III SS&CP Matrix for SCALE
E = Error SS&CP Matrix
S=1    M=0    N=7
Statistic
Value     F   NumDF DenDF Pr > F
Wilks' Lambda    0.56373  6.1912   2    16  0.0102
Pillai's Trace   0.43627  6.1912   2    16  0.0102
Hotelling-Lawley 0.77390  6.1912   2    16  0.0102
Roy's            0.77390  6.1912   2    16  0.0102

Note: should look at interactions first.

Manova Test Criteria and F Approximations
for the Hypothesis of no SCALE*GROUP Effect
S=2    M=0    N=7
Statistic          Value   F   NumDF DenDF Pr > F
Wilks' Lambda    0.56333 1.7725  6    32  0.1364
Pillai's Trace   0.48726 1.8253  6    34  0.1234
Hotelling-Lawley 0.68534 1.7134  6    30  0.1522
Roy's            0.50885 2.8835  3    17  0.0662
NOTE: F Statistic for Roy's Greatest
Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.


Only weak evidence of interaction. Repeated statement: univariate anova. Results:

Repeated Measures Analysis of Variance
Tests of Hypotheses for Between Subjects Effects
Source DF Type III SS Mean Square  F    Pr > F
GROUP   3  743.900000  247.966667 70.93 0.0001
Error  17   59.433333    3.496078
Repeated Measures Analysis of Variance
Univariate Tests of Hypotheses for
Within Subject Effects
Source: SCALE                    Adj Pr > F
DF TypeIIISS MS      F   Pr > F G - G  H - F
2  16.624   8.312  5.39 0.0093 0.0101 0.0093
Source: SCALE*GROUP
DF TypeIII   MS    F   Pr > F G - G  H - F
6   18.9619 3.160 2.05 0.0860 0.0889 0.0860
Source: Error(SCALE)
DF TypeIII SS     Mean Square
34  52.4667     1.54313725
Greenhouse-Geisser Epsilon = 0.9664
Huynh-Feldt Epsilon = 1.2806

Greenhouse-Geisser, Huynh-Feldt test to see if has certain structure.

For fixed effects model is iid .

For MANOVA model vector of is MVN but with covariance as for .

Intermediate model. Put in subject effect.

Assume

where iid and are iid . Then

This model is usually not fitted by maximum likelihood but by analyzing the behaviour of the ANOVA table under this model.

Essentially model says

GG, HF test for slightly more general pattern for .

Do univariate anova: The data reordered:


1 1 1 19
1 1 2 20
1 1 3 18
2 1 1 20
2 1 2 21
2 1 3 19
et cetera
2 4 2 10
2 4 3 12
3 4 1 11
3 4 2 10
3 4 3 10
4 4 1 11
4 4 2 7
4 4 3 12

The four columns are now labels for subject number, group, scale (a, b or c) and the response. The sas commands:

data long;
infile 'table5.7uni';
input subject group scale score;
run;
proc print;
run;
proc glm;
class group;
class scale;
class subject;
model score =group subject(group)
scale group*scale;
random subject(group) ;
run;

Some of the output:

Dependent Variable: SCORE
Sum of    Mean
Source   DF    Squares  Square F    Pr > F
Model    28   843.5333 30.126 19.52 0.0001
Error    34    52.4667  1.543
Total  62   896.0000
Root MSE      SCORE Mean
1.242231        15.33333

Source         DF  TypeISS    MS      F     Pr > F

GROUP           3 743.9000 247.9667 160.69  0.0001
SUBJECT(GROUP) 17  59.4333   3.4961   2.27  0.0208
SCALE           2  21.2381  10.6190   6.88  0.0031
GROUP*SCALE     6  18.9620   3.1603   2.05  0.0860

Source         DF TypeIIISS     MS    F     Pr > F
GROUP           3 743.9000  247.9667 160.69 0.0001
SUBJECT(GROUP) 17  59.4333    3.4961   2.27 0.0208
SCALE           2  16.6242    8.3121   5.39 0.0093
GROUP*SCALE     6  18.9619    3.1603   2.05 0.0860

Source         Type III Expected Mean Square

GROUP          Var(Error) + 3 Var(SUBJECT(GROUP))
+ Q(GROUP,GROUP*SCALE)
SUBJECT(GROUP) Var(Error) + 3 Var(SUBJECT(GROUP))
SCALE          Var(Error) + Q(SCALE,GROUP*SCALE)
GROUP*SCALE    Var(Error) + Q(GROUP*SCALE)

Type I Sums of Squares:

• Sequential sums of squares.

• Each line is a sum of squares comparing the model with effects listed above to one with one extra effect.

• Depend on order terms listed in model.

Type III Sums of Squares:

• Roughly: each line compares model with all other effects in model.

• In unbalanced designs be careful about the differences between Types II, III and IV.

Notice hypothesis of no group by scale interaction is acceptable.

Under the assumption of no such group by scale interaction the hypothesis of no group effect is tested by dividing group ms by subject(group) ms.

Value is 70.9 on 3,17 degrees of freedom.

This is NOT the F value in the table above since the table above is for FIXED effects.

Notice that the sums of squares in this table match those produced in the repeated measures ANOVA. This is not accidental.

Richard Lockhart
2002-10-10