# CPC - Common Principal Component Analysis Program

© 1994-7 Patrick Phillips, University of Texas at Arlington

## Background

This program implements most of the analyses presented in Bernhard Flury (1988), Common Principal Components and Related Multivariate Methods (Wiley, New York).

The purpose of this analysis is to compare the structure of two or more covariance matrices in a hierarchical fashion. The hierarchy is built on the realization that covariance matrices can share more complex relationships between one another than just being equal or not. For example, one matrix might be identical to another except that each element of the matrix is multiplied by a single constant. We would then say that the matrices are proportional to one another. A more precise definition of proportionality is that the matrices share identical eigenvectors (or principal components), but their eigenvalues differ by a proportional constant. This suggests that another relationship between matrices could be that they share principal components in common, but their eigenvalues different (the Common Principal Component or CPC model). Similarly, the matrices could share one, two, or up to p-2 (where we are working with p x p sized matrices) principal components in common out of the p total possible components. This is the Partial Principal Components (PCPC) model. The PCPC model stops at p-2 components because, since the principal components are defined to be orthogonal to one another, if you know p-1 of the components, you necessarily know the final one, leaving you with the full CPC model.

The hierarchical nature of this set of comparisons can be appreciated by realizing that if two matrices share two principal components in common, then they necessarily share one component. Similarly, if two matrices are proportional, then they must satisfy the CPC model, as well as all of the PCPC models. The hierarchy ends with matrix equality, which can only be true if all other elements of the hierarchy are also true.

The various models of the hierarchy can be tested in several ways. First, and perhaps most logically, they hierarchy can be built in a step-wise fashion starting with no relation between the matrices ("Unrelated structure"). From this we can go to CPC(1), then to CPC(2), etc, through CPC, Proportionality, and Equality. The likelihood that a particular model is valid is then tested against the next lowest model in the hierarchy. This leads to the decomposition of the log-likelihood ratio discussed in Flury's Chapter 7. I call this the "step-up approach" in Phillips and Arnold (in prep).

Another alternative is that any model hypothesis can be tested against any other model that is lower in the hierarchy than the first hypothesis. For example, Equality need not only be tested against Proportionality, it could also be tested against CPC (skipping Proportionality) or directly against Unrelated structure for a standard equal or not type test. The test of each level of the hierarchy against Unrelated structure is probably the most logical from a hypothesis testing point of view (which is how most biologist will probably approach these problems). Since each level of the hierarchy is inclusive of the lower levels, one can start at the bottom of the hierarchy and test each successive level against the hypothesis of Unrelated structure until a significant deviation is encountered. The move up the hierarchy should be stopped at this point. This method is called the "jump-up approach" in Phillips and Arnold (in prep). There are several caveats to this approach. First the significance tests at each level are not independent, as the tests are not orthogonal. Second, it is not currently clear how literally the p-values from these analyses should be taken, since they are based on the assumption of multivariate normality.

The final approach is the one advocated by Flury, in which the "best fitting" model is taken. In this "model building approach" significance is not the choice criterion, but rather how well a particular hierarchical model fits based on how much information is available to fit that model. The Akaike Information Criterion (AIC) can be used in this decision making, as it discussed below.

Pending planned simulation studies and power analyses, the best approach is not clear. Often the biological "reality" of the situation can be gleaned by looking at how well the matrices constructed using the constrained model (e.g., proportionality) match the actual matrices from the populations. I tend to use the jump-up approach because it provides the most straight-forward interpretation and way of understanding and interpreting the results.

## Input Format

The input for the program consists of a file containing a description of the covariance matrices and the matrices themselves. Two input formats are supported. The first input format is
```<number of matrices>
<number of traits>
<number of individuals measured for first matrix>
<elements of the first matrix>
<number of individuals measured for the second matrix>
<elements of the second matrix>
...
```
An example of this is the North American Marten bone data used by Flury. The data file for this example would look like:
```2
4
92
1.1544
0.9109 2.0381
1.0330 0.7056 1.2100
0.7993 1.4083 0.7958 2.0277
47
0.9617
0.2806 1.8475
0.9841 0.3129 1.2804
0.6775 1.2960 0.7923 1.7819
```
You can either input the entire matrix or the lower triangular part of the matrix. It is important that the last line ends with a return character.

The second input format is essentially the same, except that instead of specifying the number of matrices and traits, you give their names. In this way, the matrix and trait names will be used when generating the output. The above example in this format looks like:

```Males Females
HumLen HumWid FemLen FemWid
92
1.1544
0.9109 2.0381
1.0330 0.7056 1.2100
0.7993 1.4083 0.7958 2.0277
47
0.9617
0.2806 1.8475
0.9841 0.3129 1.2804
0.6775 1.2960 0.7923 1.7819
```
Obviously the number of labels has to correspond to the number of matrices and traits in that order, and they must be on separate lines. You may specify either the numbers or the names, but the format must be the same for both the matrix and trait designations. The output will look best if the trait-name labels are eight characters or less. You specify which format the data is in by answering a question during run-time.

This program is really only suited to analyzing phenotypic covariance matrices because only they satisfy the assumptions set forth in Flury (1988). Another program is available for comparing covariance matrices that arise in the analysis of quantitative genetic data (e.g., genetic covariance matrices).

## Running the Program

Before starting the program, it is convenient if the data file and the program are in the same directory. If not, then you can supply the full path name to the data file when asked (this can be difficult on a Macintosh). When running under DOS or the MacOS, the program will completely take over the computer (i.e., you won't be able to do anything else while the program is running). Simple analyses can take a few seconds, but the amount of time required grows at a rate which appears to be greater than the square of the number of traits. Also, the number of matrices that are compared will affect running time. As a gauge, the Marten example (bone.dat in the distribution) should take less than a minute, but an analysis of three matrices, each with twenty-two traits took a little over 24 hours on the equivalent of a Macintosh Quadra (33MHz 68040). Start small while testing the program before doing a full analysis if you are analyzing many traits.

After starting the program (double-clicking on a Macintosh or Windows or from the command line on DOS or UNIX), you need to specify the file name of the input file. You will then be asked for a file name for the results output (note that the program makes no attempt to detect whether a disk is full). You will then be asked if your data file has the matrix and trait names in it as discussed above. Next, you will be asked if you want to "Manually reorder columns for PCPC analysis". The particular Partial Common Principal Components model fit depends on which components you want included in the partial set. The program takes two approaches to the problem of specifying this set. First, if you answer no to the above question (hitting anything but 'y' or 'Y' will actually do), then the program will select the principal components in rank order according to the size of their eigenvalues (or the percent variance explained in principal component parlance). This is often a good place to start an analysis. The principal components shared between two matrices might not necessarily be those with the highest variance. To specify other possible comparisons, you must tell the program what order you want the principal components to be retained in the PCPC analysis. The first one specified will be used in the CPC(1) analysis, the first and second in the CPC(2), and so on. To help you identify which principal component is which, the program first calculates the CPC analysis, and presents the eigenvectors (principal components) from that analysis. You then tell the computer the order of the columns you would like specified by their numerical value (i.e., 4 2 3 1). The PCPC analysis will then proceed using this ordering. Finally, if you are analyzing more than a 2x2 matrix, you will be asked if you want to limit the PCPC analysis. Analyzing the partial components is the most time consuming part of the program, and there are p-2 of them to analyze. If p is large, this can make the analysis prohibitive. You can limit the analysis to only the first few elements of the PCPC part of the hierarchy. If the matrices are not similar at the first few PCPCs, then there is little point in test the higher parts of the hierarchy anyways.

As the program is running, it tries to keep you informed about its progress by printing which model is currently being analyzed.

## Output

Program output consists of systematically working through the hierarchy, testing all possible alternative models, starting at Equality and ending with CPC(1) (if there are more than two traits, otherwise the analysis ends with CPC). A standard principal components analysis is also printed at the end. For each model, the likelihood criterion is printed as well as the number of parameters associated with that model. Using this information, the model can be tested against any lower model using the relationship:

ChiSqr = (likelihood Upper Model) - (likelihood Lower Model) with
df = (# parameters Lower Model) - (#parameters upper Model)

In practice it should never be necessary for you to construct your own tests, since the program constructs all possible tests. Note that the significance testing of the chi square is approximate, and is based on the assumption of multivariate normality of the underlying populations (Flury, 1988). Resampling based tests of significance are under development.

After the likelihood criterion is printed, all possible tests against lower models are printed, giving the chi square, degrees of freedom, and p-values for each test. Avoid the tendency to go fishing for significant results in these tests, but instead use them as ready-made calculations after you have decided which particular tests are appropriate. Each test can be interpreted as the likelihood that the present model is true given that the lower model is also true. A low p-value tells you that the model is rejected. The converse is of course not necessarily true - beware Type II errors.

After these tests are printed, the maximum likelihood estimate of the matrix or matrices under the assumption that the model is correct are printed. First the eigenvalues and eigenvectors for each matrix is printed (a standard principal components analysis), and then each matrix, constrained by the model.

The final item printed is the Decomposition of the Chi Square table. In this table, the test results are organized in a hierarchical fashion, with each model being tested against the next lowest model in the hierarchy. There are several different statistical test presented in this table. First, the standard chi square, with degrees of freedom and p-values are given for a more traditional hypothesis testing approach. Flury (1988), however, argues that a model building approach might be more appropriate. Instead of just testing for the fit or lack of fit of various models, Flury argues that the overall best fitting model should be chosen. The best fitting model can be evaluated in two ways. First, the chi square value divided by the degrees of freedom conveys the amount of statistical lack of fit for the amount of information given for a particular model. The best model would therefore be the one that minimizes this ratio. A second related approach is to use the Akaike Information Criterion (AIC), which balances the size of the log-likelihood function with the number of parameters estimated. Again in this case, the model with the lowest AIC is the preferred one. Remember that different principal component extraction orders can yield substantially different results for the PCPC models.

## Author Contact

Please feel free to contact me about the usage and interpretation of this program.

Patrick Phillips
Biology Department
Box 19498
University of Texas at Arlington
Arlington, TX 76019-0498
(817) 272-2409
(817) 272-2855 (FAX)
pphillips@uta.edu

The latest version of the software is available at <http://wbar.uta.edu/software/cpc.htm> or <http://www.uta.edu/biology/phillips/software.html>