H2boot - Program for Bootstrap Estimates of Quantitative Genetic Data CPCrand - Randomization Test of the Common Principal Component Model (c) 1994-2001 Patrick C. Phillips, University Oregon ---------------------------------------------------------------------------------------- These programs apply statistical resampling approaches to quantitative genetic data. H2boot is used primarily for parameter estimation and significance testing of individual quantitative genetic parameters using a bootstrapping approach. CPCrand is used for comparing variance-covariance matrices derived from different populations using the hierarchical model devised by Bernhard Flury (1988). Although both programs are intended primarily for quantitative genetic data, straight phenotypic data can also be analyzed using these techniques. WARNING: These programs are provided for research purposes, but the application of resampling methods to quantitative genetic data have yet to be fully verified. The user must decide for his or her self whether or not the results produced by these programs provide useful information. BREEDING DESIGNS & INPUT FORMAT Nine different breeding designs (all fairly simple) are accommodated by the program. The input format for the data is straightforward. The program can analyze any number of traits (--shortly, right now there is a hardwired, but changable, limit of ten). The very first line in the file can be the names of the traits to be analyzed. If this is used as the first line, then the program will figure out how many traits there are. If it is not used as the first line, then the user must specify the number of traits using the menu given below. The basic format consists of the trait values for each individual listed on a line by itself. Different breeding designs require extra information on each line. If more than one population is to be analyzed, then the first item on each line must the name of the population. 1) Phenotypic Analysis: Simply put the trait values for each individual on a line by themselves. Variance and covariance components are estimated using the simple definitions of these statistics. 2) Mid-parent-offspring regression: Put the parental trait values followed by the offspring trait values for each parent-offspring pair on their own line. Variance components are estimated as Va = twice the regression covariance. The covariance between traits are estimated the average of the parent trait 1 on offspring trait 2 and parent trait 2 on offspring trait 1. The phenotypic variance components are estimated as Vp = twice the variance among mid-parent values. Environmental variance components are estimated as Ve = Vp - Va. 3) Weighted mid-parent-offspring regression: Each line consists of the parental values followed by the offspring values followed by a weighting value for each trait. This could be useful if the regression should be weighted by, say, the number of offspring in each family. Variance components are estimated as in (2) above. Note that because any extra information at the end of a line is ignored, the same dataset can be used to run a weighted or unweighted analysis. The weights on the end of the line are simply not read if the unweighted analysis is selected. 4) Parent-offspring regression: This regression assumes that the parental values are from a single parent. Each line is formatted as in (2) above. Variance components are estimated as Va = twice the regression covariance, Vp = the variance among parental values, and Ve = Vp - Va. 5) Weighted parent-offspring regression: As in (3) above with variance components as in (4). 6) One-way ANOVA among full-sib families: Each data line must represent a single individual from each family. The first element on each line (second if more than one population is being analyzed) should be the family name/identifier. This must be the same each member of the family, and all members of the same family must be listed consecutively in the file or they will be treated as more than one file (i.e., sort the data first by population [if appropriate], and then by family). After the family id, which can be text or numbers, the trait values for that individual are listed. Variance components are calculated as Vg = 2 x the among family variance estimated from an ANOVA, Ve = the with family variance - the among family variance, and Vp = Vg + Ve. 7) Parent-offspring regression w/full-sib ANOVA. This analysis is designed to provide the best estimate of the Vp, while still estimating Va from the parent-offspring regression. Va = twice the parent-offspring covariance as in (3), but Vp = within + among family variance components from a full sib ANOVA. Note that the full sib data is not used for the genetic estimates, just for the phenotypic variance. Ve = Vp - Vg. The data format is a little more complicated for this case, but is very similar to (6). The first line of every family is assumed to be the parental trait values. All other lines in the family are assumed to be offspring values. Each family must have a unique identifier and be contiguous in the file, as in (6). 8) One-way ANOVA among RI lines: Designed for the analysis of recombinant inbred lines generated via a cross between two inbred parental strains. Data format is as in (6), except that line IDs are used instead of family IDs. Variance components are estimated as Vg = 1/2 the among family variance, Ve = the within family variance, and Vp = Vg + Ve. 9) One-way ANOVA among clones: Data format is as in (6) and (8). Variance components are estimated as Vg = the among family variance, Ve = the within family variance, and Vp = Vg + Ve. Examples: Say you are analyzing two traits using an ANOVA among clones. The data could look like: trait1 trait2 clone1 45 34 clone1 53 46 clone2 29 31 clone2 25 28 For a parent-offspring comparison for brain and body size covariance between two populations, the data would be something like: brain body pop1 12 89 14 95 pop1 14 92 17 101 pop2 21 75 19 71 pop2 23 77 22 79 MATRIX COMPARISONS The general method of matrix comparisons used is described by Flury (1988), and to some degree in the documentation for the sister program in this package, cpc (readme.cpc). I will not cover the general ground again here. The important point for these programs is significance testing of the steps in the Flury hierarchy. The problem with testing quantitative genetic data is that the significance tests constructed by Flury rely on the likelihood statistic of the comparisons to be chi-square distributed, which in turn implies both multivariate normality and that the degrees of freedom under the null hypothesis is known. The former can be a problem for any data, but especially for genetic data where we usually have little information about the distribution of breeding values. The latter problem is uniquely severe in variance component estimation procedures because it is really not clear what the appropriate degrees of freedom for the comparison should be. For additive genetic variances and covariances, it seems as though it should be somewhere close to the number of families or sires in the breeding design. However, the number of individuals per family also contributes to the error structure of the covariance estimates. In practice, for a real life dataset (Phillips and Arnold, in prep), the distributions do appear to be chi-square distributed, but with the wrong number of degrees of freedom. The programs, h2boot and cpcrand, provide two possible approaches to the problem of significance testing in the Flury hierarchy. In general, for hypothesis testing, I prefer the method used by cpcrand, but both methods are available for use. For estimation of the pooled matrices, eigenvectors, etc, then h2boot must be used. Cpcrand uses a randomization approach to testing the hierarchy. Basically, during each iteration of the program, the families are randomized among the populations and the matrices for the "new" populations are compared. Under the null hypothesis of no differences between the populations, then randomly mixing the populations should have no effect on the comparisons. Therefore, the comparison statistic for the actual population can be compared to the distribution of randomized populations to estimate the probability of obtaining a statistic that large just by chance. The actual output of the program simply give this "p-value" for each level of the Flury hierarchy. You can also request that randomization distributions be output. The population means are subtracted off each trait value before the randomization procedure is started so that the among population variance does not confound the comparisons of within population covariance structure. It is conceivable that very odd shaped within population distributions could cause problems with this approach, but as of now that is untested. H2boot uses a bootstrapping approach to test the hierarchy (this approach was suggested to me by Bernhard Flury). The first step is to bootstrap within each population, resampling families within a population twice to create two different matrix estimates for the population. These two matrices are then compared to produce a test statistic. Doing this many times yields a distribution under which the null hypothesis is known to be true. I call this distribution the self-similarity distribution. There will be one of these for each population. This distribution is then compared to the distribution estimated by repeated bootstrapping the samples within each population and then comparing them to each other. If the self-similarity and the comparison distributions are the same, then we would say that we have no evidence to reject the null hypothesis of unrelated structure (or CPC[1] or proportionality, etc). The comparison of the distributions can be done formally using something like the Kolmogorov-Smirnov test or more informally by looking at the histograms of the distributions. Because the bootstrapping approach relies on the comparison of distributions it is not quite as clean as the randomization test described above, which yields a single p-value. Neither method has been verified to be correct (I plan on working on this), so caveat emptor. In practice, the randomization test appears to be more sensitive to misbehaved matrices (see below), whereas the bootstrapping approach seems to handle things better because one can observe the entire error distribution. PROBLEMS WITH MATRIX STRUCTURE The matrix comparison methods devised by Flury will not work on singular or non-positive definite matrices (i.e., matrices with zero or negative eigenvalues). This is usually not a problem for phenotypic matrices in which the matrices are guaranteed to be positive definite. Matrices constructed using variance component estimates are known to often have negative eigenvalues, so this can be a real problem for quantitative genetic matrices. This situation usually arises when some of the correlations are very high and/or some of the traits have very little genetic variance. These problems can sometimes be solved by eliminating these traits from the analysis. The option is to change the eigenvalues of the matrix by just enough to make the matrix positive definite. This is the bending method suggested by Hayes and Hill. There are two ways to approach this problem of non-positive definite matrices. First, bootstrap or randomization runs that result in non-positive definite matrices could be eliminated from the bootstrap sample. This is what happens if you select "do not bend" as an option. It is likely that this generates some bias in the significance testing, since it is probably the most extreme matrices that get thrown out. Alternatively, the matrices could be bent anytime that they misbehave. Unfortunately, bending appears to change the error distribution of the comparison statistic (it fairly easy to show that this must be the case). For this reason, bending every matrix separately destroys the utility of the randomization test, since it relies on a comparison with the actual value to the derived distribution. One can get results using the bootstrapping approach, but the error distributions start getting a little odd (often bimodal with bent and non-bent comparisons yielding different distributions). Since the self-test distribution is generated by the same procedure, however, the comparison of the distributions does seem to work. The validity of this is currently uncertain, however. The real problem for the randomization test comes when the inital matrices are not positive definite. This could preclude analysis. However, the cpcrand program provides an option to bend the initial matrix, and then all following matrices are bent to the same degree. Theoretically, this should yield the appropriate distributions, but again, this has yet to be verified. In summary then, bending in h2boot bends every non-positive definite matrix on its own, whereas bending in cpcrand only applies to the initial matrices, and then all matrices are bent according to those rules. As should be clear from the above discussion, the way around ill-behaved matrices is not clear. If possible try to run the analysis without any bending. If you have to bend the matrices, contact the author to see if any progress has been made on this issue. MISSING DATA The missing value designation is -9999. There should not be any "holes" in the data. Each element in the data matrix should either contain the appropriate trait value or -9999. Missing data is handled in two distinct ways. First, for the bootstrapping estimates of individual variance components, all available data is used to compute each component. For example, in a parent-offspring regression of three traits, if a parent is missing a value the first trait, then this family will not be used when estimating or bootstrapping the variance components of the first trait or any covariance components involving the first trait, but will be used in estimates of the other two traits. Thus different elements of the covariance matrix will potentially be estimated with different numbers of families. The actual number of families (or lines) used for each element is printed in the output. You should verify that the printed number agrees with what you consider to be the number of valid families. You can constrain the analysis to only be performed on families with complete data by indicating that via menu option. Note that for ANOVA style data (e.g., full-sibs) a group is considered to have missing data for a trait only if no members of that group have a valid value for that trait. Thus family size can vary from trait to trait. There is no option to constrain the analysis to only group members that have complete data (bootstrapping is always at the level of family or line). The above approach appears to be the appropriate way for treating missing data for parameter estimation. For matrix comparisons with h2boot and cpcrand, however, things are more problematic. The logical approach when comparing whole matrices during a bootstrap is to use the same set of families for variance component estimation on all characters. This works great if there is no missing data in any of the sampled families. What if there is? Two approaches suggest themselves. First, the bootstrap could be sampled from the entire dataset regardless of the missing data status of the families. Then the estimates for all traits could be taken from this sample, just leaving out families with missing data for the estimates for a particular combination of traits. The problem with this is that there is a chance that the sample will contain families with no valid data for a particular combination of traits, and the procedure in general probably leads to an unmeasured amount of bias in the comparisons. This may not be a real problem, but at the moment it is untested. This will probably be the method of treating missing data used by the program, but it is currently unimplemented. The current method uses an approach which is probably inferior (but we do not know). It samples each trait separately from the list of families with valid data and constructs a covariance matrix from these estimates. Thus, each element in the matrix is probably drawn from a different sample of families. The comparisons are then made based on these estimates. This is good in that only families with no missing data for a particular trait are sampled (thereby guaranteeing valid estimates), but bad in that it the impact of family structure is not fully accounted for during the bootstrapping or resampling procedure. My guess is that this approach is probably asymptotically correct, but probably not conservative enough in general (imagine if one family accounts for the majority of the total variance for all traits). Because of current uncertainties, I can not recommend making matrix comparisons with families with missing data at this time. At some point I hope to figure out how bad things are. If you only have a little missing data, then the first procedure outlined above, which I will implement at some point, will probably be OK. It remains to be seen whether the current method used is also valid. If you have a lot of missing data, then you may need pair down to a smaller data set that only includes families with full data. This is the method that I am currently using. RUNNING THE PROGRAM Each program is started by typing its name on the command line or double clicking on it were appropriate. Control of the program is made through the use of a simple text-based menu that provides all of the available options. To change a particular option, simply type the number of that option, hit return, and then answer the questions asked. When the program is run for the first time, it uses a default set of options. After you alter these, your entries are remembered and stored in the files "h2boot.par" for h2boot and "cpcrand.par" for cpcrand. The options in these files can be changed by running the programs and using the menus, or by directly editing the files. The first line in each file contains a batch mode option. This can only be changed by editing the file. If the batch mode is set to yes, then the program will run without presenting the options menu. This can be useful if you want to run the programs in the background on a UNIX system, for instance. Options (1) and (3) ask for the input data set name and output file, respectively. The input data set must be in the format specified above. Option (2) asks if the trait names are in on the first line of the file. If they are, then these names will be used in the output (the output looks best if the names are eight characters or less). The number of traits and the number of names must match. If the trait names are not in the file, then you must specify how many traits are in the file. Note that all extraneous information at the end of a line is ignored, so if you specify fewer traits than are in the file, then if you are using an ANOVA format then the trait values at the end of the line will be ignored. This allows you to analyze a subset of traits, provided they are in the right order. This will not work for parent-offspring data, since the full parental data comes before the offspring data. Option (4) is the number of bootstrap/randomization runs. This sets the number of times that the populations are resampled for the analysis. Making this a higher number should give better resolution of the entire error distribution, but takes longer to run. 100 is usually enough to get a good idea what is going on, 1000 should be fine most purposes, and 10,000 can be used for excessive precision. Computing time scales linearly with the number of resamples, so you might want to start small to see how long things take. On most modern personal computers (Pentium or PowerPC class), the analysis of most datasets can be accomplished fairly quickly (under an hour). If there are a large number of families and traits, and you are making matrix comparisons, then the time could potentially become much longer (the matrix comparisons take substantially longer than the parameter estimates). Option (5) sets the random number seed for the resampling sets. If you ever want to replicate a particular run, simply use the same random number seed. Otherwise, you should use a different number each time. The seed should be an integer between 1 and 32,000. Option (6) asks if you want to create output files for the resampling distributions. For h2boot, this will create a number of output files for the quantitative genetic parameter estimates. Right now, this consists of the bootstrap distributions for heritability (h2dist.out), the genetic variance-covariance matrix (gdist.out), and genetic correlations (grdist.out). Each line of the heritability file consists of the boostrap estimate of heritability for each trait for that particular run. The genetic variance-covariance and correlation files have the G matrices for each bootstrap run on each line. In order to put the matrix on a single line, the matrix is squished a little, so that each line has the genetic variances listed first, followed by the covariances {g11, g22, g33, g12, g13, g23}. If more than one population is being analyzed, then a separate file is generated for each population. This is designated by a number (0,1,...) that follows the order of the populations in the input file. If matrix comparisons are selected in h2boot, then the likelihood distribution files are automatically generated because a comparison of these distributions is the only way to conduct the comparisons using a bootstrapping approach (see above). These likelihood distribution files will only be created in cpcrand if option (6) is selected. These files are named likeliG.out, likeliP.out, and likeliE.out for comparisons of the genetic, phenotypic, and environmental covariance matrices respectively. Each line in the file contains the likelihood statistic for a given level in the hierarchy for each bootstrap or randomization replicate. The order of these likelihoods is Unrelated, Equality, Proportionality, CPC, CPC(q-2), CPC(q-3), ..., CPC(1). Any higher level in the hierarchy can be tested using a lower level as the null hypothesis by subtracting the likelihood statistic of the lower model from the higher one. For example, to test for equality of the matrices, the first column of the data would be subtracted from the second column, and then a histogram of this difference plotted as a histogram. See the file readme.cpc for more discussion about this. Using the randomization procedure of cpcrand, the initial estimate for the matrices (which is actually the first line of the output file) can be compared to entire distribution to see the probability of obtaining a value this large by chance. The program actually calculates this value for you, but you may be interested in looking at the entire distribution anyway. Under the bootstrapping approach in h2boot, you would also have to create the self-similarity distributions for each population. The data for these distributions are in the files, selfliG, selfliP, and selfliE. There will also be a number (0,1,...) after the capital letter to indicate which population that distribution came from. WARNING: the file names for these distributions are fixed, and the program does not check to see if a similar file already exists. Therefore, the distribution output from any existing runs will be overwritten if it is in the same directory. You may therefore want to conduct each separate analysis in its own directory or folder. Option (7) lets you specify the breeding design used for the analysis, as discussed at the beginning of this file. Option (8) allows you to restrict the analysis to families/lines that have no missing data, as discussed in the section on Missing Data. Option (9) in h2boot lets you specify if more than one population is present in the data (this is assumed to be the case for cpcrand). If that is true, the program will output separate estimates for each population. The matrices from each population can also be compared to one another, as indicated in option (10). If the comparisons are made, then there is a choice to allow "bending" of non-positive definite matrices (see Problems with Matrix Structure above; option [11] in h2boot and option [9] in cpcrand) and whether or not to automatically order the eigenvalues according to total variance when conducting the partial common principal components analysis (see readme.cpc for more discussion of this issue). If the last option is set to no, then the program will print a set of eigenvectors from the initial CPC analysis and ask you to determine the test order. OUTPUT The output for h2boot consists of the quantitative genetic estimates for each trait. If more than one population is being analyzed, then the results from each population are presented separately. The format for each estimate is essentially the same. First the estimate, which is the mean of the bootstrap distribution, is presented, followed by the standard error determined by the bootstrap distribution. Then some non-parametric p-values are presented. These represent the fraction of times that a particular bootstrap run overlapped some value. Zero is tested for every parameter, and this can be taken as a test of overall significance. In addition, heritabilies and genetic correlations are also tested against having values of 1, and the genetic correlations are also tested against -1. The tests versus zero are two-tailed tests (except in the case of heritability), while the tests versus 1 and -1 are one-tailed tests. The final item printed for each statistic is the number of families used for that particular estimate. Missing data might make this number vary from trait to trait. For genetic correlations, the number of successful bootstraps rather than the number of families is printed. Because genetic correlations are calculated in part from the square root of the variances, when these variances are negative (which can happen fairly regularly with variance component estimates) the genetic correlations are no defined. The approach taken in this program is to simply eliminate these runs from the bootstrap estimates. This may create a small amount of bias, but this remains to be investigated. For the matrix comparisons, the output for h2boot closely follows that of the program cpc, so user should consult the file readme.cpc for that. The output from cpcrand is fairly simple, just presenting the significance testing of the Flury hierarchy according to the "jump-up" procedure (Phillips and Arnold, in prep.; readme.cpc). Other tests are possible by direct examination of the randomization distributions as described above. AUTHOR CONTACT Please feel free to contact me about the usage and interpretation of this program. Patrick Phillips Biology Department University of Oregon Eugene, OR 97403-1210 (541) 346-0916 (541) 346-2364 (FAX) pphil@darkwing.uoregon.edu The latest versions of these programs are available at .