To test our inference procedure, we implemented a simple whole-genome pedigree simulator, which we briefly describe here. We simulate in reverse time, keeping track of those parts of the pedigree along which genomic material have actually passed; effectively constructing the ancestral recombination graph (20) in its entireity. This is computationally taxing, but it is still feasible to generate all relationships between 100 sampled humans (with the actual chromosome numbers and lengths) going back 300 generations in an hour or so on a modern machine with only 8GB of RAM; or on the same machine, 1000 sampled humans going back 150 generations, overnight. (Since memory use and time to process a generation scale linearly with the number of generations and the number of samples, a machine with more RAM could produce longer or larger simulations.) The demographic scenarios are restricted to (arbitrary) discrete population models with changing population sizes and migration rates. The code (python and R) is freely available at http://github.org/petrelharp.
The process we want to simulate is as follows: we have sampled diploid individuals in the present day, and at some time in the past, wish to know which of the sampled haplotypes inherited which portions of genome from the same ancestral haplotypes (i.e. are IBD by time ). We work with diploid individuals, always resolved into maternal and paternal haplotypes, and only work with the autosomes. We treat all chromosomes in a common coordinate system by laying them down end-to-end, with the chromosomal endpoints playing a special role. The algorithm iterates through previous generations, and works as follows to produce the state at generations ago from the state at generations ago. Each sampled haplotype can be divided into segments inheriting from distinct ancestral haplotypes from generations ago. For each of the sampled haplotypes, indexed by , we record the sequence of genomic locations separating these segments as , and the identities of the corresponding ancestors from generations ago as , where is the total number of segments the sampled haplotype is divided into generations ago. The first genomic location is always 0, and for notational convenience, let (the total genome length). The meaning of is that if two samples and match on overlapping segments, i.e. for some and , and , then both have inherited the genomic segment from the same ancestral haplotype alive at time , and are thus IBD on that segment from sometime in the past generations.
As parameters, we are given , the effective population size in subpopulation at time in the past.
To update from to , we need to pick parents for each generation- ancestor, choose the recombination breakpoints for the meiosis leading to each generation- haplotype – so, each unique value of has a corresponding (diploid) parent and a set of recombination breakpoints. These steps are performed iteratively along each haplotype, checking if parents and recombination breakpoints have been chosen already for each ancestor, and randomly generating these if not. Recombinations are generated as a Poisson process of unit rate along the genome (so, lengths are in Morgans); to this set each chromosomal endpoint is added independently with probability each. To choose a parent, stretches of genome between alternating recombination breakpoints are assigned to the two haplotypes of the parent. For instance, if the recombination breakpoints of a generation- haplotype labeled are at , and the maternal and paternal haplotypes of the parent of are labeled and respectively, any with would be changed to , while those with would be changed to , and new segments are added when breakpoints fall inside of an existing segment ().
The algorithm is run for a given number of generations, after which an algorithm iterates along all sampled haplotypes in parallel, writing out any pairwise blocks of IBD longer than a given threshold. An IBD block here is a contiguous piece of a single chromosome over which both sampled chromosomes share the same state.
We simulated from three simple demographic scenarios, with parameters chosen to roughly match the mean number of IBD blocks per pair longer than 2cM that we see in the data. The scenarios are as follows:
Constant effective population size – average 0.79 IBD blocks longer than 2cM per pair.
Exponential growth, starting from (constant) effective population size prior to 100 generations ago, and approaching exponentially, as – average 0.51 IBD blocks longer than 2cM per pair.
Exponential growth as in B, except expanding only 50 generations ago, and beginning with an effective population size of – average 1.11 IBD blocks longer than 2cM per pair.
A more complex scenario: constant size older than 60 generations ago; growing logistically to between 60 and 30 generations ago; decreasing logistically to between 30 and 15 generations ago; constant between 5 and 15 generations ago; and growing again to until the present – average 1.09 IBD blocks longer than 2cM per pair. (Imagine a population that grows large, has a small group split off gradually, which then grows in the present day; not motivated by any specific history, but chosen to test the methods when the true history is more ”bumpy”.)
For computational convenience, we simulated only up to 300 generations ago, and retained only blocks longer than 0.5cM (but often restricted analysis to those longer than 2cM); as in the paper we merged any blocks separated by a gap that was shorter than at least one adjacent block and shorter than 5cM. Coalescent rates and block length distributions are shown in figure S2. Even though we have not modeled gap removal, the results still closely match theory, since very few blocks fell so close to each other.
For each scenario, we applied the inversion procedure described in the text to the full, error-free set of blocks as well as to various subsets and modifications of it. The inversion procedure we followed for each was as follows. We chose a discretization for block lengths as described in the text, by starting with the percentiles of the distribution, and refining further so that the largest bin length was 1cM. We then computed the matrix as described in the text, except with no error distribution or false positive rate, so that if the length bin is , then , with the length of the chromosome and . For most simulations, we did not discretize time any further, but allowed to range from 1 up to 300 generations. We then used constrained optimization as implemented in the R package optim (L-BFGS-B method, 61) to maximize the penalized likelihoods described in the text, beginning at the solution to the natural approximating least-squares problem (using quadprog, 72). For each case, we show the “maximum likelihood solution” (estimated by adding a small amount of smoothness penalty to ensure numerical uniqueness, allowing the algorithm to converge), and the “smoothest consistent solution” – the largest so that the solution has decreased in log-likelihood by no more than 2 units.
In each case, we also show the exact coalescent distribution, as well as the block length spectrum predicted by theory from the true coalescent distribution and each coalescent distribution found by penalized maximum likelihood.
Note that we could have incorporated false positives, missed blocks, or length misestimation into the simulations, and subsequently modified the kernel to incorporate these rates, but this would only add additional layers of simulation code, and would not make the task of inference more difficult, since we account for these effects analytically. The sensitivity of the methods to misestimation of these rates is a concern, but this amounts to misestimation of the kernel , which we investigate below.
In figure S3 we show the results of the inference procedure applied to the full set of blocks longer than 0.5cM; figure S4 is the same, except using only blocks longer than 2cM, and figure S5 shows the results for scenario D separately. Comparing these, we see that the short blocks 0.5–2cM does not significantly improve the resolution in recent times, but does allow better estimation of coalescent rates longer ago in time. Using blocks longer than 2cM gives us good resolution on the time scale we consider (the past 100 generations), and including those down to 0.5cM does not make the likelihood much less ridged (as expected from theory).
One counter-intuitive result we obtained was that the coalescent history could have a dramatic effect on the estimation of ages of blocks given their lengths. In figure S6 we show the probability distribution of the ages of blocks of various lengths under the four scenarios, i.e. how many generations ago the ancestors lived from whom the samples inherited blocks of that length. These are counter-intuitive because a block inherited from generations ago has mean length cM, but the age distributions of blocks in practice show that the converse is not true – blocks cM long are usually much older than generations. This is computed simply as follows: the mean number of IBD blocks of length per unit of coalescence from generations ago (from paths of meioses) is ; so the probability that a block of length came from generations ago is .
We also used these simulations to evaluate our sensitivity to error. Of particular concern is error arising due to misestimation of the false positive rate – we have seen that false positive rate at short lengths can vary somewhat by population – we estimate as much as 10% around 2cM. To evaluate the effect on inference, we added to the numbers of blocks observed in each category some number of “false positives”, but applied the inference methods without accounting for these (so we still have ). The numbers of false positives added to each length bin are Poisson with mean equal to the theoretical mean predicted for that bin, multiplied by a factor that depends on the length and decreases (so there is an artifical inflation of short blocks). The results for three different false positive rates are shown in figure S7. From these, we see that if IBD rate is only increased by a maximum of 10% – even if the effect extends out to 6 or 8cM – the effect on the inferred coalescent distribution is minor. It is also useful to add a unrealistically high level of unaccounted-for false positives, as it is natural to suspect that an excess of short blocks will only increase the coalescent rate at relatively older time periods. This is indeed the case – doubling the distribution at the short end (about 2–4cM) only affects inferred coalescent rates beyond about 100 generations, because this is when the bulk of the 2–4cM blocks have come from.