The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2257 Europeans (in the POPRES dataset) to conduct one of the first surveys of recent genealogical ancestry over the past three thousand years at a continental scale.
We detected 1.9 million shared long genomic segments, and used the lengths of these to infer the distribution of shared ancestors across time and geography. We find that a pair of modern Europeans living in neighboring populations share around 2–12 genetic common ancestors from the last 1500 years, and upwards of 100 genetic ancestors from the previous 1000 years. These numbers drop off exponentially with geographic distance, but since these genetic ancestors are a tiny fraction of common genealogical ancestors, individuals from opposite ends of Europe are still expected to share millions of common genealogical ancestors over the last 1000 years. There is also substantial regional variation in the number of shared genetic ancestors. For example, there are especially high numbers of common ancestors shared between many eastern populations which date roughly to the migration period (which includes the Slavic and Hunnic expansions into that region). Some of the lowest levels of common ancestry are seen in the Italian and Iberian peninsulas, which may indicate different effects of historical population expansions in these areas and/or more stably structured populations.
Population genomic datasets have considerable power to uncover recent demographic history, and will allow a much fuller picture of the close genealogical kinship of individuals across the world.
Few of us know our family histories more than a few generations back. Therefore, it is easy to overlook the fact that we are all distant cousins, related to one another via a vast network of relationships. Here we use genome-wide data from European individuals to investigate these relationships over the past three thousand years, by looking for long stretches of shared genome between pairs of individuals inherited from common genetic ancestors. We quantify this ubiquitous recent common ancestry, showing that for instance even pairs of individuals on opposite ends of Europe share hundreds of genetic common ancestors over this time period. Despite this degree of commonality, there are also striking regional differences. For instance, southeastern Europeans share large numbers of common ancestors which date to the era of the Slavic and Hunnic expansions around 1500 years ago, while most common ancestors that Italians share with other populations lived longer ago than 2500 years. The study of long stretches of shared genetic material holds the promise of rich information about many aspects of recent population history.
Even seemingly unrelated humans are distant cousins to each other, as all members of a species are related to each other through a vastly ramified family tree (their pedigree). We can see traces of these relationships in genetic data when individuals inherit shared genetic material from a common ancestor. Traditionally, population genetics has studied the distant bulk of these genetic relationships, which in humans typically date from hundreds of thousands of years ago (e.g. 6; 69). Such studies have provided deep insights into the origins of modern humans (e.g. 39), and into recent admixture between diverged populations (e.g. 44; 27).
Although most such genetic relationships among individuals are very old, some individuals are related on far shorter time scales. Indeed, given that each individual has ancestors from generations ago, theoretical considerations suggest that all humans are related genealogically to each other over surprisingly short time scales (8; 63). We are usually unaware of these close genealogical ties, as few of us have knowledge of family histories more than a few generations back, and these ancestors often do not contribute any genetic material to us (12). However, in large samples we can hope to identify genetic evidence of more recent relatedness, and so obtain insight into the population history of the past tens of generations. Here we investigate such patterns of recent relatedness in a large European dataset.
The past several thousand years are replete with events that may have had significant impact on modern European relatedness, such as the Neolithic expansion of farming, the Roman empire, or the more recent expansions of the Slavs and the Vikings.
Our current understanding of these events is deduced from archaeological, linguistic, cultural, historical, and genetic evidence, with widely varying degrees of certainty. However, the demographic and genealogical impact of these events is still uncertain (e.g. 18). Genetic data describing the breadth of genealogical relationships can therefore add another dimension to our understanding of these historical events.
Work from uniparentally inherited markers (mtDNA and Y chromosomes) has improved our understanding of human demographic history (e.g. 67). However, interpretation of these markers is difficult since they only record a single lineage of each individual (the maternal and paternal lineages, respectively), rather than the entire distribution of ancestors. Genome-wide genotyping and sequencing datasets have the potential to provide a much richer picture of human history, as we can learn simultaneously about the diversity of ancestors that contributed to each individual’s genome.
A number of genome-wide studies have begun to reveal quantitative insights into recent human history (47). Within Europe, the first two principal axes of variation of the matrix of genotypes are closely related to a rotation of latitude and longitude (43; 49; 37), as would be expected if patterns of ancestry are mostly shaped by local migration (48). Other work has revealed a slight decrease in diversity running from south-to-north in Europe, with the highest haplotype and allelic diversity in the Iberian peninsula (e.g. 37; 2; 46), and the lowest haplotype diversity in England and Ireland (50). Recently, progress has also been made using genotypes of ancient individuals to understand the prehistory of Europe (53; 65; 34). However, we currently have little sense of the time scale of the historical events underlying modern geographic patterns of relatedness, nor the degrees of genealogical relatedness they imply.
In this paper, we analyze those rare long chunks of genome that are shared between pairs of individuals due to inheritance from recent common ancestors, to obtain a detailed view of the geographic structure of recent relatedness. To determine the time scale of these relationships, we develop methodology that uses the lengths of shared genomic segments to infer the distribution of the ages of these recent common ancestors. We find that even geographically distant Europeans share ubiquitous common ancestry even within the past 1000 years, and show that common ancestry from the past 3000 years is a result of both local migration and large-scale historical events. We find considerable structure below the country level in sharing of recent ancestry, lending further support to the idea that looking at runs of shared ancestry can identify very subtle population structure (e.g. 38).
Our method for inferring ages of common ancestors is conceptually similar to the work of Palamara et al. (51), who use total amount of long runs of shared genome to fit simple parametric models of recent history, as well as to Li and Durbin (39) and Harris and Nielsen (26), who use information from short runs of shared genome to infer demographic history over much longer time scales. Other conceptually similar work includes Pool and Nielsen (56) and Gravel (19), who used the length distribution of admixture tracts to fit parametric models of historical admixture. We rely less on discrete, idealized populations or parametric demographic models than these other works, and describe continuous geographic structure by obtaining average numbers of common ancestors shared by many populations across time in a relatively non-parametric fashion.
We can only hope to learn from genetic data about those common ancestors from whom two individuals have both inherited the same genomic region. If a pair of individuals have both inherited some genomic region from a common ancestor, that ancestor is called a “genetic common ancestor”, and the genomic region is shared “identical by descent” (IBD) by the two. Here we define an “IBD block” to be a contiguous segment of genome inherited (on at least one chromosome) from a shared common ancestor without intervening recombination (see figure 1A). A more usual definition of IBD restricts to those segments inherited from some prespecified set of “founder” individuals (e.g. 16; 12; 9), but we allow ancestors to be arbitrarily far back in time. Under our definition, everyone is IBD everywhere, but mostly on very short, old segments (57). We measure lengths of IBD segments in units of Morgans (M) or centiMorgans (cM), where 1 Morgan is defined to be the distance over which an average of one recombination (i.e. a crossover) occurs per meiosis. Segments of IBD are broken up over time by recombination, which implies that older shared ancestry tends to result in shorter shared IBD blocks.
Sufficiently long segments of IBD can be identified as long, contiguous regions over which the two individuals are identical (or nearly identical) at a set of Single Nucleotide Polymorphisms (SNPs) which segregate in the population. Formal, model-based methods to infer IBD are only computationally feasible for very recent ancestry (e.g. 4), but recently, fast heuristic algorithms have been developed that can be applied to thousands of samples typed on genotyping chips (e.g. 5; 22).
The relationship between numbers of long, shared segments of genome, numbers of genetic common ancestors, and numbers of genealogical common ancestors can be difficult to envision. Since everyone has exactly two biological parents, every individual has exactly paths of length meioses leading back through their pedigree, each such path ending in a grandparent. However, due to Mendelian segregation and limited recombination, genetic material will only be passed down along a small subset of these paths (12). As grows, these paths proliferate rapidly and so the genealogical paths of two individuals soon overlap significantly. (These points are illustrated in in figure 1.) By observing the number of shared genomic blocks, we learn about the degree to which their genealogies overlap, or the number of common ancestors from which both individuals have inherited genetic material.
At least one parent of each genetic common ancestor of two individuals is also a genetic common ancestor, so the number of genetic common ancestors at each point back in time is strictly increasing. A more relevant quantity is the rate of appearance of most recent common genetic ancestors. This quantity can be much more intuitive, and is closely related to the coalescent rate (30), as we demonstrate later. For this reason, when we say “genetic common ancestor” or “rate of genetic common ancestry”, we are referring to only the most recent genetic common ancestors from which the individuals in question inherited their shared segments of genome.
We applied the fastIBD method, implemented in BEAGLE v3.3 (5), to the European subset of the POPRES dataset (dbgap accession phs000145.v1.p1, 45), which includes language and country-of-origin data for several thousand Europeans genotyped at 500000 SNPs. Our simulations showed that we have good power to detect long IBD blocks (probability of detection 50% for blocks longer than 2cM, rising to 98% for blocks longer than 4cM), and a low false positive rate (discussed further in section 4.2 of the methods). We excluded from our analyses individuals who reported grandparents originating from non-European countries or more than one distinct country (and refer to the remainder as “Europeans”). After removing obvious outlier individuals and close relatives, we were left with 2257 individuals who we grouped using reported country of origin and language into 40 populations, listed with sample sizes and average IBD levels in table 1. For geographic analyses, we located each population at the largest population city in the appropriate region. Pairs of individuals in this dataset were found to share a total of 1.9 million segments of IBD, an average of 0.74 per pair of individuals, or 831 per individual. The mean length of these blocks was 2.5cM, the median was 2.1cM and the 25 and 75 quantiles are 1.5cM and 2.9cM respectively. The majority of pairs sharing some IBD shared only a single block of IBD (94%). The total length of IBD blocks an individual shares with all others ranged between 30% and 250% (average 128%) of the length of the genome (greater than 100% is possible as individuals may share IBD blocks with more than one other at the same genomic location).
E group | self | other | N group | self | other | ||||
Albania | AL | 9 | 14.5 | 1.7 | Denmark | DK | 1 | – | 0.9 |
Austria | AT | 14 | 1.3 | 0.9 | Finland | FI | 1 | – | 1.2 |
Bosnia | BO | 9 | 4.1 | 1.6 | Latvia | LV | 1 | – | 1.6 |
Bulgaria | BG | 1 | – | 1.3 | Norway | NO | 2 | 2.0 | 0.8 |
Croatia | HR | 9 | 2.8 | 1.6 | Sweden | SE | 10 | 3.4 | 1.0 |
Czech Republic | CZ | 9 | 2.1 | 1.3 | |||||
Greece | EL | 5 | 1.8 | 0.9 | W group | self | other | ||
Hungary | HU | 19 | 1.9 | 1.2 | Belgium | BE | 37 | 1.1 | 0.6 |
Kosovo | KO | 15 | 9.9 | 1.7 | England | EN | 22 | 1.3 | 0.7 |
Montenegro | ME | 1 | – | 1.8 | France | FR | 86 | 0.7 | 0.5 |
Macedonia | MA | 4 | 2.5 | 1.4 | Germany | DE | 71 | 1.1 | 0.9 |
Poland | PL | 22 | 3.8 | 1.5 | Ireland | IE | 60 | 2.6 | 0.6 |
Romania | RO | 14 | 2.1 | 1.2 | Netherlands | NL | 17 | 1.9 | 0.7 |
Russia | RU | 6 | 4.3 | 1.4 | Scotland | SC | 5 | 2.2 | 0.7 |
Slovenia | SI | 2 | 5.0 | 1.3 | Swiss French | CHf | 839 | 1.3 | 0.6 |
Serbia | RS | 11 | 2.7 | 1.5 | Swiss German | CHd | 103 | 1.6 | 0.6 |
Slovakia | SK | 1 | – | 0.7 | Switzerland | CH | 17 | 1.1 | 0.5 |
Ukraine | UA | 1 | – | 1.5 | United Kingdom | UK | 358 | 1.2 | 0.7 |
Yugoslavia | YU | 10 | 3.4 | 1.5 | |||||
I group | self | other | |||||||
TC group | self | other | Italy | IT | 213 | 0.6 | 0.5 | ||
Cyprus | CY | 3 | 2.7 | 0.4 | Portugal | PT | 115 | 1.9 | 0.5 |
Turkey | TR | 4 | 2.2 | 0.5 | Spain | ES | 130 | 1.5 | 0.4 |
The observed genomic density of long IBD blocks (per cM) can be affected by recent selection (1) and by cis-acting recombination modifiers. We find that the local density of IBD blocks of all lengths is relatively constant across the genome, but in certain regions the length distribution is systematically perturbed (see supplemental figure S1), including around certain centromeres and the large inversion on chromosome 8 (17), also seen by Albrechtsen et al. (1). Somewhat surprisingly, the MHC does not show an unusual pattern of IBD, despite having shown up in other genomic scans for IBD (1; 23). However, there are a few other regions where differences in IBD rate are not predicted by differences in SNP density. Notably, there are two regions, on chromosomes 15 and 16, which are nearly as extreme in their deviations in IBD as the inversion on chromosome 8, and may also correspond to large inversions segregating in the sample. These only make up a small portion of the genome, and do not significantly affect our other analyses (and so are not removed); we leave further analysis for future work.
We should expect significant within-population variability, as modern countries are relatively recent constructions of diverse assemblages of languages and heritages. To assess the uniformity of ancestry within populations, we used a permutation test to measure, for each pair of populations and , the uniformity with which relationships with are distributed across individuals from . Most comparisons show statistically significant heterogeneity (supplemental figure S2), which is probably due to population substructure (as well as correlations introduced by the pedigree). A notable exception is that nearly all populations showed no significant heterogeneity of numbers of common ancestors with Italian samples, suggesting that most common ancestors shared with Italy lived longer ago than the time that structure within modern-day countries formed.
Two of the more striking examples of substructure are illustrated in figure 2. Here, we see that variation within countries can be reflective of continuous variation in ancestry that spans a broader geographic region, crossing geographic, political, and linguistic boundaries. Figure 2A shows the distinctly bimodal distribution of numbers of IBD blocks that each Italian shares with both French-speaking Swiss and the UK, and that these numbers are strongly correlated. Furthermore, the amount that Italians share with these two populations varies continuously from values typical for Turkey and Cyprus, to values typical for France and Switzerland. Interestingly, the Greek samples (EL) place near the middle of the Italian gradient. It is natural to guess that there is a north-south gradient of recency of common ancestry along the length of Italy, and that southern Italy has been historically more closely connected to the eastern Mediterranean.
In contrast, within samples from the UK and nearby regions we see negative correlation between numbers of blocks shared with Irish and numbers of blocks shared with Germans. From our data, we do not know if this substructure is also geographically arranged within the UK (our sample of which which may include individuals from Northern Ireland). However, an obvious explanation of this pattern is that individuals within the UK differ in the number of recent ancestors shared with Irish, and that individuals with less Irish ancestry have a larger portion of their recent ancestry shared with Germans. This suggests that there is variation across the UK – perhaps a geographic gradient – in terms of the amount of Celtic versus Germanic ancestry.
The first two principal components of the matrix of genotypes, after suitable manipulations, can reproduce the geographic positions of European populations (e.g. 43; 49; 37). Therefore, it is natural to compare the structure we see within populations in terms of IBD sharing to the positions on the principal components map. (A PCA map of these populations (produced by EIGENSTRAT, 58) is shown in supplemental figure S4.) It is not known what the geographic resolution of the principal components map is, but if relative positions within populations is meaningful, then comparison of IBD to PCA can stand in for comparison to geography. Indeed, as seen in supplemental figures S6 and S5, the substructure of figure 2 correlates well with the position on certain principal components, further suggesting that the structure is geographically meaningful. Conversely, since the substructure we see is highly statistically significant, this demonstrates that the scatter of positions within populations on the European PCA map is at least in part signal, rather than noise.
Individuals usually share the highest number of IBD blocks with others from the same population, with some exceptions. For example, individuals in the UK share more IBD blocks on average, and hence more close genetic ancestors, with individuals from Ireland than with other individuals from the UK (1.26 versus 1.09 blocks at least 1cM per pair, Mann-Whitney ), and Germans share similarly more with Polish than with other Germans (1.24 versus 1.05, ), a pattern which could be due to recent asymmetric migration from a smaller population into a larger population. In figure 3 we depict the geography of rates of IBD sharing between populations, i.e. the average number of IBD blocks shared by a randomly chosen pair of individuals. Above, maps show the IBD rate relative to certain chosen populations, and below, all pairwise sharing rates are plotted against the geographic distance separating the populations. It is evident that geographic proximity is a major determinant of IBD sharing (and hence recent relatedness), with the rate of pairwise IBD decreasing relatively smoothly as the geographic separation of the pair of populations increases. Note that even populations represented by only a single sample are included, as these showed surprisingly consistent signal despite the small sample size.
Superimposed on this geographic decay there is striking regional variation in rates of IBD. To further explore this variation, we divided the populations into the four groups listed in table 1, using geographic location and correlations in the pattern of IBD sharing with other populations (shown in supplemental figure S7). These groupings are defined as: Europe “E”, lying to the east of Germany and Austria; Europe “N”, lying to the north of Germany and Poland; Europe “W”, to the west of Germany and Austria and including these; the Iberian and Italian peninsulas “I”; and Turkey/Cyprus “TC”. Although the general pattern of regional IBD variation is strong, none of these groups have sharp boundaries – for instance, Germany, Austria, and Slovakia are intermediate between E and W. Furthermore, we suspect that the Italian and Iberian peninsulas likely do not group together because of higher shared ancestry with each other, but rather because of similarly low rates of IBD with other European populations. The overall mean IBD rates between these regions are shown in table 2, and comparisons between different groupings are colored differently in figure 3G–I, showing that rates of IBD sharing between E populations and between N populations average a factor of about three higher than other comparisons at similar distances. Such a large difference in the rates of IBD sharing between regions, cannot be explained by plausible differences in false positive rates or power between populations, since this pattern holds even at the longest length scales, where block identification is nearly perfect.
To better understand IBD within these groupings, we show in figures 3G–I how average numbers of IBD blocks shared, in three different length categories, depend on the geographic distance separating the two populations. Even without taking into account regional variation, mean numbers of shared IBD blocks decay exponentially with distance, and further structure is revealed by breaking out populations by the regional groupings described above. The exponential decays shown for each pair of groupings emphasize how the decay of IBD with distance becomes more rapid for longer blocks. This is expected under models where migration is mostly local, since as one looks further back in time, the distribution of each individual’s ancestors is less concentrated around the individual’s location (recall figure 1B). Therefore, the expected number of ancestors shared by a pair of individuals decreases as the geographic distance between the pair increases; and this decrease is faster for more recent ancestry.
This wider spread of older blocks can also explain why the decay of IBD with distance varies significantly by region even if dispersal rates have been relatively constant. For instance, the gradual decay of sharing with the Iberian and Italian peninsulas could occur because these blocks are inherited from much longer ago than blocks of similar lengths shared by individuals in other populations.
Conversely, there is a high level of sharing for “E–E” relationships over a broad range of distances. This is especially true for our shortest (oldest) blocks: individuals in our E grouping share on average more short blocks with individuals in distant E populations than do pairs of individuals in the same W population. We argue below that this is because modern individuals in these locations have a larger proportion of their ancestors in a relatively small population that subsequently expanded.
IBD rate | E | I | N | TC | W |
---|---|---|---|---|---|
E | 2.57 | 0.44 | 0.99 | 0.62 | 0.53 |
I | 0.44 | 0.80 | 0.43 | 0.41 | 0.45 |
N | 0.99 | 0.43 | 2.62 | 0.33 | 0.86 |
TC | 0.62 | 0.41 | 0.33 | 1.43 | 0.25 |
W | 0.53 | 0.45 | 0.86 | 0.25 | 0.93 |
Having seen the continent-wide patterns of IBD in figure 3, it is natural to wonder if similar information is contained in single-site summaries of relatedness, such as mean Identity by State (IBS) values across European populations. The mean IBS between populations and is defined as the probability that two randomly chosen alleles from and are identical (“By State”), averaging over SNPs and individuals. In the analogous plot of IBS against geographic distance (supplemental figure S9), some of the patterns seen in figure 3 are present, and some are not. For instance, there is a continuous decay with geographic distance (linear, not exponential), and comparisons to the southern “I” group and to Cyprus/Turkey are even more well-separated below the others. On the other hand, the “E-E” comparisons do not show higher IBS than the bulk of the remaining comparisons.
Each block of genome shared IBD by a pair of individuals represents genetic material inherited from one of their genetic common ancestors. Since the distribution of lengths of IBD blocks differs depending on the age of the ancestors – e.g. older blocks tend to be shorter – it is possible to use the distribution of lengths of IBD blocks to infer numbers of most recent pairwise genetic common ancestors back through time averaged across pairs of individuals. For this inference, we restricted to blocks longer than 2cM, where we had good power to detect true IBD blocks. We obtain dates in units of generations in the past, and for ease of discussion convert these to years ago (ya) by taking the mean human generation time to be 30 years (15).
There are two major difficulties to overcome, however. First, detection is noisy: we do not detect all IBD segments (especially shorter ones), and some of our IBD segments are false positives. This problem can be overcome by careful estimation and modeling of error, described in section 4.2. The second problem is more serious and unavoidable: as described in section 4.7, the inference problem is extremely “ill conditioned” (in the sense of 54), meaning in this case that there are many possible histories of shared ancestry that fit the data nearly equally well. For this reason, there is a fairly large, unavoidable limit to the temporal resolution, but we still obtain a good deal of useful information.
We deal with this uncertainty by describing the set of histories (i.e. historical numbers of common genetic ancestors) that are consistent with the data, summarized in two ways. First, it is useful to look at individual consistent histories, which gives a sense of recurrent patterns and possible historical signals. Figure 4 shows for several populations both the best-fitting history (in black) and the smoothest history that still fits the data (in red). We can make general statements if they hold across all (or most) consistent histories. Second, we can summarize the entire set of consistent histories by finding confidence intervals (bounds) for the total number of common ancestors aggregated in certain time periods. These are shown in figure 5, giving estimates (colored bands) and bounds (vertical lines) for the total numbers of genetic common ancestors in each of three time periods, roughly 0–500ya, 500–1500ya, and 1500–2500ya (“ya” denotes “years ago”). Supplemental figures S12 (and S13) is a version of figure 5 with more populations (in coalescent units, respectively), and plots analogous to figure 4 for all these histories are shown in supplemental figure S16. For a precise description of the problem and our methods, see section 4.7. We validated the method through simulation (details in supplemental document), and found that it performed well to the temporal resolution discussed here. We note that in simulations where the population size changes smoothly, the maximum likelihood solution is often overly peaky, whereas the smoothed solution can smear out the signal of rapid change in population size. In light of that we encourage the reader to view truth as lying somewhere between these two solutions, and to not overinterpret specific peaks in the maximum likelihood, which may occur due to numerical properties of the inference. That said, there are a number of sharp peaks in common ancestry shared across many population comparisons older than 2000 years ago, which may potentially indicate demographic events in a shared ancestral population. A more thorough investigation of these older shared signals would potentially need a more model-based approach, so we restrict ourselves here to talking about the broad differences between the distribution of common shared ancestors between regions.
The time periods we use for these bounds are quite large, but this is unavoidable, because of a trade off between temporal resolution and uncertainty in numbers of common ancestors. Also note that the lower bounds on numbers of common ancestors during each time interval are often close to zero. This is because one can (roughly speaking) obtain a history with equally good fit by moving ancestors from that time interval into the neighboring ones, resulting in peaks on either side of the selected time interval (see figure S14), even though these do not generally reflect realistic histories. The reader should also bear in mind that we do not depict the dependence of uncertainty between intervals.
In figure 4 we show how the age and number of shared pairwise genetic common ancestors changes as we move away from the Balkans (left column) and the UK (right column), along with two examples of how the observed block length distribution is composed of ancestry from different depths. (The average number of shared pairwise genetic common ancestors from generation is the probability that the most recent common ancestor of a pair at a single site lived in generation (i.e. the coalescent rate) multiplied by the expected number of segments that recombination has broken a pair of individuals’ genomes into that many generations back, as shown in section 4.8.) More plots of this form are shown in supplemental figure S16, and coalescent rates between pairs of populations are shown in the (equivalent) supplemental figure S15.
Most detectable recent common ancestors lived between 1500 and 2500 years ago, and only a small proportion of blocks longer than 2cM are inherited from longer ago than 4000 years. Obviously, there are a vast number of genetic common ancestors older than this, but the blocks inherited from such common ancestors are sufficiently unlikely to be longer than 2cM that we do not detect many. For the most part, blocks longer than 4cM come from 500–1500 years ago, and blocks longer than 10cM from the last 500 years.
In most cases, only pairs within the same population are likely to share genetic common ancestors within the last 500 years. Exceptions are generally neighboring populations (e.g. UK and Ireland). During the period 500–1500ya, individuals typically share tens to hundreds of genetic common ancestors with others in the same or nearby populations, although some distant populations have very low rates. Longer ago than 1500ya, pairs of individuals from any part of Europe share hundreds of genetic ancestors in common, and some share significantly more.
We now examine some of the more striking patterns we see in more detail.
There is relatively little common ancestry shared between the Italian peninsula and other locations, and what there is seems to derive mostly from longer ago than 2500ya. An exception is that Italy and the neighboring Balkan populations share small but significant numbers of common ancestors in the last 1500 years, as seen in supplemental figures S16 or S17. The rate of genetic common ancestry between pairs of Italian individuals seems to have been fairly constant for the past 2500 years, which combined with significant structure within Italy suggests a constant exchange of migrants between coherent subpopulations.
Patterns for the Iberian peninsula are similar, with both Spain and Portugal showing very few common ancestors with other populations over the last 2500 years. However, the rate of IBD sharing within the peninsula is much higher than within Italy – during the last 1500 years the Iberian peninsula shares fewer than 2 genetic common ancestors with other populations, compared to roughly 30 per pair within the peninsula; Italians share on average only about 8 with each other during this period.
The higher rates of IBD between populations in the “E” grouping shown in figure 3 seem to derive mostly from ancestors living 1500–2500ya, but also show increased numbers from 500–1500ya, as shown in figure 5 and supplemental figures S17. For comparison, the IBD rate is high enough that even geographically distant individuals in these eastern populations share about as many common ancestors as do two Irish or two French-speaking Swiss.
By far the highest rates of IBD within any populations is found between Albanian speakers – around 90 ancestors from 0–500ya, and around 600 ancestors from 500–1500ya (so high that we left them out of figure 5; see supplemental figure S12). Beyond 1500ya, the rates of IBD drop to levels typical for other populations in the eastern grouping.
There are clear differences in the number and timing of genetic common ancestors shared by individuals from different parts of Europe, These differences reflect the impact of major historical and demographic events, superimposed against a background of local migration and generally high genealogical relatedness across Europe. We now turn to discuss possible causes and implications of these results.
Genetic common ancestry within the last 2500 years across Europe has been shaped by diverse demographic and historical events. There are both continental trends, such as a decrease of shared ancestry with distance, regional patterns, such as higher IBD in eastern and northern populations, and diverse outlying signals. We have furthermore quantified numbers of genetic common ancestors that populations share with each other back through time, albeit with a (unavoidably) coarse temporal resolution. These numbers are intriguing not only because of the differences between populations, which reflect historical events, but the high degree of implied genealogical commonality between even geographically distant populations.
We have shown that typical pairs of individuals drawn from across Europe have a good chance of sharing long stretches of identity by descent, even when they are separated by thousands of kilometers. We can furthermore conclude that pairs of individuals across Europe are reasonably likely to share common genetic ancestors within the last 1000 years, and are certain to share many within the last 2500 years. From our numerical results, the average number of genetic common ancestors from the last 1000 years shared by individuals living at least 2000km apart is about 1/32 (and at least 1/80); between 1000–2000ya they share about one; and between 2000–3000ya they share above ten. Since the chance is small that any genetic material has been transmitted along a particular genealogical path from ancestor to descendent more than 8 generations deep (12) – about .008 at 240ya, and at 480ya – this implies, conservatively, thousands of shared genealogical ancestors in only the last 1000 years even between pairs of individuals separated by large geographic distances. At first sight this result seems counterintuitive. However, as 1000 years is about 33 generations, and is far larger than the size of the European population, so long as populations have mixed sufficiently, by 1000 years ago everyone (who left descendants) would be an ancestor of every present day European. Our results are therefore one of the first genomic demonstrations of the counter-intuitive but necessary fact that all Europeans are genealogically related over very short time periods, and lends substantial support to models predicting close and ubiquitous common ancestry of all modern humans (63).
The fact that most people alive today in Europe share nearly the same set of (European, and possibly world-wide) ancestors from only 1000 years ago seems to contradict the signals of long term, albeit subtle, population genetic structure within Europe (e.g. 49; 37). These two facts can be reconciled by the fact that even though the distribution of ancestors (as cartooned in figure 1B) has spread to cover the continent, there remain differences in degree of relatedness of modern individuals to these ancestral individuals. For example, someone in Spain may be related to an ancestor in the Iberian peninsula through perhaps 1000 different routes back through the pedigree, but to an ancestor in the Baltic region by only 10 different routes, so that the probability that this Spanish individual inherited genetic material from the Iberian ancestor is roughly 100 times higher. This allows the amount of genetic material shared by pairs of extant individuals to vary even if the set of ancestors is constant.
Other work has studied fine-scale differentiation between populations within Europe based on statistics such as , IBS (e.g. 37; 50), or PCA (49), that summarize in various ways single-marker correlations, averaged across loci. Like rates of IBD, these measures of differentiation can be thought of as weighted averages of past coalescent rates (40; 66; 64; 42), but take much of their information from much more distant times (tens thousands of generations). As expected, we have seen both strong consistency between these measures and IBD (e.g. the decay with geographic distance), as well as distinct patterns (e.g. higher sharing in eastern Europe). These results highlight the fact that long segments of IBD contain information about much more recent events than do single-site summaries, information that can be leveraged to learn about the timing of these events.
A concern about our results is that the European individuals in the POPRES dataset were all sampled in either Lausanne or London. This might bias our results, for instance, if an immigrant community originated mostly from a particular small portion of their home population, thereby sharing a particularly high number of recent common ancestors with each other. We see remarkably little evidence that this is the case: there is a high degree of consistency in numbers of IBD blocks shared across samples from each population, and between neighboring populations. For instance, we could argue that the high degree of shared common ancestry among Albanian speakers was because most of these sampled originated from a small area rather than uniformly across Albania and Kosovo. However, this would not explain the high rate of IBD between Albanian speakers and neighboring populations. Even populations from which we only have one or two samples, which we at first assumed would be unusably noisy, provide generally reliable, consistent patterns, as evidenced by e.g. supplemental figure S3.
Conversely, it might be a concern that individuals sampled in Lausanne or London are more likely to have recent ancestors more widely dispersed than is typical for their population of origin. This is a possibility we cannot discard, and if true, would mean there is more structure within Europe than what we detect. However, by the incredibly rapid spread of ancestry, this is unlikely to have an effect over more than a few generations and so does not pose a serious concern about our results about the ubiquitous levels of common ancestry. Fine-scale geographic sampling of Europe as a whole is needed to address these issues, and these efforts are underway in a number of populations (e.g. 59; 33; 73; 75).
Finally, we have necessarily have taken a narrow view of European ancestry as we have restricted our sample to individuals who are not outliers with respect to genetic ancestry, and when possible to those having all four grandparents drawn from the same county. Clearly the ancestry of Europeans is far more diverse than those represented here, but such steps seemed necessary to make best initial use of this dataset.
We have shown that the problem of inferring the average distribution of genetic common ancestors back through time has a large degree of fundamental uncertainty. The data effectively leave a large number of degrees of freedom unspecified, so one must either describe the set of possible histories, as we do, and/or use prior information to restrict these degrees of freedom.
A related but far more intractable problem is to make a good guess of how long ago a certain shared genetic common ancestor lived, as personal genome services would like to do, for instance: if you and I share a 10cM block of genome IBD, when did our most recent common ancestor likely live? Since the mean length of an IBD block inherited from 5 generations ago is 10cM, we might expect the average age of the ancestor of a 10cM block to be from around 5 generations. However, a direct calculation from our results says that the typical age of a 10cM block shared by two individuals from the UK is between 32 and 52 generations (depending on the inferred distribution used). This discrepancy results from the fact that you are a priori much more likely to share a common genetic ancestor further in the past, and this acts to skew our answers away from the naive expectation – even though it is unlikely that a 10cM block is inherited from a particular shared ancestor from 40 generations ago, there are a great number of such older shared ancestors. This also means that estimated ages must depend drastically on the populations’ shared histories: for instance, the age of such a block shared by someone from the UK with someone from Italy is even older, usually from around 60 generations ago. This may not apply to ancestors from the past very few (perhaps less than eight) generations, from whom we expect to inherit multiple long blocks – in this case we can hope to infer a specific genealogical relationship with reasonable certainty (e.g. 31; 28), although even then care must be taken to exclude the possibility that these multiple blocks have not been inherited from distinct common ancestors.
Although the sharing of a long genomic segment can be an intriguing sign of some recent shared ancestry, the ubiquity of shared genealogical ancestry only tens of generations ago across Europe (and likely the world, 63) makes such sharing unsurprising, and assignment to particular genealogical relationships impossible. What is informative about these chance sharing events from distant ancestors is that they provide a fine-scale view of an individual’s distribution of ancestors (e.g. figure 3), and that in aggregate they can provide an unprecedented view into even small-scale human demographic history.
Our results also offer a way to understand the geographic location of individuals of a given degree of relatedness. The values of figure 5 (and S12) can be interpreted as the distribution of distant cousins for any reference population – for instance, the set of bars for Poland (“PL”) in the top row shows that a randomly chosen distant cousin of a Polish individual with the common ancestor living in the past 500 years most likely lives in Poland but has reasonable chance of living in the Balkan peninsula or Germany. Here “randomly chosen” means chosen randomly proportional to the paths through the pedigree – concretely, take a random walk back through the pedigree to an ancestor in the appropriate time period, and then take a random walk back down. If one starts in Poland, then the chance of arriving in, say, Romania is proportional to the average number of (genetic) common ancestors shared by a pair from Poland and Germany, which is exactly the number estimated in figure 5.
As we have shown, patterns of IBD provide ample but noisy geographic and temporal signals, which can then be connected to historical events. Rigorously making such connections is difficult, due to the complex recent history of Europe, controversy about the demographic significance of many events, and uncertainties in inferring the ages of common ancestors. Nonetheless, our results can be plausibly connected to several historical and demographic events.
One of the striking patterns we see is the relatively high level of sharing of IBD between pairs of individuals across eastern Europe, as high or higher than that observed within other, much smaller populations. This is consistent with these individuals having a comparatively large proportion of ancestry drawn from a relatively small population that expanded over a large geographic area. The “smooth” estimates of figure 4 (and more generally figures 5, and S17), suggest that this increase in ancestry stems from around 1000–2000 years ago, since during this time pairs of eastern individuals are expected to share a substantial number of common ancestors, while this is only true of pairs of non-eastern individuals if they are from the same population. For example, even individuals from widely separated eastern populations share about the same amount of IBD as do two Irish individuals (see supplemental figure S3), suggesting that this ancestral population may have been relatively small.
This evidence is consistent with the idea that these populations derive a substantial proportion of their ancestry from various groups that expanded during the “migration period” from the fourth through ninth centuries (11). This period begins with the Huns moving into eastern Europe towards the end of the fourth century, establishing an empire including modern-day Hungary and Romania; and continues in the fifth century as various Germanic groups moved into and ruled much of the western Roman empire. This was followed by the expansion of the Slavic populations into regions of low population density beginning in the sixth century, reaching their maximum by the 10th century (3). The eastern populations with high rates of IBD is highly coincident with the modern distribution of Slavic languages, so it is natural to speculate that much of the higher rates were due to this expansion. The inclusion of (non-Slavic speaking) Hungary and Romania in the group of eastern populations sharing high IBD could indicate the effect of other groups (e.g. the Huns) on ancestry in these regions, or because some of the same group of people who elsewhere are known as Slavs adopted different local cultures in those regions. Greece and Albania are also part of this putative signal of expansion, which could be because the Slavs settled in part of these areas (with unknown demographic effect), or because of subsequent population exchange. However, additional work and methods would be needed to verify this hypothesis.
The highest levels of IBD sharing are found in the Albanian-speaking individuals (from Albania and Kosovo), an increase in common ancestry deriving from the last 1500 years. This suggests that a reasonable proportion of the ancestors of modern-day Albanian speakers (at least those represented in POPRES) are drawn from a relatively small, cohesive population that has persisted for at least the last 1500 years. These individuals share similar but slightly higher numbers of common ancestors with nearby populations than do individuals in other parts of Europe (see figure S3), implying that these Albanian speakers have not been a particularly isolated population so much as a small one. Furthermore, our Greek and Macedonian samples share much higher numbers of common ancestors with Albanian speakers than with other neighbors, possibly a result of historical migrations, or else perhaps smaller effects of the Slavic expansion in these populations. It is also interesting to note that the sampled Italians share nearly as much IBD with Albanian speakers as with each other. The Albanian language is a Indo-European language without other close relatives (25) that persisted through periods when neighboring languages were strongly influenced by Latin or Greek, suggesting an intriguing link between linguistic and genealogical history in this case.
On the other hand, we find that France and the Italian and Iberian peninsulas have the lowest rates of genetic common ancestry in the last 1500 years (other than Turkey and Cyprus), and are the regions of continental Europe thought to have been least affected by the Slavic and Hunnic migrations. These regions were, however, moved into by Germanic tribes (e.g. the Goths, Ostrogoths, and Vandals), which suggests that perhaps the Germanic migrations/invasions of these regions entailed a smaller degree of population replacement, than the Slavic and/or Hunnic, or perhaps that the Germanic groups were less genealogically cohesive. This is consistent with the argument that the Slavs moved into relatively depopulated areas, while Gothic “migrations” may have been takeovers by small groups of extant populations (24; 36).
In addition to the very few genetic common ancestors that Italians share both with each other and with other Europeans, we have seen significant modern substructure within Italy (i.e. figure 2) that predates most of this common ancestry, and estimate that most of the common ancestry shared between Italy and other populations is older than about 2300 years (supplemental figure S16). Also recall that most populations show no substructure with regards to the number of blocks shared with Italians, implying that the common ancestors other populations share with Italy predate divisions within these other populations. This suggests significant old substructure and large population sizes within Italy, strong enough that different groups within Italy share as little recent common ancestry as other distinct, modern-day countries, substructure that was not homogenized during the migration period. These patterns could also reflect in part geographic isolation within Italy as well as a long history of settlement of Italy from diverse sources.
In contrast to Italy, the rate of sharing of IBD within the Iberian peninsula is similar to that within other populations in Europe. There is furthermore much less evidence of substructure within our Iberian samples than within the Italians, as shown in supplemental figure S2. This suggests that the reduced rate of shared ancestry is due to geographic isolation (by distance and/or the Pyrenees) rather than long-term stable substructure within the peninsula.
Our results show that patterns of recent identity by descent both provide evidence of ubiquitous shared common ancestry and hold the potential to shed considerable light on the complex history of Europe. However, these inferences also quickly run up against a fundamental limit to our ability to infer pairwise rates of recent common genetic ancestry. In order to make a fuller model of European history, we will need to make use of diverse sources of genomic information from large samples, including IBD segments and rare variants (46; 70), and develop methods that can more fully utilize this information across more than pairs of populations. Another profound difficulty is that Europe – and indeed any large continental region – has such complex layers of history, through which ancestry has mixed so greatly, that attempts to connect genetic signals in extant individuals to particular historical events requires the corroboration of other sources of information from many disciplines. For example, the ability to isolate ancient autosomal DNA from individuals who lived during these time periods (as do 65; 34) will help to overcome some of these these profound difficulties. More generally, the quickly falling cost of sequencing, along with the development of new methods, will shed light on the recent demographic and genealogical history of populations of recombining organisms, human and otherwise.
We used the two European subsets of the POPRES dataset – the CoLaus subset, collected in Lausanne, Switzerland, and the LOLIPOP subset, collected in London, England; the dataset is described in Nelson et al. (45). Those collected in Lausanne reported parental and grandparental country of origin; those collected in London did not. We followed Novembre et al. (49) in assigning each sample to the common grandparental country of origin when available, and discarding samples whose parents or grandparents were reported as originating in different countries. We took further steps to restrict to individuals whose grandparents came from the same geographic region, first performing principal components analysis on the data using SMARTPCA (52), and excluding 41 individuals who clustered with populations outside Europe (the majority of such were already excluded by self-reported non-European grandparents). These individuals certainly represent an important part of the recent genetic ancestry of Europe, but are excluded because we aim to study events stemming from older patterns of gene flow, and because we do not model the whole-genome dependencies in recently admixed genomes. We then used PLINK’s inference of the fraction of single-marker IBD (Z0, Z1, and Z2, 60) to identify very close relatives, finding 25 pairs that are first cousins or closer (including duplicated samples), and excluded one individual from each pair. We grouped samples into populations mostly by reported country, but also used reported language in a few cases. Because of the large Swiss samples, we split this group into three by language: French-speaking (CHf), German-speaking (CHd), or other (CH). Many samples reported grandparents from Yugoslavia; when possible we assigned these to a modern-day country by language, and when this was ambiguous or missing we assigned these to “Yugoslavia”. Most samples from the United Kingdom reported this as their country of origin; however, the few that reported “England” or “Scotland” were assigned this label. This left us with 2257 individuals from 40 populations; for sample sizes see table 1. Supplemental table S2 further breaks this down, and unambiguously gives the composition of each population. Physical distances were converted to genetic distances using the hg36 map, and the average human generation time was taken to be 30 years (15).
All figures were produced in R (61), with color palettes from packages colorspace (76) and RColorBrewer (55). Code implementing all methods described below is distributed along with IBD block data sufficient to reproduce the historical analyses through http://www.github.com/petrelharp/euroibd and in the Dryad digital repository (62).
To find blocks of IBD, we used fastIBD (implemented in BEAGLE, 5), which records putative genomic segments shared IBD by pairs of individuals, along with a score indicating the strength of support. As suggested by the authors, in all cases we ran the algorithm 10 times with different random seeds, and postprocessed the results to obtain IBD blocks. Based on our power simulations described below, we modified the postprocessing procedure recommended by Browning and Browning (5) to deal with spurious gaps or breaks introduced into long blocks of IBD by low marker density or switch error, as follows: We called IBD segments by first removing any segments not overlapping a segment seen in at least one other run (as suggested by Browning and Browning (5), except with no score cutoff); then merging any two segments separated by a gap shorter than at least one of the segments and no more than 5cM long; and finally discarding any merged segments that did not contain a subsegment with score below . As shown in figure 6, this resulted in a false positive rate of between 8–15% across length categories, and a power of at least 70% above 1cM, reaching 95% by 4cM. After post-processing, we were left with 1.9 million IBD blocks, 1 million of which were at least 2cM long (at which length we estimate 85% power and a 10% false positive rate).
All methods to identify haplotypic IBD rely on identifying long regions of near identical haplotypes between pairs of individuals (referred to as identical by state, IBS). However, long IBS haplotypes could potentially also result from the concatenation of multiple shorter blocks of true IBD. While such runs can contain important information about deeper population history (e.g. 39; 26), we view them as a false positives as they do not represent single haplotypes shared without intervening recombination. The chance of such a false positive IBD segment decreases as the genetic length of shared haplotype increases. However, the density of informative markers also plays a role, because such markers are necessary to infer regions of IBS.
If we are to have a reasonable false positive rate, we must accept imperfect power. Power will also vary with the density and informativeness of markers and length of segment considered. For example, it is intuitive that segments of genome containing many rare alleles are easier identify as IBD. Conversely, rare immigrant segments from a population with different allele frequencies may, if they are shared by multiple individuals within the population, cause higher false positive rates. For these reasons, when estimating statistical power and false positive rate, it is important to use a dataset as similar to the one under consideration as possible. Therefore, to determine appropriate postprocessing criteria and to estimate our statistical power, we constructed a dataset similar to the POPRES with known shared IBD segments as follows: we copied haploid segments randomly between 60 trio-phased individuals of European descent (using only one from each trio) from the HapMap dataset (haplotypes from release #21, 17/07/06 32), reoriented alleles to match the strand orientation of POPRES, substituted these for 60 individuals from Switzerland in the POPRES data, and ran BEAGLE on the result as before. These segments were copied between single chromosomes of randomly chosen individuals, for random lengths 0.5–20cM, with gaps of at least 2cM between adjacent segments and without copying between the same two individuals twice in a row. When copying, we furthermore introduced genotyping error by flipping alleles independently with probability .002 and marking the allele missing with probability .023 (error rates were determined from duplicated individuals in the sample as given by Nelson et al. (45)). An important feature of the inferred data was that BEAGLE often reported true segments longer than about 5cM as two or more shorter segments separated by a short gap, which led us to merge blocks as described above.
We also need a reasonably accurate assessment of our bias and false positive rates for our inference of numbers of genetic ancestors from the IBD length spectrum. Although the estimated IBD lengths were approximately unbiased, we fit a parametric model to the relationship between true and inferred lengths after removing inferred blocks less than 1cM long. A true IBD block of length is missed entirely with probability , and is otherwise inferred to have length ; with probability the error is positive; otherwise it is negative and conditioned to be less than . In either case, is exponentially distributed; if its mean is , while if its (unconditional) mean is . The parametric forms were chosen by examination of the data; these are, with final parameter values:
(1) |
where . The parameters were found by maximum likelihood, using constrained optimization as implemented in the R package optim (61) separately on three independent pieces: the parameters in and ; the parameters in ; and finally the parameters in ; the fit is shown in supplemental figure S10.
To estimate the false positive rate, we randomly shuffled segments of diploid genome between individuals from the same population (only those 12 populations with at least 19 samples) so that any run of IBD longer than about 0.5cM would be broken up among many individuals. Specifically, as we read along the genome we output diploid genotypes in random order; we shuffled this order by exchanging the identity of each output individual with another at independent increments chosen uniformly between 0.1 and 0.2cM. This ensured that no output individual had a continuous run of length longer than 0.2cM copied from a single input individual, while also preserving linkage on scales shorter than 0.1cM. The results are shown in figure 6B; from these we estimate that the mean density of false positives cM long per pair and per cM is approximately
(2) |
a parametric form again chosen by examination of the data and fit by maximum likelihood.
We found that overall, the false positive rate was around of the observed rate, except for very long blocks (longer than 5cM or so, where it was close to zero), and for very short blocks (less than 1cM, where it approached 0.4). As fastIBD depends on estimating underlying haplotype frequencies, it is expected to have a higher false positive rate in populations that are more differentiated from the rest of the sample. There was significant variation in false positive rate between different populations, with Spain, Portugal, and Italy showing significantly higher false positive rates than the other populations we examined – see supplemental figure S11 – however, the variation was significant only for blocks shorter than 2cM across all population pairs, with the exception of pairs of Portuguese individuals, where the upwards bias may be significant as high as 4cM.
Finally, one concern is that as fastIBD calls IBD based on a model of haplotype frequencies in the sample it may be unduly affected by the large-scale sample size variation across the POPRES sample. In particular, the French-speaking Swiss sample is very large, which could lead to systematic bias in calling IBD in populations closely to the Swiss samples. To investigate this, we randomly discarded 745 French-speaking Swiss (all but 100 of these), and a random sampling of the remaining populations (removing 812 in total, leaving 1445). We then ran BEAGLE on chromosome 1 of these individuals, postprocessing in the same way as for the full sample. Reassuringly, there was high concordance between the two – we found that 98% (95%) of the blocks longer than 2cM found in the analysis with the full dataset (respectively, with the subset) were found in both analyses. Overall, more blocks were found by the analysis with the smaller dataset; however, by adjusting the score cutoff by a fixed amount this difference could be removed, leaving nearly identical length distributions between the two analyses. This is a known attribute of the fastIBD algorithm, and can alternatively be avoided by adjusting the model complexity S. Browning, personal communication.
We then tested the extent to which the effect of sample size varied by population, for IBD blocks in several length categories (binning block lengths at 1, 2, 4, and 10cM). Suppose that is the number of IBD blocks found between populations and in the analysis of the full dataset, and is the number found in the analysis of the smaller dataset (counted between the same individuals each time). We then assume that and are Poisson with mean and , respectively, so that conditioned on (the total number of blocks), is binomial with parameters and . We are looking for deviations from the null model under which the effect of smaller sample size affects all population pairs equally, so that for some constant . We therefore fit a binomial GLM (41) with a logit link, with terms corresponding to each population – in other words,
We found statistically significant variation by population (i.e. several nonzero ), but all effect sizes were in the range of 0–4%; estimated parameters are listed in supplemental table S1. Notably, the coefficient corresponding to the French-speaking Swiss (the population with the largest change in sample size) was fairly small. We also fit the model not assuming additivity when , i.e. adding coefficients to the formula for , but these were not significant. These results suggest that sample size variation across the POPRES data has only minor effects on the distribution of IBD blocks shared across populations.
To look for regions of unusual levels of IBD and to examine our assumption of uniformity, we compared the density of IBD tracts of different lengths along the genome,
in supplemental figure S1. To do this, we first divided blocks up into nonoverlapping bins based on length, with cutpoints at 1, 2.5, 4, 6, 8, and 10cM. We then computed at each SNP the number of IBD blocks in each length bin that covered that site. To control for the effect of nearby SNP density on the ability to detect IBD, we then computed the residuals of a linear regression predicting number of overlapping IBD blocks using the density of SNPs within 3cM. To compare between bins, we then normalized these residuals, subtracting the mean and dividing by the standard deviation; these “z-scores” for each SNP are shown in figure S1.
We noted repeated patterns of IBD sharing across multiple populations (seen in supplemental figure S3), in which certain sets of populations tended to show similar patterns of sharing. To quantify this, we computed correlations between mean numbers of IBD blocks; in supplemental figure S7 we show correlations in numbers blocks of various lengths. Specifically, if is the mean number of IBD blocks of the given length shared by an individual from population with a (different) individual from population , there are populations, and , then figure S7 shows for each and ,
(3) |
the (Pearson) correlation between and ranging across . Other choices of block lengths are similar, although shorter blocks show higher overall correlations (due in part to false positives) and longer blocks show lower overall correlations (as rates are noisier, and sharing is more restricted to nearby populations). The geographic groupings of table 1 were then chosen by visual inspection.
We assessed the overall degree of substructure within each population, by measuring, for each and , the degree of inhomogeneity across individuals of population for shared ancestry with population . We measured inhomogeneity by the standard deviation in number of blocks shared with population , across individuals of population . We assessed the significance by a permutation test, randomly reassigning each block shared between and to a individual chosen uniformly from population , and recomputing the standard deviation, 1000 times. (If there are blocks shared between and and individuals in population , this is equivalent to putting balls in boxes, tallying how many balls are in each box, and computing the sample standard deviation of the resulting list of numbers.) Note that some degree of inhomogeneity of shared ancestry is expected even within randomly mating populations, due to randomness of the relationship between individuals in the pedigree. These effects are likely to be small if the relationships are suitably deep, but this is still an area of active research (28; 7). The resulting -values are shown in supplemental figure S2. We did not analyze these in detail, particularly as we had limited power to detect substructure in populations with few samples, but note that a large proportion (47%) of the population pairs showed greater inhomogeneity than in all 1000 permuted samples (i.e. ). Some comparisons even with many samples in both populations (where we have considerable power to detect even subtle inhomogeneity) showed no structure whatsoever – in particular, the distribution of numbers of Italian IBD blocks shared by Swiss individuals is not distinguishable from Poisson, indicating a high degree of homogeneity of Italian ancestry across Switzerland.
To assess the single marker measures of relatedness across the POPRES sample we calculated pairwise identity by state, the probability that two alleles sampled at random from a pair of individuals are identical, averaged across SNPs. This was calculated for all pairs of individuals using the “--genome” option in PLINK v1.07 (60, http://pngu.mgh.harvard.edu/purcell/plink), and is shown in Supplementary Figure S9 with points colored as in Figure 3.
Here, our aim is to use the distribution of IBD block lengths to infer how long ago the genetic common ancestors were alive from which these IBD blocks were inherited. A pair of individuals who share a block of IBD of genetic length at least have each inherited contiguous regions of genome from a single common ancestor generations ago that overlap for length at least . If we start with the population pedigree, those ancestors from which the two individuals might have inherited IBD blocks are those that can be connected to both by paths through the pedigree. The distribution of possible IBD blocks is determined by the number of links (i.e. the number of meioses) occurring along the two paths.
Throughout the article we informally often refer to ancestors living a certain “number of generations in the past” as if humans were semelparous with a fixed lifetime. Keeping with this, it is natural to write the number of IBD blocks shared by a pair of individuals as the sum over past generations of the number of IBD blocks inherited from that generation. In other words, if is the number of IBD blocks of genetic length at least shared by two individual chromosomes, and is the number of such IBD blocks inherited by the two along paths through the pedigree having a total of meioses, then . Therefore, averaging over possible choices of pairs of individuals, the mean number of shared IBD blocks can be similarly partitioned as
(4) |
In each successive generation in the past each chromosome is broken up into successively more pieces, each of which has been inherited along a different path through the pedigree, and any two such pieces of the two individual chromosomes that overlap and are inherited from the same ancestral chromosome contribute one block of IBD. Therefore, the mean number of IBD blocks coming from generations ago is the mean number of possible blocks multiplied by the probability that a particular block is actually inherited by both individuals from the same genealogical ancestor in generation . Allowing for overlapping generations, the first part we denote by , the mean number of pieces of length at least obtained by cutting the chromosome at the recombination sites of meioses, and the second part we denote by , the probability that the two chromosomes have inherited at a particular site along a path of total length meioses (e.g. their common ancestor at that site lived generations ago). Multiplying these and summing over possible paths, we have that
(5) |
i.e. the mean rate of IBD is a linear function of the distribution of the time back to the most recent common ancestor averaged across sites. The distribution is more precisely known as the coalescent time distribution (35; 74), in its obvious adaptation to population pedigrees. As a first application, note that the distribution of ages of IBD blocks above a given length depends strongly on the demographic history – a fraction of these are from paths meioses long.
Furthermore, it is easy to calculate that for a chromosome of genetic length ,
(6) |
assuming homogeneous Poisson recombination on the genetic map (as well as constancy of the map and ignoring the effect of interference, which is a reasonable for the range of we consider). The mean number of IBD blocks of length at least shared by a pair of individuals across the entire genome is then obtained by summing (5) across all chromosomes, and multiplying by four (for the four possible chromosome pairs).
Equations (5) and (6) give the relationship between lengths of shared IBD blocks and how long ago the ancestor lived from whom these blocks are inherited. Our goal is to invert this relationship to learn about , and hence the ages of the common ancestors underlying our observed distribution of IBD block lengths. To do this, we first need to account for sampling noise and estimation error. Suppose we are looking at IBD blocks shared between any of a set of pairs of individuals, and assume that , the number of observed IBD blocks shared between any of those pairs of length at least , is Poisson distributed with mean , where
(7) | ||||
(8) |
Here the false positive rate , power and the components of the error kernel are estimated as above, with parametric forms given in equations (2) and (1). The Poisson assumption has been examined elsewhere (e.g. 16; 31), and is reasonable because there is a very small chance of having inherited a block from each pair of shared genealogical ancestors; there a great number of these, and if these events are sufficiently independent, the Poisson distribution will be a good approximation (see e.g. 21). If this holds for each pair of individuals, the total number of IBD blocks is also Poisson distributed, with given by the mean of this number across all constituent pairs. (Note that this does not assume that each pair of individuals has the same mean number, so does not assume that our set of pairs are a homogeneous population.)
We have therefore a likelihood model for the data, with demographic history (parametrized by ) as free parameters. Unfortunately, the problem of inferring is ill-conditioned (unsuprising due to its similarity of the kernel (6) to the Laplace transform, see 14), which in this context means that the likelihood surface is flat in certain directions (“ridged”): for each IBD block distribution there is a large set of coalescent time distributions that fit the data equally well. A common problem in such problems is that the unconstrained maximum likelihood solution is wildly oscillatory; in our case, the unconstrained solution is not so obviously wrong, since we are helped considerably by the knowledge that . For reviews of approaches to such ill-conditioned inverse problems, see e.g. Petrov and Sizikov (54) or Stuart (68); the problem is also known as “data unfolding” in particle physics (10). If one is concerned with finding a point estimate of , most approaches add an additional penalty to the likelihood, which is known as “regularization” (71) or “ridge regression” (29).
However, our goal is parametric inference, and so we must describe the limits of the “ridge” in the likelihood surface in various directions, (which can be seen as maximum a posteriori estimates under priors of various strengths).
To do this, we first discretize the data, so that is the number of IBD blocks shared by any of a total of distinct pairs of individuals with inferred genetic lengths falling between and . We restrict to blocks having a minimum length of 2cM long, so that . To find a discretization so that each has roughly equal variance, we choose by first dividing the range of block lengths into 100 bins with equal numbers of blocks falling in each; discard any bins longer than 1cM; and divide the remainder of the range up into 1cM chunks. To further reduce computational time, we also discretize time, effectively requiring to be constant on each interval , with , for – so the resolution is finest for recent times, and the maximum time depth considered is 6660 meioses, or 99900 years ago. (The discretization and upper bound on time depth were found to not affect our results.) We then compute by numerical integration (using the function integrate in R) the matrix discretizing the kernel given in (7), so that is the kernel that applied to gives the mean number of true IBD blocks per pair observed with lengths between and , and is the mean number of false positives per pair with lengths in the same interval. We then sum across chromosomes, as before. The likelihood of the data is thus
(9) |
To the (negative) log likelihood we add a penalization , after rescaling by the number of pairs (which does not affect the result but makes penalization strengths comparable between pairs of populations), and use numerical optimization (the L-BFGS-B method in optim, 61) to minimize the resulting functional (which omits terms independent of )
(10) |
Often we will fix the functional form of the penalization and vary its strength, so that , in which case we will write for .
For instance, the leftmost panels in figure 4 show the minimizing solutions for (no penalization) and for (“roughness” penalization). Because our aim is to describe extremal reasonable estimates , in this and in other cases, we have chosen the strength of penalization to be “as large as is reasonable”, choosing the largest such that the minimizing has log likelihood differing by no more than 2 units from the unconstrained optimum. This choice of cutoff can be justified as in Edwards (13), gave quite similar answers to other methods, and performed well on simulated population histories (see supplemental document).
This can be thought of as taking the strongest prior that still gives us “reasonable” maximum a posteriori answers. Note that the optimization is over nonnegative distributions also satisfying (although the latter condition does not enter in practice).
We would also like to determine bounds on total numbers of shared genetic ancestors who lived during particular time intervals, by determining e.g. the minimum and maximum numbers of such ancestors that are consistent with the data. Such bounds are shown in figure 5. To obtain a lower bound for the time period between and generations, we penalized the total amount of shared ancestry during this interval, using the penalizations , and choosing to give a drop of 2 log likelihood units, as described above. The lower bound is then the total amount of coalescence for minimizing . The upper bound is found by penalizing total shared ancestry outside this interval, i.e. by applying the penalization . It is almost always the case that lower bounds are zero, since there is sufficient wiggle room in the likelihood surface to explain the observed block length distribution using peaks just below and above . Examples are shown in supplemental figure S14. On the other hand, upper bounds seem fairly reliable.
In the above we have assumed that the minimizer of is unique, thus glossing over e.g. finding appropriate starting points for the optimization. In practice, we obtained good starting points by solving the natural approximating least-squares problem, using quadprog (72) in R. We then evaluated uniqueness of the minimizer by using different starting points, and found that if necessary, adding only a very small penalization term was enough to ensure convergence to a unique solution.
To test this method, we implemented a whole-genome coalescent IBD simulator, and applied our inference methods to the results under various demography scenarios. We also used these simulations to evaluate the sensitivity of our method to misestimation of power or false positive rates. The simulations, and the results, are described in Supplemental Document 1; in all cases, the simulations showed that the method performed well to the level of uncertainty discussed throughout the text and confirmed our understanding of the method described above. We also found that misestimation of false positive rate only affects estimated numbers of common ancestors by a comparable amount, and that misestimation of blocks less than 4cM long mostly affects estimates older than about 2000 years. Therefore, if our false positive rates above 2cM are off by 10% (the range that seems reasonable), which would change our estimated numbers of blocks by about 1%, this would only change our estimated numbers of shared ancestors by a few percent.
We only used blocks longer than 2cM to infer ages of common ancestors, in part because the model we use does not seem to fit the data below this threshold. Attempts to apply the methods to all blocks longer than 1cM reveals that there is no history of rates of common ancestry that, under this model, produces a block length distribution reasonably close to the one observed – small, but significant deviations occur below about 2cM. This occurs probably in part because our estimate of false positive rate is expected to be less accurate at these short lengths. Furthermore, our model does not explicitly model the overlap of multiple short IBD segments to create on long segment deriving from different ancestors, which could start to have a significant effect at short lengths. (The effect on long blocks we model as error in length estimation.) This could be incorporated into a model (in a way analogous to 39), but consideration of when several contiguous blocks of IBD might have few enough differences to be detected as a long IBD block quickly runs into the need for a model of IBD detection, which we here treat as a black box. Use of these shorter blocks, which would allow inference of older ancestry, will need different methods, and probably sequencing rather than genotyping data.
Estimated numbers of genetic common ancestors can be found by simply solving for using an estimate of in equations (5) and (6) (still restricting to genetic ancestors on the autosomes). These tell us that given the distribution , the mean number of genetic common ancestors coming from generation – i.e. the mean number of IBD blocks of any length inherited from such common ancestors – is , where is the total sex-averaged genetic length of the human chromosome. Since the total sex-averaged map length of the human autosomes is about 32 Morgans, this is about . This procedure has been used in figures 4 and 5.
Converting shared IBD blocks to numbers of shared genealogical common ancestors is more problematic. Suppose that modern-day individuals and both have as a grandparent. Using equation (6) at , we know that the mean number of blocks that and both inherit from is , with , since each block has chance of being inherited across meioses. First treat the endpoints of each distinct path of length back through the pedigree as a grandparent, so that everyone has exactly grandparents, and some ancestors will be grandparents many times over. Then if and share genetic grandparents, a moment estimator for the number of genealogical grandparents is . However, the geometric growth of means that small uncertainties in have large effects on the estimated numbers of genealogical common ancestors – and we have large uncertainties in .
Despite these difficulties, we can still get some order-of-magnitude estimates. For instance, we estimate that someone from Hungary shares on average about 5 genetic common ancestors with someone from the UK between 18 and 50 generations ago. Since , we would conservatively estimate that for every genetic common ancestor there are tens of millions of genealogical common ancestors. Most of these ancestors must be genealogical common ancestors many times over, but these must still represent at least thousands of distinct individuals.
Thanks to Razib Khan, Sharon Browning, and Don Conrad for several useful discussions, and to Jeremy Berg, Ewan Birney, Yaniv Brandvain, Joe Pickrell, Jonathan Pritchard, Alisa Sedghifar, and Joel Smith for useful comments on earlier drafts. We also thank the four anonymous reviewers, as well as Amy Williams (at Haldane’s sieve), for their helpful suggestions.