|Challenge data for macroevolutionary reconstructions
Fitting birth-death cladogenic models to time-calibrated phylogenies of extant species ("extant timetrees") is a widely used technique for estimating speciation/extinction dynamics over time, especially for taxa with a poor fossil record.
In a paper published in Nature (2020), we have shown that in the absence of additional constraints
it is impossible to reliably reconstruct past speciation/extinction rates from extant timetrees alone.
The reason is that for any candidate diversification scenario there exist a myriad of alternative and markedly different diversification scenarios
that would generate extant timetrees with the same probability distribution as the candidate scenario.
These "congruent" diversification scenarios are thus statistically indistinguishable from the candidate scenario, even for infinitely large datasets.
In practice, this means that when fitting specific birth-death models to an extant timetree (for example via maximum likelihood), we can at most recover the congruence class
of the true diversification history that generated the tree, but not the true diversification history itself.
This situation becomes even more severe when the input data consists solely of extant molecular sequences and the phylogeny itself must be jointly estimated with the diversification dynamics, as is the case in some Bayesian analyses.
A challenge data set
Here we provide multiple "challenge" data sets that illustrate the difficulties of inferring past diversification histories from extant timetrees alone.
Specifically, we provide 10 extant timetrees generated according to 5 different hypothetical diversification scenarios, i.e., using birth-death models with specific speciation and extinction rate profiles over time,
and encourage anyone to try for themselves to reconstruct the original speciation/extinction rates used in the simulations.
The diversification scenarios used in the simulations were chosen to be generally plausible, i.e., to resemble the complexity and rate values commonly inferred in eukaryotic macroevolution.
Note that nature almost never generates data exactly according to our mathematical models, and the only feasible goal is usually to find reasonable approximations of the more complex truth.
In that spirit, we do not expect anyone to exactly reconstruct the true speciation/extinction rates, but at least provide approximations that would be reasonably acceptable in practice (for example, exhibiting the correct trends, broader features and order of magnitude).
And since information towards the root is scarce, we only ask for accurate reconstructions in the 2nd (more recent) half of the tree's time span.
For every tree we provide the sampling fraction used, i.e., which fraction of extant lineages was ultimately sampled at present-day (in practice this information is not always available).
The provided timetrees are "exact", i.e. topology and branch lengths can be considered to be accurately known, in order to eliminate uncertainties in tree estimation (as they often appear in practice) and focus on the identifiability issues stemming from birth-death model congruencies.
The provided timetrees are intentionally large (>10,000 tips), in order to reduce reconstruction errors stemming from small sample sizes and the unavoidable stochasticity of the birth-death process.
Of course you are welcome to downsample the trees if your favorite tool cannot handle such large datasets (don't forget to adjust the sampling fraction in your analysis).
How do I submit reconstruction attempts?
Please send us the reconstructed speciation/extinction rates for any one (or all) of the challenge sets in the form of TSV data tables (one file per tree).
Each table should list the ages (time before present), speciation rates and extinction rates in 3 separate and clearly named columns (one row per time point).
You may also include the reconstructed net diversification (speciation minus extinction) rates over time.
If you performed a Bayesian analysis and prefer sending us the posterior distribution of estimated diversification dynamics, rather than a single "best guess",
please send us a separate TSV table for each posterior sample (for example, 1000 TSV files if your MCMC chain yielded 1000 posterior samples).
The range and number of time points listed is your choice, but note that the speciation/extinction rates will be interpretted and plotted as piecewise linear curves between successive time points (so use plenty of time points to make your curves smooth).
Please also include a brief (1 paragraph) description of methods used.
Please send your data to Stilianos Louca (email provided here).
When/how will you reveal the true diversification dynamics?
We plan to publish the true diversification dynamics (i.e., speciation/extinction rates over time) used in the simulations, on this website by August 2021.
We will also post any reconstruction attempts submitted by the community for comparison; the identity of submitters will not be published unless the submitter explicitly requests otherwise.
To allow verification of the veracity of the posted true diversification scenarios, i.e., confirming that we did not retrospectively change the scenarios in response
to the attempts submitted by the community, we hereby also provide the MD5 hash for each scenario's data table listing the true speciation/extinction rates over time (to be revealed in August, 2021).
How will reconstruction attempts be evaluated?
With every reconstruction attempt, we will publish basic measures of accuracy, including the mean relative deviation (|fit-true|/true) for the speciation rate, extinction rate and net diversification rate.
We will also assess the ability to make correct qualitative inferences, including (a) whether the speciation rate or extinction rate has been increasing or decreasing in recent times (i.e., is the slope positive or negative at present-day),
(b) what is the relative ranking of present-day extinction rates across scenarios and (c) what was the average trend of the speciation/extinction rates over the entire considered period?
Lastly, through visual comparison everyone is encouraged to reach their own conclusion on whether they would trust such reconstructions for their own work in the future.
|Tree HBD1. Sampling fraction 0.5. Truth MD5 423b23bdd90b93e74d3a6bbc5e16aed8.|
|Tree HBD2. Sampling fraction 0.1. Truth MD5 60048fb5c88a87b3b434beff7a524205.|
|Tree HBD3. Sampling fraction 0.8. Truth MD5 b407625a100c659ac90f5e4799e19549.|
|Tree HBD4. Sampling fraction 1.0. Truth MD5 b21febad8e63cf1280c8ff8455a40363.|
|Tree HBD5. Sampling fraction 0.5. Truth MD5 b9eb997a56a899617025e214e1e53795.|
|Tree HBD6. Sampling fraction 1.0. Truth MD5 a5eb94e6fc56f1bb8b1e16d31f669a5c.|
|Tree HBD7. Sampling fraction 0.5. Truth MD5 5674f35c38df8c59a4f4a66e11ccc9a2.|
|Tree HBD8. Sampling fraction 1.0. Truth MD5 e8c4728fbfc3141c0c78a683b79e9742.|
|Tree HBD9. Sampling fraction 1.0. Truth MD5 43ca6e729d418c40504cce83c9616e50.|
|Tree HBD10. Sampling fraction 1.0. Truth MD5 5293e331e10f09e9da5bf30cf27f4c0f.||
Louca lab. Department of Biology, University of Oregon, Eugene, USA|
© 2021 Stilianos Louca all rights reserved