Global Prokaryotic Census
This dataset is provided as supplementary material to the following article:
Louca, S., Mazel, F., Doebeli, M. and Parfrey, W. L. (2019).
A census-based estimate of Earth's bacterial and archaeal diversity.
PLOS Biology. 17(2):e3000106. DOI:10.1371/journal.pbio.3000106

The Global Prokaryotic Census (GPC) is a massive composite dataset of bacterial and archaeal 16S amplicon sequences, comprising publicly available data from roughly 34,000 samples across 492 studies worldwide. The GPC covers a wide range of environments, including the surface and deep ocean, oxygen minimum zones, freshwater and hypersaline lakes, rivers, groundwater, marine surface and deep subsurface sediments, agricultural and forest soils, peats, permafrost, deserts, animal hosts and feces, plant leafs and rhizospheres, salt marshes, bioreactors, processed food, methane seeps, mine drainages, sewages, hydrothermal vents and hot springs. Particular effort was put into representing soils (100 studies), sediments (37 studies) and animal guts (52 studies), which likely harbor a large fraction of Earth's prokaryotic diversity. The GPC's main objective was to enable a robust estimate of total extant global prokaryotic 16S diversity.

After stringent quality- and chimera-filtering the GPC comprised about 1.7 billion high-quality reads, each covering at least 200 basepairs in the V4 hypervariable region of the 16S gene, a commonly targeted region in microbial ecology. From the GPC, we recovered roughly 740,000 high-fidelity prokaryotic OTUs at 97% sequence similarity, or 2.3 million OTUs at 99% similarity. Each OTU was required to occur in at least two samples of the same study, to minimize spurious (non-biological) OTUs. As we describe in the associated paper, at 97% similarity in 16S-V4 we estimate that the GPC covers about 47-96% of global extant OTU richness representing >99.98% of prokaryotic cells, and that on average 93% of OTUs in any new study are expected to already be in the GPC.

Sample summaries, including accession numbers for the raw sequence data.
Overview and accession numbers of studies included.
Additional materials & methods not included in the original paper.
OTU representative sequences, in fasta file format.
Incidence frequency counts of OTUs across samples.
Estimated taxonomic identities of OTUs.
OTU tables (read counts per sample).
Draft phylogenetic trees, generated using FastTree.
Samples on world map

Collectors curves and global richness estimates

Distribution of relative OTU abundances and distances to SILVA

Taxon-specific richness estimates and coverages

