We used whole-genome shotgun sequence data from 910 individuals whose genomes were sequenced as part of the CAAPA project, available from dbGaP as accession phs001123.v1.p1. The total data set contains 1.19 trillion (1.19 × 1012) 100-bp paired end reads, representing an average of 30–40× coverage for each individual’s genome. Sequencing was performed on an Illumina HiSeq 2000. The subjects in the study were all of African ancestry and were selected from 19 populations across the Americas, the Caribbean, and continental Africa25 (Supplementary Table 6).

Assembly of novel contigs

For each sample, we aligned all reads to GRCh38.p0 using Bowtie2 (ref. 29) and extracted unaligned reads and their mates using Samtools30 (Fig. 1). GRCh38 alternate loci were excluded from the reference index, but were considered later in the process. We then assembled all unaligned reads with the MaSuRCA assembler31; if neither mate in a pair aligned to GRCh38, MaSuRCA treated the reads as paired ends with a fragment size of 300 bp, and if only one mate was unaligned, MaSuRCA treated it as an unpaired read.

We filtered the resulting assemblies to exclude contigs shorter than 1000 bp (Fig. 1) and evaluated all remaining contigs with the Centrifuge metagenomics program32, scanning against the comprehensive NCBI nucleotide database to obtain a taxonomic classification of each contig. We considered any contigs labeled by Centrifuge as non-chordates (for example, bacterial or viral contigs) to be contaminants and removed them from further consideration.

Positioning contigs within GRCh38

We attempted to place the assembled contigs in a precise location in the human genome using mapping information from paired reads (“mates”). We masked contigs with RepeatMasker33 with the low-complexity option off (–nolow) and used Bowtie2 to realign all unaligned reads from read pairs in which only one mate had aligned originally. For each read R aligning within 500 bases of the end of a contig, we examined the alignment of R’s mate to GRCh38 to determine whether the contig had a unique placement in the reference genome. The fragment length for all paired-end libraries was 300 bp; by considering reads within 500 bp of the end of a contig, we reduced the likelihood that one or both of the alignments was a spurious match. Additional details of the sequencing protocols for the CAAPA genomes are described elsewhere25. This process resulted in a pool of linking mates corresponding to the beginning and end of each contig.

We then separated contigs into several groups based on their linking information:

1. No linking mates existed on either end of the contig; the reads mates did not align to GRCh38. 2. Placement was unambiguous (or unique) for at least one end of the contig. We define ‘chromosome unambiguous’ to mean >95% of the linking mates linked to the same chromosome. We define ‘region unambiguous’ to mean that of the >95% of mates aligned to the same chromosome, all mates aligned within 2 kb of each other. When both conditions hold, we say placement is unambiguous. These contigs were further divided into two subgroups: a. Both ends of the contig were placed unambiguously, or b. Only one end was placed unambiguously. 3. At least one end of the contig was chromosome unambiguous, but neither end was region unambiguous. 4. Neither end was chromosome unambiguous.

For all contigs in the second group, we used NUCmer34 to align them to the region determined by the linking mates (Fig. 1). If a contig end had one or more consistent exact matches of at least 15 bases (and no inconsistent alignments), we then determined the contig end’s exact insertion location based on alignment coordinates (Supplementary Methods). We permitted an exact two-ended placement only if both ends aligned to the same reference region with the same orientation. The insertion position was either a single breakpoint, if both ends of the contig were placed identically, or a range, if the insertion location of the two ends was not identical. For contigs with only a single end exactly placed, we recorded their exact single-end insertion position and the number of overlapping bases (bases to be trimmed off the end of the contig).

Insertion discovery with PopIns

To supplement the list of placed contigs determined by the procedure above, we ran the PopIns program16, which was used previously for a set of genomes from Icelandic individuals, and was designed to find insertions from a relatively genetically homogenous population. We ran PopIns beginning with the popins merge step, using the cleaned MaSuRCA contig assemblies described above. We ran subsequent PopIns steps as recommended in the PopIns documentation, through the popins place-finish step. PopIns output was converted into a comparable format, and verifiable placements were added to our sets of insertions (Supplementary Methods).

Clustering of placed contigs

Once contig locations were determined for each individual sample, we aligned all insertions to one another and clustered them to determine which contigs represented the same insertion across individuals (Fig. 1).

Clustering two-ended placements

For contigs with both ends placed, we ran BEDtools merge35 to group contigs placed at approximately the same location. We used the -d option with a distance of 10 to allow placements within 10 bases of each other to be combined. We also ran the merge using -d 100, which produced identical results. For each resulting region and contig cluster, we chose the longest contig in the cluster as the cluster’s representative (R), and these representatives formed the initial set of two-end placed contigs, 2EP. Two-ended placement clusters from PopIns were then added to 2EP. We verified clusters by aligning all contigs in each cluster to its representative, R, with default nucmer parameters and removing from the cluster any contigs that did not have any alignments to R. To find the complete set of samples containing each insertion, we then aligned all remaining contigs (including unplaced contigs) to the contigs in the clusters. Any contig aligning with >99% identity that was fully contained within a contig in a cluster C and covered ≥80% of the contig in C was included in C as part of the final set. Contained, 99–100% identical contigs aligning with <80% coverage were also included if they had at least five linking mates and at least 25% of those mates linked to within 5 kb of the placement location. The longest representative contig in each cluster was used as the final insertion sequence for the African Pan-Genome (APG) contig collection (Supplementary Tables 1 and 2).

Clustering one-ended placements

We separated contigs with only one end placed into two groups: (1) contigs where the “left” end aligned to the reference, so that the contig extends into a gap to the right of the placement location; and (2) contigs with their “right” end placed, so the contig extends into a gap to the left of the placement location (Fig. 1). Left and right were determined by the orientation of the chromosomes in GRCh38. We then created clusters separately for the two groups using BEDtools merge (-d 100) as described above, identifying the longest representative R for each group. This formed the initial set of one-end placed contigs, 1EP. Any placements within 100 bases of a two-ended cluster (in the set 2EP) were then removed from 1EP, and each contig in these 1EP clusters was aligned to the representative of the 2EP cluster(s) within 100 bases. If any 1EP contig in the cluster aligned with ≥80% coverage and ≥90% identity to the 2EP contig, the 1EP contig was added to the 2EP cluster.

We then added PopIns one-ended placement clusters to the right and left placements in 1EP (Supplementary Methods). Then for all clusters, we used NUCmer with default parameters to align contigs within each cluster to the representative R. If no alignment was found between a contig and R, the contig was removed from the cluster. We then realigned all other contigs to those in each of these filtered clusters, excluding contigs already determined to be part of a two-ended insertion. Contigs >99% identical over their whole length to any member of a cluster C and covering at least 80% of the contig in C were added to C. Contained, 99–100% identical contigs aligning with less than 80% coverage, were also included if they had at least five linking mates and at least 25% of those mates linked to within 5 kb of the placement location.

We then evaluated the one-ended placements to determine whether two contigs might belong to the same longer insertion, where one contig would ‘fill’ the left side of a gap and the other would fill the right side, possibly meeting in the middle. In some of these cases, the contigs might overlap, allowing us to merge them and create a single, longer insertion sequence. If placement positions were within 500 bases of one another, the sequences were aligned with NUCmer and merged if they were determined to be part of the same insertion (Supplementary Methods). Resultant merged sequences and their clusters were moved to the 2EP set (Fig. 1).

Finally, to remove any potential redundancy from placed clusters, we aligned all representatives from both one- and two-end placed clusters to one another (using nucmer –maxmatch –nosimplify) regardless of placement distance. If two representatives aligned with ≥98% identity, covering ≥95% of one of the contigs, and were placed within 5 kb of one another, these clusters were merged. To determine the representative (and therefore reported placement) of the merged clusters, two-ended placed representatives were favored over one-ended representatives, then our placements were preferred over PopIns, then longer contigs were favored over shorter contigs. By merging only placements within 5 kb, we avoided merging contigs that were similar solely due to repetitive sequences but were unambiguously linked to different locations.

Unplaced contigs

For all unplaced contigs, we ran nucmer –maxmatch –nosimplify with a minimum seed length of 31 (-1 31) and a minimum cluster size of 100 (-c 100) to align all contigs against one another. Contigs contained within another contig and aligning with >95% identity were removed, and if contigs were annotated as identical by show-coords with >97% identity, the smaller of the two was removed. If the ends of two contigs overlapped by at least 100 bases and a third contig was contained within the joined contigs, the contained contig was also removed. Trimming of up to 100 bases was permitted for finding overlaps. Finally, we aligned all resulting unplaced contigs to the placed representatives pre-trimming. If an unplaced contig aligned with ≥80% coverage and ≥90% identity, it was removed from the unplaced set, though it was not added into the placed cluster, as it did not meet the stricter placement or containment criteria used to create the clusters.

In an additional attempt to place more contigs in the reference genome, we repeated the placement procedure described above, this time considering only the subset of linking mates that mapped to GRCh38 with a mapping quality >10, and only attempting to place a contig if the contig end had a minimum of five such linking mates. This mapping quality criterion decreased the overall ambiguity of the putative locations for unplaced contigs (Supplementary Fig. 2); however, this additional placement effort only placed 150 additional contigs. We produced a file of putative linking locations for unplaced contigs by examining separately for each end the linking mates with a mapping quality >10. If >50% of these high-quality linking mates for a given end pointed to the same region, where a region was defined by grouping mates within 2 kb of each other, we reported that region as the putative placement location for that end of the contig, as well as the total number of high-quality mates and the percentage of those mates linking to that location. For this report, the two contig ends were allowed to putatively link to different locations; in such cases both the start and end regions identified are provided, as these are the two most likely placement regions for the contig (Supplementary Table 3). The putative locations include high-copy repetitive sequences that may be underrepresented in GRCh38, and thus are overrepresented as linking locations (Supplementary Note 4 and Supplementary Fig. 3).

Additional screening and analyses

To screen for contaminants missed by Centrifuge, we used the Kraken metagenomics classifier36 on our final set of representative contigs to compare them to a database containing all complete bacterial and archaeal genomes, all viral genomes, selected fungi and protists, human, mouse, and known contaminant sequences. Any unclassified contig or contig hitting something other than mouse or human was further examined by running the blastn program26 to align the contig to NCBI’s nonredundant nucleotide database. We removed all contigs (as likely contaminants) that had alignments to a non-chordate covering >50% of the contig with a BLAST e-value <10–10. We additionally removed a single contig, also an apparent contaminant, hitting Canis familiaris at 90% identity over the entire contig, but lacking any strong matches to primates. As expected, all of these contaminant contigs were found in the set of unplaced contigs. Deleted contaminants were examined for infections of interest, resulting in the incidental discovery of 29 individuals with malaria infections and 1 with human betaherpesvirus (Supplementary Note 5 and Supplementary Table 7).

To ensure the final set of contigs were truly absent from the human reference genome, we realigned all APG contigs to GRCh38.p10 using bwa-mem28 with default parameters. Two separate alignments were performed, one to the primary sequence and one to all patches and alternate loci. We removed any APG contigs with alignments to the primary assembly sequences at or above 90% identity over at least 80% of the contig’s length, regardless of whether they had a better alignment to some alternate locus (Supplementary Methods). In Supplementary Tables 1 and 2, we report the best alignment location for each contig that had at least 50% of the contig aligned to GRCh38.p10 at ≥80% identity. All placed locations were intersected with the NCBI-provided gene annotations, GCF_000001405.36, which is the union of GenBank and RefSeq annotations for GRCh38.p10, and a translated BLAST search (blastx) was run against the comprehensive NCBI protein database to identify potential protein-coding regions in the APG sequences.

Calling presence/absence per sample

Raw contigs from the MaSuRCA assemblies (including contigs under 1 kb) of all 910 individuals were aligned to the final set of APG contigs with bwa-mem using default parameters. Alignments to an APG contig aligning within 300 bp of one another were chained to create longer alignments where possible. Identity of the chained alignment was taken to be the identity of these alignments weighted by length, and coverage was taken to be the total aligned bases over the total APG contig length. If an individual’s raw contig alignments produced an alignment with ≥90% identity and ≥80% coverage to an APG contig, that APG contig was called as present, and a “1” was included in the matrix (Supplementary Data Set 1).

Additionally, for the placed contigs, because we had already determined which individuals contained these sequences, the genotype matrix was supplemented by adding a presence call (“1”) if we had determined that an individual had a contig in the placement cluster. This additional calling allowed increased sensitivity for individuals who had mate placement information available for the insertion, even when the contigs did not meet the identity/coverage criteria used for this presence/absence genotyping. The “genotype” matrix entries indicate presence/absence calls represented as 1 or 0; heterozygous and homozygous genotypes are not differentiated.

To estimate whether the pan-genome would continue to grow as more individuals were sequenced, we randomly sampled varying numbers of individuals within our dataset and used the genotype matrix to determine, in each subset, how much of the APG sequence was present. Each data point was an average of ten random samplings, each with the same number of individuals. The amount of DNA added to the pan-genome appears to increase approximately linearly as the sample size grows, and has not reached an asymptote with 910 individuals (Supplementary Fig. 4).

We additionally called presence/absence of the APG insertions in 12 individuals from six European populations and 12 individuals from six African populations from the Simons Genome Diversity Project (Supplementary Table 5). We assembled these individual’s contigs from raw read data via the same assembly pipeline used for the CAAPA data and then used the resulting MaSuRCA assembly contigs to make the presence/absence calls.

Comparisons to other genomes

We aligned all APG contigs to two additional genome assemblies: a Chinese genome assembly14 and a Korean genome assembly15. All alignments were performed using bwa-mem with default parameters. Because bwa-mem sometimes found multiple distinct alignments for a contig, the best query-consistent set of alignments for each contig was retained, so no part of an APG contig aligned to more than one location in the reference. The best query-consistent set was determined by comparing the sums of alignment length weighted by percent identity. We then filtered these alignments to these genomes, retaining alignments with an overall identity ≥90% that covered ≥ 80% of the contig.

We compared each APG contig’s alignment(s) to the Chinese and Korean genomes to all alignments of the same contig to GRCh38.p10, including patches and alternate loci, obtained as previously described. Among the contigs aligning to the Chinese or Korean genomes, we examined further those with a better alignment (higher identity × coverage) to the Chinese or Korean genome than to GRCh38.p10. We separated these further into two categories, those contigs with a ‘reasonably good’ alignment to GRCh38.p10 (≥50% contig coverage and ≥80% identity for query-consistent sets of alignments within 1 kb of one another), and those lacking reasonably good alignments to GRCh38.p10.

Code availability

Commands and parameters are included in Supplementary Note 6. Custom scripts used are available upon reasonable request.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.