23 May 2019

It’s not just the dark web. The dark genome, too, hides some nefarious goings-on, and it will take creative sleuthing to drag them into the light. Genomic sequencing has had some success scanning for rare variants that cause disease. Still, according to a study published May 20 in Genome Biology, thousands of genes are obscured from the gaze of commonly used sequencing techniques, which rely on short snippets of the genome to assemble its sequence. Researchers led by Leonard Petrucelli and John Fryer of the Mayo Clinic in Jacksonville, Florida, devised methods to both identify and illuminate these so-called “dark regions.” They revealed some 37,000 previously hidden areas lurking within more than 6,000 genes. One dark segment accounted for more than a quarter of the sequence encoding complement receptor 1 (CR1), a top AD risk gene. Within its shadows, the researchers identified a frameshift mutation in five AD cases, but no controls, in the Alzheimer’s Disease Sequencing Project sample.

Thousands of “dark regions” in the genome are undetected by short-read sequencing.

Some 76 genes implicated in 326 diseases have such dark regions.

CR1 is 26 percent dark; exposing it revealed a potential AD risk variant.

Rapid developments in genomic sequencing over the past couple of decades have placed whole-genome and exome-sequencing studies within reach of more labs than ever before, allowing for discovery of rare variants (Aug 2017 news; Jun 2018 news; Aug 2018 news). Most studies employ short-read techniques such as those offered by Illumina, which repetitively sequence 100 base-pair segments of the genome and align them to a reference. However, some segments of the genome are inherently difficult to piece together this way. For one, the sequence of the region itself—for example, if it has a lot of GC repeats—can bungle sequencing efforts, yielding few reads. The researchers dubbed these regions “dark by depth.” For another, duplicated, or highly similar, genomic regions fly under the short-read sequencing radar because alignment algorithms fail to map them to specific locations in the reference genome. Essentially, these duplicated segments camouflage each other.

How much of the genome do standard sequencing studies miss? To gauge the scope of the problem, co-first authors Mark Ebbert and Tanner Jensen set out to identify these dark and camouflaged regions within standard short-read sequencing data from the Alzheimer’s Sequencing Project (ADSP). Drawing on sequencing data from 10 men—to allow analysis of the Y chromosome—the scientists searched for regions that either had few reads, or mapped poorly. They limited their quest to regions that fell within gene bodies—i.e., within 5' and 3' untranslated regions (UTRs), exons, and introns, but not intergenic regions. Even so, they identified 36,794 dark regions within 6,054 gene bodies. More than three-quarters of these dark regions lurked within introns, while less than a tenth resided within protein-coding exons. Bad mapping, i.e., camouflaging, accounted for a majority of the dark regions.

In nearly 600 genes, at least 5 percent of their protein-coding region was obscured. Among those, certain pathways and functions were enriched, including ubiquitin-specific processing proteases, defensins, and nuclear import proteins. When they investigated known protein-protein interactions between the products of dark genes, the researchers hit upon RNA transport, nuclear import, and cell stress pathways. Fryer told Alzforum that some of these pathways are prone to camouflaging because they include many genes with similar sequences, such as nuclear importins and RNA transport proteins.

The researchers identified 76 dark genes with known links to 326 diseases, including Alzheimer’s, schizophrenia, autism, and spinal muscular atrophy. The SMN1 and SMN2 genes, implicated in both SMA and ALS, were 95 and 88 percent dark, respectively, likely because they are camouflaging each other. Similarly, HSPA1A and HSPA1B—genes encoding heat-shock proteins implicated in ALS—partially camouflaged each other due to sequence similarity.

The AD risk gene CR1 was 26 percent dark, thanks to camouflaging within itself. The dark bit encompassed the sequence encoding the complement-binding domain, which repeated several times. Curiously, C4b, one of the complement proteins that bind CR1, was 73 percent dark. In some of the ADSP samples, even ApoE was 6 percent dark. In contrast to CR1, ApoE was “dark by depth,” turning up few to no reads in its dark region. The scientists then compared dark regions from the ADSP data set to those of the Genome Aggregation Database, which houses sequences of more than 125,000 whole-exomes and over 15,700 whole-genomes. Doing that, they found that while camouflaged genes were largely the same between data sets and samples, dark-by-depth genes, such as ApoE, were dark in some samples and not others. They speculated that this variability could come down to DNA quality or experimental differences. Interestingly, the same region of ApoE was always dark in the samples that were affected. Fryer and Ebbert pointed out that ApoE has been notoriously difficult to sequence, owing to its high GC content.

To find out if some of these dark regions harbor disease variants, the researchers developed a protocol to bring dark regions from standard short-read sequencing data into the light, so to speak. The process only works on camouflaged genes, and involves extracting reads from camouflaged regions, masking all highly similar regions in the genome except one, realigning, and repeating. Using this technique on more than 13,000 ADSP samples, the researchers unmasked the dark region of CR1, and identified a rare 10-nucleotide frameshift deletion in five AD cases and no controls. Fryer and Ebbert speculated that such a mutation would lead to loss of function. Given the rarity of the mutation, they had too few samples to confirm its link to AD risk. They are applying their rescue protocol to other data sets to search for more cases of the variant. The rescue algorithm is also available for other researchers to use, at https://github.com/mebbert/Dark_and_Camouflaged_genes.

Even though this rescue protocol will expose some camouflaged regions in existing data sets, the researchers see it as no more than a Band-Aid. It’s better to expose these regions from the get-go, they say, with long-read sequencing techniques. Indeed, when they compared three available long-read technologies—from PacBio, 10x Genomics, and ONT—long-read sequencing illuminated 57, 71, and 78 percent of the previously dark nucleotides, respectively, in a subset of their ADSP samples. To date, however, those technologies are more expensive than short-read approaches, so that investigators working with limited funding allocated to a project will opt for the cheaper approach, Fryer said.

Carlos Cruchaga of Washington University in St. Louis agreed that this paper highlights the limitations of short-read sequencing and current analytical pipelines, and shows up a clear need for improvement. “The technology is moving quickly,” he wrote to Alzforum. “A few years ago, everybody was excited with GWAS, and now with short-read whole-genome sequencing. We are moving to long-read sequencing, and more multi-omics. There is still a lot of biology to discover.”—Jessica Shugart