Identification of candidate mobile element transduction events

As others have reported [12], among the rearrangement junctions in our paired-end sequencing were groups of apparent translocations that had highly recurrent breakpoints, ‘translocated’ to many unique sites (Fig. 2). For example, 17/22 tumours (in our discovery set Batch P) had a total of 61 ‘translocation breakpoints’ within a 1.4kb region at 29.065 Mb (reference genome GrCH37/hg19) on chromosome 22, within an intron of the TTC28 gene, up to seven in the same tumour, and the joined ‘breakpoints’ were all different. Similarly, four tumours had a total of 46 junctions at 59.220 Mb on chromosome 14 and seven tumours had a total of 29 at 11.732 Mb on chromosome X (Table 1, Additional file 1). These could be distinguished from typical artefactual ‘translocations’ resulting from misalignment of reads to repeat sequences, because usually both ‘breakpoints’ of such events are joined to multiple partner breakpoints. They also did not resemble true recurrent translocations because the breakpoints were clustered into too small a region, generally 1kb or less. For many—typically a quarter to a half—of the junctions, paired-end sequencing detected a neighbouring junction, suggesting that the ‘translocation’ was reciprocal, perhaps with a small duplication at the breakpoint [23], or was in fact an insertion. For example, 26 of the 61 chromosome 22 ‘translocations’ had two junctions (Additional file 1).

Fig. 2 Mobile element inserts mimic multiple translocations. Circos plot of mobile element inserts detected by discordant read pairs in tumour 7409, which had the highest number detected. The genome is displayed as a circle, chromosomes 1 to Y, with curved lines representing the apparent rearrangements detected. For example, the many apparent junctions between chromosome 14 and other chromosomes (green), represent copies of a chromosome 14 sequence that have been transduced and inserted all over the genome Full size image

Table 1 ‘Elements’ in the reference genome that mobile element inserts align to Full size table

In the 22 tumours of Batch P, seven distinct loci had at least five of these translocation-like junctions, spread over more than one tumour, each breakpoint within 1kb of the next (marked ‘Cl’ in Table 1; Additional file 1), which were verified by PCR as described below.

We hypothesized that these were insertions of L1 mobile elements, and to explore this idea we searched both Batch P and the 21 Batch B1 tumours for rearrangement junctions that mapped to, or close to, the 3’ ends of the 90 L1s in the reference genome that are thought to be capable of mobilisation (Table four of ref. [24]). This identified nine junction clusters, two of which were among the seven originally identified (Table 1; Additional file 1).

The combined list of apparent inserts was compared to known L1, Alu and SVA polymorphisms [13, 25, 26] and only eight (of nearly 700) ‘insert sites’ corresponded to known polymorphisms that we had failed to sample in the normals (Additional file 1).

Inspection of the inserts suggested two ways in which inserts created by mobile elements could be identified as rearrangements by conventional paired-end-read mapping: the first, when the inserts included unique sequence transduced by the mobile element, and the second, when the insert included sequence that could be—correctly or incorrectly—uniquely matched to a mobile element in the reference genome (Fig. 1).

The first way accounted for several ‘clusters’ of inserts that included unique sequence that mapped 3’ to the end of an L1 in the reference genome (Table 1). The chromosome 22 element was typical—inserts included sequence 3’ to an L1 at chr22: 29059272–29065303, extending to at least 29066424.

Two further clusters of inserts of unique sequence—at chromosome 3: 123.59 Mb and chromosome 14: 59.22 Mb (the most abundant in our original set)—resembled transduced sequence but were not adjacent to L1s in the reference genome. They were adjacent to polymorphic L1s absent from the reference genome [26] and known to transduce 3’ flanking sequence [12, 13] (Table 1). The chromosome 14 element, for example, is inserted in the reference sequence at 59220404 in 25 % of the population sampled by Tubio et al. [13], in the plus direction. Our inserts all ended in polyA added at map position 59221073–59221078 (Table 1).

The second way that candidate inserts were detected, was when a cluster of breakpoints was mapped to the 3’ terminus of an L1 or Alu in the reference genome with or without a polyA tail. Although these sequences are ‘repeats’, they are sufficiently distinct in the reference genome that aligners (which do not mask repeat sequences) may align reads confidently to these positions because they match a read better than anywhere else (Fig. 1 D; Table 1). For example, 84 rearrangement junctions, spread over nine tumours, were mapped to the 3’ end of an L1 that ends at 30,484,890 bp on chromosome 7 and is followed by a 4 bp gap then 20As.

Similarly, approximately 100 junctions, in 21 tumours, mapped to the longest polyA run in the reference genome, the 90-bp polyA tail of an Alu at chr12:66451373–66451739, while 22 junctions in 18 tumours mapped to the 79 bp polyA tail of an AluYa5 at chr15: 77910868–77911236. The interpretation of these depended on the aligner used. The Novoalign aligner (Batch B1 tumours), which can trim reads more to achieve a match, generally aligned reads to them solely because they contained 70 or more polyA/T, without Alu sequence, but with or without a few bp 3’ to the polyA that may represent common target site sequence. For example, the read TTATTC[74 polyT]GGGAGAGAGATTTTTTTTTT was aligned to the chromosome 15 locus by matching the TTATTC and polyT, then trimming off the remaining base pairs. Of three tested by PCR for each locus (see below; Additional file 2), all were somatic inserts. Two were essentially only polyA while four included an L1Hs 3’ terminus ignored by the aligner.

The aligner BWA (Batch P tumours) aligned reads to these polyA runs by matching polyA plus some recognisable Alu sequence. Their apparent insertion sites were 3’ to reference Alu sequences, and two—at about 49.3603 Mb on chromosome 19 and 141.014 Mb on chromosome 2—were common to two tumours, so these may represent germline polymorphisms of longer polyA tails on these reference Alus that we failed to sample in the matched normals and normal panel.

Thus this second class of apparent rearrangements are where an aligner confidently maps reads to the 3’ end or polyA of a mobile element in the reference genome. These may occasionally represent mobilisations of that reference mobile element, but more commonly will be incorrect mappings of an insert from an unrelated element, or a polymorphism.

A third class of apparent junction, observed only using the Novoalign aligner, appeared to be a special case of such mismappings, which we designated ‘split mappings’. Apparently, reads from within one mobile element insert were mapped to different places in the reference genome (Fig. 1E). For example, we identified a ‘translocation junction’ between chromosome 14:59.2206 Mb, which is sequence transduced by an L1, and the polyA run at chromosome 15:77.9109 Mb (Table 1). Presumably the sequence reads are from an insert of the chromosome 14 L1 that has transduced flanking sequence, and its added polyA tail. We detected ‘junctions’ of this kind that mapped to the chromosome 14 and X_A transduced sequences at one end and either the chromosome 12 or 15 polyA runs at the other. Four other junctions appeared similar, joining the chromosome 14 or 15 loci either to an L1Hs at chr2:88206897–88206997 or sequences found to be transduced by Tubio et al. [13] (Additional file 1).

Verification and structure of inserts

Using PCR, we verified that rearrangement junctions existed and were somatic for at least two inserts from each element in Table 1, except the chr3 element (not attempted) and chr6 (only one attempted). Inserts with read pairs at both ends were preferred (Additional file 1) (this would bias towards truncated inserts, see Fig. 1). For 25/45 inserts we were able to amplify across the insert, using primers designed to the expected flanking sequence (Fig. 3A; Additional file 2). For a further 17, we were able to amplify across at least one junction, giving an overall verification rate of 42/45 tumour inserts (93 %). Failure to amplify some junctions and inserts is not surprising, since some inserts would be too large to amplify and many primers were designed by guessing the other side of an insert, which could have been rearranged or deleted [1]. All inserts were tumour-specific. Amplifying across the inserts gave a product from the normal target site, common to the tumour and matched normal, plus a larger band unique to the tumour. In addition, a weaker intermediate band was often obtained, which on cloning and sequencing gave sequence the same as the normal or insert band and therefore we suspected was a hybrid between the normal and insert bands. Supporting this interpretation, when excised and re-run, the normal and larger insert bands ran as expected, while the intermediate band regenerated the pattern of three bands.

Fig. 3 Verification of representative inserts. a Flanking junctions of an insert from chromosome 22 into ENPP2 on chromosome 8 in Tumour 7396. b PCR across four representative inserts, of elements denoted Chr X_B, Chr 6, Chr 7 and Chr 14 respectively. Note that in addition to the larger, tumour-specific band in most cases there is an intermediate size band, most clearly in the last example. c sequence of insert in a, showing target site duplication (TSD), insert of chromosome 22 material and polyA tail Full size image

All verified inserts had the expected structure, allowing for known variations. All had a polyA tail, with upstream insert sequence variably truncated (Fig. 3; Additional file 2). Most were flanked by target site duplication, up to 20bp, as expected [10, 13], though some inserts showed zero duplication or even a small deletion (Additional file 2). Variations included partial inversion of the 5’ end of the inserted sequence—about 20 % (36/195) of inserts where both ends were found by discordant read analysis. For example, the insert shown in Fig. 3, of chr22 into chr 8 consisted principally of chr22 sequence, from 29066229–29066424, 3’ to the intact L1 that ends at 29065303, followed by 37bp of polyA (Fig. 3). There was an apparent small inversion at the 5’ end, with the first 17 bp of the insert mapping in the reverse orientation 54 bp upstream to the main chr22 material, consistent with the proposed mechanism for inversion [19]. The insert was flanked by an 18-bp target site duplication.

The inserts we verified and sequenced across ranged from 1bp (plus >50 polyA and target site duplication) to about 1kb excluding the polyA, and the polyA sequences ranged from roughly 25 to over 50 (polymerase slippage means that these figures are approximate). The inserts of transduced 3’ unique sequence included part or all of the transduced unique sequence, with or without part of the upstream L1HS sequence. For example, the insert previously described included no L1HS sequence, while the insert of the chr4 element into chr5 included 69bp of the 3’ end of the (negative-strand) L1HS together with 931bp of unique chromosome 4 material downstream of this end.

The sequences transduced 3’ to intact L1s mostly were bounded by polyA addition sites 0.8-1.3 kb from the L1, though one, from the element at 119.5 Mb on chromosome X, extended only 24bp from the L1. For some the polyA addition sites varied, e.g. for the chromosome 22 element, the L1 of which ends at 29,065,303, there were two cases of polyA addition at 29,065,912, two at 29,066,121, one at 29,066,424, and others. The chr8 element was found both with 3’ unique sequence transduced and without any transduced sequence.

Estimating abundance of inserts by detecting polyA

Since the total number of L1–mediated insertions is likely to be much greater than detected by discordant read pairs, which only works where reads in the insert can be mapped uniquely to the reference genome, we obtained a rough estimate of total inserts by searching for tumour-specific polyA sequence, hypothesising that most of these would be L1-mediated insertion events.

We searched the 43 tumours of sets P and B1 for reads that contained at least 20 consecutive As or Ts and were paired with sequence reads that were confidently mapped by the BWA aligner. From these we extracted candidate inserts that were supported by at least three read pairs and absent not only from the matched normal but also the 20 or 21 other normals from the same batch of tumours (Additional file 3). We compared these candidate insert sites to known polymorphic L1s and Alus [13, 25, 26], and about 1 % (63 out of approximately 5300) were within 1kb, without considering direction of read. As a control, the analysis was repeated with each B1 tumour interchanged with its matched normal: this detected only an average of 2 (range 0 to 6) candidate normal-specific polyA sequence runs (1/42 of which matched a known polymorphism), suggesting that relatively few of the inserts detected in the tumours were merely bioinformatic artefacts or polymorphisms that we had failed to detect in the matched sample.

We confirmed that the great majority of these candidate insertions resembled tumour-specific L1 events, by manually scrutinizing a random sample of 20 of the least-confident examples, i.e. those supported by only three reads. We viewed all sequence reads within an approximately 600bp window spanning the candidate insert site, using the Integrative Genomics Viewer (IGV) [27] (Additional file 4). All 20 showed additional aberrant reads including additional ‘split’ or ‘soft-clipped’ reads (i.e. reads partially aligned to the reference with a ‘tail’ of mismatched bases) that clearly identified an insertion point. In 18/20 cases these split reads included L1 sequence. In one further case the sequence appeared to be transduced. In only 1/20 cases was the insert clearly germline, while only one further case had a single read in the normal that could possibly have supported a germline insert (Additional file 4).

For around 85 % of an independent unbiased sample of the candidate tumour-specific polyA sequence we were able to verify a retrotransposon insert by PCR across the candidate insert site, and some of the failures might have been inserts that were too large to amplify (Additional file 2). 28 primer pairs were designed, of which 26 pairs successfully amplified a normal band from the matched normal. 22 of these 26 (85 %) amplified a larger band from the tumour. All these insert bands were absent from the matched normal. 11 of these 22 were cloned and sequenced. All showed an insert that included polyA, together with additional sequence (Additional file 2). 10/11 had target site duplication, suggesting they were indeed non-LTR retrotransposon events. 8/11 included the 3’ end of L1HS sequence, for example, the largest was 7416_19, almost 600bp, comprising around 450 bp of L1 sequence, slightly rearranged, with two stretches of polyA separated by a 75bp fragment of chromosome 9, flanked by duplication of TAGAAGCCCAATTTCT. The three without L1 sequence included the smallest, 7436_9, which was ATATAATAATAAT + polyA, flanked both sides by CCCA, which is a duplication of the native chromosome 9 target site. ATATAATAATAAT does not match the end of a consensus L1HS or Alu. Insert 7416_13 comprised 67 bp unique sequence from chromosome 10:120,819,239-120,819,305, plus polyA and duplication of AAGAAATATTTCCC. Insert 7416_10 was about 1kb of unique sequence from chromosome 17 chr17:46707420–46708467, inserted with a target site duplication of 14bp, but with polyA at both ends, A 31 and A 16 allowing for the original sequence, in the same orientation. Neither of these two are near reference L1s or known polymorphic L1s [13, 25, 26]. They might represent undocumented polymorphic L1s or re-mobilisation of somatic inserts [13], and, indeed, discordant read pairs show nine rearrangement breakpoints within chr17:46707414–46708480 in tumour 7416, indicating multiple transductions of this sequence.

A further four inserts were also verified, three chosen because they were in genes, PARK2, FOXP2 and SNTG (Additional file 2).

Taking 85 % as an estimate of validity, our rough estimate of tumour-specific inserts identified by polyA sequence ranged from five (which may be artefactual background) to about 700, average approximately 100 (Table 2). This is over six times the number of inserts from the elements listed in Table 1 detected as rearrangement junctions. This may be an underestimate because read pairs have to be placed precisely to detect the inserts, and some of the tumours are contaminated enough with normal cells to reduce sensitivity of detection.

Table 2 Active elements in individual tumours Full size table

Combining the inserts found by discordant read pairs and our polyA search (Table 2), most of the tumours have evidence of somatic mobile element activity. Only 10/43 (23 %) tumours did not have either five or more inserts detected by discordant pairs or >50 candidate inserts detected by polyA (Table 2), and one of these (7404) had a verified insert (Additional file 2). On the other hand, there are a few tumours that may well not have activity, notably the 4/43 tumours that have zero or one unverified insert detected by discordant reads.

Effects on genes

Among the inserts found by analysing discordant reads pairs, 198/675 (29 %) were in a gene listed by Ensembl. Of these, approximately 90 were in the same transcription direction as the gene, 108 in the opposite orientation. (These are approximate figures because only one junction was detected in most cases, and the orientation of this junction could be misleading in cases with 5’ inversion).

Five inserts seemed particularly likely to affect gene function: four were in coding sequence, of MRPL13 and ZAN (both confirmed by PCR), SYCP1 and ZDHHC14; and one was in the 3’ UTR of PDE10A (confirmed).

Several genes had inserts in more than one tumour (not confirmed by PCR): AGBL4, DLG2 and SNTG1 had insertions in three tumours; AGBL1, AREG, EDIL3, EPHA6, GPM6A, LRP1B, NEGR1, NRXN3, PGCP, and PLAKHA4 had insertions in two tumours. AGBL4, ROBO2, EYS and RYR3 appeared to have two insertions in distinct sites in the same tumour, though these could be insertions at a single rearrangement junction, since this is known to occur [28].

Among the insertions detected by polyA search (Additional file 3), which are less reliable and may be too small to affect gene function, a number of genes had inserts in several tumours, but these were very large genes (0.8 - 2.2 Mb) such as LSAMP, so the significance is unclear (Additional file 3).