The bacterium Agrobacterium tumefaciens has been the workhorse in plant genome engineering. Customized replacement of native tumor-inducing (Ti) plasmid elements enabled insertion of a sequence of interest called Transfer-DNA (T-DNA) into any plant genome. Although these transfer mechanisms are well understood, detailed understanding of structure and epigenomic status of insertion events was limited by current technologies. Here we applied two single-molecule technologies and analyzed Arabidopsis thaliana lines from three widely used T-DNA insertion collections (SALK, SAIL and WISC). Optical maps for four randomly selected T-DNA lines revealed between one and seven insertions/rearrangements, and the length of individual insertions from 27 to 236 kilobases. De novo nanopore sequencing-based assemblies for two segregating lines partially resolved T-DNA structures and revealed multiple translocations and exchange of chromosome arm ends. For the current TAIR10 reference genome, nanopore contigs corrected 83% of non-centromeric misassemblies. The unprecedented contiguous nucleotide-level resolution enabled an in-depth study of the epigenome at T-DNA insertion sites. SALK_059379 line T-DNA insertions were enriched for 24nt small interfering RNAs (siRNA) and dense cytosine DNA methylation, resulting in transgene silencing via the RNA-directed DNA methylation pathway. In contrast, SAIL_232 line T-DNA insertions are predominantly targeted by 21/22nt siRNAs, with DNA methylation and silencing limited to a reporter, but not the resistance gene. Additionally, we profiled the H3K4me3, H3K27me3 and H2A.Z chromatin environments around T-DNA insertions using ChIP-seq in SALK_059379, SAIL_232 and five additional T-DNA lines. We discovered various effect s ranging from complete loss of chromatin marks to the de novo incorporation of H2A.Z and trimethylation of H3K4 and H3K27 around the T-DNA integration sites. This study provides new insights into the structural impact of inserting foreign fragments into plant genomes and demonstrates the utility of state-of-the-art long-range sequencing technologies to rapidly identify unanticipated genomic changes.

Our routine ability to add or alter genes in plant genomes using transgenesis has proven to be a game changer to plant sciences. Transgenics not only enables the study of gene function but also allows the development of modern crop plants without the unwanted genetic baggage coming from natural crossing. A major tool to create transgenics is the Agrobacterium system which naturally shuttles and integrates pieces of foreign DNA into its host genome. While the position and number of integrations was relatively easy to track, molecular tools never allowed to see the integrated piece of DNA within a single “picture”. Here we have utilized state-of-the-art DNA sequencing technology to capture the size and structure of multiple DNA insertion events in a plant genome. We discovered that insertion of the anticipated DNA fragment occurred as multiple concatenated full and partial fragments that led in some cases to intra- and interchromosomal rearrangements. Our analysis of the epigenetic landscapes showed variable effects from silencing of the integrated foreign DNA to alterations of chromatin marks and thus chromatin structure and functionality.

Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: FJ is an employee of Bayer Crop Science. JRE serves on the scientific advisory boards of Cibus, Zymo Research and Pathway Genomics Inc.

Funding: FJ was supported through a Human Frontier Science Program Organization long-term fellowship ( http://www.hfsp.org/ ). MZ was supported by the Salk Pioneer Postdoctoral Endowment Fund as well as by a Deutsche Forschungsgemeinschaft (DFG; http://www.dfg.de/ ) research fellowship (Za-730/1-1). JRE is an Investigator of the Howard Hughes Medical Institute ( www.hhmi.org ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Current knowledge of structural genome changes and epigenetic stability in transgenic plants is limited. In this study we report on the genome structures of four Arabidopsis T-DNA floral-dip transformed plants, and for the first time we report the lengths of T-DNA insertions up to 236 kilobases along with long-molecule evidence for genome structural rearrangements including chromosomal translocations and induced epigenomic variation. To study such large insertions and rearrangements at the sequence level, we de novo assembled the genomes of two multi-insert lines (SALK_059379 and SAIL_232) and the reference accession Columbia-0 (Col-0) using Oxford Nanopore Technologies MinION (ONT) reads to very high contiguity. We present polished contigs that span chromosome arms and reveal the scrambled nature of T-DNA and vector backbone insertions and rearrangements in high detail. We subsequently tested transgene expression and functionality and show differential epigenetic effects of the insertions on the transgenes between the two tested vector backbones. Small interfering RNA (siRNA) species induced transgene silencing through the RNA-directed DNA methylation (RdDM) pathway of the entire T-DNA strand in the SALK-vector background, and in contrast the transgene remained active in the SAIL-line. Moreover, by profiling the occupancy of H3K4me3, H3K27me3 and H2A.Z in SALK_059379, SAIL_232 and five additional T-DNA lines, we uncovered various effects of T-DNA insertions on the adjacent chromatin landscape. In summary, new technological advances have enabled us to assemble and analyze the genomes and epigenomes of T-DNA insertion lines with unprecedented detail, revealing novel insights into the impact of these events on plant genome/epigenome integrity.

Knowledge of structural variations induced by transgene insertions, including insertion site, copy number and potential backbone insertions, as well as evidence for epigenetic changes to the host genome is crucial from scientific as well as regulatory perspectives. These aspects are routinely assessed using laborious Southern blotting, Thermal Asymmetric Interlaced (TAIL) PCR, targeted short-read sequencing, or recently digital droplet PCR [ 4 , 23 ]. One of the few attempts to gain deeper insight into an engineered genome was for transgenic Papaya, using a Sanger sequencing approach [ 14 ]. This work identified three insertion events, each less than 10 kilobases (kb) in length, however large repeat structures with high sequence identity are generally impossible to assemble using short-read sequences [ 24 ].

The Agrobacterium strains used in research projects are no longer harmful to the plant because the oncogenic elements of the tumor-inducing (Ti) plasmid have been replaced by a customizable cassette that includes a diverse set of in planta regulatory elements. Agrobacterium-mediated transgene integration occurs through excision of the T-DNA strand between two imperfect terminal repeat sequences [ 5 ], the left border (LB) and right border (RB) [ 6 ], and translocation into the host genome (reviewed in Nester [ 7 ]). Hijacking the plant molecular machinery, the T-DNA is integrated at naturally occurring double strand breaks through annealing and repair at sites of microhomology [ 8 , 9 ]. While the exact mechanisms behind this error prone integration are poorly understood, it is known that insertion events generally occur at multiple locations throughout the genome [ 5 , 10 ]. T-DNA insertions also frequently contain the vector backbone and occur as direct or inverted repeats of the T-DNA, resulting in large intra- and inter-chromosomal rearrangements [ 6 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 ]. The phenomenon of long T-DNA concatemers has previously been attributed to the replicative T-strand amplification specific to the floral dip method, and which is less often observed after tissue explant transformation of roots or leave discs [ 22 ].

Plant genome engineering using the soil microorganism Agrobacterium tumefaciens has revolutionized plant science and agriculture by enabling identification and testing of gene functions and providing a mechanism to equip plants with superior traits [ 1 , 2 , 3 ]. Transfer DNA (T-DNA) insertional mutant projects have been conducted in important dicot and monocot models, and over 700,000 lines with gene affecting insertions have been generated in Arabidopsis thaliana (Arabidopsis henceforth) alone (reviewed in O’Malley [ 4 ]). Targeted T-DNA sequencing approaches were conducted on approximately 325,000 of these lines to identify the disruptive transgene insertions and to link genotype with phenotype [ 4 ]. This wealth of sequence information, much of which has been made available prior to publication, is available at: http://signal.salk.edu/Source/AtTOME_Data_Source.html , has been iteratively updated since 2001, and accessed by the community over 10 million times by 2018.

Results

Assembly of highly contiguous genomes from ONT MinION data The number of insertions in SALK_059379 and the type of rearrangements observed in SAIL_232 sparked our interest in analyzing these genomes at greater (nucleotide) resolution. We sequenced these engineered genomes, alongside the parent reference Col-0 plant (ABRC accession CS70000) using the Oxford Nanopore Technologies (ONT; Oxford) MinION device. We performed ONT sequencing on each line using a single R9.4 flow cell (Table 1, S1 Table) and assembled each genome using minimap/miniasm followed by three rounds of racon [28] and one round of Pilon [29]. We assembled the three lines into 40 contigs (Col-0; longest 16,115,063 bp), 59 contigs (SAIL_232; longest 16,070,966 bp) and 139 contigs (SALK_059379; longest 8,784,268 bp) (S1 Table). Individual whole genome alignments to the TAIR10 reference show over 98% coverage with 39 and 57 contigs for SAIL_232 and SALK_059379, respectively (Table 1, S1 Fig, S2 Table). The remaining short contigs (< 50 kb) encode only highly repetitive sequences such as ribosomal DNA and centromeric repeats that cannot be placed onto the reference. Chromosome arms are generally contained within one or two contigs, and contiguity declines with repeat content towards the centromere (S1 Fig). Chromosome arm-spanning contigs covered telomere repeats and at least the first centromeric repeat, thus capturing 100% of the genic content. When we aligned the contigs of all assembled genomes back to the TAIR10 reference, we consistently did not cover ~3.9Mb of the reference. The Col-0 contigs covered over 99% of the TAIR10 reference, and the only discrepancies occur at the centromeres (S1 Fig). The high contiguity and quality of this genome assembly allowed correction of previously identified misassemblies (38/46 ‘N’-regions) in the TAIR10 reference genome (Fig 1A, S3 Table). Our contigs were not able to span the remaining eight. BNG alignments confirmed that all ONT contigs were chimera-free, while only eleven (Col-0 and SALK_059379) or three contigs (SAIL_232) contained misassembled non-T-DNA repeats (S4 Table). One aim of creating near complete genome assemblies was to enable the structural resolution of transgene insertions at nucleotide level, rather than with genome scaffolding alone. We next assessed contiguity at sites of T-DNA insertion. After aligning these assemblies to BNG maps we concluded that the shorter insertions, SALK:chr2_18Mb (28 kb), SAIL:chr3_9Mb (11 kb), SAIL:chr3_21Mb (25 kb; Fig 2C) and SAIL:chr5_22Mb (11 kb) were completely assembled. Because of extensive repeats, much larger T-DNA insertions collapsed upon themselves, although contigs reaching up to 39 kb into the insertions from the flanking genomic sequences could be assembled.

T-DNA independent chromosomal inversion in the SAIL Col-3 background Chromosome 1 in the SAIL_232 line was assembled into a single contig (SAIL_contig_20), spanning an entire chromosome arm from telomere to the first centromeric repeat arrays (S1 Fig). Compared with the Col-0 reference genome, we found a 512 kb inversion in the upper arm (SAIL_chr1:11,703,634–12,215,749). Because we could not find a signature of T-DNA insertion at the inversion edges, we posited that this event may have resulted from a pre-existing structural variation of the particular Columbia strain used for the SAIL project (Col-3; ABRC accession CS873942). To test this hypothesis, we genotyped SAIL_232 alongside two randomly selected lines of the same collection (SAIL_59 and SAIL_107) using primers specific to the reference Col-0 CS70000 genome and the SAIL-inverted state (see Methods). PCR analysis confirmed that this “inversion” was common to all three independent SAIL-lines tested, and absent from Col-0 CS70000 (S2 Fig). Thus the event was not due to the T-DNA mutagenesis, rather is an example of the genomic “drift” occurring during the propagation of the Columbia “reference” accession within individual laboratories [30].

SALK_059379 T-DNA insertions are conglomerates of T-strand and vector backbone To annotate T-DNA insertion sites within the assembled genomes, we searched for pROK2 and pCSA110 plasmid vector sequence fragments within the assembled contigs (S5 Table). The assembled SALK_059379 genome contained three of the four BNG map identified T-DNA insertions: SALK:chr1_5Mb, chr2_15Mb and chr2_18Mb (Fig 1A, 1C and 1E, Table 1). Specifically, SALK:chr2_18Mb, the shortest identified insertion with 28,356 bp, was completely assembled (contig_7:3,690,254–3,719,373) and included a genomic deletion of 5,497 bp (chr2:18,864,678–18,870,175). Annotation of the insertion revealed two independent insertions; a T-DNA/backbone-concatemer (11,838 bp) from the centromere proximal end, and a T-DNA/backbone/T-DNA-concatemer (16,463 bp) from the centromere distal end, both linked by a guanine-rich (26 G bases) segment of 55 bp (Fig 1E). Two independent insertion events, 5,497 bp apart, potentially created a double-hairpin through sequence homology which was eventually excised, removing the intermediate chromosomal stretch (S3 Fig). The second and third insertions, SALK:chr1_5Mb (131 kb) and SALK:chr2_15Mb (207 kb), were partially assembled into contigs. SALK:chr1_5Mb contig_10 and contig_5 contain 25,132 bp and 33,736 bp T-DNA segments. Similarly, the extremely long chr2_15Mb insertion was partially contained within contig_4 and contig_7 (Fig 1C), leaving an unassembled gap of approximately 131 kb. The recovered structure of this insertion is noteworthy as it represents a conglomerate of intact T-DNA/backbone concatemers, as well as various breakpoints that introduced partial vector fragments with frequent changes of the insertion direction (Fig 1C). Finally, the fourth insertion SALK_059379 (SALK:chr4_10Mb) was absent from the assembly. However, we observed a single ONT read (length 10,118 bp) supporting the presence of an insertion at this location. We also recovered a further ONT read (length 15,758 bp) that anchors at position chr3:20,141,394 and extends 14,103 bp into a previously unidentified T-DNA insertion (S4 Fig). PCR amplification from DNA samples using the genomic/T-DNA junction sequences from segregant and homozygous seeds confirmed the presence of all five insertions, revealing heterozygosity within the ABRC-sourced seed material. While BNG maps were successful in placing long “T-DNA only” sequence contigs into the large gaps (e.g. SALK:chr1_5Mb), the four “short” T-DNA-only sequence contigs of ~50 kb or less did not contain sufficiently unique nicking pattern to confidently facilitate contig placement.

Large-scale rearrangements reshape the SAIL_232 genome We next searched the SAIL_232 ONT contigs for pCSA110 vector fragments, and were able to confirm all BNG-observed genome insertions (Table 1). This search additionally identified a further T-DNA insertion at chr5:20,476,509 (S5 Table) that was not assembled in the BNG maps. We found that chromosome 3 harbored two major rearrangements (Fig 2). The first was a translocation of a 1.19 Mb fragment (chr3:8,902,305–10,095,395), which additionally split at an internal T-DNA insertion at reference position 9,343,053 bp (Fig 2A). The resulting two fragments were independently inverted prior to integration just before position chr3:2,586,494 (Fig 2A and 2B). The second major change was a swap between the distal arms of chromosomes 3 and 5, which is supported by two SAIL_232 BNG genome maps as well as two ONT contigs (Fig 2A–2D). Here, chromosome 3 broke at 21,094,402 nt, and chromosome 5 at 18,959,379 nt and 20,476,664 nt, and the larger chromosomal fragments swapped places. Specifically, this translocation was captured in SAIL_contig_31, showing the fusion of chr5:20,476,664-end to chromosome 3 after position 21,094,402 nt. The reciprocal event joined (SAIL_contig_2) chromosome 3 fragment (21,094,407-end), almost seamlessly to chromosome 5 after reference position 18,959,379 nt (Fig 2C and 2D). The genomic location of an excised fragment of chromosome 5 (chr5:18,959,380–20,476,663 found within SAIL_contig_11) was not determined (Fig 2D). Finally, the 81-kb insertion at SAIL:chr1_19Mb was identified from a BNG contig that aligned together with a contig that did not harbor the insertion (Fig 2E). This apparent phasing of a heterozygous region was the only occasion where we observed this. The insertion consists of four tandem T-DNA copies (~30 kb) followed by ~20 kb of breakpoint interspersed T-DNA and vector backbone, and was partially assembled (50,676 bp) at the 5’ end of contig_5 (Fig 2E). Although not assembled as part of the flanking SAIL_contig_47, we recovered multiple ONT reads that contain T-DNA as well as genomic DNA sequences. In summary, the BNG maps perfectly aligned with the ONT assembly of the T-DNA insertion haplotype (Fig 2E).

T-DNA Integration occurs independently from both double-strand break ends While both sequenced lines share similar numbers of T-DNA insertion events, the genome of the SAIL_232 plant line underwent more significant changes to its architecture. All genome insertion sites began and ended with the LB of the T-DNA strand, providing evidence for independent transgene integration at both ends of the DNA double-strand break. We did not recover any LB sequences at the chromosome/T-DNA junction, in line with literature reports that usually 73–113 bp are missing from the LB sequence inwards [15, 19, 31]. Internal T-DNA sequence deletions were also seen at breakpoints within the insertion (Fig 1C, Fig 2C). As observed for the SALK:chr2_18Mb chromosomal deletion (Fig 1E, S3 Fig), we cannot exclude that long homologous stretches between the independently inserted T-DNA/vector backbone concatemers represent inverted repeats.

Transgenes are functional in SAIL-lines but are silenced in SALK-lines We next wanted to assess the effects of T-DNA insertions on the epigenomic landscape. The pROK2 T-DNA strand contains the kanamycin antibiotic-resistance gene nptII under control of the bacterial nopaline synthase promoter (NOSp) and terminator (NOSt), and an empty multiple cloning site under control of the widely used Cauliflower mosaic virus 35S (CaMV 35S) promoter and NOSt [25]. The CaMV 35S constitutive overexpression promoter has previously been described to cause homology-dependent transcriptional gene silencing (TGS) in crosses with other mutant plants that already contain a CaMV 35S promoter driven transgene [32]. In germination assays, we confirmed that the kanamycin selective marker is not functional in SALK lines propagated for more than a few generations [4]. While ~75% of SALK_059379 seed initially germinated on kanamycin-containing plates, these seedlings stopped growth after primary root and cotyledon emergence (Fig 3A). The SAIL_232 pCSA110 T-DNA segment encodes the herbicide resistance gene bar (phosphinothricin acetyltransferase), under control of a mannopine synthase promoter [26]. In contrast to SALK_059379, we confirmed proper transgene function by applying herbicide to soil-germinated plants [33, 34] (Fig 3B). We corroborated these differential phenotypes by mapping RNA-seq reads to the corresponding transformation plasmid and found that the SAIL_232 bar gene was expressed, while in SALK_059379 the nptII gene was not expressed, most likely due to epigenetic silencing (Fig 3C and 3D). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. SALK and SAIL T-strand sequences show divergent epigenomic signatures. Transgene activity was tested by exposing plants to the respective selective agent: (a) SALK_059379 grown on media containing the antibiotic kanamycin (50ug/uL) (or empty control) and (b) the SAIL_232 sprayed with the herbicide Finale™ at two different concentrations through the middle of the tray. All phenotypes are compared to the WT Col-0 (a,b). Analysis of expression and epigenetic signatures on the corresponding T-DNA sequence is captured in genome browser shots for SALK_059379 plasmid pROK2 (c) and SAIL_232 plasmid pCSA110 (d): Illumina read mapping of bisulfite sequencing, RNA-seq and different small RNA species. Quantification of individual siRNA read length against individual parts of the two plasmid sequences are reported in (e = pROK2) and (f = pCSA110). GUS staining of leaves and flowers of SAIL_232 (g), with Col-0 (h) and SALK_059379 (i) as control. https://doi.org/10.1371/journal.pgen.1007819.g003