Plant materials and genetic transformation

The plant materials comprising the inter-cultivar and FN classes included in this study have been previously described [1, 29]. Briefly, the inter-cultivar group consists of 41 soybean accessions used as parents in developing the SoyNAM population. The FN population was developed in the background of the variety ‘M92-220’ [43] derived from the 2006 Crop Improvement Association seed stock of variety ‘MN1302’ [44]. Two types of FN plants were studied, including ten with detectable mutant phenotypes and 35 with no detectable phenotype. All FN plants were descendants of unique M 1 individuals that were treated with either 4, 16, or 32 Gy of FN radiation [1].

Genetic transformation using Agrobacterium rhizogenes followed published methods [45, 46]. Each plant was confirmed to be transgenic based on PCR analysis and survival on selective (herbicide-treated) medium. The five T 1 soybean individuals were from unique transformation events. The constructs for these transformations included a zinc finger nuclease [47], transcription activator-like effector nuclease, GFP and RNAi hairpin, mPing-Pong transposon [32], and a magnesium chelatase [48] RNAi hairpin. These transformations were in a ‘Bert’ cultivar [30] background (subline’Bert-MN-01’) or ‘Williams 82’ (subline ‘Wm82-ISU-01’) [31, 49]. The ‘Bert-MN-01’ subline (referred to as ‘Bert’ throughout this study) was derived from a single ‘Bert’ individual to reduce heterogeneity between transformed plants. The ‘Wm82-ISU-01’ subline (referred to as ‘Williams 82’ throughout this study) was derived from a single ‘Williams 82’ individual and is the nearest known match to the soybean reference genome assembly version 1.0 (Wm82.a1.v1.1) [31, 50].

Comparative genome hybridization

The CGH data for all comparisons used in this study have been deposited in the National Center for Biotechnology Information Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo). The data for the inter-cultivar, FN, and transgenic plant comparisons can be found as accession numbers GSE56351, GSE58172, and GSE73596, respectively.

As with previous CGH analyses [1, 29], the DEVA software algorithm SegMt was used to generate raw data and identify segments in the transgenic plants. DNA samples from transgenic plants were labeled with Cy3 and the appropriate reference individual (‘Bert’ or ‘Williams 82’) was labeled with Cy5. Program parameters were: minimum segment difference = 0.1, minimum segment length (number of probes) = 2, acceptance percentile = 0.99, number of permutations = 10. Spatial correction and qspline normalization were applied. The resulting segments were processed based on their log 2 ratio mean. Segments that exceeded the upper threshold were considered “UpCNV”. Segments that were less than the lower threshold were considered “DownCNV”. The upper threshold of 0.3484 and lower threshold of −0.5257 were based on empirical data from hemizygous deletions and duplications in eight previously characterized FN plants (Additional file 1: Table S4) [1]. A custom Perl script calculated the number of genes overlapping these significant segments. Minimum segment length was adjusted to three probes to account for noise seen in control arrays. Structural variants in the transgenic plants were further investigated through visual inspection, to identify any obvious SVs that were not detected by the threshold based pipeline.

SV attributable to intra-cultivar heterogeneity were removed, as has been done in the previous studies [1, 29]. Intra-cultivar heterogeneity was seen as significant segments of the exact same location occurring in multiple plants. By overlaying the raw CGH data of the four transgenic plants in the ‘Bert’ background, heterogeneous SV in the ‘Bert’ cultivar were removed. A similar method was used to filter out heterogeneity in the transformed ‘Williams 82’ background. The comparison array in this case was ‘Williams’ (the backcross parent in ‘Williams 82’ [49]) also hybridized to ‘Williams 82’. Any identical SV event discovered in both ‘Williams’ and transformed ‘Williams 82’ was considered heterogeneity and removed.

The CGH platform, methods, and filtering steps of the inter-cultivar and FN data have been previously described [1, 29]. The SV detected in the inter-cultivar variation study were all cross validated with resequencing data and conservative thresholds. For all CGH arrays, test genotypes were labeled with Cy3 and the appropriate reference individual was labeled with Cy5 in all hybridizations (Additional file 1: Table S2).

Visual displays of the CGH data were created using Spotfire DecisionSite software. Additional file 1: Table S5 provides a list of soybean plants chosen for analysis, corresponding publication, and hybridization reference. Our previous study [29] of inter-cultivar variation assessed CNV on a gene-by-gene cross-validated basis across all 41 SoyNAM genotypes, concluding that SV affected 1528 genes. We conservatively converted this to SV genes per genotype using the CGH thresholds from the study and probe-based log 2 ratio score for each of the 1528 genes. FN data came from the “no-phenotype” class of 35 plants as described above, and ten “mutant phenotype” lines described in Additional file 1: Table S1 [1]. Only SV overlapping with genes were included in segment size summaries in all three genotypic classes.

Confirming novel SV

PCR was used to confirm structural variants found with CGH in the transgenic plants. PCR and Sanger sequencing across breakpoints was used to confirm the four CGH observed events. Confirmed events and internal primers were used to genotype the structural variants in additional plants. Primer sequences are provided in Additional file 1: Table S6. In three of these lineages, siblings and offspring of the transgenic plants were genotyped to test if the SV were heritable. The events were confirmed not to be intra-cultivar heterogeneity by PCR-genotyping 47 untransformed individuals (either in the corresponding ‘Bert’ or ‘Williams 82’ background) at these three loci. Furthermore, the SoyNAM parents as well as cultivars ‘Archer’, ‘Minsoy’, and ‘Noir1’ were also PCR-genotyped with the breakpoint and internal primers to test for novelty of the SV events.

Analyzing transgene insertion sites

Transgene integrations were analyzed using TAIL-PCR [51], Southern blot, and resequencing data. Southern blots used a BAR gene probe to detect the number of T-DNA insertions in the plants tested. TAIL-PCR was used to detect T-DNA locations in WPT_384-1-1, WPT_389-2-2 and WPT_301-3-13. Transgene insertion sites and counts were also determined by resequencing according to steps one through six outlined by Srivastava et al. [52]. Briefly, raw paired-end reads were aligned using Bowtie2 to the transgene sequence between the left and right border and the orphaned mapped reads were then aligned to the host soybean genome. The resulting putative transgene integration locations were filtered on prior knowledge of homology between components of the transgene (i.e. GmUbi promoter, RNAi hairpin targets, and their paralogs) and the genome. The location of the mapped orphaned reads, read depth coverage, and paired-end read spacing were further used to detect SV induced locally to transgene insertions. Integrated Genome Viewer (IGV) version 2.3.52 was used to visualize alignment results [53].

Sequence handling, alignment, and calling of nucleotide substitutions

The sequence read data from the ten “mutant phenotype” fast neutron plants analyzed in this study, along with the parent line of the population (cv. ‘M92-220’), are deposited in the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/) under accession number SRP036841. The sequence read data from the two transgenic plants, along with two individuals of the parent line (cv. ‘Bert’), and the cultivars ‘Archer’ and ‘Noir 1’ are deposited in the Sequence Read Archive under accession number SRP063738.

To determine the relative rates of base substitution due to FN mutagenesis, we used resequencing data from the aforementioned ten FN plants that had associated mutant phenotypes as reported in [1] (see Additional file 1: Table S1). We sequenced two transgenic plants and two controls to estimate the base substitution rate and localize T-DNA insertion sites. See Additional file 2: Figure S11 for the transgenic resequencing data analysis pipeline. All individuals were sequenced with Illumina 100 bp paired end reads.

FastQC version 0.11.2 was used on initial read data (and after any modifications to sequence data) to ensure that tools were used properly and the data was of acceptable quality for downstream applications [54]. Forward and reverse reads were treated separately, and then resynchronized for alignment using resync.pl (Riss util version 1.0, http://msi-riss.readthedocs.org/en/latest/software/riss_util.html). Cutadapt version 1.6 was used to remove adapter sequences using –b to specify both adapter sequences (GATCGGAAGAGCACACGTCTGAACTCCAGTCAC-NNNNNN-ATCTCGT-ATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCC-TACACGACGCTCTTCC-GATCT) where NNNNNN specifies the unique 6 bp sequence attached to samples when multiplexing. Sequence artifacts (low-complexity reads) were removed using fastx artifacts filter (Fastx toolkit version 0.0.14). Read quality was further filtered using fastq quality trimmer in the fastxtoolkit. Bases with phred quality of less than 20 were removed, and reads that were shorter than 30 bp after trimming were discarded.

We chose to align reads to the reference with two different read mapping programs, BWA mem (v. 0.7.10) [55], and Bowtie2 (v. 2.2.4) [56]. BWA mem alignments allowed for more accurate single base substitution calls, and Bowtie2 produces alignments more suitable for confirming CGH-identified SV. For BWA mem, the mismatch penalty was set to 6 (−B 6), which allows for approximately seven high-quality mismatches per read. Bowtie2 alignments were produced with default parameters. In both cases, reads were mapped to the Glycine max assembly version 1.0 (Wm82.a1.v1.1) [50]. Read cleaning and post-alignment filtering resulted in a realized mean coverage of 35x for the FN mutagenized plants, and 20x for WPT_389-2-2, and 21x for WPT_391-1-6.

Genotype calls for all sites were generated with the UnifiedGenotyper in the Genome Analysis Tool Kit (GATK) version 3.3 [57]. Pairwise comparisons of soybean varieties typically identify over one-million single base substitutions [33, 34]. This BWA mem resequencing and SNP detection pathway identified 1,110,325 substitutions between genotype ‘Archer’ and the ‘Williams 82’ reference genome sequence, and 1,904,061 substitutions between genotype ‘Noir 1’ and ‘Williams 82’. These findings served as a control to demonstrate our analysis pipeline identified similar polymorphism counts as have been previously reported in soybean studies.

We then applied a set of filtering criteria to look at only substitutions that are private to a single individual (termed “unique” or “novel” throughout the paper) across the most confidently called portions of the genome. This excluded sites with less than five reads per sample, sites that were monomorphic for the reference base, sites with heterozygous or missing calls, and sites with a homozygous alternate base call in more than one individual. Applied together, these filtering criteria produced variant calls that were homozygous private differences from reference. The filtering criteria assumed de novo mutations at a single base position will only be observed once. A large section in FN plant 07 on Chromosome 12 between 10 and 23 Mb was found to contain a disproportionate number of substitutions. CGH results from other FN individuals [1], not included in this sample, suggest this region is heterogeneous in the ‘M92-220’ cultivar. We therefore excluded this region of 183 substitutions when analyzing FN plant 07. The observed transition:transversion ratios were too variable between individuals to compare to previously reported ratios in FN mutagenesis [35].

Circos plots [58] were generated using 2d tile data tracks, plotting unique substitutions detected, previously published FN-induced SV [1], detected transformation-induced SV, and T-DNA mapping results. Scripts to perform data handling and analysis are available at https://github.com/TomJKono/Unintended_Consequences.

Availability of supporting data

The data sets supporting the results of this article are available in the National Center for Biotechnology Information Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) and Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/) repositories, GSE56351, GSE58172, GSE73596, SRP036841, and SRP063738.