Abstract Plant mitochondrial genomes are usually assembled and displayed as circular maps based on the widely-held view across the broad community of life scientists that circular genome-sized molecules are the primary form of plant mitochondrial DNA, despite the understanding by plant mitochondrial researchers that this is an inaccurate and outdated concept. Many plant mitochondrial genomes have one or more pairs of large repeats that can act as sites for inter- or intramolecular recombination, leading to multiple alternative arrangements (isoforms). Most mitochondrial genomes have been assembled using methods unable to capture the complete spectrum of isoforms within a species, leading to an incomplete inference of their structure and recombinational activity. To document and investigate underlying reasons for structural diversity in plant mitochondrial DNA, we used long-read (PacBio) and short-read (Illumina) sequencing data to assemble and compare mitochondrial genomes of domesticated (Lactuca sativa) and wild (L. saligna and L. serriola) lettuce species. We characterized a comprehensive, complex set of isoforms within each species and compared genome structures between species. Physical analysis of L. sativa mtDNA molecules by fluorescence microscopy revealed a variety of linear, branched, and circular structures. The mitochondrial genomes for L. sativa and L. serriola were identical in sequence and arrangement and differed substantially from L. saligna, indicating that the mitochondrial genome structure did not change during domestication. From the isoforms in our data, we infer that recombination occurs at repeats of all sizes at variable frequencies. The differences in genome structure between L. saligna and the two other Lactuca species can be largely explained by rare recombination events that rearranged the structure. Our data demonstrate that representations of plant mitochondrial genomes as simple, circular molecules are not accurate descriptions of their true nature and that in reality plant mitochondrial DNA is a complex, dynamic mixture of forms.

Author summary Plant mitochondrial genomes are commonly depicted in research articles and textbooks as circular molecules that are the size of the genome. Although research on mitochondrial DNA (mtDNA) over the past few decades has revealed that genome-sized circles are exceedingly rare and that alternative forms of mtDNA are more common, many biologists still perceive circular maps as representing one or more physical chromosomes. This misconception can lead to biases in how mitochondrial genomes are assembled and misinterpretation of their evolutionary relationships, synteny, and histories. In this study, we present an assembly methodology that uses short- and long-read sequencing data to determine the mitochondrial genome structures of three lettuce species. We show that these mitochondrial genomes are fluid and dynamic, with multiple sequence arrangements of the genome coexisting within individuals of the same species. Differences in sequence arrangements between species can be explained by rare recombination events. Inspection of physical molecules of mtDNA reveals primarily non-circular forms. We demonstrate that plant mitochondrial genomes are a complex mixture of physical forms and sequence arrangements. Our data suggest that plant mitochondrial genomes should be presented as multiple sequence units showing their variable and dynamic connections, rather than as circles.

Citation: Kozik A, Rowan BA, Lavelle D, Berke L, Schranz ME, Michelmore RW, et al. (2019) The alternative reality of plant mitochondrial DNA: One ring does not rule them all. PLoS Genet 15(8): e1008373. https://doi.org/10.1371/journal.pgen.1008373 Editor: Nathan M. Springer, University of Minnesota, UNITED STATES Received: April 24, 2019; Accepted: August 16, 2019; Published: August 30, 2019 Copyright: © 2019 Kozik et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Data sources generated by this project: the mitochondrial genomes of L. sativa, L. serriola, and L. saligna have been deposited in GenBank under accession numbers MK642355, MK820672, and MK759657, respectively (BioProject PRJNA508811). PacBio reads generated from DNA extracted from seven-day-old dark-grown seedlings of L. sativa cv. Salinas can be found in the NCBI Sequence Read Archive (SRA) under accession numbers SRX3557844–SRX3557871. Mitochondrial PacBio reads for the L. saligna germplasm accession CGN5271 are available at NCBI SRA under accession number SRX5104332. Illumina reads for the L. serriola germplasm accession US96UC23 can be obtained from the NCBI SRA database under accession numbers SRX5097892 (paired-end 170 bp), SRX5097891 (mate pair 2.5 kb), and SRX5097890 (mate pair 10 kb). The Illumina mitochondrial paired-end read data for L. saligna used in this analysis are available at the NCBI SRA under accession number SRX5131542. Reads of Hi-C libraries generated from L. sativa leaves can be found in the NCBI SRA under accession number SRX3973834. Previously generated data sources used in this project include: the L. sativa reference chloroplast genome (GenBank DQ383816.1). L. sativa Illumina paired-end reads with insert sizes of 175 bp (NCBI SRA SRR577192), 475 bp (NCBI SRA SRR577183), and 750 bp (NCBI SRA SRR577184). L. sativa mate pair reads with insert sizes of 2.5 kb (NCBI SRA SRR577197), 5 kb (NCBI SRA SRR577193), and 10 kb (NCBI SRA SRR577207). Leucaena trichandra mitochondrial genome assembly (GenBank MH717173.1). Putative autonomous mitochondrial DNA element (GenBank MH717174.1). L. trichandra mitochondrial PacBio reads (NCBI SRA SRX2719625), and Illumina 4 kb mate-pair libraries (NCBI SRA SRX2719623 and SRX2719624). Funding: Support for this work was provided by the University of California Davis to R.W.M., and the National Science Foundation (www.nsf.gov) grant MCB-1413152 to A.C.C. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Unlike the relatively simple mitochondrial genomes of animals, the genomes of non-parasitic flowering plant mitochondria are large and complex [1–8]. They exhibit extensive variation in size [191 kb–11,319 kb], sequence arrangement, and repeat content, yet coding sequences are highly conserved (typically 24 core genes with 17 variable genes) [9–13]. Mitochondria in plants not only have important roles in respiration, metabolism, and programmed cell death (similar to animal mitochondria), but also in conferring male sterility [14–17]. Their evolution has been the subject of numerous studies [3,9,10,18–25]. Because they can often be assembled and mapped as circles, there is a commonly-held misconception that plant mitochondrial genomes exist in vivo as circular molecules (the master circle model [26–28]). This was the consensus view until a lack of strong evidence for genome-sized circular molecules in plants and accumulating evidence for non-circular forms caused a shift in the understanding among domain experts and a transition to a more accurate understanding of plant mtDNA as primarily a collection of dynamic non-circular forms [5,6,29–33]. These forms can change during development and in response to stress, such as prolonged exposure to low temperatures [34]. However, most biologists outside of the specialized community of plant mitochondrial researchers still hold the outdated “master circle” view. This is perhaps because many contemporary publications of mitochondrial genomes persist in presenting a “master circle,” often without mention of any other forms (S1 Table) and this is what is currently presented in most biology textbooks. Additionally, the replication and recombination mechanisms of plant mitochondrial genomes are still not fully understood, nor are the adaptive reasons for the striking differences from animal mitochondrial genomes. Accurate characterizations of plant mitochondrial genome structures are required for understanding their functions, replication, inheritance, and their peculiar evolutionary trajectories. DNA sequencing and comparisons between sister taxa have revealed that plant mitochondrial synonymous mutation rates are substantially lower than in animal mitochondria, or the plant nucleus [35–38]. In contrast, once entire genomes were sequenced, it became clear that while there was conservation of gene content, there was very little conservation of gene order even between close relatives [39,40], likely due to frequent DNA repair that occurs via homologous recombination and nonhomologous end-joining [22,25,41]. Most plant mitochondrial genomes include a substantial fraction of poorly conserved DNA of unknown function and some have dramatically increased their genome sizes to millions of nucleotides, while still encoding only dozens of genes [11,42,43]. In contrast to their representation as master circles, mtDNA exhibits complex and dynamic structures, including linear and branched molecules (which could be intermediates in replication or recombination) and these may represent multiple isoforms of the genome [44]. Plant mitochondrial genomes generally have a small number of non-tandem direct or inverted repeat sequences of several kb in length. These may recombine frequently and symmetrically, isomerizing the genome [45,46]. Some genomes assemble into more than one independent molecule [47], although these may also be the consequences of assembly methods and parameters used during the assembly process [11,48]. In many cases, particularly those sequences assembled solely from short-read sequencing, recombination and isomerization have simply been assumed or ignored. There are also dispersed repeats of up to a few hundred base pairs that recombine at relatively low frequencies in wild type plants but often recombine more frequently (and asymmetrically) in DNA maintenance and repair mutants [49–53]. Such repeats are not always annotated, but their important contribution to the rearrangement and evolution of mitochondrial genomes is starting to emerge [22]. We employed new technologies to address the challenge of assembling genomes with branches and rearrangements by determining the complete mitochondrial genome sequences of three closely related species in the genus Lactuca. In terms of gene content, overall size, and repeat content, these genomes are typical of many flowering plant mitochondrial genomes [24,25]. We combined long-read and high coverage, short-read data to determine the sequences, junctions, rearrangements, and stoichiometry with great precision to produce the first high-quality mitochondrial assemblies with detailed information about structures and isoforms for species in the Compositae. This allowed us to evaluate the repeat structure and frequent isomerization by recombination at the large repeats. We propose a model for rare recombination events that rearranged the mitochondrial genome during the divergence of two Lactuca lineages. The physical structure of the genomes visualized by fluorescence microscopy showed that the mtDNA of L. sativa exists primarily in branched, linear forms, and subgenome-sized circles. Our data allowed us to document the diversity of isoforms in great detail, clarify misconceptions about mtDNA structures, explore the best methods for assembly of these dynamic, complex genomes, and examine the evolution of mitochondrial genomes in both a wild and a domesticated descendent from a common Lactuca ancestor.

Discussion The complexity of plant mitochondrial genomes has confounded efforts to characterize their sequence, structure, and dynamic evolution. Despite the general acceptance among mitochondrial biologists that plant mitochondrial genomes have a variety of configurations [5,6,29,30,32–34,72], over half of the publications on plant mitochondrial genomes since 2016 have presented circles as if they are the only form or the primary form of the genome, without any indication that the concept of a “master circle” is only a model (see S1 Table). Assumptions have to be made in any genome assembly process about the type of molecule being assembled and published sequences reflect choices made during the assembly process. Because plant mitochondrial genomes rearrange so frequently and are present in so many complex and non-circular states, assumptions as to the circularity of the genome before assembly have inevitably led to arbitrarily circularized structures [73,74] and incomplete and/or incorrect structures. In addition, GenBank restricts sequences to being designated as either circular or linear, with no provision in the annotations for dynamic structural forms. When plant mitochondrial genomes are assembled with the goal of producing a “master circle,” and their sequences are archived as such, it perpetuates the outdated notion of the existence of master circle molecules, especially for scientists who are outside of the specialized community of plant mitochondrial researchers. In addition to assuming a master circle, sometimes the goal has been to assemble the sequence into unique circular molecules of minimum size, in spite of the existence of very large repeats in different molecules that would allow for recombination to combine smaller molecules into larger ones [11,48]. Assemblies reported as multi-chromosomal may not exist as distinct molecules. In Silene conica [11], for example, the assembled constructs referred to as chromosomes 9, 11, 12, 14, and 23 total 600 kb in length; however, 389 kb of that length consists of repeats between 1 and 75 kb, matching sequences in 13 other assembled circles. These repeats can act as substrates for recombination and combine multiple chromosomes into a single sequence. We suggest that a more comprehensive analysis of structure could inform assembly choices in order to produce a more accurate understanding of plant mitochondrial genomes. Incomplete assemblies may also be missing such repeats that do exist in vivo and lead to the incorrect interpretation of multiple molecules. Genome rearrangements can occur both within species or individuals in real time and between species during their evolution and divergence [22]; therefore, accurate assembly and identification of multiple isoforms is critical to understanding the evolution of mitochondrial genomes. In this study, we aimed to produce accurate assemblies and characterize isoforms of the mitochondrial genomes of three Lactuca species. Our success was dependent on the exploitation of multiple long-read technologies and atypical assembly approaches. Our use of Illumina mate-pair, Hi-C, and long PacBio reads enabled the high resolution analysis of the mitochondrial genomes of three species of Lactuca. Long reads at high coverage depth were essential for accurately determining the in vivo sequences and arrangements because short paired-end reads would not allow the identification of recombination events that involve repeats longer than the paired-end library fragments. We assembled these genomes with CLC, reverse read mapping, and long distance analysis with mate-pair reads that made no assumptions as to the component structures and redundancies. Conventional assembly software programs, such as Falcon or Canu [75,76], are usually not capable of identifying the smallest structural components of an assembly and consequently are limited in their ability to identify all isoforms that may exist in plant mitochondrial genomes and resolve complex nested repeats. Assemblers that can split putative chimeras (isoforms) into distinct structural units in the initial steps of assembly have advantages over those that try to assemble all reads into a single linear or circular genome. Bottom-up construction of contigs and alignment to raw PacBio reads allowed a sensitive and accurate assembly of dynamic mitochondrial genomes and identification of multiple isoforms, resulting in precise representation of the subtle complexities of plant mitochondrial genomes and enabling evolutionary studies. The availability of long read sequences for an increasing number of plant species provides the opportunity for the reassessment of mitochondrial structures and diversity across the plant kingdom using our approach. When we applied our workflow to the mitochondrial genome data of Leucaena trichandra [74], we were able to generate a cyclic graph for this genome and connect all segments into contiguous isoforms (S8 Fig). Our structural studies using fluorescence microscopy of mtDNA molecules and Hi-C provided a detailed description of the complexities of plant mitochondrial genome architecture and dynamics that was consistent with our genome assemblies. This provided further evidence for the existence of multiple major and minor isoforms produced by homologous recombination at very large repeats. The prevalence of branched and linear forms of mtDNA likely represents ongoing recombination, perhaps due to recombination-dependent DNA replication [5]. The presence of smaller linear and degraded forms might be evidence of post-replication processing. In Chenopodium album cell cultures [6] and in later stage mung bean cotyledons [34], the complexity first increases during mtDNA replication and is then processed to simpler forms after replication has been completed. We would expect that the mitochondria isolated from 7-day-old whole lettuce seedlings would be a mix of those in the DNA replication stage and those post-replication and our analysis of the DNA structural forms is consistent with this expectation. L. sativa major isoforms had a genome size of ~363 kb, but no linear band of this size was observed by pulsed-field gel electrophoresis (PFGE). Circles only accounted for a small fraction of the DNA forms, yet the assemblies had a circular topology. This is similar to what has been observed with PFGE of watermelon [77], tobacco [5], soybean [32], and maize [78]. Although there was a range of percentages of molecules in each structural category, circles never accounted for the majority of the mtDNA and exhibited a broad distribution. The interpretation that the branched linear structures contain multi-genomic concatemers of the genome is consistent with all of these findings. The Hi-C data also suggest no higher-order structure of mtDNA within the nucleoid. Repeated sequences in plant mitochondria provide the substrates for contemporaneous and long-term structural variation. Large repeats, usually several kb or more in length, recombine frequently and isomerize the genome continuously. Whether the abundance of isoforms varies among cell types or tissues remains to be investigated; however, the tools are now available for the detailed analysis of specificity. There are also infrequently recombining non-tandem repeats, usually less than several hundred base pairs in length [25]. Although it is unclear what controls the interconversion of isoforms and their relative stoichiometries, repeat length may be an important factor. Homology-based ectopic recombination for the intermediate-length repeats has been shown to occur following double-strand breaks [52,79–81], when DNA repair functions are impaired by mutations [49,52,82], and occasionally during the evolution and divergence of species [22]. We were able to compare our detailed structures of the different species and infer the most parsimonious path from possible ancestral forms to the genomes of the extant species. Ectopic recombination between dispersed repeats could explain some of the structural changes that have become fixed between the two species. We also identified small linear plasmid-like molecules that have been shown to exist autonomously in plant mitochondria in a variety of species and are sometimes integrated into the mitochondrial genome (reviewed in [83]). Both Zea mays and D. carota have integrated plasmids in their mitochondrial genomes which prevented the assembly of the genomes into circular maps [65,84,85]. One such linear plasmid is integrated into the L. saligna genome but is incomplete in L. sativa. We found no evidence for its integration at any other location, nor for its existence as an autonomous molecule. Its presence in the L. saligna genome and its frequent inversions complicated the analysis of junctions in L. saligna and prevented a simple interpretation of the stoichiometry of the junctions. This is the most detailed description of the sequence, structure, and dynamics of plant mitochondrial genomes to date. Our comprehensive approach for investigating mitochondrial genomes can be applied to existing and future datasets generated for other plant species. This approach should avoid incomplete assemblies and reveal the complexity of multiple isoforms, non-circular molecules within circularly permuted maps, and recombination events that occur within and between species. This will be facilitated by the advent of even longer reads provided by rapidly advancing single molecule (PacBio) and nanopore sequencing (Oxford Nanopore Technologies). Understanding the evolutionary changes that can occur with mitochondrial plant genomes, including integration or loss of linear plasmid-like molecules, provides the foundation for future work on plant evolution and taxonomy, as well as mitochondrial genome structure and function.

Materials and methods Germplasm sources L. sativa cv. Salinas is a designated cultivar and seeds can be obtained from multiple commercial sources. Seeds of the L. saligna accession are available from the Centre for Genetic Resources, Wageningen University and Research (WUR) the Netherlands under the stock number CGN5271. We used our UC Davis L. serriola stock US96UC23 and will provide seeds upon request. PacBio reads DNA was extracted from seven-day-old dark-grown seedlings of L. sativa cv. Salinas grown in sterile conditions at 15°C using a modified CTAB protocol [86] with two chloroform extractions. DNA was removed after EtOH precipitation using a glass hook and then washed two times in 70% EtOH. The DNA was further processed with a high-salt, phenol-chloroform extraction and precipitation and removal of polysaccharides (https://www.pacb.com/support/documentation/). Finally, the DNA was precipitated using EtOH and washed two times with 70% EtOH. The DNA was divided into two samples for library preparation. For the first library, the DNA was sheared using a Megaruptor instrument to 20 kb before PacBio library construction, while the second PacBio library was prepared directly from the non-sheared DNA. SMRTbell libraries were made according to manufacturer’s standard protocol (Pacific Biosciences). Libraries were size selected > 20 kb using BluePippin (Sage Science). Sequencing on a Pacific Biosciences (PacBio) RSII Instrument generated 39.6 Gb of 2.7 million single pass reads from 28 SMRT cells. The raw reads were deposited in GenBank under accession numbers SRX3557844–SRX3557871. For the L. saligna germplasm accession CGN5271, DNA was extracted from leaves of young seedlings grown on soil in a greenhouse under long-day (16 h light) conditions with a temperature of 21°C during the day and 19°C at night using a modified version of the protocol described previously [87]. PacBio long read data was obtained from 86 SMRT cells. The selected set of mitochondrial reads (1 Gb, 101,753 reads) is available at SRA under accession number SRX5104332. Illumina genomic reads For L. sativa cv. Salinas, we used previously published Illumina whole genome libraries [88]. Paired-end reads included insert sizes of 175 bp (SRR577192), 475 bp (SRR577183), and 750 bp (SRR577184), and mate pair reads included insert sizes of 2.5 kb (SRR577197), 5 kb (SRR577193), and 10 kb (SRR577207). Paired-end libraries were filtered for high quality reads of uniform length (100 nt), yielding 25 million read pairs from the 175 and 475 bp libraries, and 15 million pairs from the 750 bp insert library. Paired-end genomic reads were aligned (mapped) to the lettuce chloroplast reference sequence (GenBank accession DQ383816.1) using the CLC Genomics Workbench with stringent parameters (Mismatch cost = 2; Insertion cost = 3; Deletion cost = 3; Length fraction = 0.9; Similarity fraction = 0.9) to find sequences that mapped to the chloroplast genome. Unmapped read pairs were collected and compiled into separate files for downstream mitochondrial assembly and analysis. Reads from mate-pair libraries were used without filtering against the lettuce chloroplast genome. L. serriola acc. US96UC23, whole genome sequence (WGS) paired-end and mate pair genomic libraries were prepared using the DNeasy Plant Mini Kit (Qiagen) from DNA isolated from dark grown seedlings. A paired-end 170 bp insert library was constructed utilizing the NEXTflex PCR-Free approach (BIOO Scientific). The 2.5 kb library was constructed using the Mate Pair Library v2 Kit (Illumina). The 10 kb library was constructed using the Nextera Mate Pair Sample Preparation Kit (Illumina). The raw reads were submitted to the SRA database under accession numbers SRX5097892 (170 bp), SRX5097891 (2.5 kb), and SRX5097890 (10 kb). Illumina paired-end read data for L. saligna (insert size 500 bp) were generated as a part of the International Lettuce Genomics Consortium (ILGC) project (http://lgr.genomecenter.ucdavis.edu/). Mitochondrial reads used in this analysis are available at NCBI Sequence Read Archive under accession number SRX5131542. Dovetail Hi-C libraries Hi-C libraries from leaves were generated by Dovetail™ using their standard proprietary protocol. Libraries were sequenced using an Illumina HiSeq 4000 at the UC Davis Genome Center. Paired reads were deposited in NCBI GenBank SRA under accession number SRX3973834. Preliminary Illumina assembly of the mitochondrial genome for L. sativa The plant mitochondrial RefSeq sequences from GenBank were used as a reference to mine mitochondrial contigs from a preliminary assembly of L. sativa Illumina reads that was generated using Velvet [89]. This preliminary assembly involved six rounds of independent assemblies using Velvet with k-mer values 51, 57, 63, 67, 71, and 75. The Velvet contigs were then joined using CAP3 [90] specifying -o 200 -p 95. A self-BLAST was performed on the CAP3 contigs and a non-redundant set of sequences was selected for downstream analysis. Optimization of CLC assembly parameters with L. sativa chloroplast PacBio genomic reads The reference chloroplast genome (GenBank accession DQ383816.1) was used to recover all PacBio reads that contained chloroplast fragments for input into the assembly with the CLC Genomics Workbench using the default approach of the Genome Finishing Module. Assembly parameters were adjusted until we achieved the expected three contigs for this genome, corresponding to the long and short unique segments and the inverted repeat. Assembly of the L. sativa mitochondrial genome with PacBio reads The preliminary Illumina-based mitochondrial genome assembly for L. sativa was used to query and select PacBio reads containing mitochondrial sequences from the raw PacBio genomic read set. This was achieved using a BLAST-N search of PacBio reads against the Illumina assembly and filtering for contiguous alignments with at least 500 nt and 80% identity. The Illumina reads had already been filtered to remove any plastid DNA reads. The total number of PacBio reads selected for the assembly was adjusted to achieve 250x coverage (~4,000 reads) before input into the first round of assembly with the CLC Genomics Workbench using the approach and parameters identified from the chloroplast genome assembly. After the first round of assembly, chloroplast segments contained in the mitochondrial assembly were masked and the masked contigs were used to mine additional PacBio reads for mitochondrial sequences for a second round of assembly. This process was repeated until all reads containing mitochondrial segments had been recovered. Assembly of the L. saligna mitochondrial genome with PacBio reads The L. sativa PacBio mitochondrial genome assembly was used to select PacBio reads from the L. saligna data that contained mitochondrial fragments. As before, the amount of reads was adjusted to ~250x coverage (~6,000 reads) for input into the first round of assembly with the CLC Genomics Workbench, following the same approach as used for L. sativa. The PacBio reads were slightly shorter for L. saligna, requiring more total reads to achieve ~250X coverage. Contig naming convention Upon assembly and preliminary comparison of contigs between species, the following convention was chosen to assign IDs to simplify downstream data analysis and interpretation. Contig naming started with K to avoid the first letters of the alphabet that could be used to designate figure panels on data presentation. Letters X and Y were skipped because of confusion with human chromosomes and letter O because of visual similarity with 0 (zero). L. sativa contigs M and N are frequently adjacent on genome isoforms, so alphabetical ordering keeps them together in data tables. Two letters were assigned to the longest L. saligna contig UV because it can be decomposed into two distinct sequences that are similar to L. sativa U and W. At the same time, L. sativa W is different from L. saligna V by having extra segments and thus was “wider.” Alphabetical sorting keeps them (U + V) in the right order. For one repeat, the letter R was assigned, and for the shortest one T (tiny). L. saligna S contig stands for “special” (integrated linear plasmid). These simple mnemonic rules were helpful for navigating the diverse data in the process of genome assembly because so many steps had to be visualized and resolved manually. This letter coding style was particularly useful in analysis of data using MS Excel spreadsheets and conditional formatting that allowed distinct coloring for different contigs and their segments. Inference of mitochondrial secondary building blocks using raw PacBio reads Contiguity of the assemblies (sequential order of contigs or primary structural units) was determined based on the analysis of full length raw PacBio reads. This led to the set of secondary building blocks that were used to infer the comprehensive set of mitochondrial genome isoforms. For the reverse PacBio read mapping approach, a sliding window of 2 kb with a step size of 1 kb was applied to each contig of the assembly (S9A Fig) to generate a library of overlapping fragments. Overlapping tiling fragments were mapped/aligned back to raw mitochondrial PacBio reads. A BLAST-N search was used to map tiling fragments to PacBio reads using the relaxed parameters to allow longer gaps (compared to default parameters) over contiguous alignment (-V T -F F -e 1e-60 -y 50 -X 75 -Z 500). A custom BLAST-N parser (https://github.com/alex-kozik/atgc-tools/blob/wiki/tcl_blast_parser.md) was used to generate a tab-delimited table that was used for sequential order analysis of primary structural units in PacBio reads. PacBio reads containing at least 10 distinct tiling fragments of the mitochondrial non-redundant contiguous genome sequence for L. sativa and 12 distinct tiling fragments of the mitochondrial non-redundant contiguous genome sequence for L. saligna were selected and compiled into maximally informative sets. Maximally informative sets resulted in 4,046 and 11,009 PacBio reads for L. sativa and L. saligna, respectively. These sets were used for the analysis of junctions between different mitochondrial genome primary structural units. The distributions of junctions between different units were first analyzed visually with MS Excel (S9B Fig). Regular expressions with egrep were utilized to quantify the junctions from tables in tab-delimited format. Re-assembly and validation of junctions Regions of PacBio reads specific to individual junctions (as identified in each set of secondary building blocks, see above) were compiled as separate, small subsets and reassembled to recover the precise sequence at each junction. For each junction, PacBio reads containing termini of two contigs were trimmed to 4–6 kb so the junction region was located in the middle of the trimmed reads. Subsets of the trimmed PacBio reads were assembled with CLC for each junction individually. The alignment of mitochondrial contigs to reassembled junction regions allowed for the determination of precise sequences between mitochondrial contigs. Error correction using Illumina reads Paired-end Illumina reads for L. sativa and L. saligna were used for error correction of the assembled PacBio contigs. Illumina reads from genomic libraries were mapped/aligned to the PacBio contigs to correct homopolymer errors and consensus sequences were selected based on what was supported by the majority of the Illumina reads. Mitochondrial contig stoichiometry The stoichiometry of mitochondrial contigs was determined by PacBio read coverage. A sliding window of 2 kb with a step size of 0.5 kb was applied to each contig of the assembly to generate a library of overlapping fragments. BLASTN was performed on each 2 kb tiling fragment versus the set of the most informative PacBio reads. To calculate the coverage, the number of PacBio reads for each tiling fragment was counted if the alignments were longer than 1.8 kb with an identity of 80% or better. Since all non-tandem repeats of medium size are shorter than 1.2 kb, alignments due to repeated sequences were not included. The total number of alignments per tiling fragment were plotted (this reflects coverage per segment across contig) and compared to each other. The coverage for each junction was calculated for the L. sativa and L. saligna mitochondrial genomes following methods similar to those above. The junction-spanning assemblies, consisting of 2 kb fragments with 1 kb of sequence on either side of the junction, were used as BLAST queries versus the PacBio reads to verify the assembly and provide numerical values for the fractionation of each junction in the pool of PacBio reads. Each segment that corresponded to a particular junction generated one of two types of alignment. The first type was a completely uninterrupted alignment that was specific for a particular junction. The second type was a fragmented alignment that corresponded to junction components in different locations of the mitochondrial genome. The alignments of different lengths were plotted for each BLAST hit (sorted by alignment length). The transition from complete (1.8 kb) to shorter segmented alignments indicated the fraction of each junction type in the pool of PacBio reads. Isoform inference and quantification Mitochondrial contigs generated with the CLC assembler were considered as primary units after polishing. Junctions between primary structural units inferred by reverse PacBio read mapping and analysis (see above) determined their sequential order. Visual inspection of secondary building blocks and analysis of their possible inter-connections taking into account their stoichiometries resulted in the set of mitochondrial genome isoforms. In this paper, we define contigs as primary structural units; secondary building blocks are combinations of several primary structural units (as detected by reverse read mapping); isoforms are combinations of several secondary building blocks with stoichiometry taken into account. Stoichiometry data were derived from contig coverage with PacBio and Illumina reads. Validation of the assemblies with Illumina mate-pair libraries For L. sativa and L. serriola, we used Illumina mate-pair libraries to validate structural accuracy of the PacBio assembly and inferred mitochondrial isoforms. Illumina mate-pair reads from genomic libraries of L. sativa with insert sizes of 2.5 and 10 kb were aligned to the assembled isoforms using BWA [91]. The distances between aligned mate-pair reads were calculated based on their positions/coordinates on the mitochondrial reference sequence. Pairs with distances greater than 1 kb were selected and compiled into subsets of equal size (320,000 pairs) for each library. Two dimensional distance heat plots for 1 kb bins were generated using modified Python scripts (https://github.com/alex-kozik/atgc-uni-cluster) and the Pixelirator visualization program (http://cgpdb.ucdavis.edu/data_pixelirator/). Higher order arrangement analysis with Hi-C libraries Read pairs from Hi-C libraries were filtered to eliminate ligation artifacts that do not reflect true long-distance interactions. The use of the Mbo-I restriction enzyme for fragmentation prior to ligation creates GATC-GATC dimer sites upon ligation of the Mbo-I digested termini, which do not exist in the original mitochondrial genome. Reads with GATC-GATC sites were selected and used for the long distance interaction analysis. Each read with a GATC-GATC site was split into pairs of sub-reads and each read of a split pair was mapped separately to the mitochondrial reference. The distances were calculated as in the mate read analysis. A subset of 640,000 pairs with distances greater than 1 kb was randomly selected for the distance analysis and visualization. Repeat analysis Repeated sequences were found as described previously [25] using BLAST-N with a word size of 50, ungapped, no masking, reward +1, penalty -20, and e-value 1,000. These parameters identified nearly identical repeats, indicating either a recent duplication, or recent homologous recombination and gene conversion of the two copies. Less similar repeats were presumed to have mutated and drifted without recent gene conversion, indicating that they are no longer engaging in productive homology-based events. Detection of rare recombination events between repeats less than 1 kb For the detection of rare recombination events between short repeats, all available mate pair reads of insert size 5 and 10 kb for L. sativa and 10 kb for L. serriola were used. The mitochondrial fraction of these libraries had ~1 million pairs per library for L. sativa and ~0.7 million for L. serriola. Mate-pair reads were aligned to the genomic sequences using BWA [91], the same parameters as above, and masking the second copy of the large repeats in the reference genome. Two-dimensional distance heat plots for 1 kb bins were generated as described above. Recombination events between short repeats resulted in distinct patterns of double short diagonals that are separated from the main diagonal. Variable intensity of the short diagonals was an indication of recombination frequency for particular repeats. Numerical values were analyzed in an MS Excel table as shown in S5D Fig. Gene content, analysis, and mitochondrial genome annotation The mitochondrial genome was annotated for expected mitochondrial features using the Mitofy web server (http://dogma.ccbb.utexas.edu/mitofy/). In order to find coding sequences missed by Mitofy, all open reading frames were manually examined. The two different junction open reading frames for atp1 and rps4 were translated and compared to other plant mitochondrial atp1 and rps4 protein sequences using BLAST-P (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Data files for the GenBank submission were compiled using GB2sequin [92]. Sequence analysis with MUMmer Mitochondrial genome sequences or their segments were compared with each other using MUMmer4 [93]. Alignments in MUMmer delta format were generated with nucmer utility with options ‘—maxmatch—nosimplify.’ SVG plots were created with the mummerplot program and custom edits of corresponding input files for gnuplot (http://www.gnuplot.info/). Leucaena trichandra dataset The L. trichandra dataset [74] was used to validate our approach to mitochondrial genome assembly with a different species. L. trichandra mitochondrial PacBio reads (SRX2719625), Illumina 4 kb mate-pair libraries (SRX2719623 and SRX2719624), and the putative autonomous mitochondrial DNA element (MH717174.1) were reanalyzed as described above and compared to the published L. trichandra mitochondrial genome assembly (MH717173.1). Isolation of mitochondria and preparation for in-gel fluorescence microscopy and pulsed-field gel electrophoresis Whole seven-day-old seedlings (roots and shoots) of L. sativa cv. Salinas grown in the dark under sterile conditions at 15°C were harvested and 5–10 g of plant material was flash frozen in liquid nitrogen. The frozen tissue was ground to a fine powder using a mortar and pestle, then ground again after adding 5 mL of High Salt Buffer (HSB: 1.25 M NaCl, 40 mM HEPES pH 7.6, 2 mM EDTA pH 8, 0.1% Bovine Serum Albumin (BSA), 0.1% 2-mercaptoethanol) until the frozen slurry melted. The liquid homogenate was filtered through 1 layer of Miracloth, then centrifuged at 4°C at 3,000 x g to pellet chloroplasts and nuclei. The supernatant was transferred to a fresh set of tubes and centrifuged at 4°C at 20,000 x g to pellet mitochondria. The mitochondrial pellet was resuspended in Liverwort Dilution Buffer (LDB; 0.4 M sorbitol, 1 mM EDTA, 0.1% BSA, 20 mM HEPES-KOH, pH 7.5) at half the volume of HSB used for the initial grinding and centrifuged at 4°C at 3,000 x g to again remove chloroplasts and nuclei by pelleting. The supernatant was transferred to a fresh set of tubes and centrifuged at 4°C at 20,000 x g to pellet mitochondria. The mitochondrial pellet was resuspended in half the original volume of LDB and the 3,000 x g centrifugation, transfer of supernatant and 20,000 x g centrifugation were repeated. The mitochondrial pellet was resuspended using 2 μL of LDB per μL of pelleted mitochondria. For PFGE, low melt point agarose (LMPA) and sorbitol were added to the suspension to achieve final concentrations of 0.7% LMPA and 0.41 M sorbitol before pipetting into an agarose plug mold, which was allowed to cool for 20 min at 4°C. For in-gel fluorescence microscopy, agarose plugs of the mitochondria were prepared as for PFGE, but the mitochondria were first diluted 1:20, 1:100, or 1:200 in LDB before embedding. Samples for PFGE were run on a 1.5% agarose gel in 0.5X Tris-Borate-EDTA (TBE) buffer at 5 V/cm with a pulse time of 30 s for 24 hours and stained in 3X Gel Red (Biotium) before imaging. For in-gel fluorescence microscopy, ¼ of an agarose plug was soaked in BET solution (3% 2-mercaptoethanol, 0.1 μg/mL EtBr, 1X TBE) for 30 mins. The BET solution was removed and the sample was soaked in fresh BET before slide preparation. Alternatively, ¼ of an agarose plug was soaked in GET solution (3% 2-mercaptoethanol, 0.5x QuantiFluor dye (Promega), 1X TBE) for 30 mins. To prepare slides, ½ of the stained agarose plug was placed on top of a glass slide on a heat block at 60–65°C and mixed with 15 μL ABET (1% LMPA in BET) and a coverslip was placed on top. When the agarose was completely melted underneath the coverslip, the slides were removed from the heat block and sealed with nail polish before microscopy. To measure the circular molecules, the images were analyzed using Fiji software (https://imagej.net/Fiji). The path along each entire circular molecule was measured three times and we used the average of these three measurements.

Acknowledgments We thank Luca Comai, Arnold Bendich, and Jeff Palmer for helpful comments on the manuscript and Elizabeth Georgian for editorial assistance. We are also grateful to Pauline Sanders for supplying and maintaining seed stocks and Huaqin Xu for assistance with data submission to NCBI GenBank.