Significance The fundamental biological functions of a living cell are stored within the DNA sequence of its genome. Classical genetic approaches dissect the functioning of biological systems by analyzing individual genes, yet uncovering the essential gene set of an organism has remained very challenging. It is argued that the rewriting of entire genomes through the process of chemical synthesis provides a powerful and complementary research concept to understand how essential functions are programed into genomes.

Abstract Understanding how to program biological functions into artificial DNA sequences remains a key challenge in synthetic genomics. Here, we report the chemical synthesis and testing of Caulobacter ethensis-2.0 (C. eth-2.0), a rewritten bacterial genome composed of the most fundamental functions of a bacterial cell. We rebuilt the essential genome of Caulobacter crescentus through the process of chemical synthesis rewriting and studied the genetic information content at the level of its essential genes. Within the 785,701-bp genome, we used sequence rewriting to reduce the number of encoded genetic features from 6,290 to 799. Overall, we introduced 133,313 base substitutions, resulting in the rewriting of 123,562 codons. We tested the biological functionality of the genome design in C. crescentus by transposon mutagenesis. Our analysis revealed that 432 essential genes of C. eth-2.0, corresponding to 81.5% of the design, are equal in functionality to natural genes. These findings suggest that neither changing mRNA structure nor changing the codon context have significant influence on biological functionality of synthetic genomes. Discovery of 98 genes that lost their function identified essential genes with incorrect annotation, including a limited set of 27 genes where we uncovered noncoding control features embedded within protein-coding sequences. In sum, our results highlight the promise of chemical synthesis rewriting to decode fundamental genome functions and its utility toward the design of improved organisms for industrial purposes and health benefits.

In the early 2000s, the template-independent chemical synthesis of the 7.4-kb polio virus (1) and 5.4-kb bacteriophage phiX174 genomes (2) using oligonucleotides has ushered in the field of synthetic genomics. The initial progress on moderately sized viral genomes has spurred whole-genome synthesis of more complex organisms. In 2008 and 2010, the Craig Venter Institute reported the chemical synthesis of genome replicas from Mycoplasma genitalium (583 kb) and Mycoplasma mycoides (1.1 Mb) (3, 4), respectively. These efforts expanded the chemical synthesis scale to megabases and improved in vitro DNA assembly strategies and genome transplantation methods. However, the work also highlighted the challenges of whole-genome synthesis, as a single missense mutation within the dnaA gene initially prevented boot up. To gain insights into a minimal gene set for cellular life, the teams of Craig Venter built a 473-gene reduced version of the M. mycoides genome (5).

Along with these accomplishments, the concept of whole-genome synthesis and genome minimization has been expanded toward the rebuilding of all 16 chromosomes of Saccharomyces cerevisiae driven by an international consortium composed of 21 institutions. In 2014, the consortium reported synthesis of the artificial yeast chromosome synIII (273 kb) (6). Subsequently, five additional chromosomes (7⇓⇓⇓–11) were generated, and as of 2018, roughly 40% of the entire yeast genome has been covered. The redesigned chromosomes removed repetitive sequences (tRNA genes, introns, and transposons) to increase targeting fidelity during stepwise homologous replacement as well as included the seeding of loxP sites to permit iterative genome reduction on completion of yeast chromosomes. In the beginning of the yeast 2.0 synthesis project, CRISPR had not yet entered the stage, but today, it offers an alternative approach for progressive genome reduction.

The redundancy of the genetic code defining the same amino acid by multiple synonymous codons offers the possibility to erase and reassign codons throughout an entire genome. Such rewriting efforts are used to engineer organisms with altered genetic codes and free up codons for incorporation of artificial amino acids, which do not occur within natural organisms. To date, genome-wide rewriting efforts have been primarily reported for viral genomes (12⇓–14), and a few are focused on the rewriting of microbial genomes of Escherichia coli, Salmonella, and S. cerevisiae. Using oligo-mediated recombineering (15), all 321 instances of the TAG stop codon in E. coli were altered to TAA, demonstrating the dispensability of a stop codon within the genetic code (16). In an extension of this approach, rewriting of 13 sense codons across a set of ribosomal genes (17) and genome-wide rewriting of 123 instances of the arginine rare codons AGA and AGG (18) were accomplished in E. coli. These studies unearthed unexpected recalcitrant synonymous rewriting events that occurred primarily in the vicinity of 5′ and 3′ termini of protein-coding sequences (18, 19). Recently, to investigate the impact of more complex rewriting schemes, de novo DNA synthesis methods have been used for the rewriting of gene cassettes in conjunction with genomic replacement strategies (15, 20). Ongoing de novo synthesis toward a 57-codon E. coli genome was reported (21), with the complete genome synthesis underway.

Despite this progress, the underlying rewriting design principles have remained ill defined, and debugging has remained challenging (17, 19). It has been speculated that presence of embedded transcriptional and translational control signals at the termini of coding sequences (CDSs) as well as imprecise genome annotations are the underlying cause. We hypothesized that massive synonymous rewriting in conjunction with a systematic investigation of error causes will shed light onto the general sequence design principles of how biological functions are programed into genomes. However, while some progress has been made to study recoding schemes using individual genes and gene clusters (21), the field currently lacks a broadly applicable high-throughput error diagnosis approach to probe the rewriting of entire genomes.

Here, we report the chemical synthesis of Caulobacter ethensis-2.0 (C. eth-2.0), a bacterial minimized genome composed of the most fundamental functions of a bacterial cell. We present a broadly applicable design–build–test approach to program the most fundamental functions of a cell into a customized genome sequence. By rebuilding the essential genome of Caulobacter crescentus (Caulobacter hereafter) through the process of chemical synthesis writing, we studied the genetic information content at the level of its essential genes.

Discussion C. crescentus has emerged as an important model organism for understanding the regulation of the bacterial cell cycle (25, 51, 52). A notable feature of Caulobacter is that the regulatory events that control polar differentiation and cell cycle progression are highly integrated and occur in a temporally restricted order (53). The advent of genomic technologies has enabled global analyses that have revolutionized our understanding of Caulobacter genetic core networks that control the lifecycle (26⇓⇓–29). In recent years, many components of the regulatory circuit have been identified, and simulation of the circuitry has been reported (25, 54). More recent experimental work using transposon sequencing has shown that 12% of the Caulobacter genome is essential for survival under laboratory conditions (30). The identified set of essential sequences included not only protein-coding sequences but also, regulatory regions and noncoding elements that collectively store the genetic information necessary to run a living cell. Of the individual DNA regions identified as essential, 91 were noncoding regions of unknown function, and 49 were genes presumably coding for hypothetical proteins with function that is unknown. Although classical genetic approaches dissect the functioning of biological systems by analyzing individual native genes, uncovering the function of essential genes has remained very challenging. Herein, we show that the rewriting of entire genomes through the process of chemical synthesis provides a powerful and complementary research concept to understand how essential functions are programed into genomes. Contemporary synthetic genome projects (3, 5, 8) have largely maintained natural genome sequences, implementing only modest design changes to increase the likelihood of functionality. However, conservative genome design misses a key opportunity of chemical DNA synthesis: the rewriting of DNA to advance our understanding of how fundamental biological functions are encoded within genomes. Indeed, synthetic autonomous bacteria, such as M. mycoides strain JCVI-syn3.0 made up of 473 genes within a 531-kb genome (5), resulted in the creation of a replicative cell. However, it also encompasses 149 genes with unknown functions (84 labeled as “generic,” and 65 labeled as “unknowns”) (55). This corresponds to over one-third of its gene set. While these studies were highly valuable to experimentally determine the core set of genes for an independently replicating cell, they did not probe the genetic information content of its essential genes. By rebuilding the essential genome of Caulobacter through the process of chemical synthesis rewriting, we assessed the essential genetic information content of a bacterial cell on the level of its protein-coding sequences. Within the 785,701-bp genome of C. eth-2.0, we used sequence rewriting to reduce the number of genetic features present within protein-coding sequences from 6,290 to 799. Overall, we introduced 133,313 base substitutions, resulting in the synonymous rewriting of 123,562 codons. We speculated that synonymous rewriting of protein-coding sequences maintains the encoded amino acid sequences but likely erases additional genetic information layers. These include alternative reading frames as well as hidden control elements embedded within protein-coding sequences of essential genes. Rewriting of 56% of all codons resulted in complete rewriting of the essential Caulobacter transcriptome. Despite incorporating such drastic changes at the level of mRNA, our functionality analysis revealed that over 432 of the transcribed essential genes of C. eth-2.0 corresponding to 81.5% of all rewritten essential genes are equal in functionality to natural counterparts to support viability. This result suggests that, in most essential genes, the primary mRNA sequence, the secondary structure, or the codon context has no significant influence on biological functionality. This finding is surprising given the fact that previous studies on individual genes reported that codon translation in vivo is controlled by many factors, including codon context (56). Furthermore, our findings suggest that the vast majority of the probed ORFs encode exclusively for proteins and that other layers of genetic control do not seem to play a significant role. Among the 134 enzyme-encoding genes that make up the metabolic core network of C. eth-2.0, the level of functional genes is even over 90%, suggesting that rewritten biosynthetic pathways retain their functionality in most cases. A possible explanation for the high proportion of functional metabolic genes might be the fact that regulation of essential metabolic functions occurs rather by allosteric interactions at the level of enzymes than at the level of gene expression. In addition to 432 functional rewritten genes, our study precisely mapped 98 genes that lost functionality on synonymous rewriting as detected by our transposon-based functionality assessment. Since retaining solely the protein-coding sequences of these genes is not sufficient for their functionality, it is reasonable to conclude that these genes are misannotated or contain hitherto unknown essential genetic elements embedded within their CDS. Alternatively, it is also possible that a subset of these genes encode for RNA rather than protein-coding functions. Taken together, our genome rewriting approach can be used to experimentally validate the annotation fidelity of entire genomes. Altogether, the identified set of 98 nonfunctional genes corresponds to less than 20% of the essential genome of C. eth-2.0 and precisely revealed where we currently have gaps in our knowledge that persisted despite previous omics-informed genome reannotation efforts. In the future, it will be interesting to unravel why rewriting renders particular genes nonfunctional. These studies will shed light onto hitherto unknown transcriptional and translational control layers embedded within protein-coding sequences that are of fundamental importance for proper gene functioning. Targeted repair of identified nonfunctional C. eth-2.0 genes, as exemplified within the subset of the four faulty cell division genes murG, murC, ftsQ, and ftsZ, will lead to the discovery of genetic features, such as the essential attenuator element identified upstream of the ftsZ gene, the function of which is currently unknown. We acknowledge that the 98 identified nonfunctional genes are still poorly understood, yet our findings on C. eth-2.0 serve as an excellent starting point to close current knowledge gaps in essential genome functions toward rational construction of a synthetic organism with a fully defined genetic blueprint. On the level of de novo DNA synthesis, we herein demonstrate how chemical synthesis rewriting facilitates the genome synthesis process. To simplify the entire genome build process, we used sequence design algorithms (31, 32) and collectively introduce 10,172 base substitutions to remove 5,668 DNA synthesis constraints, including 1,233 repeats, 93 homopolymeric stretches, and 4,342 regions of high GC content. Successful low-cost synthesis and subsequent higher-order assembly of C. eth-2.0 into the complete chromosome exemplify the utility of our approach to rapidly produce designer genomes. Our results highlight the promise of chemical synthesis rewriting of entire genomes to understand how the most fundamental functions of a cell are programed into DNA. On the systems engineering level, our design–build–test approach enables us to harness massive design flexibility to produce rewritten genomes that are customized in sequence while maintaining their biological functionality. On the level of genome synthesis, our findings also highlight how chemical synthesis facilitates rewriting of biological information into DNA sequences that can be physically manufactured in a highly reliable manner, thereby reducing costs and increasing effectiveness of the genome build process. In sum, our results highlight the promise of chemical synthesis rewriting to decode fundamental genome functions and its utility toward design of improved organisms for industrial purposes and health benefits.

Materials and Methods Detailed materials and methods are in SI Appendix. The sequence of the C. eth-2.0 genome has been deposited in the NCBI database (GenBank accession no. CP035535). Design of C. eth-1.0 Genome and Sequence Rewriting into C. eth-2.0. To streamline the C. eth-1.0 design (30) for DNA synthesis, the previously reported Genome Caligrapher algorithm and sequence design pipeline (31) were applied at a codon recoding probability of 0.56. The streamlined C. eth-2.0 design contains a low amount of both synthesis constraints and unnecessary genetic features. To enable the retrosynthetic assembly route, C. eth-2.0 was partitioned into 3- to 4-kb DNA blocks using the previously published Genome Partitioner algorithm (32). Synthesis and Hierarchical Assembly of the C. eth-2.0 Genome. The partitioned 3- to 4-kb DNA blocks for the hierarchical assembly of C. eth-2.0 were ordered from two commercial suppliers of low-cost de novo DNA synthesis. The blocks were assembled into 20-kb segments and subsequently, into 40- to 60-kb megasegments using yeast homologous gap repair. To verify the assemblies, a junction-amplifying PCR was conducted. To assemble the megasegments into the 785-kb C. eth-2.0 genome, homologous gap repair was done by the newly generated S. cerevisiae strain YJV04. To transform the segments into the yeast cells, a spheroplast procedure was applied. The assembly was verified by a junction-amplifying PCR. The correct size of the construct was verified using pulse field agarose gel electrophoresis by lysing the yeast cells inside an agarose plug. The sequence of the C. eth-2.0 construct was verified using the Illumina NextSeq and iSeq systems. Construction of Merosynthetic Caulobacter Test Strains. Sequence-confirmed C. eth-2.0 segments were conjugated from E. coli S17-1 into Caulobacter NA1000 to generate a panel of 37 merosynthetic test strains. The occurrence of toxic C. eth-2.0 genes was measured by the conjugation frequency of the different segments. To pinpoint the toxic genes, the C. eth-2.0 segments were sequenced on an Illumina system after the boot up in Caulobacter. Using the sequencing data, the mutations within the evolved C. eth-2.0 segments were analyzed, yielding the precise coordinates of toxic genes. Fault Diagnosis of C. eth-2.0 by Transposon Sequencing. To benchmark the functionality of the C. eth-2.0 genes, transposon sequencing was applied (30). The analysis was conducted using hypersaturated transposon libraries and an Illumina system. The sequencing data were mapped onto the original Caulobacter genome, resulting in a set of all functional C. eth-2.0 genes. After analyzing the nonfunctional genes, a repair of the sequence was done using standard cloning techniques. To test the repaired C. eth-2.0 genes, a β on-galactosidase reporter assay was conducted.

Acknowledgments We thank R. Schlapbach and L. Poveda from Zürich Functional Genomics Center (ZFGC) for sequencing support; B. Maier and members from ScopeM for electron microscopy support; S. Nath from the Joint Genome Institute (JGI) for DNA synthesis and sequencing support; F. Rudolf for assistance with yeast marker design; H. Christen for conception of computational algorithms; and Samuel I. Miller, Markus Aebi, and Uwe Sauer for critical comments. This work received institutional support from Community Science Program (CSP) DNA Synthesis Award Grants JGI CSP-1593 (to M.C. and B.C.) and CSP-2840 (to M.C. and B.C.) from the US Department of Energy Joint Genome Institute, Swiss Federal Institute of Technology (ETH) Zürich ETH Research Grant ETH-08 16-1 (to B.C.), and Swiss National Science Foundation Grant 31003A_166476 (to B.C.). The work conducted by the US Department of Energy Joint Genome Institute, a Department of Energy Office of Science User Facility, is supported by Office of Science of the US Department of Energy Contract DE-AC02-05CH11231.

Footnotes Author contributions: M.C. and B.C. designed research; J.E.V., L.D.M., A.W., P.S., Y.B., D.A., F.T., C.E.F.-T., M.v.K., R.G., and S.D. performed research; J.E.V., L.D.M., M.C., and B.C. analyzed data; and J.E.V., M.C., and B.C. wrote the paper.

Conflict of interest statement: Eidgenössische Technische Hochschule holds a patent application (WO2017085249A1) with M.C. and B.C. as inventors that covers functional testing of synthetic genomes. M.C. and B.C. hold shares from Gigabases Switzerland AG.

This article is a PNAS Direct Submission.

Data deposition: The sequence of the C. eth-2.0 genome reported in this paper has been deposited in the National Center for Biotechnology Information database (GenBank accession no. CP035535).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1818259116/-/DCSupplemental.