Abstract Biomolecular information systems offer exciting potential advantages and opportunities to complement conventional semiconductor technologies. Much attention has been paid to information-encoding polymers, but small molecules also play important roles in biochemical information systems. Downstream from DNA, the metabolome is an information-rich molecular system with diverse chemical dimensions which could be harnessed for information storage and processing. As a proof of principle of small-molecule postgenomic data storage, here we demonstrate a workflow for representing abstract data in synthetic mixtures of metabolites. Our approach leverages robotic liquid handling for writing digital information into chemical mixtures, and mass spectrometry for extracting the data. We present several kilobyte-scale image datasets stored in synthetic metabolomes, which can be decoded with accuracy exceeding 99% using multi-mass logistic regression. Cumulatively, >100,000 bits of digital image data was written into metabolomes. These early demonstrations provide insight into some of the benefits and limitations of small-molecule chemical information systems.

Citation: Kennedy E, Arcadia CE, Geiser J, Weber PM, Rose C, Rubenstein BM, et al. (2019) Encoding information in synthetic metabolomes. PLoS ONE 14(7): e0217364. https://doi.org/10.1371/journal.pone.0217364 Editor: Andrew C. Gill, University of Lincoln, UNITED KINGDOM Received: March 9, 2019; Accepted: May 10, 2019; Published: July 3, 2019 Copyright: © 2019 Kennedy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Mass spectra from this work may be downloaded from Metabolomics Workbench data repository (study ST001173). Raw data is also available from the Brown Digital Repository (DOI: 10.26300/jwv9-ew20). Funding: This research was supported by funding from the Defense Advanced Research Projects Agency (DARPA W911NF-18-2-0031) to BMR and JKR. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have submitted a provisional patent application (62/791,504) related to this work.

Introduction The metabolome is the complete set of small molecules found in a biological system [1]. The properties of this set of compounds are an amplified and dynamic measure of an organism’s genome, transcriptome, proteome, and environment [2]. This makes the metabolome an incredibly information-rich system, which displays diverse chemical, structural and biological dimensions [3–5]. Although much remains to be understood, improvements in protocols and efficient mass spectrometry (MS) have enabled metabolomic disease screening and drug discovery [6–12]. These technologies are supported by continually improving statistical tools and databases [13, 14]. As these tools advance, they may also suggest exciting alternative applications for metabolomics. For inspiration, we observe that researchers have mimicked living systems by using DNA [15] for long-term archival information storage [16, 17], building on rapid advances in sequencing technology. Given recent progress in proteomic and metabolic profiling tools [18–21], it is timely to explore if the metabolome can also be used in a complementary way for information representations. Whereas DNA and proteins are often large molecules which exist in small numbers, metabolites are higher in number, smaller in mass, and more structurally and energetically diverse. Like DNA, metabolites are biologically ubiquitous, and their primary pathways and processes are conserved across species [22]. The power of DNA as an information carrier comes from the combinatorial complexity that can exist within one polymer [23]. By contrast, the power of the metabolome is in the diversity of many co-existing molecules which can interact, or be acted upon, in complex combinations [5]. Non-genomic molecular data storage has also been demonstrated using fluorescent dyes on polymer films [24] and rotaxanes [25]. Other demonstrations have utilized collections of fluorophores which interact with information-bearing compounds in statistically identifiable ways [26]. However, all of these methods encode information into the state of a single compound at one time. In this paper, we encode abstract binary data into the chemical composition of thousands of spatially arrayed nanoliter volumes (Fig 1a). Each volume (‘spot’) contains a prescribed mixture from a library of purified metabolites—a synthetic metabolome. A key strength of this work is that it can be applied to any chemical library. Metabolites hold particular potential, because they provide access to well-regulated interconversion networks, materials, and databases which could facilitate computational operations on chemical data. The presence or absence of one metabolite in one spot encodes one bit of information. Therefore, the total number of bits stored in one spot is equal to the number of available library elements [27]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Writing and reading data encoded in mixtures of metabolites. (a) Binary image data is mapped onto a set of metabolite mixtures, with each bit determining the presence/absence of one compound in one mixture. For example, a spot mapped to four bits with values [0 1 0 1] may contain the 2nd and 4th metabolite at that location. (b) Small volumes of the mixtures are spotted onto a steel plate and the solvent is evaporated (scale bars: 5 mm). This chemical dataset is analyzed by MALDI mass spectrometry (b, bottom). Using the observed mass spectrum peaks, decisions are made about which metabolites are present. These decisions are assembled from the array of spots to recover the original image. The image shown is the Rhode Island Hope Regiment Colors [28]. https://doi.org/10.1371/journal.pone.0217364.g001 We recover the encoded data from metabolic mixtures using mass spectrometry (Fig 1b). The data aquisition is inherently parallelized, because a single mass spectrum provides information on every compound in a mixture. Noise characterization and logistic regression strategies for recovering the data are presented, along with examples of chemically encoded digital images. Raw error rates <1% are achieved with kilobyte-scale data sets using a simple peak analysis, illustrating the viability of both writing and reading metabolomic information. We use these experimental demonstrations to consider the benefits and limitations of encoding data into a biochemical medium in which interactions and interconversions can occur.

Materials and methods Chemical library preparation Reagent grade samples of 36 distinct metabolic compounds (Table A in S1 File) were diluted in dimethyl sulfoxide (DMSO, anhydrous), each to a nominal concentration of 25mM. Some metabolites were initially dissolved in an alternative solvent (de-ionized water with or without 0.5M or 1M hydrochloric acid) to facilitate solvation in DMSO. 10μL of each compound was aliquoted into a 384-well microplate (Labcyte 384LDV). Data mixture preparation The chemical data mixtures were prepared on a 76 × 120 mm2 stainless steel MALDI plate. An acoustic liquid handler (Labcyte Echo 550) was employed to transfer the compounds from the library wellplate onto the MALDI plate. The nominal droplet transfer volume is 2.5 nL, but to reduce variability, we typically use 2 droplets (5 nL) per compound. The destinations of the droplets are programmed to match a standard 2.25mm pitch 1536-spot (32 × 48) target. After spotting the compounds to the MALDI plate, a MALDI matrix material was added to each location. We selected 9-Aminoacridine for its compatibility with metabolite libraries, its low background in the small molecule regime, and its support for both positive and negative ion modes. The MALDI plate is left to dry and crystallize overnight (∼ 10 hours). Once dried, the plate can be stored in a humidity controlled cabinet or analyzed by MALDI-FT-ICR mass spectrometry. Mass analysis of data plates A Fourier-transform ion cyclotron resonance (FT-ICR) mass spectrometer (SolariX 7T, Bruker) was used to analyze the crystallized metabolite data mixtures. The exact resolution is a function of the measurement time allocated per spectrum. For these experiments, we typically used 0.5-1 sec, yielding a resolution of < 0.001 Da. The instrument is run in MALDI mode and is configured to serially measure the mass spectrum of each mixture on the 48x32 grid. Acquisition for a full plate takes <2 hours. To read the encoded data from the mass spectra, the probability of a metabolite being present is modeled as a combination of multiple predictor masses. A multinomial logistic regression considers the natural exponent of an offset plus the sum of all identifying mass SNRs, where each SNR is multiplied by a trained weight coefficient. A limited-memory BFGS algorithm was used to predict the logistic accuracy scores given an input of the n best peaks per metabolite. This process was iterated for all metabolome constituents.

Discussion One advantage of molecular data storage is its high storage density. To date, demonstrations using DNA have reached about 214 petabytes per gram [32], although this is still orders of magnitude from theoretical limits [33]. For moderate amounts of data, an encoded metabolome written using a large small-molecule library could improve on this number [34], thanks to its increased chemical diversity. Our experiments highlight several limitations and potential benefits that warrant further discussion. Statistically discriminating m/z features were used to classify the metabolite mixtures and recover the data at 98-99.5% accuracy using a simple analysis. Further development can take advantage of the wide range of sophisticated analysis technologies for metabolic profiling, including artificial neural networks, genetic algorithms, and self-organizing maps [35]. The inclusion of these methods, in conjunction with error correcting codes, leaves ample headroom for improved data recovery from more complex mixtures. In terms of data rates, we demonstrated write speeds of 5 bits/sec, and aggregate read speeds of 11 bits/sec. We have performed little optimization of either the read or write times, and as the size of the metabolite library is increased, the MS read speed in particular has significant room to improve. Looking forward, it would be interesting to consider the upper bound on information capacity using all known metabolites (∼ 105 [14]). Even if only a fraction are stable, detectable, and display unique masses, this conservatively predicts hundreds of bits per spectral acquisition, which could all be read in parallel. As sub-zeptomole MS and nanomolar concentration detection have been available for nearly two decades [36, 37], detection at this level of complexity seems plausible. Improvements in spatial density, and perhaps write speed, could come from reducing the volume and pitch of spots. There are opportunities for high density multilayer printing. To avoid storage density limits arising from finite transfer volumes, the precise mixture of metabolites associated with one spot can be pre-mixed in one well of an intermediary data plate. Transfer of 2.5 nL from the intermediary plate well to one spot means that hundreds of metabolites can be present in a nL volume on the plate. There is also room to extend on this work using larger libraries for higher capacity, or by storing multiple bits per complex, leveraging oligomerization [38]. In terms of density, we elected to use millimeter-scale arrays compatible with commercial instrumentation. Scaling the mixture spots down to diffraction-limited laser spot scales could improve data storage density by 6 orders of magnitude. Theoretically, this could facilitate extension from kilobyte- to gigabyte-scale data sets per plate. However, the true limit of data storage density depends on the available instrumentation. ICR-MS (or other high-resolution MS such as orbital traps) have a finite ion capacity per acquisition, so the number of compounds can not be arbitrarily increased due to competition. Metabolites with a lower ionization efficiency will be excluded even though present in a large, competitive mixture. Therefore, to increase the number of metabolites per spot, future work may need to screen libraries for ionization efficiency. Alternatively, other read strategies (e.g. nanopores [39–41]) could provide higher sensitivity. A likely source of error in more complex mixtures will be interactions between metabolites [5]. However, interspecies networks may also have benefits, such as opportunities for overwriting or transforming data, which hints at possibilities for synthetic metabolomic computation. One recurring challenge in metabolomics is obtaining trustworthy ‘ground truth’ samples. Perhaps by considering metabolomes as more abstract and mutable stores of information, we can develop new tools that allow us to overcome statistical biases, establish ground truths, and tease out subtle interactions and interconversion rates in well-regulated synthetic metabolomes.

Conclusion ‘Omics’ technologies have grown out of genomics to encompass other complex information-rich systems like the metabolome. It is natural to ask whether there exist complementary opportunities to make use of metabolites’ structural diversity and interactivity. As a proof of principle of postgenomic small-molecule information storage, we have experimentally encoded >100,000 bits of digital images into synthetic metabolomes (Table B in S1 File), and we are confident that this number can be increased significantly in the future. One novel contribution is the demonstration of data storage in a mixture of dissimilar molecules, which can improve information capacity and read times through diversity and parallelism. Perhaps more importantly, this work offers a new perspective on small-molecule chemical information, and it introduces possibilities for synthetic metabolomic computation and establishing metabolic ‘ground truths’ through interrogation of synthetic metabolomes.

Supporting information S1 File. Supporting information. Additional details about library compounds, read error rates, dataset sizes, repeated reads, data plates, error correlations, training cross-validation, and adducts. https://doi.org/10.1371/journal.pone.0217364.s001 (PDF)

Acknowledgments The authors are grateful for support from Sherief Reda, Eunsuk Kim, and Jason Sello. This research was supported by funding from the Defense Advanced Research Projects Agency (DARPA W911NF-18-2-0031, BMR and JKR). The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.