To explore the frontier of next-generation sequencing at sea, we deployed a Life Technologies Ion Torrent Personal Genome Machine (PGM) during the 2013 Southern Line Islands Research Expedition to sequence bacterial isolates and community metagenomes from these remote islands. We installed local bioinformatics capabilities to perform necessary sequence analysis. There were numerous challenges to remote DNA sequencing and analysis, however the end result—genome sequences generated at the remote central Pacific Atolls allowed us to focus our research on questions relevant to the samples we collected. In this paper we describe the sequencing and informatics pipelines established during the expedition, release the data generated during the 2013 Line Islands research expedition, and discuss some of the unexpected challenges in remote sequencing.

There are many challenges to next-generation sequencing in the field, but surmounting those obstacles will allow scientists to pursue new research avenues in exploring the environment using genetic approaches. Some of these challenges may be mitigated by the specific location being explored. For example, many terrestrial locations are accessible to mobile laboratories ( Griffith, 1963 ; Grolla et al., 2011 ) and have access to cellular communications that can provide Internet access. Though low-bandwidth, these connections can be used for analysis of next-generation DNA sequences ( Hoffman & Edwards, 2011 ). In contrast, aside from near-shore venues, Internet communication at remote marine stations typically relies on satellite transmissions, and thus is both limited in bandwidth and extremely expensive. New bioinformatics approaches are reducing the computational complexity of the algorithms in DNA sequence processing, therefore minimizing the resources needed for data analysis. Together with faster and cheaper computational technologies, these improved approaches mitigate the need for Internet-based computations ( Edwards et al., 2012 ; Silva et al., 2014 ).

The use of next-generation sequencing for microbial ecology involves two distinct components. First, the experimental aspects that include sample collection and preparation, DNA extraction, and sequencing, which are routine in the laboratory but challenging in the field. The principle limitation to taking sequencing into the field is the significant infrastructure and resources required for all steps of the sequencing process. In addition to the dedicated hardware required to generate the sequences, much of the hardware, and many of the sample preparation steps, require physically separated laboratory space (to reduce cross-contamination between the samples). Second, the informatics aspects, including processing the raw data into high quality sequences, comparing those sequences to existing databases to generate annotations, and the subsequent data analysis all require high-performance computational resources to generate meaningful biological interpretations ( Meyer et al., 2008 ; Aziz et al., 2012 ; Edwards et al., 2012 ).

DNA sequencing has revolutionized microbial ecology: next-generation sequencing has upended our traditional views of microbial communities, and enabled exploration of the microbial components of many unusual environments. In a typical environmental exploration, samples are collected at the study site, transported back to the laboratory, and analyzed after the scientific team returns from the field. This abstraction of the DNA sequencing from the sampling eliminates the possibility of immediate follow-up studies to explore interesting findings. In our previous studies of environmental microbial and viral components we identified questions and challenges that could have been answered with additional sample collection but awaited a future return to the field before they could be addressed ( Dinsdale et al., 2008 ; Bruce et al., 2012 ).

Methods

Study site This study was performed during an expedition to the Southern Line Islands, central Pacific in October–November 2013 aboard the M/Y Hanse Explorer. Departing from Papeete Harbor, Tahiti, the islands visited were Flint (11.43°S, 151.82°W), Vostok (10.1°S, 152.38°W), Starbuck (5.62°S, 155.93°W), Malden (4.017°S, 154.93°W), and Millennium (previously called Caroline) (9.94°S, 150.21°W), in order, before returning to Papeete (Fig. S1).

Sample collections Water samples were collected above the reef, and in-reef water samples were collected through crevices and against the benthos, both at 10 m depth. All sampling sites were named as either “tent sites” or “black reef”. “Supersucker” (Hass et al., in press) samples were collected from either coral or algal-surfaces with a modified syringe system which uses pre-filtered sterile seawater to flush the targeted microbial community from the respective surface (Fig. 1; Fig. S2). Metagenomics samples were collected from the benthic boundary layer of two sites at Starbuck islands; a newly discovered black reef site (Fig. S1; 5.62653°S, 155.90886°W) and the tent site (5.62891°S, 155.92529°W). The collection was performed using 19 l low density polyethylene collapsible bag (Cole-Parmer, Vernon Hills, IL, USA) connected to a modified bilge pump (Fig. 1) as we have described previously (Dinsdale et al., 2008). Large debris and eukaryotic cells were removed by filtration through 100 µm Nitex mesh and microbial cells were captured by passing the filtrate through the 0.45 µm Sterivex filter (Millipore, Inc., MA, USA). The Sterivex filters were stored at −20 °C until DNA extraction (Hass et al., in press). Figure 1: Workflow for the preparation of bacterial isolates and water samples for genome and metagenome sequencing. We started with sampling the coral reefs using supreme supersuckers, and isolates were cultured on different media. Isolates were assayed using multi-phenotype assay plates and sequenced. This work was done under permit 012/13 from the Environment and Conservation Division of the Republic of Kiribati.

Bacterial isolates collection A sample (100 µl) of each water sample was plated onto Thiosulfate-citrate-bile salts-sucrose (TCBS) agar for the isolation of Vibrio-like spp (Lotz, Tamplin & Rodrick, 1983). Typically >90% of colonies are Vibrio spp., but Pseudoalteromonas, Pseudovibrios, Shewanella and others also grow on TCBS. Therefore, colonies isolated from the TCBS plates were designated as Vibrio-like (F Thompson, pers. comm., 2014). In addition, a sample (100 µl) of each water sample was also plated onto Zobell’s Marine agar for the isolation of heterotrophic marine bacteria (ZoBell, 1941). In the naming scheme of isolates, “V” indicates Vibrio-like spp. and “Z” indicates isolates from Zobell’s Marine agar (Table 1). Single colonies were picked and re-streaked onto new agar plates for colony isolation. Vibrio-like isolates were selected based on the color and size of the colony. Non-vibrio isolates were selected based on the pigmentation (color) and colony morphology. Cells were scraped off the agar plate for DNA extraction, multi-phenotype assay plates (MAP), storage in RNA later, and metabolites extraction using 100% MeOH (Fig. 1). Permit regulations restrict the import and export of live biological material between Kritibati and the United States, and therefore viable bacteria are not available. Identifier Island Site (depth) Potential genus Culture media Barcodea VRT1 Flint Tent (10 m) Vibrio TCBS 1 VRT2 Flint Tent (10 m) Vibrio TCBS 2 VRT3 Flint Tent (10 m) Vibrio TCBS 3 VRT4 Flint Tent (10 m) Vibrio TCBS 4 VAR1 Flint Tent (10 m) Vibrio TCBS 5 VAR2 Flint Tent (10 m) Vibrio TCBS 6 VAR3 Flint Tent (10 m) Vibrio TCBS 7 VAR4 Flint Tent (10 m) Vibrio TCBS 8 VRT5B Flint Tent–20 m Vibrio TCBS 8 VRT11 Vostok Supersucker (10 m - Coral) Vibrio TCBS 7 VRT14 Vostok In-reef (10 m) Vibrio TCBS 10 VRT18 Starbuck Ambient water (10 m) Vibrio TCBS 6 VRT22 Starbuck In-reef (10 m) Vibrio TCBS 3 VRT23 Starbuck In-reef (10 m) Vibrio TCBS 5 (4) VRT25 Starbuck Black-reef water (10 m) Vibrio TCBS 4 VRT30 Starbuck Black-reef (surface) Vibrio TCBS 9 VRT35 Malden Supersucker (10 m - Coral) Serratia TCBS 11 VRT37 Malden Supersucker (10 m - Algae) Serratia TCBS 12 VRT38 Millenium Supersucker (10 m - Coral) Vibrio TCBS 13 VRT41 Millenium Supersucker (10 m - Algae) Vibrio TCBS 14 ZRT1 Flint Tent (10 m) Pseudoalteromonas Zobell 9 (3) ZRT3 Flint Tent (10 m) Pseudoalteromonas Zobell 11 ZAR1 Flint Tent (10 m) Pseudoalteromonas Zobell 13 (1) ZAR2 Flint Tent (10 m) Ruegeria Zobell 14 (2) ZRT28 Malden Supersucker (10 m - Coral) Serratia Marine 16 ZRT32 Malden Supersucker (10 m - Algae) Serratia Marine 15 SLI_3.1 Starbuck Tent (10 m) Mix N/A 1 SLI_3.2 Starbuck Black-reef water (10 m) Mix N/A 2 DOI: 10.7717/peerj.520/table-1

DNA extraction and sequencing The DNA from bacterial isolates was extracted and purified using the standard bacteria protocol in Nucleospin Tissue Kit (Macherey-Nagel, Dueren, Germany). In short, the cells were re-suspended with 180 µl T1 lysis buffer and mixed thoroughly. Proteinase K (25 µl) was added and the mixture was incubated at 37 °C for 3–8 h. The remaining extraction procedure was followed as recommended by the manufacturer protocol. Total microbial DNA was isolated from the Sterivex filters based on a modified protocol using the Nucleospin Tissue Kit (Macherey-Nagel, Dueren, Germany) (Kelly et al., 2012). Lysis steps were completed overnight at 37 °C in the Sterivex filters with double amount of Proteinase K-added T1 lysis buffer. An appropriate volume (200 µl for 180 µl T1 lysis buffer added, and 400 µl for 360 µl T1 lysis buffer added) of B3 lysis buffer was added for complete lysis before the lysate was removed from the Sterivex filter for subsequent extraction procedure as described in the manufacturers protocol. Sequence libraries were prepared using the Ion Xpress™ Plus Fragment Library Kit (Life Technologies, NY, USA) with slight protocol modification and each library is barcoded using the Ion Xpress™ Barcode Adapters 1–16 Kit. SPRI beads-based size selection according to the published New England Bioscience (NEB) E6270 protocol (https://www.neb.com/protocols/1/01/01/size-selection-e6270) was performed for 200–300 bp fragment size-selection after adapters ligation. Emulsion PCR was performed on 8-cycles amplified library using the OneTouch supplemented with Ion Torrent PGM Template OT2 200 Kit and template libraries were sequenced on the Ion Torrent PGM using the Ion Torrent PGM Sequencing 200 Kit v2 and Ion 318™ Chip Kit v2. Sequencing was performed across five different locations on the ship (Fig. 2). Figure 2: A field guide in setting up sequencing workflow, specifically on a moving ship. (A) Molecular bench for procedure including DNA extraction and library preparation was set-up in a clean room, which also serves as one of the bedrooms of scientists on-board. (B) Emulsion PCR was performed using the One-Touch technology located at the laundry room close to the hull of the ship. Damaged touch screen was replaced by re-wiring the display onto a laptop. (C) The sequencer was run at the owner’s cabin where there is an easy and safe access to the nitrogen tank that fuels the microfluidics of the sequencing technology. (D) A modified version of Ion Torrent server was set up to run the data analysis.

Multi-phenotype assay plate (MAP) Bacterial cells were resuspended from single colonies into sterile artificial seawater. Before leaving San Diego, MAPs were created as stock plates using 48 different carbon substrates arrayed on the plate in duplicate (Table S1). Each stock well contains 1 ml of 6X basal media (6X MOPS media, 57 mM NH 4 Cl, 1.5 mM NaSO 4 , 30 µM CaCl 2 , 6 mM MgSO 4 , 1.9 MNaCl, 7.92 mM K 2 HPO 4 , 60 mM KCl, 36 µM FeCl 3 ) and 1 ml of 5X carbon substrate. The substrates are used at a final concentration of 0.2% unless specified. Each experimental well on a 96-well plate consists of 50 µl of pre-mixed basal media + substrate solution, 75 µl sterile water, and 25 µl re-suspended bacterial cells. Bacteria cell optical density (OD) was read using spectrophotometer at 650 nm, at the start of the experiment (T = 0) and subsequently at the times noted. The multi-phenotype assay data were parsed and compiled using in-house PERL scripts (http://www.perl.org/). The data were visualized as growth curves by plotting OD measurements over time. Using the ggplot library in R (http://ggplot2.org/), the entire plate and curves were generated as images that were manually inspected (DA Cuevas, DR Garza, S Sanchez, JE Rostron, CS Henry, RA Overbeek, V Vonstein, F Rohwer, EA Dinsdale, RA Edwards, 2014, unpublished data). The OD measurements occurring at or after 40 h were extracted from the data for comparative analysis between the samples. These values were used to establish the 48 substrate vector profile of each sample. The Euclidean distance was calculated using the SciPy (Jones, Oliphant & Peterson, 2001) spatial distance module to generate a distance matrix that was the basis for a neighbor-joining tree (DA Cuevas, DR Garza, S Sanchez, JE Rostron, CS Henry, RA Overbeek, V Vonstein, F Rohwer, EA Dinsdale, RA Edwards, 2014, unpublished data). This code is available from GitHub at https://github.com/dacuevas/PMAnalyzer.

Bioinformatics analysis of sequence libraries As noted below, the most common computational issue was corruption of the data files on the hard drives. To mitigate this issue, the MD5 checksum values for each file were calculated on the personal genome machine using the command line md5sum application. (The PGM contains a single hard drive.) This application was chosen because it is fast and efficient. The checksum for each file was computed and compared to the expected values before the computation started and at the completion of each computation. On board ship, bases were called using a modified version of the Ion Torrent pipeline. To expedite the processing in the absence of a large compute cluster, the sequencing chip was digitally divided into four quadrants using the –cropped option to the bead finding application justBeadFind (part of the Ion Torrent suite, Life Technologies, Carlsbad, CA). A standard IonExpress 318 chip is 3,392 × 3,792 beads, and the chip was divided into four quadrants, 0–1,746, 0–1,946; 0–1,746, 1,846–3,792; 1,646–3,392, 0–1,946; and 1,646–3,392, 1,846–3,792. An overlap was provided on either side to ensure that all beads were identified. Any identical sequences from the same bead that was found in more than one quadrant were removed in post-processing steps. Bead finding, bead analysis, and base calling were performed using the Life Technologies software version 4.0. The modified pipeline consisted of three steps with the following commands: justBeadFind --cropped=0,0,1746,1946 --librarykey=TCAG --tfkey=ATCG --no-subdir --output-dir=sigproc sequences/ > justBeadFind.out 2> justBeadFind.err Analysis --from-beadfind ../sequences > analysis.out 2> analysis.err BaseCaller --trim-qual-cutoff 15 --trim-qual-window-size 30 --trim-adapter-cutoff 16 --input-dir ../sigproc > BaseCallerPost.hc.out 2> BaseCallerPost.hc.err Sequences were separated based on the IonExpress primer sequence using a custom written PERL script that looked for exact matches to those primers (available from http://edwards-sdsu.cvs.sourceforge.net/viewvc/edwards-sdsu/bioinformatics/bin/split_sequences_by_tag.pl). Any sequences with a mismatch to the primers were ignored as potentially containing sequencing errors as has been proposed elsewhere (Schmieder & Edwards, 2011). Sequences shorter than 40 nt (including the tag) were discarded, and the tag sequence was removed prior to subsequent analyses. No other quality trimming was performed prior to assembly. Sequences were assembled using Newbler version 2.7 (Margulies et al., 2005), and scaffolds were constructed to closely related genomes using Scaffold Builder (Silva et al., 2013). On the boat, contigs were annotated with a hybrid custom-written annotation pipeline based on both the RAST (Aziz et al., 2008; Overbeek et al., 2014) and the real time metagenomics system (Edwards et al., 2012). This pipeline uses amino acid k-mers to make functional assignments onto assembled DNA sequences, and extends the open reading frames (ORFs) containing the k-mer matches to identify the protein encoding regions. This heuristic pipeline eliminates largely overlapping ORFs but does not attempt to accurately identify the start positions of the genes, and is only concerned with those genes that can be functionally annotated from the k-mers. This approach was chosen as it is extremely fast, and provides a comprehensive annotation of the genes that can be identified in the database at the cost of the accuracy of exactly identifying the locations of the genes. This code is a part of the SEED toolkit and is available from http://biocvs.mcs.anl.gov/viewcvs.cgi/FigKernelScripts/assign_to_dna_using_kmers.pl. An online version of this is also available at the real time metagenomics site, http://edwards.sdsu.edu/RTMg/. In addition, tRNA-Scan SE was used to identify tRNA genes (Lowe & Eddy, 1997). Potential phage genes were identified by comparison to the PhAnToMe database (http://www.phantome.org/). On return to shore, all assembled genomes were annotated using the standard RAST pipeline (Aziz et al., 2008; Overbeek et al., 2014) and metabolic models were built using the model seed (Henry et al., 2010).

Phylogenetic relationships of isolates The 16S rRNA, rpoB, and recA gene sequences were extracted from the unassembled reads of each genome using the program genomePeek (K McNair, RA Edwards, 2014, unpublished data). Each group of sequences extracted from the same genome library were assembled into contigs using Newbler 2.7 (Margulies et al., 2005) with default parameters. The contigs were then grouped into 16S rRNA, RopB, and RecA gene group. Each group was aligned with ClustalW2 (Larkin et al., 2007) using the default parameters. The alignments were visually checked using Seaview (Galtier, Gouy & Gautier, 1996). Extraneous contigs were removed from the original set, and the remaining contigs were re-aligned, trimmed and exported in the PHYLIP format. Phylogenetic trees were generated using neighbor-joining clustering method (Larkin et al., 2007) and visualized using the interactive tree of life (Ciccarelli et al., 2006). This was not performed on the boat.

Genus designations 16S sequences were extracted as described above using Genome Peek, and the genus of the closest relative was used as the genus for the isolate. All GenomePeek analyses are available at http://edwards.sdsu.edu/GenomePeek/LineIslands/. This was performed on the boat, but was subsequently repeated.

Heat-map A blastn search (Altschul et al., 1997) using an expected value cutoff of 10−5 was performed to compare all the 2013 expedition genomic data against 35 coral metagenomic data from previous expeditions (the metagenomic libraries vary from 996,778 bp to 117,975,100 bp). The portion of reads (just using the best hits) from each metagenomic sample that matched to each genome was calculated using Eq. (1) and implemented in a Python script available from http://edwards-sdsu.cvs.sourceforge.net/viewvc/edwards-sdsu/bioinformatics/LineIslandsGenomes/. (1) ∑ i = 1 n Alignment _ size i Metagenome _ size x ∗ Genome _ size y . A heat-map was then generated to visualize the similarity between the 26 genomes sequenced on the 2013 expedition and the 35 sequenced coral metagenomes using an in-house Python script. This was initiated on the boat with a simple recruitment plot (available at the link above) and then subsequently refined.

Conserved functions Function is best conserved between orthologous proteins (i.e., proteins that are derived from the same common ancestor). On the boat we determine which functions are conserved across all genomes by counting proteins that had the same annotation. This allowed us to quickly compare the common functions and identify genes that were unique to the strains that we sequenced. Upon our return, and after the RAST annotations we composed orthologous groups (OGs) specific for the organisms sequenced here. These orthologous groups represent protein families derived from a single protein in the common ancestor of the genomes and were identified by using a similar approach as previously described (Lucena et al., 2012). Briefly, we first queried the complete proteomes using an all-by-all blastp search (Altschul et al., 1997). The resulting bitscores were used to define in-paralogous groups of recently duplicated genes (i.e., after the last speciation event) within every genome. Within the genome, all proteins with a matching score better or equal than to any protein in another genome were joined into an “in-paralogous group”. We then combined the in-paralogous groups conservatively between species by joining pairs of reciprocal best blastp hits to create the final list of orthologous groups for the complete set of genomes.