Here, we present BlobTools, a modular command-line solution for the visualisation of genome assemblies as BlobPlots, and taxonomic interrogation for purposes of quality control. BlobTools is a complete reimplementation of the Blobology pipeline ( Kumar et al ., 2013 ) focussed on usability, improved taxonomic assignment of sequences based on custom user input, and support for coverage information based on multiple formats and sequencing libraries. We demonstrate the features of BlobTools using synthetic datasets, and offer guidelines for efficient adoption of BlobTools into genome assembly programmes.

BlobPlots, or taxon-annotated GC-coverage plots ( Kumar et al ., 2013 ) are another contamination detection and data partitioning methodology. BlobPlots are two-dimensional scatter plots, in which sequences are represented by dots and coloured by taxonomic affiliation based on sequence similarity search results. For each sequence, the position on the Y-axis is determined by the base coverage of the sequence in the coverage library, a proxy for molarity of input DNA. The position on the X-axis is determined by the GC content, the proportion of G and C bases in the sequence, which can differ substantially between genomes.

Existing contaminant screening pipelines also differ in the way results are presented. Anvi’o depicts assemblies through interactive plots with rich annotations of sequence composition features, coverages across datasets and taxonomic/binning results. PhylOligo offers heatmaps of hierarchical clusterings of sequences, tree visualisations, and t-SNE (t-Distributed Stochastic Neighbor Embedding) plots, where sequence composition clusterings have been reduced to two dimensions. ProDeGe displays sequences in an interactive, three-dimensional k -mer PCA plots.

Interrogation of genome assemblies to assure single-taxon origin is an elemental step in the genome sequencing process. Failure to identify non-target sequence can lead to false conclusions regarding the biology of the target organism, such as metabolic potential and events of horizontal gene transfer (HGT) between species. Several reports of HGTs into eukaryotic genomes have later been shown to have been based on undetected contamination in assemblies. Identification of contamination can radically change the conclusions of a study, as shown for the starlet sea anemone Nematostella vectensis ( Artamonova & Mushegian, 2013 ) and the tardigrade Hypsibius dujardini ( Koutsovoulos et al ., 2016 ). Importantly, undetected non-target sequence contamination of published genomes will pollute public sequence databases and promote propagation of annotation errors.

Advances in next generation sequencing technologies have generated vast amounts of data and knowledge ( Goodwin et al ., 2016 ). The decrease in cost per nucleotide lead to an increased application of these technologies to non-model organisms, life forms which have so far not been intensively studied by the research community. Genome-enabled science on these species can then illuminate novel processes and reveal the patterns of evolution. For non-model species, the luxury of large amounts of material from cultured isolates is often not possible, and research must progress from organisms sourced from the wild or from complex mixtures of species. DNA extracted from a sample may actually contain genomes from multiple organisms – food sources, host material, symbionts, pathogens, commensals and external contaminants – in addition to the target organism. In some cases, the associated genomes can be considered “contaminants”, while in others, they can provide insights into the biology of the target organism. In all cases they should be identified, isolated and investigated with care.

Methods

Implementation BlobTools is written in Python and consists of a main executable that allows the user to interact with the implemented modules (see Table 1). It offers a simple, modular command line interface which can easily be adapted to process multiple datasets simultaneously using GNU parallel (Tange, 2011). Inputs for BlobTools are standard file formats commonly created during the course of genome assembly projects. The primary processing in BlobTools constructs a BlobDB data structure based on user input. From this data structure, BlobTools generates easily interpretable, two-dimensional visualisations ready for publication, in conjunction with tabular output, enabling the user to partition sequences and paired-end (PE) reads contributing to them, for separate downstream processing. We present two recommended workflows, one targeted at de novo genome assembly projects in the absence of a reference genome (Figure 1A) and another for projects where a reference genome is available (Figure 1B). BlobTools

module Task create Parsing of input files and creation of BlobTools (JSON) data

structure, i. e. BlobDB view Generation of tabular output for manual inspection and

subsequent partitioning of sequences in the assembly, input

files for CONCOCT, and/or COV files based on a BlobDB plot Plotting of BlobPlots based on a BlobDB covplot Plotting of CovPlots based on a BlobDB and a COV file seqfilter Partitioning of sequences from a FASTA file based on a list of

sequence IDs bamfilter Partitioning of paired-end reads from a BAM file based on a

list of sequence IDs and their mapping behaviour map2cov Generation of a COV file (containing base and read

coverage) based on a BAM/CAS file taxify Annotation of tabular sequence similarity search output (e. g.

BLAST/Diamond output) with TaxIDs from a mapping file or

generation of a BlobTools hits file based on custom user input

Taxonomy assignment Taxonomy assignment in BlobTools is based on user-supplied, tab-separated-value (TSV) files composed of three columns: the input sequence ID, a NCBI TaxID, and a numerical score. We refer to these TSV files as ‘hits’ files below. They can be generated from the output of sequence similarity searches, such as BLAST (Camacho et al., 2009) or Diamond blastx (Buchfink et al., 2015) searches against public or reference databases, or the output of other contaminant identification tools. The BlobTools module taxify allows easy conversion of tabular file formats to BlobTools compatible input, in addition to annotation of similarity search results based on NCBI TaxID mapping files, as available from UniProt and NCBI. Based on these inputs, BlobTools assigns a single NCBI taxonomy for each sequence in the assembly, based on the highest scoring NCBI TaxID at the following taxonomic ranks: species, genus, family, order, phylum, and superkingdom. Score calculation can be controlled by the user through a minimal score threshold ( --min_score ) and a minimal difference in scores ( --min_diff ) between the best and second-best scoring taxonomy. In addition, three non-canonical taxonomic annotations are possible: ‘no-hit’, the suffix ‘-undef’ and ‘unresolved’. Sequences not assigned to any taxonomic group, or not passing the --min_score threshold, are labelled ‘no-hit’. If a NCBI TaxID has no explicit parent at a taxonomic rank, the suffix ‘-undef’ is appended to the next upper taxonomic rank for which one does exist. In cases where the score difference between the best and second-best hits is smaller than --min_diff , sequences are labelled as ‘unresolved’. Multiple ‘hits’ files can be provided as input. In this case, the behaviour of the taxonomy assignment process can be controlled further through ‘taxrules’. The highest scoring taxonomy can either be inferred across all files (‘bestsum’) or successively (‘bestsumorder’) in the order they were supplied as input, allowing only sequence that received no hits from one file to be considered for taxonomic annotation in the next file, thereby leveraging reliability of scores of different input file sources. The original blobology pipeline (Kumar et al., 2013) recommended the use of a single, best BLAST hit per sequence for taxonomy assignment. However, taxonomically mis-annotated sequences in databases (derived from inclusion of un-screened genome assemblies) can lead to erroneous taxonomic annotation. BlobTools mitigates this issue by accepting multiple hits per sequence and allocating taxonomy based on the highest sum of scores. It should be noted that a definitive taxonomic placement for every sequence in the assembly is not required for successful taxonomic partitioning of sequences, since differential coverage and sequence composition profiles between the genomes are often sufficient.

Visualisations In BlobTools, sequences are depicted as circles in BlobPlots (as opposed to dots in the blobology pipeline), with diameters proportional to sequence length. The scatter-plot is decorated with coverage and GC histograms for each taxonomic group, which are weighted by the total span (cumulative length) of sequences occupying each bin. A legend reflects the taxonomic affiliation of sequences and lists count, total span and N50 by taxonomic group. Taxonomic groups can be plotted at any taxonomic rank and colours are selected dynamically from a colour map. The number of taxonomic groups to be plotted can be controlled ( --plotgroups, default is ‘7’) and remaining groups are binned into the category ‘others’. An example is shown in Figure 2A. The power of differential coverage profiles across different sequencing libraries for partitioning sequences in an assembly prompted the development of CovPlots (Figure 3) (Koutsovoulos et al., 2016), which are analogous to BlobPlots, except that the GC-axis is substituted by the coverage-axis from another sequencing library. CovPlots can be used for the visualisation of patterns of differential coverage signatures between taxonomic groups in the assembly. The modules for generating BlobPlots and CovPlots support additional input parameters controlling visualisation behaviour, including cumulative addition ( --cumulative ) or generation of separate plots for each taxonomic group ( --multiplot ), exclusion ( --exclude ) or relabelling ( --relabel ) of taxonomic groups, assignment of specific HEX colours to groups ( --colour ) or labelling sequences based on arbitrary, user defined categories ( --catcolour ). The latter could be, for instance, binned categories of RNAseq mappings to sequences in the assembly as shown in Koutsovoulos et al. (2016). ReadCovPlots (Figure 2B and 2C) visualise the proportion of reads of a library that are unmapped or mapped, showing the percentage of mapped reads by taxonomic group, as barcharts. These can be of use for rapid taxonomic screening of multiple sequencing libraries within a single project. The underlying data of ReadCovPlots and additional metrics are written to tabular text files for custom analyses by the user.

Support of multiple coverage libraries BlobTools supports coverage input (BAM/CAS format) from multiple sequencing libraries. As these data formats contain more information than needed, BlobTools parses coverage information of sequences (normalised base coverage and read coverage) into COV files in TSV format. These files can be generated through the module map2cov prior to construction of a BlobDB. Within the BlobDB data structure, base and read coverage information is stored for each sequence in the assembly. If more than one coverage file is supplied, BlobTools constructs an additional coverage library (‘cov_sum’) internally, containing the sum of coverages for each sequence across all coverage files. This internal coverage library is considered when extracting views or plotting visualisations.