In particular, a successful high-throughput genomic visualization environment for modern microbial informatics must satisfy two criteria. First, software releases must be free and open-source to allow other researchers to verify and to adapt the software to their specific needs and to cope with the quick evolution of data types and datasets size. Second, visualization tools must be command-driven in order to be embedded in computational pipelines. This allows for a higher degree of analysis reproducibility, but the software must correspondingly be available for local installation and callable through a convenient interface (e.g., API or general scripting language). Local installations have also the advantage of avoiding the transfer of large or sensitive data to remote servers, preventing potential issues with the confidentiality of unpublished biological data. Neither of these criteria, of course, prevent tools from also being embeddable in web-based interfaces in order to facilitate use by users with limited computational expertise ( Blankenberg et al., 2010 ; Giardine et al., 2005 ; Goecks et al., 2010 ; Oinn et al., 2004 ), and all such tools must regardless produce informative, clear, detailed, and publication-ready visualizations.

In the specific context of microbial genomics and metagenomics, next-generation sequencing in particular produces datasets of unprecedented size, including thousands of newly sequenced microbial genomes per month and a tremendous increase in genetic diversity sampled by isolates or culture-free assays. Displaying phylogenies with thousands of microbial taxa in hundreds of samples is infeasible with most available tools. This is especially true when sequencing profiles need to be placed in the context of sample metadata (e.g., clinical information). Among recently developed tools, iTOL ( Letunic & Bork, 2007 ; Letunic & Bork, 2011 ) targets interactive analyses of large-scale phylogenies with a moderate amount of overlaid metadata, whereas ETE ( Huerta-Cepas, Dopazo & Gabaldon, 2010 ) is a Python programming toolkit focusing on tree exploration and visualization that is targeted for scientific programmers, and Krona ( Ondov, Bergman & Phillippy, 2011 ) emphasizes hierarchical quantitative information typically derived from metagenomic taxonomic profiles. Neither of these tools provides an automatable environment for non-computationally expert users in which very large phylogenies can be combined with high-dimensional metadata such as microbial community abundances, host or environmental phenotypes, or microbial physiological properties.

Modern high-throughput sequencing technologies provide comprehensive, large-scale datasets that have enabled a variety of novel genomic and metagenomic studies. A large number of statistical and computational tools have been developed specifically to tackle the complexity and high-dimensionality of such datasets and to provide robust and interpretable results. Visualizing data including thousands of microbial genomes or metagenomes, however, remains a challenging task that is often crucial to driving exploratory data mining and to compactly summarizing quantitative conclusions.

Export2graphlan performs an analysis on the abundance values and, if present, on the LDA score assigned by LEfSe, to annotate and highlight the most abundant clades and the ones found to be biomarkers. Through a number of parameters the user can control the annotations produced by export2graphlan.

Export2graphlan can take as input two files: the result of the analysis of MetaPhlAn (either version 1 or 2) or HUMAnN, and the result of the analysis of LEfSe. At least one of these two input files is mandatory. Export2graphlan will then produce a tree file and an annotation file that can be used with GraPhlAn. In addition, export2graphlan can take as input a BIOM file (either version 1 or 2).

Export2graphlan is a framework to easily integrate GraPhlAn into already existing bioinformatics pipelines. Export2graphlan makes use of two external libraries: the pandas python library ( McKinney, 2012 ) and the BIOM library, only when BIOM files are given as input.

GraPhlAn is composed by two Python modules: one for drawing the image and one for adding annotations to the tree. GraPhlAn exploits the annotation file to highlight and personalize the appearance of the tree and of the associated information. The annotation file does not perform any modifications to the structure of the tree, but it just changes the way in which nodes and branches are displayed. Internally, GraPhlAn uses the matplotlib library ( Hunter, 2007 ) to perform the drawing functions.

GraPhlAn is a new tool for compact and publication-quality representation of circular taxonomic and phylogenetic trees with potentially rich sets of associated metadata. It was developed primarily for microbial genomic and microbiome-related studies in which the complex phylogenetic/taxonomic structure of microbial communities needs to be complemented with quantitative and qualitative sample-associated metadata. GraPhlAn is available at http://segatalab.cibio.unitn.it/tools/graphlan

Results and Discussion

Plotting taxonomic trees with clade annotations The simplest structures visualizable by GraPhlAn include taxonomic trees (i.e., those without variable branch lengths) with simple clade or taxon nomenclature labels. These can be combined with quantitative information such as taxon abundances, phenotypes, or genomic properties. GraPhlAn provides separate visualization options for trees (thus potentially unannotated) and their annotations, the latter of which (the annotation module) attaches metadata properties using the PhyloXML format (Han & Zmasek, 2009). This annotation and subsequent metadata visualization process (Fig. 1) can be repeatedly applied to the same tree. Figure 1: Schematic and simplified example of GraPhlAn visualization of annotated phylogenies and taxonomies. The software can start from a tree in Newick, Nexus, PhyloXML, or plain text formats. The “default plot” (A) produces a basic visualization of the tree’s hierarchical structure. Through an annotation file, it is possible to configure a number of options that affect the appearance of the tree. For instance, some global parameters will affect the whole tree structure, such as the color and thickness of branches (“set global options,” B). The same annotation file can act on specific nodes, customizing their shape, size, and color (“set node options,” C). Labels and background colors for specific branches in the tree can also be configured (“set label options,” D). External to the circular area of the tree, the annotation file can include directives for plotting different shapes, heatmap colors, or bar-plots representing quantitative taxon traits (“set external ring options,” E). The GraPhlAn tree visualization (plotting module) takes as input a tree represented in any one of the most common data formats: Newick, Nexus (Maddison, Swofford & Maddison, 1997), PhyloXML (Han & Zmasek, 2009), or plain text. Without annotations, the plotting module generates a simple version of the tree (Fig. 1A), but the process can then continue by adding a diverse set of visualization annotations. Annotations can affect the appearance of the tree at different levels, including its global appearance (“global options” e.g., the size of the image, Fig. 1B), the properties of subsets of nodes and branches (“node options” e.g., the color of a taxon, Fig. 1C), and the background features used to highlight sub-trees (“label options” e.g., the name of a species containing multiple taxa, Fig. 1D). A subset of the available configurable options includes the thickness of tree branches, their colors, highlighting background colors and labels of specific sub-trees, and the sizes and shapes of individual nodes. Wild cards are supported to share graphical and annotation details among sub-trees by affecting all the descendants of a clade or its terminal nodes only. These features in combination aim to conveniently highlight specific sub-trees and metadata patterns of interest. Additional taxon-specific features can be plotted as so-called external rings when not directly embedded into the tree. External rings are drawn just outside the area of the tree and can be used to display specific information about leaf taxa, such as abundances of each species in different conditions/environments or their genome sizes. The shapes and forms of these rings are also configurable; for example, in Fig. 1E (“set external ring options”), the elements of the innermost external ring are triangular, indicating the directional sign of a genomic property. The second, third, and fourth external rings show leaf-specific features, using a heatmap gradient from blank to full color. Finally, the last external ring is a bar-plot representing a continuous property of leaf nodes of the tree.

Compact representations of phylogenetic trees with associated metadata Visualizing phylogenetic structures and their relation to external metadata is particularly challenging when the dimension of the internal structure is large. Mainly as a consequence of the low cost of sequencing, current research in microbial genomics and metagenomics needs indeed to visualize a considerable amount of phylogenetic data. GraPhlAn can easily handle such cases, as illustrated here in an example of a large phylogenetic tree (3,737 taxa, provided as a PhyloXML file in the software repository, see Availability section) with multiple types of associated metadata (Fig. 2). Figure 2: A large, 3,737 genome phylogeny annotated with functional genomic properties. We used the phylogenetic tree built using PhyloPhlAn ( Segata et al., 2013 ) on all available microbial genomes as of 2013 and annotated the presence of ATP synthesis and Fatty Acid metabolism functional modules (as annotated in KEGG) and the genome length for all genomes. Colors and background annotation highlight bacterial phyla, and the functional information is reported in external rings. ATP synthesis rings visualize the presence (or absence) of each module, while Fatty Acid metabolism capability is represented with a gradient color. Data used in this image are available as indicated in the “Datasets used” paragraph, under “Materials and Methods” section. Specifically, we used GraPhlAn to display the microbial tree of life as inferred by PhyloPhlAn (Segata et al., 2013), annotating this evolutionary information with genome-specific metadata (Fig. 2). In particular, we annotated the genome contents related to seven functional modules from the KEGG database (Kanehisa et al., 2012), specifically two different ATP synthesis machineries (M00157: F-type ATPase and M00159: V/A-type ATPase) and five modules for bacterial fatty acid metabolism (M00082: Fatty acid biosynthesis, initiation, M00083: Fatty acid biosynthesis elongation, M00086: acyl-CoA synthesis, M00087: beta-Oxidation, and M00088: Ketone body biosynthesis). We then also annotated genome size as an external circular bar plot. As expected, it is immediately visually apparent that the two types of ATPase are almost mutually exclusive within available genome annotations, with the V/A-type ATPase (module M00159) present mainly in Archaea and the F-type ATPase (module M000157) mostly characterizing Bacteria. Some exceptions are easily identifiable: Thermi and Clamydophilia, for instance, completely lack the F-type ATPase, presenting only the typically archaea-specific V/A-type ATPase. As previously discussed in the literature (Cross & Müller, 2004; Mulkidjanian et al., 2007), this may due to the acquisition of V/A-type ATPase by horizontal gene transfer and the subsequent loss of the F-type ATPase capability. Interestingly, some species such as those in the Streptococcus genus and some Clostridia still show both ATPase systems in their genomes. With respect to fatty acid metabolism, some clades—including organisms such as Mycoplasmas—completely lack any of the targeted pathways. Indeed, Mycoplasmas are the smallest living cells yet discovered, lacking a cell wall (Razin, 1992) and demonstrating an obligate parasitic lifestyle. Since they primarily exploit host molecular capabilities, Mycoplasmas do not need to be able to fulfill all typical cell functions, and this is also indicated by the plotted very short genome sizes. Escherichia, on the other hand, has a much longer genome, and all the considered fatty acid metabolism capabilities are present. These evolutionary aspects are well known in the literature, GraPhlAn permits them and other phylogeny-wide genomic patterns to be easily visualized for further hypothesis generation.