Syntax highlighting for computational biology file formats

bioSyntax currently recognizes FASTA, FASTQ, CLUSTAL, BED, GTF, PDB, SAM and VCF formats across all four text editors and less. Upon installation, bioSyntax automatically recognizes file-extensions and seamlessly assigns syntax highlighting to these data files.

The main benefit of syntax highlighting is immediately apparent through its increased legibility (Fig. 1, Additional file 1: Figures S1 and S2), especially in the deconstruction of verbose content such as plain-text CIGAR strings (Fig. 1a). In each file-format data is organized using contrasting colours to accentuate keywords or data fields. Nucleotides and amino acids are highlighted with distinct colours allowing for users to read sequences and interpret patterns in the alignment. Data fields containing scores such as PHRED base quality or mapping scores are gradient coloured.

The overall system of highlighting is also designed to group biological classes, even across file formats (Additional file 2: Table S1). For instance, dark-green is reserved for genomic coordinates in BED, GTF, SAM and VCF, so even if a user is unfamiliar with the SAM format, previous experience associating dark-green in BED or GTF will inform them of the meaning of those fields when presented in a SAM file (Additional file 1: Figure S2).

Ultimately, bioSyntax aims to help computational biologists comprehend data using graphical highlighting rather than simple syntactic highlighting. When the data does not have to be read per character or per word, but can be viewed as graphical patterns, underlying information in the data becomes salient, similar to alternative nucleotide representations [37, 38]. This is best seen in complex files such as SAM in which PCR-duplicate reads form block patterns and read density can be approximated by the diagonal similarity of reads at a locus (Fig. 1b). In a user-experience survey of bioinformaticians (Additional file 3: Text 1), 98.6% of users (N = 72) selected bioSyntax highlighted alignments as being easier to identify nucleotide variants compared to a standard monochrome text. Future research on how syntax highlighting can be refined to optimize for user performance is necessary.

bioSyntax nucleotide representation

bioSyntax implements a novel nucleotide colouring scheme for the complete IUPAC ambiguous base set [39], unlike other colour-sets which are designed for four or five bases (Fig. 2). The four primary base colours are chosen such that additive colour mixing also represents complete base ambiguity. For instance, thymine (blue) and cytosine (red) are pyrimidines (magenta), and the “any base”, N, is white. This colour-set visually distinguishes the strong bases (G,C) and weak bases (A,T) as warm and cool colours respectively, allowing for an intuitive approximation of the sequence GC-content (Fig. 2c). Additionally, a high-contrast colour-scheme is available to aid visually impaired or colour-blind users (Additional file 1: Figure S3).

Fig. 2 bioSyntax nucleotide colour scheme. a The four primary bases are coloured in two pairs of contrasting colours. IUPAC ambiguous bases are then coloured in increasingly lighter tones of the approximately mixed colours. To accomodate 4-dimensional bases in 3-dimensional colours, aMino (A or C) and Keto (G or T) bases are darker. b A comparison of nucleotide colour-schemes in the literature. c bioSyntax colouring allows for approximation of a sequences GC-content by how warm (high GC) or cool (high AT) it appears Full size image

The bioSyntax repository

There are scores of biological and scientific file-formats which would benefit from syntax highlighting. To facilitate future development of syntax definition files in science, the bioSyntax repository (https://bioSyntax.org) was set-up. The repository is both a library for scientific syntax highlighting and a community-oriented resource for learning syntax highlighting development. In this manner, researchers experienced in the use-cases of a file-format can quickly develop and share new syntax definition files.