BioWord provides an easily accessible and expandable toolkit for the manipulation and editing of biological sequences embedded within a Microsoft Word ribbon (Figure 2). To facilitate user interaction, the ribbon is divided into several functional groups that are discussed in the following sections.

Figure 2 The BioWord ribbon. Functionally related tasks are grouped in separate tabs. Additional buttons and tab-specific options can be accessed through the boxed arrow icon located at the bottom right of tabs. Full size image

Format and sequence manipulation

In its current implementation, BioWord can parse and convert to and from three widespread formats for biological sequences: FASTA [14], GenBank Flat File [15] and bare/raw sequence. Conversion buttons are available in the Manipulation group, along with reverse and complement (DNA/RNA) buttons, but output conversion can also be made implicit by setting the Format option of the Basic Options group to the desired format.

Translation and sequence statistics

BioWord features frame-dependent DNA to protein translation and translation maps using different genetic codes, as well as reverse translation using a variety of approaches (Figure 3). Reverse translation can be performed assuming a uniform codon distribution and using IUB characters to encode redundancy, or following a codon usage table, provided by the user in GCG Wisconsin Package format, as generated by the Codon Usage Database [8, 16, 17]. Basic statistics for DNA and protein sequences are also implemented in this distribution of BioWord. Among other, the toolkit can provide n-gram statistics and window-based analyses of DNA %GC content, as well as protein-specific indices, such as the GRAVY score [18]. The output for these analyses is generated in table format and can be readily pasted into spreadsheet software for graph generation.

Figure 3 Comparison between reverse translation of the Escherichia coli K-12 MG1655 LexA protein (NP_418467) assuming a uniform codon distribution ( RT UNIF ) and using the E. coli codon usage table ( RT CUT ) supplied by the Codon Usage Database [[16]]. Red bold indicates deviation from the real DNA sequence shown at the bottom. Full size image

Search methods and consensus logos

String and pattern-based search methods comprise a significant part of BioWord’s functionality. The output for search methods can be overlaid on the sequence (highlighted) or provided in table format. BioWord provides a simple-to-use ORF search tool, which can maximize ORF length alone or combined with a supplied codon usage table from a reference genome. Basic string search methods (Substring Search) enable mismatch-based search for sequences and the ability to specify variable spacers in Gapped search. Mismatch-based search can operate on DNA sequences incorporating IUB redundancy codes or apply standard (e.g. BLOSUM62) scoring matrices to weigh matches in amino acid sequences. Pattern-based methods (Site Search) provide a more robust approach to sequence search by incorporating PSFM models and using Shannon’s mutual information or relative entropy derived methods to score putative sites [19–21]. PSFM models are built from collections of sites and/or IUB consensus sequences provided by the user either in raw or FASTA sequence format. Like mismatch-based methods, pattern-based methods allow (Dyad Pattern) searching for variable spacer motifs based on direct or inverted repeats of a provided pattern (Figure 4).

Figure 4 Sequence search on the E. coli K-12 MG1655 lexA (b4043) promoter region (125 bp upstream of the translation start point, shown in bold), using several of the search methods implemented in BioWord and a collection of known E. coli LexA-binding sites [[22]]. (Top left) Consensus logo representation of the LexA-binding site collection and its dyad motif. (Bottom left) Table format results for a Dyad Pattern search using with the dyad motif and 6-10 variable spacer. The overall R i score is the sum of individual dyad scores. (Right) Table format results for a Gapped substring search using with CTGW and WCAG as substrings, maximum mismatch of 2 and 6-10 variable spacer. The overall score is the sum of dyad mismatch scores. (Bottom) Superimposed results for a pattern Search using the LexA-binding motif. In this output mode, the grey-scale shading intensity that highlights located sites is based on the information score (R i ), with darker shades indicating higher-scoring sites. Full size image

BioWord also exploits the ability to handle PSFM models to address a pressing need in the representation of sequence motifs. It is well known that consensus sequences are an unsuitable representation of sequence motifs because they omit information on the importance of consensus bases and the relative frequency of non-consensus bases at each position of the motif [23]. Sequence logos are able to integrate these two missing elements, together with the consensus, in an encapsulated representation and are therefore a superior and preferred method for the representation of sequence motifs [24]. Unfortunately, sequence logos are graphic elements and many authors continue to use consensus sequences to represent motifs in order to avoid the need for additional figures or to allow in-text discussions about the motif. BioWord provides a solution to this problem by allowing the representation of sequence motifs in text format using the consensus sequence, but depicting simultaneously its information content. For instance, the LexA-binding motif of Escherichia coli[22] would be represented as . In this representation (the consensus logo), the vertical bar character is used to represent the y-axis scale, with the maximum value, in bits, provided next to it. The height of the consensus letter at each position corresponds to the positional information content of that position (using either mutual information or relative entropy measures). This representation does not provide frequency information of non-consensus bases and, therefore, a sequence logo should be used preferentially whenever possible. Nonetheless, the consensus logo provides the means to convey information about positional conservation in text format and its use of information theory units allows straightforward comparison of motifs (e.g. the LexA-binding motif of E. coli can be directly compared to that of the α-Proteobacteria [25]).

Motif discovery and alignment

BioWord supports several methods for motif discovery. The user can apply a greedy search strategy or Gibbs sampling to a collection of unaligned DNA or protein sequences [26, 27] in order to locate underlying motifs of a given length (Figure 5). Both greedy search and Gibbs sampling are initialized randomly and iterated as many times as specified by the user. The reported motif is the one yielding larger information content across all iterations. The current distribution of BioWord also incorporates a Dyad Motif search tool. This is a string-based motif search tool for bipartite motifs that reports all the occurrences of direct or inverted repeats with a maximum number of mismatches on the dyad and variable spacing (Figure 5). In addition, the package incorporates global and local pair-wise sequence alignment by implementing the Needleman-Wunsch and Smith-Waterman algorithms [28, 29]. Memory management and computing power are constrained in BioWord by the use of Microsoft Word-embedded VBA code. As a result, computationally or memory intensive methods in BioWord, such as motif discovery cannot match the capabilities of equivalent specialized resources, like MEME [30]. Nonetheless, benchmarking of the BioWord greedy search algorithm on several known E. coli transcription factor-binding motifs indicates that BioWord motif discovery algorithms can provide results that are qualitatively comparable to those obtained by MEME, locating the known motif in nearly all instances (Figure 6), and alignment of relatively long sequences (e.g. 2,500 aa) can be performed seamlessly within BioWord.

Figure 5 (Top) Motif discovery with Gibbs Sampling on a set of LexA protein sequences from different bacterial phyla. Instances of the discovered motif are highlighted on the sequences using the superimposed output option. The detected 10 amino acid-long motif shown in the consensus logo is centered on the well characterized Ala-Gly cleavage site of LexA [31]. (Bottom) Dyad Motif search on the E. coli K-12 MG1655 lexA (b4043) promoter region (see Figure 4), with 4±1 bp dyad, 8±1 bp spacer and 2 allowed mismatches. The reported score is the sum of dyad mismatch scores. Full size image