Introduction to Sequence File Formats

As soon as biologicaly data was able to be stored digitally, a multitude of file formats arose. The very first files contained raw DNA sequence reads in a regular .txt file, but as soon as the range of information broadened, so did the types of files.

Several different file formats arose, each with their own purpose.

Compatibility per specific software (for visualizations, diagrams, mappings).

Simple text for easy data processing, parsing, and human readability (comma or tab delimited files - eg. tsv , csv .

, . Improve efficiency for computers. Usually these are in a non-human readable binary format. You'll see some binary files have a corresponding "index" file which is useful for searching.

Purpose of this lesson

In this series, we'll go over the most common sequence file formats you'll come across in bioinformatics. The purpose of this lesson won't be to learn the intricate details per format, but to simply become familiarized with the different file types that are used to store biological data. This will help calm that overwhelming feeling you get when sifting through bioinformatics forums, books and software tools.

Typically, to truly understand a file format (especially raw output files), you'll need to know the technology used to generate that format. In these cases, we'll link you to the corresponding tutorial.

We'll mainly go over DNA sequence file types, and save database file formats such as EMBL or SWISS-PROT for another lesson.

What is a file format?

A file format is a way for computers (and humans) to standardize how data is organized. For example, this page was written on an .html extension. HTML files contain special tags that tell the browser what each block of text is, and how to display it on the page.

Additionally, computers are able to check file formats and immediately determine whether it should be opened in a text editor (for editing), a modern browser (for viewing) or some other software.

File types can also indicate which algorithm to use to view (or open) that file. For example, .gif , .jpg and .png all display images, but the level of compression, size and resolution differ.

Some examples of image file formats: .gif, .jpg, and .png

Plain text files

Early on, scientists held sequence information in plain text ( .txt ) with descriptive file names. The researchers then felt limited, when they felt the need to include annotations and additional information about the sequences.

csv, tsv

More common (yet still primitive) file types include csv and tsv . The former stands for comma-separated values , meaning that there is a comma between each value. The simplicity of this format allows researchers to easily exchange data among computers, a term known as portability .

id,fname,lname,occupation 314,Peter,Ignasius,Bioinformaticist 232,Sarah,Carlito,Mathematician 412,Enrique,Menezes,Microbiologist

Opening a .csv file in Excel, and in regular text editor. These files are portable due to their simplistic nature.

A tsv (tab-separated values) file is similar, but data is separated by tabs instead of commas. Many of the biological filetypes covered in this lesson have tab-separated values.

What is a newline (EOL) character?

The newline (aka end of line or EOL ) is a special character or sequence of special characters that signify the end of a line in text. On a normal text editor such as Notepad++, these characters are hidden.

What's tricky about the EOL character is that depending on the platform (UNIX or MS Windows), the newline character is different. On the Command Line, you may interchange files of the two types with the dos2unix and unix2dos commands.

Viewing end of line character ($) on Vim with the 'set list' command.

Markdown

Another common text-based format that is becoming more and more popular is the markdown format. These files are indicated with a basename of .md .

The markdown format is a markup language, just like HTML. All it does is mark up the text within the document to indicate which lines are headers, paragraphs, and so on. For example,

# This is a first-level header ## This is a second-level header ### This is a third-level header > This is a blockquote Four spaces / 1 tab = line for code *italics*, **bold**, `inline code`, [Google](http://google.com) Unordered List: - Illumina - PacBio - IonTorrent Ordered List: 1. FASTA 2. FASTQ 3. SAM/BAM/CRAM

There are command-line tools that help convert from markdown to html, such as pandoc:

$ pandoc --from markdown --to html README.md > README.html

The left picture shows a raw .md file, while the right picture shows how it renders in a browser.

You'll often see these files as README.md when you download a source file from GitHub or another repository. The markdown format allows the page to load with proper formatting.

Great! Now that we've covered basic text-based file formats, let's move into more bioinformatics-specific file types.

FASTA format

FASTA (pronounced "fast-A") format is a simple type of format that bioinformaticians use to represent either nucleotide or protein sequences. It is written in text format, allowing for processing tools to easily parse the data. The general file extension is .fas .

The FASTA file format originated from a DNA and protein sequence alignment software package called FASTP created in the mid-1980's. The format allows you to precede each sequence with a comment.

There are two lines per sequence - 1) the identifier (comments, annotations) and 2) the sequence itself.

Sample FASTA sequence

Before we dig into a FASTA sequence, let's see what one looks like. Here is an example of a standard FASTA format. Pretty simple, right?

>gi|13959657|sp|Q9PTU8|VSP3_BOTJA Venom serine proteinase A precursor MVLIRVIANLLILQLSNAQKSSELVIGGDECNITEHRFLVEIFNSSGLFCGGTLIDQEWVLSAAHCDMRN MRIYLGVHNEGVQHADQQRRFAREKFFCLSSRNYTKWDKDIMLIRLNRPVNNSEHIAPLSLPSNPPSVGS VCRIMGWGTITSPNATFPDVPHCANINLFNYTVCRGAHAGLPATSRTLCAGVLQGGIDTCGGDSGGPLIC NGTFQGIVSWGGHPCAQPGEPALYTKVFDYLPWIQSIIAGNTTATCPP

1) Identifier

The top line holds information pertaining to the sequence below. It is preceded by with a ">". Without this informative first line, we just have a raw format.

Sequence identifiers

When the FASTA sequence comes from a biological database, the identifier marks which database. Here is a list of major database sequence identifers:

GenBank/EMBL/DDBJ gi|gi_number|*|accession.version|locus

The * is gb, embl, or dbj depending on the database. NCBI refseq ref|accession|locus PRF pir|entry

Protein Research Foundation SWISS-PROT sp|accession|locus

Non-coding RNA regions for a genome. PDB pdb|entry|chain Protein Data Bank

2) Sequence

The line immediately proceeding the identifier is the raw sequence. For both DNA and proteins, standard nucleic acid and amino acid IUB/IUPAC codes are used.

Additionally, there are a few more notes to consider:

Lower-case letters are mapped to upper-case.

Hyphens represent a gap character.

Amino acid sequences, U and * are acceptable.

It is recommended that each line be shorter than 80 characters.

IUB/IUPAC DNA nucleic acid code

Here is a list of the standard IUB/IUPAC nucleic acid codes.

Abbreviation Meaning A A C C G G T T U U R A or G (puRine) K C, T, or U (bases with Ketone) M A or C (bases with an aMino group) S C or G (Strong interaction) W A, T or U (Weak interaction) B not A (B comes after A) D not C (D comes after C) H not G (H comes after G) V neither T nor U (V comes after U) N A C G T U (Nucleic acid) X masked - Gap of unknown length

IUB/IUPAC amino acid residue code

Here's a list of the 24 amino acids and 3 special codons.

Abbreviation Meaning A Alanine B Aspartic Acid (D) or Asparagine (N) C Cysteine D Aspartic Acid E Glutamic Acid F Phenylalaine G Glycine H Histidine I Isoleucine J Leucine (L) or Isoleucine (I) K Lysine L Leucine M Methionine N Asparagine O Pyrrolysine P Proline Q Glutamine R Arginine S Serine T Threonine U Selenocysteine V Valine W Tryptophan Y Tyrosine Z Glutamic acid (E) or Glutamine (Q) X Any * Translation Stop - Gap of unknown length

Specific file extensions

The generic form of FASTA file has the .fas extension. For more specific types, we can use the following:

fna FASTA nucleic acid

Specifies nucleic acids. ffn FASTA nucleotide coding regions

Contains coding regions for a genome. faa FASTA amino acid

Contains amino acids. frn FASTA non-coding RNA

Non-coding RNA regions for a genome.

Multi-FASTA format

If we just append multiple sequences in FASTA format, we get multi-FASTA format. This is a single file with several sequences, and is often used for multi-alignment programs like ClustalW or multialign.

Obtaining FASTA-format

To get FASTA-formatted sequence from GenBank NCBI database, simply click the display near the top of the record and click FASTA.

Obtaining FASTA-format for the insulin protein from the NCBI protein database. Simply click Display Settings, then FASTA.

Converting FASTA sequences

Keep in mind that there are programs out there like READSEQ that allow you to convert formats to and from FASTA.

FASTQ format

The FASTA format is extremely simple with just two lines per sequence - the first is for the description, the other for the raw sequence. The simplicity is nice when running a quick pairwise alignment, but limiting when we need more information per sequence.

With next-generation sequencing instruments pumping out millions of reads per run, scientists needed a way to check the quality of each base call. To document both the sequence and the probability of each of being correct, scientists came up with the FASTQ format. The "Q" comes from quality, as in the quality of the read.

The file extension for FASTQ is .fq and .fastq .

Original development

The FASTQ format was developed by the Wellcome Trust Sanger Institute, and became the de facto standard for high-throughput sequencing instrument outputs.

In addition to storing biological sequence information, it also adds a line for the quality scores. Each score is encoded with a single ASCII character

Characteristics

Let's take a look at an example FASTQ format, then look at each line.

@SEQ_ID TTCAACTCGTTAGTAAATATCAAACGATCAGTACCATTTTGGGGTTCAAAGTGACAGTTT + !'>>>>CCC'*((((***(***-+*'')+))%%%++))**55CCF>>%%%%).1CCCC65

1) Sequence identifier and description

The first line begins with an '@' character and contains the sequence identifier with an optional description. This is just like FASTA's first line.

Illumina sequence identifiers

Here is an example sequence identifier from Illumina

@HWUSI-EAS100R:6:73:941:1973#0/1

HSWUSI-EAS100R Unique instrument name 6 Flowcell lane 73 Tile number within the flow cell lane 941 x-coordinate of teh cluster within the tile. 1973 y-coordinate of cluster within the tile. #0 Index number for multiplexed sample /1 Member of a pair

2) Raw sequence letters

The second line contains raw sequence reads, also similar to FASTA files.

3) Line 3: +

Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

4) Quality scores

The fourth line encodes the quality scores per each base call. This line must have the same length as the sequence in line 2.

Scores range from ! being the lowest quality and ~ being the highest. These values come from the ASCII table values 33-126.

An ASCII table, courtesy of Wikipedia.

The values are shifted down to 0 to 93, but we rarely have a Phred score of over 60.

Quality

To map the quality to the probability that a base call is correct, we use a bit of math.

Q sanger = -10 log 10 p

The p is the probability that the corresponding base call is incorrect, and Q is the Phred quality score which can range from 0 to 93.

References

For a more complete guide on FASTQ, visit the FASTQ format Wikipedia page.

SAM, BAM and CRAM

Before we talk about SAM, BAM and CRAM, we must discuss the software,SAMtools, from which these formats originate.

What is SAMtools?

SAMtools is a suite of utilities that allow for efficient post-processing of short DNA sequence read alignments. The program includes several command line programs such as view , sort , and index that allow for next-generation sequence data processing.

The SAM, BAM and CRAM file formats come from the use of SAMtools .

What is the SAM format?

The name SAM comes from Sequence Alignment/MAP. In addition to regular sequence reads, SAM includes alignment data that link short reads to a reference sequence. This makes SAM files the choice of format when visualizing short read sequences in genome browsers such as IGV (Integrated Genome Viewer).

IGV (Integrated Genome Viewer) uses SAM files to view short read alignments to a reference sequence. Image from Illumina's BaseSpace blog.

What is BAM and CRAM?

The SAM format is simple to parse, generate and check for errors. However, its large file size (~10 gb on average) gets in the way of efficiency. Thus, researchers found a way to compress it into a binary format without losing the ability to manipulate it. BAM contains indexable representation of nucleotide sequence alignments, allowing for intensive data processing in production pipelines.

CRAM is a restructured version of its binary version, with column-orientation.

References

For more reading on SAM and BAM, head over to the Center for Statistical Genetics.

BED format

BED is a tabs-delimited file format allows users to define how data lines of an annotation track are displayed.

If you're unfamiliar with an annotation track, they're simply the lines that are displayed on a genome browser.

itemA and itemB are sample annotation tracks.

BED files can have up to 12 columns, but only three are required for the UCSC browser, Galaxy browser and bedtools. The number of columns must be consisted throughout each row of the file.

Let's look at all 12 BED fields, as explained by theUCSC Genome Browser Information section.

3 Required BED fields

The following 3 fields are required for all BED files.

chrom Name of chromosome - chr5, chrX, chr2_random. or scaffold - scaffold10671 chromStart Starting position of chrom.

First base starts at 0. chromEnd Ending position.

This value does not get displayed. For example, the first 20 bases would have chromStart value of 0 to and chromEnd value of 20.

9 Optional BED fields

These 9 BED fields are optional.

name Name of the BED line. score Score between 0 and 1000. If useScore is set to 1, the score will determine the level of gray that is displayed. A higher number equates to a darker shade. strand Which strand - either '+' or '-'. thickStart The position when the feature is drawn thickly (the start codon for gene display). thickEnd Ending position of where the feature is drawn thickly. itemRgb Determines the color of the data contained in the BED line. (255,0,0) for red.

Use the Color Picker to translate a color. blockCount Number of blocks (exons) in the BED line. blockSizes Comma-separated list of block sizes.

Size of list should correspond to blockCount. blockStarts A comma-separated list of block starts.

Should be calculated relative to chromStart.

Size of list should correspond to blockCount.

UCSC Genome BED file display.

References

BEDtools - Read the Docs

UCSC Genome Browser - BED format

Wig and BigWig

The Wiggle format ( .wig ) is an efficient way to store dense, continuous blocks of data. It is primarily used to store values such as GC percentage, probability scores and transcriptome data. Instead of specifying a value for each nucleotide position, wig allows you to bind values to entire regions that follow a certain pattern.

BigWig

Like SAM and BAM, wig has an indexed binary equivalent called bigWig. This allows for efficient data handling, as only parts of the file are extracted and processed when viewing particular regions on a genome browsers. For a conversion, use the WigToBigWig program.

Characteristics

The .wig filetype contains one or more blocks. On the top of each block is the track declaration line , which defines the data elements with a number of options.

Track definition line

There are several options we can place on the first line which characterizes that particular block of information. Each variable should be formatted as a key=value pair.

name Name of block. description Describes the region in detail. priority Integer describing the order to display tracks. color Color per track in RGB or hexadecimal. graphType Bar or point graph.

The two main formatting option per block are variableStep and fixedStep .

variableStep

The variableStep option is the more common option. It includes the chromosome position in one column, and data values in another.

variableStep chrom=chr4 400001 13 400002 13 400003 13 400004 13 400005 13

We may have the chromosome number and an optional parameter known as span , which tells us the number of bases each value should cover.

The use of the "span" parameter can help us save space. The following is identical to the data block above, but saves much more space.

variableStep chrom=chr4 span=5 400001 13

fixedStep

In case you have data blocks with regular intervals between each position, you can use the fixedStep option. This allows you to place the positions on the track definition line, along with the interval length. Thus, only one column is necessary for the data parameters.

fixedStep chrom=chr4 start=400001 step=100 13 14 15

The above block would feature chromosome 4, position 400001 as having a value of 13, position 400101 having the value 14, and position 400201 having value 15.

You may also specify a span, indicating the length of each sequence.

fixedStep chrom=chr4 start=400001 step=100 span=5 13 14 15

This is similar, but the values range for five nucleotides instead of just one. Thus we have 13 for 400101-400105, 14 for 400201-400205, and 15 for 400301-400305.

References

GFF and GTF formats

GFF, or the General Feature Format is used to describe genes and other features of DNA, RNA and protein sequences. It comes with the .gff extension.

What exactly is GFF?

GFF is an extension of a basic file with the name, start and end parameters (NSE). For example, an NSE (Chromosome2,2000,4000) specifies two kilobases found on chromosome 2. GFF allows the annotation of these segments.

Name, start and end parameters (NSE).

GFF allows for users to perform common operations such as intersection, exclusion, union, filtration, sorting, transformation and dereferencing.

What types of software use GFF?

Several types of bioinformatics software use GFF. This includes genome views such as GBrowse, Jalview and IGB.

Different versions

There are several versions of GFF. The ones used today are GFF2, GTF and GFF3.

GFF2 (General Feature Format version 2) was limited in that it could only handle three-level feature hierachies instead of three-level such as gene -> transcript -> exon. Thus the Sequence Ontology and GMOD projects expanded on this with features.

GTF (General Transfer Format) has also been known as GFF Version 2.5 since it improves on verison 2, but not as much as version 3.

Characteristics

GFF consists of one line per feature, each containing 9 columns of data. Each column is separated by a tab, making it a tabs-delimited file.

Optional track lines

Within the file, we can also include optional track definition lines. These go at the beginning of the list of features they are to affect.

Fields

refseq name Name of chromosome or scaffold. Chromosomes can be given without the 'chr' prefix.

Must be one used within Ensembl. source Source of annotation, name of program that generated this feature. feature Feature type name.

Gene, variation, similarity start Start position, starting at 1. end End position, starting at 1. score Floating point value.

For scores such as similarity, identity, etc. strand '+' for forward and '-' for reverse. frame Either 0, 1 or 2.

0 indicates first base of the feature is first base of codon, 1 indicates second base of feature is the first base of a codon, etc. attribute Semicolon-separated list of tag-value pairs.

Provides additional information about each feature.

Validator

Validators allow us to ensure that a file is formatted properly. To validate a GFF3 file, go to the GFF3 validator.

References

Ensembl

Wellcome trust sanger institute. GFF: an exchange format for feature description

Conversion tools

With so many different filetypes, bioinformaticists need a quick way to convert among the types.

Here are a list of converters that are well-used.

ReadSeq Converts between a selection of biological sequence formats.

Readseq homepage SeqVerter Free sequence file converter

Available on GeneStudio.com Seqret Not a filetype converter, but provides a number of functions for a sequence.

EMBOSS Seqret webpage

Conclusion

Hopefully this lesson gave you a good idea of some of the more commonly used filetypes used in bioinformatics. Any thoughts, questions or concerns? Please leave a comment below!