Posted on January 9, 2019

Introduction

Hyrax ABIF is a Haskell package, that I created at HyraxBio to test our bioinformatics software pipeline. We have released the HyraxAbif package as open source (BSD3 licence) in the hopes that it will be useful to others.

In this post I’ll show how the package can be used as a standalone tool as well as looking at how the Haskell code works. Skip directly to the usage / ABIF format / Haskell sections if that is all you are interested in reading about.

Licence

See the LICENCE file. Please note that this package is distributed without warranties or conditions of any kind.

Chromatograms & some basic biology

Chromatograms

Part of what we do at HyraxBio is analyze DNA sequences to determine drug resistance for various pathogens. The first step in this process getting DNA data from a sequencing machine. The mechanics of sequencing are pretty complex. Fortunately for us we start with the data already sequenced which means that all the “wet-work” is done and we can analyse and interpret the results as data, i.e. bioinformatics.

DNA is made up of four bases A , C , G and T (adenine, cytosine, guanine, and thymine respectively). A sequencing machine takes DNA strands and determines the sequences of bases that are present. There is a fair amount of complexity here. You can’t simply grab a strand of DNA and read it in its entirety and certainly not with 100% accuracy (because biology). Rather the DNA is amplified and a consensus of reads for each position in the DNA strand is calculated. There can be both (many) variations of the same virus (mutations) as well as errors in the reading process itself. So each position is calculated based on which of the bases have the strongest signal per position.

ABIF files are generated by these sequencing machines by using chemical reactions that release a tiny amount of coloured light when a reagent reacts with one of the bases. Each base results in a different colour which enables the machine to detect which base is present DNA. The details behind this are fascinating see e.g. wikipedia for more detail if you are interested.

Below is a section of a chromatogram showing a wave for each of the four bases.

This is a perfect chromatogram, there are often multiple possibilities per position of different intensity. In the image below you can see that the second and third positions have more than one possible base, this is called a mix.

Even that is an unnaturally clean chromatogram. In reality they often look more like this, and take complicated base calling software and/or trained lab workers (or overworked PHD students) to decide on what base is actually represented.

The chromatogram data in ABIF format is fed into base calling software like PHRED and/or recall which analyze the chromatogram and decide on which base to call per position. The result being a string of bases (A/C/G/T).

Why we created hyraxAbif

Testing a full bioinformatics pipeline is critical to ensuring that every step works correctly and results in high quality outputs. The problem is that we could find no practical existing way to generate our own chromatograms (ABIFs). It is possible to use a set of existing ABIF files but this has two major problems

These are DNA sequences from real people so there are confidentiality issues

More practically, we needed very specific input data to test decisions further down the pipeline and finding real data with the exact mutations and no others is not realistic.

HyraxABIF was created to resolve this. It lets us easily create chromatograms from a DNA sequence and thus do all the testing we need to.

A bit more biology

The image above has a bit more detail, it shows the bases including ambiguous ones as well as the amino acids.

Ambiguous bases

As discussed above there could be multiple possibilities per position. The IUPAC ambiguity codes (see wikipedia ) or bioinformatics.org are a way of encoding the ambiguity in a single letter. For example a Y IUPAC code means that the base is either a C or a T .

Amino acids

Each group of three nucleotide bases is called a codon and encodes for a single amino acid (in coding regions…). The chromatogram also shows the amino acid per codon. This is not important to know for this post but may help if you see other pictures of other chromatograms as this will usually be show.

Using the application

HyraxAbif’s primary goal was for generating chromatograms from an input DNA sequence. It can be installed from hackage with cabal, from stack, or by cloning the git repo.

Generating a simple ABIF

The input for generating a chromatogram is a simplified FASTA format file. These files look like this

> 1 ACTG

The first line is the weight (more on this later), the second is the DNA sequence. Given this input file you would run

hyraxAbif-exe gen inputDir/ outputDir/

and you would end up with a ABIF file and a chromatogram like this

For many scenarios this is all you’ll need. You create a folder of FASTA input files and get a folder of generated corresponding ABIF files.

Generating more complex chromatograms

You can also generate chromatograms with mixes. The first line has the weight for the sequence, and each FASTA file can contain multiple reads.

> weight read > weight read

The weight is a numeric value between 0 and 1 that specifies the weight of the current read i.e. the intensity of the peak.

No other header/name is allowed (no quality data / naming etc)

The read is the set of input nucleotides, IUPAC ambiguity codes are supported (MRWSYKVHDBNX).

A read can be single or multi-line

Weights for each position are summed to a maximum of 1.0 per nucleotide

You can use _ as a “blank” nucleotide, in which case only the nucleotides from other reads will be considered

Reads need not be the same length

For example

> 0.5 ACG > 0.3 AAAA > 1 __AC

Results in the following weighted nucleotide per position

position A C G T 0 0.5 + 0.3 = 0.8 0 0 0 1 0.3 0.5 0 0 2 0.3 + 1 = 1.0 0 0.5 0 3 0.3 1 0 0

Note that 0.3 + 1.0 = 1.0 because the max value is 1.0

And this sample

> 1 ACAG > 0.3 _GT > 0.2 _G

results in this chromatogram

The IUPAC codes here are

S = G or C

= or W = A or T

Reverse reads

A weighted FASTA can represent a reverse read. To do this add a R suffix to the weight. The data you enter should be entered as if it was a forward read. This data will be complemented and reversed before writing to the ABIF

> 0.9R ACAG

which results in the sequence TGTC

Dumping an existing ABIF file

You can also dump an existing ABIF file

hyraxAbif-exe dump sample.abif

This prints two views of the file. First a detail view, partially show below

and then a summary

For certain types of data (e.g. strings) the parsed value is displayed.

The ABIF file format

The ABIF format is documented here. As you can see from the spec the ABIF format was modeled after the TIFF format. This means that there is a directory of entries and each entry has a data type.

The spec is quite thorough and explains the layout well. If you are wanting to understand the format it is your best starting point. The spec, however, only goes into detail on the ABIF structure. It does not go into much detail on how the chromatogram data itself is stored, I’ll cover that here.

In the file after the ABIF header and version number is the root directory entry. This entry points to the first of the data directory entries that can be located at any other location in the file.

Each directory entry has an offset to the location of its data in the file, the data size, the element size and number of entries. See the spec or the discussion of the Haskell code below for more details on each field.

Note that for data with a size of four bytes or less, the data is stored in the offset field itself.

Important ABIF directory entries

Towards the end of the spec are examples of the layout of ABIF files for a few sequencing machines. Lets take a look at some of these for the 3500 layout.

Chromatogram traces

Type Tag Number Contains DATA 1 - 4 The raw data per base (channel). The order of the bases is specified by the FWO_ entry DATA 5 Short Array holding measured volts/10 (EP voltage) during run DATA 6 Short Array holding measured milliAmps trace (EP current) during run DATA 7 Short Array holding measured milliWatts trace (Laser EP Power) during DATA 8 Short Array holding measured oven Temperature (polymer temperature) trace during run DATA 9 - 12 Short Array holding analyzed color data

This is a pretty intimidating set of values we thought we would have to generate from a FASTA input, just for the traces. Fortunately through trial and error we were able to see that only a small subset of the entries were required for the base calling software we were using (PHRED + Recall). All we needed to generate were the data sections 9 to 12, i.e. one per base, the analyzed colour data.

The four DATA sections we need to generate (entries 9 through 12) contain an array of shorts. Each short represents the intensity of the light for that base at a given point. Each of these DATA sections have the wave of the light intensity over time. The FWO_ directory entry specifies which base each DATA entry represents. We always generate it in the 3500 format, so the order is 9=G , 10=A , 11=T , 12=C .

Peak location

The traces above are just the wave form, the PLOC entry specifies the location of each peak. This is the location, across all four DATA entries, where the peak of the waves should be found. There is a single PLOC entry for all four DATA entries.

Other required entries

The other required entries are easy to generate, they are things like the base order (FWO_), file name (PDMF) and called based (PBAS). See the Haskell code discussion below to see them all.

Generating the waveforms

Given a base, we then need to create a wave and a single peak location entry. The data we use for each wave is this array of shorts [0, 0, 128, 512, 1024, 1024, 512, 128, 0, 0] which creates a wave like this.

The peak is the middle of the wave, nice and simple.

Again see the code discussion for more details in this.

Using the code, some basic examples

Reading and printing the ABIF structure

readAbif tries to parse an ABIF file, it returns an Either Text Abif Check if the file was parsed successfully If not (Left) then print the error If successful (Right) then clearAbif removes all the raw data. If you don’t do this then all the massive byte arrays will get printed too print the result.

The functions are all commented and visible on hackage.

Only the Right case is different than the previous example

addDirectory is called to add a new comment directory entry that is created by mkComment writeAbif writes the updated file to disk

More examples

The Examples directory contains more examples. You can also look at the Generate and Main modules to see how the code is used.

Understanding the code

Data.Binary

Data.Binary is used to read and write the raw bytes in the ABIF files.

Writing

Below is an example of writing two Int8 and an Int32 value.

runPut “runs” the PutM monad. testWrite can then simply call the put* functions to write the data in whatever format is required.

This creates a file that looks like this

Reading

Reading the data from the file looks like this

runGet is given the ByteString from the file and testRead gets the values in the appropriate format.

If you prefer applicatives you could instead have written testGet as

and testWrite as

ABIF Types

Hyrax.Abif contains the core types for the package

src/Hyrax/Abif.hs (22 to 32)

src/Hyrax/Abif.hs (37 to 64)

These three types make up most of what we need to represent an ABIF. A few things to notice

The root directory in the Abif type will point to the array of Directory entries

type will point to the array of entries dElemTypeCode is the integer value read from the file (see the spec for the codes). dElemType and dElemTypeDesc are interpreted values from this

is the integer value read from the file (see the spec for the codes). and are interpreted values from this dElemSize and dElemOffset are read from the file, but are automatically calculated when writing (see the Hyrax.Abif.Write section below)

and are read from the file, but are automatically calculated when writing (see the Hyrax.Abif.Write section below) dData is the actual raw data read from the file, or data to be written to an ABIF file

is the actual raw data read from the file, or data to be written to an ABIF file dDataDebug is populated while reading the file and used during dumping to give human readable info about the file being inspected.

Element types

The remaining code is the definition of ElemType and functions for interpreting the raw element type integer value. Note that the spec defines a number of unsupported data types, these are included here.

Hyrax.Abif.Read

Starting the read

readAbif calls getAbif to parse the data

src/Hyrax/Abif/Read.hs (51 to 53)

getAbif starts the parsing of the Abif data structure.

src/Hyrax/Abif/Read.hs (58 to 75)

The Either monad is used so any Left value will short-circuit out of the function and return the Left value immediately.

monad is used so any value will short-circuit out of the function and return the value immediately. Data.Binary.runGetOrFail returns a Left if the get operation fails. Much better than getting an exception as you would with runGet

returns a if the get operation fails. Much better than getting an exception as you would with The first step is to read the header and the root directory

Then the ABIF directory entries are read. These directories are found at the offset specified in the root directory. The code “goes” to this offset by dropping the number of bytes from the read data.

getDirectories reads the number of directory entries specified by the root entry.

getRoot gets the header and root directory (see next section)

src/Hyrax/Abif/Read.hs (218 to 223)

getHeader gets the “ABIF” magic string and version number. Similar to the read example above.

src/Hyrax/Abif/Read.hs (209 to 213)

Reading directories

getDirectories is given the number of directories to read. It tries to read a single Directory by calling getDirectory and then recursively calls itself until done.

src/Hyrax/Abif/Read.hs (272 to 279)

Reading and individual directory is done by getDirectory .

Read the values and convert into appropriate types (e.g. Int8 to Int)

If the data is four bytes or less then the offset field contains the data

If the data is larger than four bytes, go to the offset and read the entire chunk as a ByteString .

. Create a Abif value

src/Hyrax/Abif/Read.hs (228 to 267)

Reading strings

A PString is prefixed with an Int8 size. So read the size and then the string

is prefixed with an size. So read the size and then the string A CString is null terminated, so read all the data for the length of the string (from the directory entry) and drop the final null character.

src/Hyrax/Abif/Read.hs (194 to 204)

Debug info

getDebug adds human readable information for some types, e.g. for strings. This lets us print the Abif structure to the console with some useful data. Only a portion of getDebug is show here as it is a little repetitive. However it is a good function to look at to see more examples of reading the raw data.

src/Hyrax/Abif/Read.hs (94 to 112)

When printing the structure it does not make sense to print all the raw data too. So the clear* functions remove that before printing

src/Hyrax/Abif/Read.hs (80 to 89)

Hyrax.Abif.Write

As with the read functions there are two write functions for writing to ByteString or to a file.

src/Hyrax/Abif/Write.hs (54 to 64)

Writing the ABIF data is relatively simple since each directory entry already contains the ByteString raw data. putAbif does need to recalculate the data size though

src/Hyrax/Abif/Write.hs (69 to 98)

The total data size is calculated. It is the sum of all the non-root directory entries where the data is not stored in the offset field (i.e. where data size > 4 bytes)

The data will starting being written at offset 128, i.e. immediately after the header and root directory entry

Add the root directory

Write the 47 zeros required by the spec

Write all the data from each of the directory entries

Write the directory entries, incrementing the offset for each entry with data > 4 bytes

Header

Writing the header is pretty simple, write the magic string and version number.

src/Hyrax/Abif/Write.hs (117 to 121)

Strings

There are two functions for writing Text values

src/Hyrax/Abif/Write.hs (103 to 112)

Directory

When writing a directory there a few things to take care of

Ensure that the directory name is exactly 4 bytes long

Write the offset for data > 4 bytes

Write the data to the offset field if it is <= 4 bytes, ensure it is exactly 4 bytes long on disk

Append the reserved zero value

src/Hyrax/Abif/Write.hs (126 to 142)

mk* helper functions

The mk* set of functions help in constructing valid directory entries.

Below are two of these functions

mkBaseOrder which creates a FWO_ Directory entry.

which creates a FWO_ Directory entry. mkLane which creates a LANE Directory entry.

As you can see these functions take appropriately typed values in and produce a valid directory entry for the data and directory type. (See Hyrax.Abif.Generate to see them in use)

src/Hyrax/Abif/Write.hs (48 to 49)

src/Hyrax/Abif/Write.hs (208 to 244)

See the code or haddock for the full set of mk* functions.

Adding a directory

addDirectory appends a directory entry to an existing Abif . See the examples to see this in use.

src/Hyrax/Abif/Write.hs (346 to 349)

Hyrax.Abif.Generate and Hyrax.Abif.Fasta

Generating ABIFs is the main purpose of this package and the code to do this is in Hyrax.Abif.Generate . There is less than 200 lines of code, but I’ll go through how it works in some detail.

generateAb1 is the main function in this module, it controls the flow of generating a single ABIF. It has the following high level concerns

Generate the traces per base from the weighted FASTA Generate the peak locations Generate the directories Create the ABIF

src/Hyrax/Abif/Generate.hs (131 to 171)

A quick detour - Reading the weighted FASTA

readWeightedFasta reads the contents of a single weighted .fasta file. (Unless you are interested in how the FASTA parsing works, you can skip this and go to the next section. Just have a look at what the types represent).

The parsed content has the type [('Double', 'Text')] , which stores the data like this

[('Double', 'Text')] ^ ^ | | | +---- read | +---- weight

i.e. an array of weights together with the sequence at that weight.

src/Hyrax/Abif/Generate.hs (265 to 290)

The FASTA is read and parsed in Hyrax.Abif.Fasta . Note that readWeighted handles the reverse read logic by calling complementNucleotides and then reversing the string. This section of the code is not entirely relevant for this discussion of the ABIF generation so I wont spend much time on it.

src/Hyrax/Abif/Generate.hs (391 to 406)

src/Hyrax/Abif/Fasta.hs (23 to 51)

readWeightedFastas reads all the FASTA files from a directory and returns a tuple of ( file-name, f ) where f is [('Double', 'Text')] as described above.

src/Hyrax/Abif/Generate.hs (296 to 317)

Generating the trace data

generateTraceData does the bulk of the work in the ABIF data generation

src/Hyrax/Abif/Generate.hs (181 to 183)

src/Hyrax/Abif/Generate.hs (187 to 188)

Lets break (\(w, ns) -> (w,) . unIupac <$> Txt.unpack ns) <$> weighted down a bit

Its running a lambda for each weighted element So lambda <$> weighted Weighted has the type [(Double, Text)] as discussed above The lambda takes the params \(w, ns) . I.e. it destuctures a tuple from the array and gets the weight and the string of nucleotides. For each nucleotide f <$> Txt.unpack ns f is (w,) . unIupac So each nucleotide gets passed to unIupac (as a Text ) and added to a tuple with the weigh, so (weight, [nucleotide]) unIupac takes a possibly ambiguous nucleotide code and returns the list of nucleotides it represents. E.g. V -> ACG

And then List.transpose is called. This gives us all the nulceotides and weights per position

This code is perhaps a bit hard to follow, so here is an example showing how this would work for the weighted FASTA

>1 AC >0.5 WK

The weighted fasta is parsed as

Each of the nucleotides is passed to unIupac , and since W = AT and K = GT we get

Finally, the list is transposed to get the weight and nucleotide per position

position 0 has an A with weight 1 and an A / T with weight 0.5

has an with weight 1 and an / with weight 0.5 position 1 has a C with weight 1 and a G / T with weight 0.5

src/Hyrax/Abif/Generate.hs (192 to 195)

Next the shape of the curve is defined. A curve this shape, was selected as it has some space either side to avoid mixing with neighboring waves and a steep climb so that the peak is easily detectable.

src/Hyrax/Abif/Generate.hs (199 to 203)

src/Hyrax/Abif/Generate.hs (224 to 232)

getWeightedTrace is then called for each of the four bases. For each position for a base it returns a curve. If the position does not have the base then the curve is flat (zeros), if it does the curve above is returned multiplied by the weight.

Again an example may make this easier to understand

We start with the same parsed FASTA as above

We define a small curve as [0, 100, 0]

We call getWeightedTrace for the A and G bases

For A

there is a A at position 0 with a total weight of 1 (remember max is 1.0) so the full curve is used

at position with a total weight of (remember max is 1.0) so the full curve is used no A at position 1

For G

no G at position 0

at position there is a G at position 1 with a weight of 0.5 so each value in the wave is multiplied by 0.5

Notice that in the code above, these results are then concatenated so the actual results are

With that have a way to generate a wave form for the input weighted fasta

src/Hyrax/Abif/Generate.hs (207 to 209)

The ABIF needs to store the called bases in the PBAS entry. We get the bases from the input data, IUPAC encode each position and we have the sequence.

src/Hyrax/Abif/Generate.hs (213 to 219)

And return the TraceData value

For completeness here is the unIupac function

src/Hyrax/Abif/Generate.hs (329 to 352)

Generating the peak locations

src/Hyrax/Abif/Generate.hs (140 to 143)

To generate the array of peak locations

Take the midpoint of a single wave (which will always be the peak for the shape of the waves we have define)

Create an array of positions that start from this point, per wave for the total length of the input data

Given a curve of [0, 10, 10, 0]

valsPerBase = the length of the curve = 4

= the length of the curve = 4 midPeak = 4 / 2 = 2

= 4 / 2 = 2 The peaks generated are [2, 6, 10, 14, 18...... , one element per length of the input FASTA.

, one element per length of the input FASTA. This is the data that is stored in the PLOC directory entry.

Generating the ABIF

We now have all the data we need, the mk* functions are used to generate the minimal set of directories

src/Hyrax/Abif/Generate.hs (150 to 170)

And with that we can generate any test ABIF we need. The code is much shorter than the explanation. Hopefully with the guidance from this post and the code comments it should be easy enough to follow.

Testing

The package comes with property tests that test

Weighted FASTA parsing

Round tripping a ABIF, i.e. generated ABIF == generated + read + written + read

That the peaks in a generated ABIF represent the expected nucleotide sequence.

We used Hedgehog for the property tests. It made writing the properties & generators (see the Generators module) really easy.

For more details see the property tests and the following Hedgehog links

Conclusion

Hopefully you find this package useful, either as a standalone tool or as a library. If you do we would love to hear how you are using it.

If you have any questions feel free to email me.

Thanks

Andre.