Using cljam: a brief tutorial

Here are examples of interacting with SAM/BAM files using cljam. More information on usage and specific functions is provided in the readme file and https://chrovis.github.io/cljam/.

Installation

Cljam is available as a Clojure library at Leiningen, a popular build tool for Clojure projects. The following statement should be added to a Leiningen configuration.

Leiningen automatically downloads the Java Archive of cljam and resolves its dependency in a project. Then, cljam functions can be used in the code.

Reading a SAM/BAM file

Cljam provides a file reader and a namespace including various I/O functions to read a SAM/BAM file. The following code opens a BAM file and retrieves the first five alignments, where pnext, tlen, flag, qname, and rname indicate the potision of the mate/next read, observed template length, bitwise flag, query template name, and reference sequence name, respectively, based on the SAM format [13].

Sorting a SAM/BAM file

A SAM/BAM file can be sorted by chromosomal coordinates or reference name using functions in the ‘cljam.sorter.’ For example, to create a BAM file sorted by chromosomal coordinates,

In this case, the input and output files are file.bam and sorted.bam, respectively.

Indexing a BAM file

The ‘cljam.bam-indexer’ has functions for indexing a BAM file. The following code creates a BAI file from a BAM file.

Getting pileup information

The ‘cljam.pileup’ provides pileup and mpileup functions equivalent to those of SAMtools. For example, to get simple pileup of the first 10 genomic positions of chr1 reference,

Command line interface

The command line interface of cljam provides an additional feature to quickly check its functions. For example, the following command displays contents of a SAM file including header information.

Performance of indexing and pileup

We conducted timing measurement experiments to determine the performance of BAM indexing and pileup under a changing number of thread conditions: 1, 2, 4, 8, and 12 threads with cljam (v0.1.3), SAMtools (v1.2) (single thread), and Picard (v1.134) (single thread). We used a BAM file (about 13.2GB) from the 1000 Genomes Project [14]. The machine specifications were CPU: Intel Core i7-4930K @ 3.40 GHz, 12 MB L2 cache, 12 cores (6 real cores & HT), 64 GB RAM, and SSD storage.

The results for indexing and pileup are shown in Figs. 1 and 2, respectively. Each condition was measured 10 times and the average time of the 10 trials was plotted.

Fig. 1 Execution time of indexing. The green dashed line indicates SAMtools and the red dashed line indicates Picard under single thread conditions because they cannot be run using multithreaded processing. The error bar shows the standard deviation of the result Full size image

Fig. 2 Execution time of pileup. The green dashed line indicates SAMtools under a single thread condition because it cannot be run using multithreaded processing. The error bar shows the standard deviation of the result Full size image

The results indicate that the execution times for cljam were getting shorter until the 4 thread condition in indexing and 3 thread in pileup. However, the execution times under the conditions of above 6 threads in indexing and 4 threads in pileup were almost same. We believe there may be an overhead of the file I/O when reading BAM files; the performance does not improve in parallel conditions. The execution time of pileup in cljam with the 3 thread condition was 1.3 times longer than with SAMtools, which can be considered as almost the same performance.

Code metrics

Code readability and maintainability are more important than optimization of code under our software development environment, which uses recent high-speed and multi-core CPU technologies. Thus, we used CLOC [15] to measure logical LOC (lines of code) of source codes of cljam, SAMtools, and Picard. The results indicate that the LOC of cljam was about 1/4 that of SAMtools and 1/9 that of Picard, as shown in Table 1. These three programs do not have all the same functions; thus, they cannot be compared only using LOC. Cljam has been implemented simply in Clojure with parallel programming with multi-core processors and with the focus on readability and maintainability.