Michael Schatz and James Taylor

In honor of the 60th anniversary of the publication of the structure of DNA, we organized a contest related to DNA and its applications in current research. The contest began on April 20 and ended on April 25: the anniversary itself, and popularly known as 'DNA Day'. The contest drew nearly 1,000 participants from across the world. Reflecting the transition from genetics to genomics in the 60 years since the discovery, the contest was presented as a series of bioinformatics challenges in which participants would assemble, align or otherwise analyze nucleic acid sequences to identify a message hidden in the data.

The contest consisted of five stages, ordered so that the solution to one stage unlocked access to the next by completing its URL. There were no timing requirements for the first four stages since they were released at a predefined time for all participants, although the overall winners were determined by how quickly they could correctly solve the final stage. The top prize was an iPad, and the second and third place entries had their choice of a one-year subscription to Genome Biology or registration to the Beyond The Genome conference. In addition to celebrating the discovery, we hoped to reach out to students and postdocs around the world to motivate them to learn a few new techniques and a few new concepts of molecular biology. This appears to have been quite successful, and several students outside of biology participated in the contest.

The stages of the contest were presented in order of complexity: the first could be completed in a few minutes, while the final stage might require several hours. However all of the stages could be solved by using a combination of open-source software, if only one could identify the correct algorithms to use. The contest problems and solution guide are available online at http://genomebiology.com/about/update/DNA60_INTRO and http://genomebiology.com/about/update/DNA60_ANSWERS respectively.

Stage 1: motif finding

The first stage was based on the common bioinformatics problem of motif finding, such as for identifying a transcription factor binding motif or other regulatory element upstream of a set of gene sequences. Finding true biological motifs requires complex learning approaches such as Gibbs sampling to account for the variability that may be present. For the contest, we simplified the problem to identification of a 7 base-pair sequence motif without any variability or errors. As a result, the solution could be computed in a few seconds with any of a number of k-mer counting software packages. Nevertheless, the simplicity of the stage aided in explaining the process of how to use the solution to unlock the subsequent stages, and also made the contest accessible to a very large set of participants.

Solution: TAGCGAC

Recommended algorithm: Jellyfish k-mer counter [1]

Stage 2: gene finding

The second stage centered around the important problem of computational gene finding. Users were presented with an artificial one megabase-pair microbial genome, and tasked to identify the open reading frames (ORFs) and analyze their amino acid sequences. ORFs are regions of a genome stretching from a start codon to a stop codon absent of any in-frame internal stop codons, and represent possible protein-coding genes. While not every ORF in a microbe will be a true gene, the longest ORFs typically are, and thus constitute an effective heuristic for training a gene finder for classifying the other ORFs in an unannotated genome [2]. Once the ORFs were identified, participants were then tasked to translate their codons into their corresponding amino acid sequences, and then report the 25th amino acid from the 15 longest ORFs in sorted order. There are several gene finding and ORF finding programs available that could be used for solving the stage, including EMBOSS [3] and Glimmer [2], although it seems many participants chose to implement their own given the questions we received, especially to clarify the processing of overlapping ORFs. Care was taken in designing the problem to ensure that the answer was unambiguous by ensuring the top 15 longest ORFs had distinct lengths. Several participants asked if we had a typo in designing the contest, but they should take note that there is no amino acid with the abbreviation 'O'

Solution: THESECRETQFLIFE

Recommended algorithm: EMBOSS getorf program [3]

Stage 3: RNA-seq expression

For stage three, participants were presented with a pair of simulated RNA-seq experiments from a portion of Escherichia coli, and asked to find the most highly differentially expressed gene. While RNA-seq has the potential to discover new genes and new isoforms, in this stage we provided the annotation for the genome, and being a microbe, did not include any alternatively spliced genes. As such, identifying the solution was a relatively simple matter of mapping the reads and comparing the mapped read coverage in the two conditions. Curiously, from the access logs it appears at least one person attempted to solve the stage by systematically trying all 93 annotated genes until the correct one was found.

Solution: CARB

Recommended algorithms: Bowtie [4] and SAMtools [5]

Stage 4: 16S metagenomics

Stage four simulated a metagenomics experiment, as used to explore the microbial composition on different sites on the human body or in different environments around the world. A reasonable shotgun metagenomics simulation would have required a larger dataset than was desirable for the stage, thus we chose to simulate a microbial community profiling experiment using amplified 16S rRNA sequences. We randomly selected 80 or so members of the Helicobacter genus, together with a matched number from other random genera, from the Greengenes [6] database of 16S sequences. We then generated simulated 250 to 400 base-pair reads from the V1-V3 variable regions, with a progressively decreasing number of reads drawn from each species. Sequencing and other characteristic errors found in real 16S experiments were not simulated after initial evaluations determined they would make the stage difficult to solve unambiguously without a much larger dataset. The resulting dataset was highly enriched for reads from members of Helicobacter, allowing an answer to be determined as verified using the RDP classifier [7] or CAMERA [8]. In generating this stage's dataset, we found that if we reduced the prevalence of the dominant genus it quickly became difficult for common taxonomic classifiers to yield an unambiguous answer. However, because Helicobacter was so over-represented, the correct answer could easily be guessed just by aligning random reads to an appropriate database.

Solution: HELICOBACTER

Recommended algorithm: CAMERA [8]

Stage 5: decoding the genome

The final stage was to identify a secret message that we had embedded into a genome, and then email us the correct phrase as fast as possible. This simplified the scoring as we had a time-stamped electronic record of the submissions along with the email addresses of the participants. We embedded the secret message using the encoding scheme proposed by Church et al., in which text or images are represented in a binary alphabet expressed in DNA nucleotides [9]. To further complicate the stage, instead of providing the genome with the secret message embedded within it, we simulated the shotgun sequencing of it and presented just the unassembled reads. We expected the participants to then assemble the reads, BLAST the assembly at NCBI to determine the species, align the assembly to the reference, extract the inserted nucleotides, and then decode the message using the included decoder. Alternatively, one could run the decoder script directly on the unassembled reads. The majority of the reads would decode into unintelligible characters, but those with the insertion would decode into recognizable words that could then be assembled into the entire phrase. This approach would be somewhat more complex to implement since most available genome assemblers are specialized for DNA sequences, but has the advantage of skipping the time-consuming steps of assembling and BLASTing to determine the reference. Indeed, the winning entry used this shortcut to outpace the competition.

Solution: 'We went up, saw the structure, we came back to King's and looked at our Pattersons, and every section of our Pattersons we looked at screamed at you, "Double Helix!" And it was just there! - once you knew what to look for. It was amazing.' (a quote from Genome Biology's DNA Day interview with Ray Gosling [10])

Recommended algorithms: ALLPATHS-LG [11], BLAST [12] and MUMmer [13]

The first correct solution to the final stage was emailed just 19 minutes after posting the challenge by Sven-Eric Schelhorn of the Max-Planck-Institut für Informatik, Germany. The second place winner was physics undergraduate Kevin Wang at the University of Chicago, USA just seconds behind, and the third place was Gustavo Lacerda at the Campinas State University, Brazil in 24 minutes. Twenty-four participants emailed the correct solution to the final challenge in the three hours before we announced the winner, hundreds made it to that stage, and nearly 1,000 participants completed at least the first stage of the contest. Participants came from across the entire globe and most were at academic or research institutions.

Interestingly, the number of bioinformatics competitions is on the rise, including the DREAM [14], Assemblathon [15, 16] and Sequence Squeeze [17] contests to name but a few [18]. This rise reflects the increased availability of datasets, the increasing diversity of problems and approaches in the field, and perhaps even the competitive nature of bioinformaticians to strive for the best method to solve a given problem. A well-designed contest provides a unique mechanism for broad evaluation on a level playing field, especially when a well-defined gold standard is available. To that end, here we posed artificial problems with specific correct answers, although the problems had the same form as might be seen with genuine data.

We have organized similar contests to DNA60IFX for the last several years at Beyond The Genome, but this was our first all-electronic contest. Details of the contest were broadcast using Twitter, email and blogs, although it appears most participants learned of the contest over Twitter. Without a physical presence, the contest partially lacked the sense of head-to-head competition that we had seen at Beyond The Genome, but we were able to reach a much broader audience than ever before. Overall, we feel this was a worthwhile trade-off, and enabled us to more directly reach our target audience. In addition, Twitter was extremely useful for rapid impromptu discussion between the participants and for clarification of the rules. Given the success of the project, we are already planning the next contests for later this fall and are also considering making the DNA Day challenge an annual event.

See you at Beyond the Genome (http://www.beyond-the-genome.com) on October 1-3 for the next contest!