Popular open-source platform offers easy access to computational analysis tools.

By Kevin Davies

September 30, 2011 | Enter the term “galaxy” in a Web search engine, Penn State’s Anton Nekrutenko muses, and the top hits are likely to be an astrophysical entity or “a very bad soccer team.” But making fast strides up the web charts is the Galaxy open-source tool, which is coming into its own as more and more researchers seek ways to easily handle and manipulate next-gen sequencing (NGS) and other large datasets.

“Galaxy allows you to do analyses you cannot do anywhere else, without the need to install or download anything,” says Nekrutenko. “We can make genomics better, easier and more efficient. You can analyze multiple sequence alignments, compare genomic annotations, profile metagenomic samples and much, much more.” Galaxy was originally developed by Nekrutenko, who is based in the Center for Comparative Genomics and Bioinformatics at Penn State, and his former Penn State colleague James Taylor, who is now an assistant professor in biology and math & computer science at Emory University. Both are quick to cite the many contributions to Galaxy’s evolution from the genomics community.

Taylor was finishing his Ph.D. when the pair started to develop Galaxy, and they have devoted much of their time to that effort ever since. Nekrutenko is the more biologically inclined of the pair. “I can script a bit,” he says, “but Galaxy could only be developed with proper software engineering practices, which was only possible after James got involved.”

Easy Access

Galaxy is primarily a platform for making computational tools accessible. Nekrutenko and Taylor observed “a huge disconnect” between computer science development tools and algorithms on the one hand, and the researchers wanting to use them on the other. Galaxy is designed to fill that gap.

“It was a neat idea to take these existing tools and make a platform to allow how they work to be captured abstractly. Now we provide workflow functionality, provenance capture, and additional values,” says Nekrutenko. “We pushed from just an idea of making things accessible to ensuring that analyses are reproducible.”

Galaxy’s impact on biomedical research is felt in two main areas. First, Galaxy enables biomedical researchers to perform complex compute-intensive analyses without having to install and configure anything. Second, it allows tool developers to deploy their analysis applications without the need to design interfaces or maintain a compute infrastructure.

Most of the Galaxy tools are in the genomics and gene evolution space, but researchers are also adapting the platform to proteomics and other areas. “When this started, NGS didn’t exist,” says Nekrutenko. “Initially, we were addressing problems with whole genome sequence, comparative genomics, etc. In reality, very few people can use this information. We’re still the only resource to meaningfully manipulate genome alignments on a large scale.”

Nekrutenko says there are two kinds of users—biologists and computer scientists—but they are largely disconnected. “Their rewards are different. Biologists are rewarded on publications, computer scientists are rewarded on algorithm citations. Galaxy makes it easy for people like James to plug in tools, and easy for biologists like me to apply these tools on a large scale without needing to babysit or install anything.”

Getting Things Right

“From a software engineering perspective,” Taylor insists, “there’s nothing especially exciting about the Galaxy platform. It’s just that we’ve put a lot of work into getting things right.”

“There’s nothing that impressive about Google and Facebook either,” quips Nekrutenko.

The basic model is a Web-based platform. “We believe that’s really important for collaboration and communication,” says Taylor. “Having no barrier to using Galaxy other than a Web browser is very important.”

The main Galaxy software is open source, so users can download and run it on their own servers. It is very configuration driven, says Taylor. Users can customize it, integrate new tools, and plug into existing compute clusters, storage, and so on.

“If you’re a PI, using our public site for occasional analysis is a good option,” explains Nekrutenko. “If you’re a biotech or institution, you can run on your resources. If you’re concerned about patient data or security, there are many successful examples of [groups using] that.”

But as the complexity of computing and the volumes of data grow, downloading will become a rate-limiting step. “We provide the Web service but it’s a limited resource,” says Nekrutenko. “As compute needs grow, it’s going to be impossible to meet everyone’s needs. People can download on their local resources, or they can easily add Galaxy on Amazon Web Services.” Indeed, once cloud computing becomes cheaper, Nekrutenko says it will become a prominent way to use Galaxy, with considerable appeal for regular data analysts who don’t want to pay for cooling, heating, systems administration, etc.

So far, the response has been positive and some publications are coming out, but it is early days. “Getting data in or out of the cloud is mostly expensive rather than slow,” says Nekrutenko. “But for people who have very intermittent needs in terms of computation, it seems to fit well. It’s worth the extra cost.”

Applications

“We’d built something to handle large amounts of data, so we were very well positioned [for NGS],” adds Taylor. The platform is agnostic to the origin of the NGS data. Galaxy features a link to the venerable genome browser, developed by the University of California, Santa Cruz more than a decade ago. “We’ve had a long friendly relationship with UCSC—the best source for vertebrate comparative genomics,” says Taylor. “But there’s integration with many other data sources and browsers, including federated sources.”

Nekrutenko cites numerous studies using Galaxy, from RNA-seq and Chip-seq to genome mapping and annotation. He describes a typical workflow: “Imagine a simple analysis—a researcher wants to do Chip-seq on 20 tissues or cell lines. They go to the sequencing facility, get the data on a thumb drive, open the files in Microsoft Word, and hit the wall. Fortunately there is a wealth of publication-proven community software for these emerging problems. The hard part here is that such software is often difficult to use for computationally unskilled biologists who feel confused faced with a program that needs to be run from a UNIX shell.

“This has created an opportunity to provide software to enable biologists to perform these analyses, including commercial offerings from providers such as SoftGenetics, CLCbio, and others.” (see, p. 35) However these platforms often include proprietary algorithms.

Galaxy takes a different approach, making existing open-source tools readily usable for biologists, and providing an open platform that users and the community can extend. In the case of the ChIP-seq example mentioned above, Galaxy provides an open-source set of tools for that analysis. “[Users] need to see quality, how many good reads map against genome, identify enriched regions. So for the initial stages, Galaxy provides tools to manipulate large datasets—millions of reads. Then you can do the QC.”

The broad Galaxy community helps, of course. “If you write a tool, it makes sense to spend 30 minutes and make a Galaxy wrapper and deposit into our tool shed,” says Nekrutenko. The tool shed—equivalent to an app store—currently holds more than 100 tools, and will likely become the central piece of Galaxy.

For example, a group at the University of Maryland has built and deposited a suite of tools for RNA-seq analysis, which Taylor and colleagues have integrated and created workflows around. “If someone was on their own, they’d need to work from a command line—instead, they can use the Galaxy environment.”

“Most studies are completely irreproducible, in NGS and other fields,” says Nekrutenko. “Once you do an analysis in Galaxy, we have a special mechanism called Galaxy Pages. It records everything, so you can extract and share that information, or use as supplementary data. So the reviewers [of a manuscript] can see everything too.”

Future Mission

In January 2011, the main site at Penn State ( http://usegalaxy.org ) handled more than 150,000 jobs (e.g. executing a tool such as bowtie or BWA) and accumulated more than 7 Terabases of data in user uploads. Nekrutenko says Galaxy usage has been growing by around 5-10% each month for the past couple of years. Galaxy was cited more than 150 times in 2010, which probably does not fully represent its true usage.

When Nekrutenko and Taylor recently tried to gauge user feedback regarding their use of Galaxy, they were quickly overwhelmed with more than 1,500 responses. “Most of these responses were simply short statements of enthusiastic support, such as ‘Galaxy has become absolutely indispensable for our work,’” says Nekrutenko. Most respondents were working with NGS datasets, including some 50 PIs. “The Galaxy community is becoming increasingly de-centralized, with dozens of local installations used by multiple users, whole institutes or consortia. A number of responses specifically mentioned using Galaxy as the central infrastructure in sequencing cores.”

In the near term, the biggest challenge facing Galaxy’s developers may be its sheer popularity—Galaxy currently attracts some 150,000 jobs/month on the public site. “Remember the UPS commercial?” says Nekrutenko. “A guy makes a new start-up, watches his orders grow online, then it explodes. What do you do now?! That’s what we’re experiencing... Penn State is very good to us—buys us disks—but it’s really not sustainable.”

“Our big thing going forward is to facilitate the growth of the decentralized community,” adds Taylor. “Rather than the public site, we want people to run on their local resources or the cloud, and move workflows around.”

There is also active investigation into the potential application of Galaxy in more clinical settings. “We’re actively exploring what it would take for Galaxy to become useful in the clinical environment,” says Nekrutenko. •