From Quanta Magazine (find original story here).

To distill a clear message from growing piles of unruly genomics data, researchers often turn to meta-analysis—a tried-and-true statistical procedure for combining data from multiple studies. But the studies that a meta-analysis might mine for answers can diverge endlessly. Some enroll only men, others only children. Some are done in one country, others across a region like Europe. Some focus on milder forms of a disease, others on more advanced cases. Even if statistical methods can compensate for these kinds of variations, studies rarely use the same protocols and instruments to collect the data, or the same software to analyze it. Researchers performing meta-analyses go to untold lengths trying to clean up the hodgepodge of data to control for these confounding factors.

Purvesh Khatri, a computational immunologist at Stanford University, thinks they’re going about it all wrong. His approach to genomic discovery calls for scouring public repositories for data collected at different hospitals on different populations with different methods—the messier the data, the better. “We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”

This strategy seems too easy, but in Khatri’s hands, it works. Analyzing troves of public data, Khatri and colleagues have uncovered signature genes that could allow clinicians to detect life-threatening infections that cause sepsis, classify infections as bacterial or viral, and tell if someone has a specific disease such as tuberculosis, dengue or malaria. Last year Khatri and two other scientists launched a company to develop a device for measuring these gene signatures at a patient’s bedside. In short, they’re deciphering the host immune response and turning key genes into diagnostics.

Over the past year Khatri discussed his ideas with Quanta Magazineover the phone, by email and from his whiteboard-lined Stanford office. An edited and condensed version of the conversations follows.

What turned you on to biology?

I left India and came to the U.S. in the “fix the Y2K bug” rush with plans to get a master’s in computer science and become a software engineer. Months after arriving at Wayne State University in Detroit I realized that writing software for the rest of my life was going to be really boring. I joined a lab working on neural networks.

But then my adviser switched to bioinformatics and said he’d pay my tuition if I switched with him. I was a poor Indian grad student. I thought, “You’re going to pay my salary? I’ll do whatever you are doing.” That’s how I moved into biology.

You made a splash pretty quickly. How did that happen?

While my adviser was away on sabbatical in 2000-2001, I worked in the lab doing bioinformatics analyses with a postdoc in our collaborator’s lab, a gynecologist studying genes involved in male fertility. Microarrays for running assays on large numbers of genes at once were brand-new. From a recent experiment, he’d gotten a list of some 3,000 genes of interest, and he was trying to figure out what they were doing.

One day I saw him going from one website to another, copying and pasting text into Excel spreadsheets. I said to him, “You know, I can write software for you that will do all of that automatically. Just tell me what you are doing.” So I wrote a script for him—it took me three days—and with the results we wrote a Lancet paper.

We put the software on the web. There was huge interest. They presented it at some conference, and Pfizer wanted to buy it. I thought, wow, this is such low-hanging fruit. I can be a millionaire soon.

What does the software do?

It takes the set of genes you specify and searches annotation databases to tell you what biological processes and molecular pathways those genes are involved in. If you have a list of 100 genes, it could tell you that 15 are involved in immune response, another 15 are involved in angiogenesis and 50 play a role in glucose metabolism. Let’s say you’re studying Type 1 diabetes. You could look at these results and say, “I’m on the right path.”

This was 15 years ago, when I was getting my master’s degree. I developed more tools and expanded the work into a Ph.D. It’s now an open-access, web-based suite of tools called Onto-Tools. Last I checked a few years ago, it had 15,000 users from many countries, analyzing an average of 100 data sets a day.

Although the tools became very popular, they weren’t telling me how the results get used, how they help people. I wanted to see how research progresses from bioinformatics analyses to lab experiments and ultimately to something that could help patients.

How did you make that switch?

When I came to Stanford as a postdoc in 2008, one of my conditions was that somebody with a wet lab—someone running experiments on samples from mice or actual patients, not just analyzing data in silico—would pay half my salary, because I wanted their skin in the game. I wanted to make predictions using methods I’d develop in one lab, and then work with another lab to validate those predictions and tell me what’s clinically important. That’s how I ended up working with Atul Butte, a bioinformatician, and Minnie Sarwal, a renal transplant physician. [Editor’s note: Butte and Sarwal have both since moved from Stanford to the University of California, San Francisco.]

What shifted your attention to immunology?

Reading papers to learn the basic biology of organ transplant rejection, I had an “Aha!” moment. I realized that heart transplant surgeons, kidney transplant surgeons and lung transplant surgeons don’t really talk to each other!

No matter which organ I was reading about, I saw a common theme: The B cells and T cells of the graft recipient’s immune system were attacking the transplant. Yet diagnostic criteria for rejection were different—kidney people follow Banff criteria for renal graft rejection, heart-and-lung people follow ISHLT [International Society for Heart and Lung Transplantation] criteria. If the biological mechanism is common, why are there different diagnostic criteria? That didn’t make sense to me as a computer scientist.

I was starting to form a hypothesis that there must be a common mechanism—some common trigger that tells the recipient’s immune cells that something is “not self.” While thinking about this, I came across a fantastic paper titled “The Immunologic Constant of Rejection.” The authors basically laid out my hypothesis. They proposed that while the triggers for organ rejection may differ, they share a common pathway. And they were saying someone should test this.

What did you do at that point?

I started asking my colleagues, “Why don’t we start collecting samples from various organ transplant cohorts and do the analysis to find out what common genes are involved?” They said you can’t do it because you’d have to account for all the heterogeneity—different organs, different microarray technologies, different treatment protocols. It would be expensive to control for all of that.

Plus, it would take years to get everyone to contribute all those samples. I was in a hurry. So Atul suggested getting ahold of existing public data instead. But these data are “dirty,” as they are confounded by a number of biological and technical factors.

I wondered if we really had to control for heterogeneity. If all this “dirty” data exists, maybe we could just combine it somehow. And if we found a signal, despite the heterogeneity, wouldn’t you then say, oh, that’s what I should be looking at?

I started working on it.

What happened on that first try?

I went to the Gene Expression Omnibus website and downloaded data from several organ transplant studies—heart, kidney, lung, liver. The data came from five hospitals and used at least two different diagnostic criteria. Because we weren’t throwing out “incompatible” data, we set our [allowable] false discovery rate higher than usual (20 percent instead of the usual 5 percent). We were willing to get more false positives if we could find a common mechanism across all the solid-organ transplant rejections. We checked some other things, like making sure one data set wasn’t driving all the results, and did some additional steps to make sure we weren’t just getting a bunch of genes changing. And it worked.

What do you mean by “worked”?

Using a lot of heterogeneous data, we found a set of 11 genes that were overexpressed in patients who rejected their transplants, and we showed that we could validate that gene signature in other cohorts from different hospitals in different countries. Plus, using this gene set, we could predict—from a biopsy six months after the graft surgery—which patients would experience significant subclinical graft injury (a harder condition to detect than acute rejection) 18 months later. So it was also a prognostic marker.

We confirmed these results in mice. We took a heart from one mouse, put it into another animal, and asked: Do these genes change when we see transplant rejection? The answer was yes.

We then did a Google search to find drugs whose mechanisms suggest they regulate the biological processes of the genes we had found. We chose two FDA-approved drugs to try on our mice. Lo and behold, they worked. Both drugs reduced graft-infiltrating immune cells [a marker for rejection]. They looked as good as a drug we currently give to transplant patients.

One of those two drugs is a statin, a drug widely prescribed to prevent heart disease. I sought help from a former colleague who now works in Belgium and has access to electronic medical records dating back to 1989. I asked him to search the database for patients who got renal transplants and see what drugs they took, when their grafts failed, all of that. He ran the analysis and a week later said to me, “Guess what? If the patients received statins, their graft failure rate was reduced 30 percent.”

Diagnosis, prognosis, therapy and validation of the findings against electronic medical records—all in one paper.