The emergence of big data, as well as advancements in data science approaches and technology, is providing pharmaceutical companies with an opportunity to gain novel insights that can enhance and accelerate drug development. It will increasingly help government health agencies, payers, and providers to make decisions about such issues as drug discovery, patient access, and marketing. From our unique vantage points at Genentech, a leading biotechnology company with a major data science practice, and The Data Incubator, a data-science education company that places and trains data scientists, we have seen how the pharmaceuticals industry has leveraged big data for some potentially revolutionary advances and the challenges it has faced along the way.

For the industry, the biggest challenge by far has been talent: upgrading skill sets from those sufficient to analyze relatively small amounts of clinical trial data to those required to gain insights from the vast amount of real-world data, including unstructured data such as physicians’ notes, scans and images, and pathology reports. The pharmaceuticals industry has seen an explosion in the amount of available data beyond that collected from traditional, tightly controlled clinical trial environments. To be sure, anonymized insurance-claims data and electronic health record (EHR) data has been accessed and analyzed for many years. But in the past, EHR data was often limited to a single research institution or provider network, and obtaining the data needed to help answer a specific research question usually involved a tedious and inefficient process. While much still needs to be done to create standardized methods for sharing and making sense of anonymized EHR and genomic data across providers, it is now possible to link different data sources, which allows complex research questions to be addressed.

Insight Center Innovating for Value in Health Care Sponsored by Medtronic Exploring cutting edge ways to lower costs and improve quality.

For example, the analysis of comprehensive EHR patient data collected in real time during doctor or hospital visits provides an opportunity to better understand diseases, treatment patterns, and clinical outcomes in an uncontrolled, real-world setting. These valuable insights complement those gained from clinical trials and can provide an opportunity to assess a wider spectrum of patients that are traditionally excluded from clinical trials (e.g., elderly, frail, or immobile patients, as well as people with rare indications and diseases not yet studied in clinical trials). It also allows companies to assess real-world challenges that cannot be observed in a clinical trial, such as drug compliance and the utilization of health care resources.

While these advances are generating great opportunities, they also pose resourcing and capability development challenges. One of the biggest is how to make the transition from legacy technology and analytical competence to more-powerful and sophisticated analytical tools and analysis methodologies.

Historically, the pharmaceutical industry has recruited SAS programmers who have executed well-defined analyses of clinical trials in a standardized, efficient manner. This worked well, given that clinical trials have been designed to answer questions about efficacy and safety with clean data sets in an industry-standard structure with few missing values.

But real-world data comes in a variety of different formats, is often highly unstructured (containing textual and other nonnumeric data), and is rife with missing values. It is messy data, filled with inconsistencies, potential biases, and noise. These attributes force data scientists to find creative ways to answer critical research questions to support drug research and development and ultimately to provide patients with access to the right therapies.

Consequently, there is an emerging need for analysts and data scientists who can take full advantage of tools and techniques developed in Silicon Valley that are capable of handling noisy data and presenting results to stakeholders in a simple, easy-to-interpret way. These analysts must be able to deal with ambiguity and be collaborative, entrepreneurial, and adaptive in their approaches. They must be able to apply “options thinking” to figure out what questions to ask, what data to examine, and what methodologies and technologies to use to address the aim. They must also have a deep knowledge of the health care system, including its standard practices, in order to understand how the data was originally collected, what biases may exist, and how it can be repurposed to answer clinical research questions.

Genentech, which is owned by F. Hoffmann-La Roche, has been building such capability for two years. In addition to investing in data partnerships and analytics tools, it has built a big-data infrastructure — a platform that can analyze billions of patient records in seconds. It has been aggressively recruiting and developing people with the requisite skills, partnering with universities and firms such as The Data Incubator to recruit and train data scientists, and it now has an entrepreneurial global team of more than 80.

A recent example of the kind of work it conducts is the creation of a database on a historical cohort of real-world patients previously diagnosed with cancer. The team analyzed their data to understand the outcomes of different patient subtypes and treatment regimens. This helped Genentech learn how different biomarker alterations and different treatment patterns affect clinical outcomes in the real world. This information will ultimately support critical drug development decisions. Genentech is also utilizing real-world data in other therapeutic areas, such as neuroscience, where drug development is notoriously challenging, in order to better understand the variability of disease patterns, progression rates, and treatment responses, and to quantify cost dynamics as diseases advance.

It will not be easy to learn how to tap the full potential of real-world data. But it can be done. The potential to use that data to improve drug discovery and get the right treatments to the right patients at the right time is enormous.