Powering Personalized Medicine with Hadoop and Apache Spark

November 23, 2015



The Obama administration plans to dedicate $215 million in the 2016 budget to the “Precision Medicine Initiative.” This project centers on the collection and analysis of data from a projected million volunteers. Funding includes $130 million for the National Institutes of Health (NIH) to collect the data for analysis; $10 million to the U.S. Food and Drug Administration (FDA) to architect databases; and $5 million to protect the privacy of the collected data.

Personalized Medicine Data Analytics

Under the 2016 budget, the National Cancer Institute (NCI) will receive $70 million for a pilot project that will investigate precision treatments for cancer. In this clinical trial, patients will receive targeted treatments based on the genetic abnormalities in their tumors, regardless of the type of cancer they have.

Personalized medicine is well underway in cancer treatment programs. Genetic profiling of tumors is being used to develop treatment plans that can best attack the cancer, while also reducing the severity of treatment side effects. The National Cancer Institute, noting that “cancer is a disease of the genome,” states that Oncology is already well along the path to precision medicine with “many genetically targeted therapies currently available to cancer patients, and many more are expected to become available in the near future.”

Personalized Medicine is also showing promise in cardiovascular and infectious disease treatment and diagnostics.

Personalized Medicine in the Marketplace

A new research report by Kelly Scientific Publications puts the Personalized Medicine marketplace at $60 billion by 2019, up from its current worth of $42 billion. The market includes companion diagnostics and targeted therapeutics.

The research shows that personalized therapeutics “will be more specific and effective thereby giving pharma/biotech companies a significant advantage to recuperate R&D costs. Personalized medicine will reduce the frequency of adverse drug reactions and therefore have a dramatic impact on health economics. Developmental and diagnostic companies will benefit from lower discovery and commercialization costs and more specific market subtypes.”

Hadoop’s Role in Personalized Medicine

The amount of data generated by diagnostic testing is enormous. In 2011 the U.S. healthcare system alone reached 150 exabytes (150 billion gigabytes), and IDC predicts that U.S. healthcare data will grow to 2,314 exabytes by 2020.

The cost of storing, securing, and making this data accessible to analysts is a central issue in the development of Personalized Medicine. Much of the data is semi-structured or unstructured. A traditional data warehouse and relational database architecture is obviously not the best choice for working with huge, mixed format data sets. Apache Hadoop’s HDFS storage and MapReduce computation system enables complex analytics across large sets of structured, semi-structured, and unstructured data. Apache Spark, a data processing platform for Hadoop, brings improved performance to data analysis and increased abilities to quickly extract intelligence from huge, dissimilar data sets.

Supporting projects developed by Hadoop vendors and community members greatly enhance Hadoop’s Big Data analytics capabilities, adding increased abilities to pull actionable insights from clinical, non-clinical, and genomic information. Hadoop already delivers high performance on commodity servers, providing cost-effective storage of vast data volumes. But teamed with Apache Spark, which supports rapid, large-scale data processing, Hadoop is the best choice to power Personalized Medicine.

Sparking Fresh Potential for Hadoop

Apache Spark’s speed and simplicity are precisely the attributes that are needed in medical diagnostics. Spark is a general-purpose engine for large-scale data processing that offers distributed, in-memory data processing speed at scale.

Apache Spark is written in Scala, a functional programming language. Rapid application development is enabled with easily understandable programming APIs for Scala, Java, and Python. Code can be reused across batch, interactive, and streaming applications.

The Spark stack includes:

  • Spark SQL, allowing relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. Supported data sources include Parquet files, JSON datasets, or Apache Hive data stores.
  • Spark Streaming, which provides fault-tolerant stream processing of live data stream from many sources.
  • MLlib (Machine Learning Library), a library of Machine Learning algorithms.
  • GraphX for graph-analytics of data sets from sources such as social networks.

But it is Spark’s base platform abilities to deliver real-time analytics across diverse datasets that will most directly speed the development of Personalized Medicine. Clinicians agree that a critical aspect of targeting therapies to patients will involve the combination of data collected from the patient via biosensors—“smart” patches, mobile technology, or fixed devices—which would then often be combined with diagnostic data sets (to determine the potential outcome of patient data), and analyzed in real time.

Returned data about the patient’s condition can then also be monitored in real time by caregivers or healthcare providers, and fast decisions can be made to address any developing problems. The hope is that rapid response will improve patient outcomes while also reducing medical care expenditures that grow as disease progresses.