Back to All

Spark and laboratory data

25 July, 2018

At Diaceutics we are using a number of tools to leverage our data and to generate actionable insights.  We mine data to forecast market trends and to understand the changing biomarker landscape.  We are able to understand the use of new biomarker targets over time, review the dissemination of new companion diagnostics across the globe and build an understanding of the patient journey and related events (numbers of tests and timeliness, for example) linked to health outcomes. There are many methods that can be used to generate valuable insights through use of lab data.  

A key technology we leverage is Apache Spark®.  We use this big data technology to analyze our proprietary laboratory test data for our pharmaceutical clients.  We incorporate Spark in our regular processes to handle weekly uploads to our data warehouse and it takes just one hour to parse the entirety of our historical data.  This is possible through use of parallel processing using Amazon Web Services (AWS) as a cloud hosting provider to scale up data processing as needed.  When we deploy Spark we efficiently use the right number of machines to process data, while Amazon manages the machines and software.  Furthermore, Spark is extensible, featuring machine learning libraries that can be used to mine deeper insights into our laboratory data.  

We have used Spark for both analysis as well as for processing. One example of how we leverage processing power is to aggregate all biomarkers tested per patient over their entire history.  For analysis, we can take the list of tests performed and calculate probabilities of how likely one test will lead to another.  We can also take the list of biomarkers and cluster patients according to treatment histories.  Having one place to bulk process records is valuable as we discover deep insights using our machine learning algorithms through reviews of many different slices of data.   

Spark can be used to correct anomalous data and augment gaps in data.  Data that is manually entered or missing can be forecast or given context.  For example, body sites of a biopsy are often manually entered by a physician or pathologist and at times there are spelling errors that may prevent accurate tracking of a sample’s origin.  When there are gaps in information, Spark allows us to cross reference a body site for a given test event with the entire history of a sample, allowing us to interpolate where in the body the sample originated.  

The possibilities for using Spark are great and many large companies use Spark regularly, including Alibaba, Amazon, Autodesk, Tencent and TripAdvisor.  For example, Trip Advisor is capable of processing every review that has been added to their site through use of Spark – they apply natural language processing to the reviews to make the content more useful.  

Spark has been around since 2012 and its use in the health sciences continues to grow.  At Diaceutics, we are continually looking for new ways to leverage the latest technologies to actively break down barriers to deliver better testing, therefore better treatment, for patients.  

#ApacheSpark #Labdata #Diaceutics #MachineLearning #BigData  


About Diaceutics

At Diaceutics we believe that every patient should get the precision medicine they deserve. We are a data analytics and end-to-end services provider enabled by DXRX - the world’s first Network solution for the development and commercialization of precision medicine diagnostics. 

Diaceutics has worked on every precision medicine brought to market and provides services to 36 of the world’s leading pharmaceutical companies. We have built the world’s largest repository of diagnostic testing data with a growing network of 2500 labs in 51 countries.

Public Relations & Investor Relations advisers

Alma PR
71-73 Carter Lane

Tel: +44 (0)20 3405 0205 or [email protected]

Caroline Forde
Robyn Fisher
Kieran Breheny