You are on page 1of 24

From Acquisition to Insight

A Data Science Platform for Prescriptive Healthcare Analytics

Wade L. Schulz, MD, PhD


Department of Laboratory Medicine

Jacob McPadden, MD
Department of Neonatology

Thomas J.S. Durant, MD, MPT


Department of Laboratory Medicine

SLIDE 0

Overview
Healthcare Data and Data Sources
Yale/YNHH Data Science Platform
The Data Science Journey: Computational Healthcare Use Case

Question
Data
Architecture
Implementation

Other Use Cases in Healthcare

SLIDE 1

Healthcare (Big?) Data

SLIDE 2

Healthcare (Big?) Data

SLIDE 3

The Dream for Big Data in Healthcare

SLIDE 4

Data Science Journey

Step 1:
Get Data
Step 2: ?
Step 3:
Profit
SLIDE 5

Step 2 (Or Better Yet, Step 1)


Define the question / hypothesis
Assess the data source, clean as necessary
Iteratively test the experiment and implementation

Descriptive

Diagnostic

Predictive

Prescriptive

SLIDE 6

Clinical Laboratory Use Case


Advanced Analytics for Laboratory Diagnostics
~10 million laboratory tests
~46 million results per year
Significant number require physician interpretation

SLIDE 7

Kerberized

The Data Science Platform

SLIDE 8

Step 1: Problem Definition


Can we use big data and machine learning to improve our diagnostic reproducibility and
throughput for laboratory testing?
Hypothesis: The use of a ML algorithm and real-time data processing pipeline will improve
our turnaround time and interpretive reproducibility. Quantitative results will improve
diagnostic accuracy.

SLIDE 9

Step 2: Data Acquisition

S L I D E 10

Peripheral Blood Smear


Current
Frequent dacrocytes and occasional target cells, which can be
seen in iron deficiency anemia. Platelets and leukocytes
unremarkable.

Future
Normal RBCs: 95.5% (Low)
Dacrocytes: 2.7% (High)
Target cells: 1.2% (High)
Elliptocytes: 0.5% (Normal)
Schistocytes: 0.1% (Normal)

Based on red cell indicators and integrated laboratory studies,


patient has likely diagnosis of iron deficiency anemia, recommend
iron replacement therapy

S L I D E 11

Step 3: Advanced Analytics (Machine Learning!)

Normal Erythrocyte

S L I D E 12

Step 4: Architect the Pipeline

Ingestion
- HDF, Flume, Storm

Storage
- HDFS, Elastic

Workflow Coordination
- HDF, Oozie, Python

Analysis
- Spark, Python

S L I D E 13

Data Acquisition Technology Assessment


Storm

Great capacity, scalability


Strong real-time streaming architecture
Requires Java development
Often difficult to debug/test

HDF / NiFi
Good capacity
Strong real-time streaming architecture
No coding / development needed for this
use case
Easier to debug (with data provenance)
Other laboratory data already processed in
HDF

Flume
Good capacity, easy to scale
Built-in features should work for file
acquisition, but may require custom dev

S L I D E 14

Data Analysis Technology Assessment


Spark
Scalability
Easy to develop / test
Great integration with existing data sets
Less mature ML frameworks
Minimal GPU support

Python

Less scalable
More difficult to integrate existing data
Great ML support with ability to use GPU
Easily tested locally then deployed within our
cluster-enabled Docker Swarm

S L I D E 15

The Tech Stack

S L I D E 16

Step 5: (Data) Science!

Data from: Durant, Olsen, and Torres 2016


S L I D E 17

Conclusions: Advanced Healthcare Analytics


Machine learning can do things!
Other frameworks from the data science toolbox can be integrated with the Hadoop
ecosystem, even in a Kerberized environment
Hortonworks Data Flow / NiFi provides an easy to use data ingestion / data pipeline system

S L I D E 18

Other Healthcare Use Cases: Patient Monitoring

S L I D E 19

Other Healthcare Use Cases: Real-Time Laboratory BI

S L I D E 20

Other Healthcare Use Cases: Precision Medicine

A Novel Big Data Platform to Manage Genomic Variants in the Clinical Laboratory

S L I D E 21

Take Home Messages


Hadoop ecosystem provides a robust framework for the ingest, management, and analysis
of healthcare data
Actionable, clinical insights from big data are becoming a reality in healthcare
Healthcare data is not always big in volume, but often complex data types and ontologies

Big data workflows are complex, but often have reusable architecture
Data scientists should strive to use advanced analytics and deliver high-impact analyses
Try to define questions in advance, use the scientific process

S L I D E 22

Questions?
Wade L. Schulz, MD, PhD
Department of Laboratory Medicine
wade.schulz@yale.edu

S L I D E 23

You might also like