2016-12 Hortonworks Road Show - From Acquisition To Insights

From Acquisition to Insight
A Data Science Platform for Prescriptive Healthcare Analytics
Wade L. Schulz, MD, PhD

Department of Laboratory Medicine
Jacob McPadden, MD
Department of Neonatology
Thomas J.S. Durant, MD, MPT

SLIDE 0
Overview
Healthcare Data and Data Sources
Yale/YNHH Data Science Platform
The Data Science Journey: Computational Healthcare Use Case
Question
Data
Architecture
Implementation
Other Use Cases in Healthcare
SLIDE 1
Healthcare (Big?) Data
SLIDE 2
Healthcare (Big?) Data
SLIDE 3
The Dream for Big Data in Healthcare
SLIDE 4
Data Science Journey
Step 1:
Get Data
Step 2: ?
Step 3:
Profit
SLIDE 5
Step 2 (Or Better Yet, Step 1)

Define the question / hypothesis
Assess the data source, clean as necessary
Iteratively test the experiment and implementation
Descriptive
Diagnostic
Predictive
Prescriptive
SLIDE 6
Clinical Laboratory Use Case

Advanced Analytics for Laboratory Diagnostics
~10 million laboratory tests
~46 million results per year
Significant number require physician interpretation
SLIDE 7
Kerberized
The Data Science Platform
SLIDE 8
Step 1: Problem Definition

Can we use big data and machine learning to improve our diagnostic reproducibility and
throughput for laboratory testing?
Hypothesis: The use of a ML algorithm and real-time data processing pipeline will improve
our turnaround time and interpretive reproducibility. Quantitative results will improve
diagnostic accuracy.
SLIDE 9
Step 2: Data Acquisition
S L I D E 10
Peripheral Blood Smear

Current
Frequent dacrocytes and occasional target cells, which can be
seen in iron deficiency anemia. Platelets and leukocytes
unremarkable.
Future
Normal RBCs: 95.5% (Low)
Dacrocytes: 2.7% (High)
Target cells: 1.2% (High)
Elliptocytes: 0.5% (Normal)
Schistocytes: 0.1% (Normal)
Based on red cell indicators and integrated laboratory studies,

patient has likely diagnosis of iron deficiency anemia, recommend
iron replacement therapy
S L I D E 11
Step 3: Advanced Analytics (Machine Learning!)
Normal Erythrocyte
S L I D E 12
Step 4: Architect the Pipeline
Ingestion
- HDF, Flume, Storm
Storage
- HDFS, Elastic
Workflow Coordination
- HDF, Oozie, Python
Analysis
- Spark, Python
S L I D E 13
Data Acquisition Technology Assessment

Storm
Great capacity, scalability

Strong real-time streaming architecture
Requires Java development
Often difficult to debug/test
HDF / NiFi
Good capacity
Strong real-time streaming architecture
No coding / development needed for this
use case
Easier to debug (with data provenance)
Other laboratory data already processed in
HDF
Flume
Good capacity, easy to scale
Built-in features should work for file
acquisition, but may require custom dev
S L I D E 14
Data Analysis Technology Assessment

Spark
Scalability
Easy to develop / test
Great integration with existing data sets
Less mature ML frameworks
Minimal GPU support
Python
Less scalable
More difficult to integrate existing data
Great ML support with ability to use GPU
Easily tested locally then deployed within our
cluster-enabled Docker Swarm
S L I D E 15
The Tech Stack
S L I D E 16
Step 5: (Data) Science!
Data from: Durant, Olsen, and Torres 2016

S L I D E 17
Conclusions: Advanced Healthcare Analytics

Machine learning can do things!
Other frameworks from the data science toolbox can be integrated with the Hadoop
ecosystem, even in a Kerberized environment
Hortonworks Data Flow / NiFi provides an easy to use data ingestion / data pipeline system
S L I D E 18
Other Healthcare Use Cases: Patient Monitoring
S L I D E 19
Other Healthcare Use Cases: Real-Time Laboratory BI
S L I D E 20
Other Healthcare Use Cases: Precision Medicine
A Novel Big Data Platform to Manage Genomic Variants in the Clinical Laboratory
S L I D E 21
Take Home Messages

Hadoop ecosystem provides a robust framework for the ingest, management, and analysis
of healthcare data
Actionable, clinical insights from big data are becoming a reality in healthcare
Healthcare data is not always big in volume, but often complex data types and ontologies
Big data workflows are complex, but often have reusable architecture
Data scientists should strive to use advanced analytics and deliver high-impact analyses
Try to define questions in advance, use the scientific process
S L I D E 22
Questions?
Wade L. Schulz, MD, PhD
wade.schulz@yale.edu
S L I D E 23

2016-12 Hortonworks Road Show - From Acquisition To Insights

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2016-12 Hortonworks Road Show - From Acquisition To Insights

Uploaded by

Copyright:

Available Formats

From Acquisition to Insight

A Data Science Platform for Prescriptive Healthcare Analytics

Wade L. Schulz, MD, PhD

Thomas J.S. Durant, MD, MPT

Other Use Cases in Healthcare

Healthcare (Big?) Data

Healthcare (Big?) Data

The Dream for Big Data in Healthcare

Data Science Journey

Step 2 (Or Better Yet, Step 1)

Clinical Laboratory Use Case

The Data Science Platform

Step 1: Problem Definition

Step 2: Data Acquisition

Peripheral Blood Smear

Based on red cell indicators and integrated laboratory studies,

Step 3: Advanced Analytics (Machine Learning!)

Step 4: Architect the Pipeline

Data Acquisition Technology Assessment

Great capacity, scalability

Data Analysis Technology Assessment

The Tech Stack

Step 5: (Data) Science!

Data from: Durant, Olsen, and Torres 2016

Conclusions: Advanced Healthcare Analytics

Other Healthcare Use Cases: Patient Monitoring

Other Healthcare Use Cases: Real-Time Laboratory BI

Other Healthcare Use Cases: Precision Medicine

Take Home Messages

You might also like