Professional Documents
Culture Documents
Jacob McPadden, MD
Department of Neonatology
SLIDE 0
Overview
Healthcare Data and Data Sources
Yale/YNHH Data Science Platform
The Data Science Journey: Computational Healthcare Use Case
Question
Data
Architecture
Implementation
SLIDE 1
SLIDE 2
SLIDE 3
SLIDE 4
Step 1:
Get Data
Step 2: ?
Step 3:
Profit
SLIDE 5
Descriptive
Diagnostic
Predictive
Prescriptive
SLIDE 6
SLIDE 7
Kerberized
SLIDE 8
SLIDE 9
S L I D E 10
Future
Normal RBCs: 95.5% (Low)
Dacrocytes: 2.7% (High)
Target cells: 1.2% (High)
Elliptocytes: 0.5% (Normal)
Schistocytes: 0.1% (Normal)
S L I D E 11
Normal Erythrocyte
S L I D E 12
Ingestion
- HDF, Flume, Storm
Storage
- HDFS, Elastic
Workflow Coordination
- HDF, Oozie, Python
Analysis
- Spark, Python
S L I D E 13
HDF / NiFi
Good capacity
Strong real-time streaming architecture
No coding / development needed for this
use case
Easier to debug (with data provenance)
Other laboratory data already processed in
HDF
Flume
Good capacity, easy to scale
Built-in features should work for file
acquisition, but may require custom dev
S L I D E 14
Python
Less scalable
More difficult to integrate existing data
Great ML support with ability to use GPU
Easily tested locally then deployed within our
cluster-enabled Docker Swarm
S L I D E 15
S L I D E 16
S L I D E 18
S L I D E 19
S L I D E 20
A Novel Big Data Platform to Manage Genomic Variants in the Clinical Laboratory
S L I D E 21
Big data workflows are complex, but often have reusable architecture
Data scientists should strive to use advanced analytics and deliver high-impact analyses
Try to define questions in advance, use the scientific process
S L I D E 22
Questions?
Wade L. Schulz, MD, PhD
Department of Laboratory Medicine
wade.schulz@yale.edu
S L I D E 23