Professional Documents
Culture Documents
Model-centric to
Data-centric AI
Andrew Ng
AI system = Code + Data
(model/algorithm)
Andrew Ng
Inspecting steel sheets for defects
Examples
of defects
Andrew Ng
Improving the code vs. the data
Andrew Ng
Data is Food for AI
~1% of AI research? ~99% of AI research?
80% 20%
PREP ACTION
Source and prepare high quality ingredients Cook a meal
Andrew Ng
Lifecycle of an ML Project
Andrew Ng
Scoping: Speech Recognition
Define project
Andrew Ng
Collect Data: Speech Recognition
Define and
collect data
Andrew Ng
Iguana Detection Example
Labeling instruction:
Andrew Ng
Making data quality
systematic: MLOps
• Ask two independent labelers to label a sample
of images.
• Measure consistency between labelers to
discover where they disagree.
• For classes where the labelers disagree, revise
the labeling instructions until they become
consistent.
Andrew Ng
Labeler consistency example
Steel defect detection (39 classes). Class 23: Foreign particle defect.
Labeler 1
Labeler 2
Andrew Ng
Labeler consistency example
Steel defect detection (39 classes). Class 23: Foreign particle defect.
Labeler 1
Labeler 2
Andrew Ng
Making it systematic: MLOps
Model-centric view Data-centric view
Collect what data you can, The consistency of the data is
and develop a model good paramount. Use tools to
enough to deal with the noise improve the data quality; this
in the data. will allow multiple models to
do well.
Hold the data fixed and Hold the code fixed and
iteratively improve the iteratively improve the
code/model. data.
Andrew Ng
Audience poll: Think about the last supervised learning
model you trained. How many training examples did you
have? Please enter an integer.
Poll results:
Andrew Ng
Kaggle Dataset Size
Andrew Ng
Small Data and Label Consistency
• Small data
• Small data • Big data
• Clean (consistent)
• Noisy labels • Noisy labels
labels
Andrew Ng
Theory: Clean vs. noisy data
You have 500 examples, and 12% of the examples are
noisy (incorrectly or inconsistently labeled).
Andrew Ng
Example: Clean vs. noisy data
0.8
0.7
Clean
0.5
Noisy
0.4
0.3
250 500 750 1000 1250 1500
Number of training examples
Note: Big data problems where there’s a long tail of rare events in the input (web
search, self-driving cars, recommender systems) are also small data problems.
Andrew Ng
Train model: Speech Recognition
Andrew Ng
Train model: Speech Recognition
Andrew Ng
Deploy: Speech Recognition
Andrew Ng
Making it systematic:
The rise of MLOps
AI systems = Code + Data
Software
Creation ML engineers
engineers
Quality/
DevOps MLOps
Infrastructure
Andrew Ng
Making it systematic:
The rise of MLOps
AI systems = Code + Data
Software
Creation ML engineers
engineers
Quality/
DevOps MLOps
Infrastructure
Andrew Ng
Traditional software vs AI software
Traditional software
AI software
Andrew Ng
MLOps: Ensuring consistently high-
quality data
Andrew Ng
Audience poll: Who do you think is best
qualified to take on an MLOps role?
Poll results:
Andrew Ng
From Big Data to Good Data
MLOps’ most important task: Ensure consistently high-quality
data in all phases of the ML project lifecycle.
Andrew Ng
Takeaways: Data-centric AI
MLOps’ most important task is to make high
quality data available through all stages of the
ML project lifecycle.
Model-centric AI Data-centric AI
How can you change the How can you systematically change
model (code) to improve your data (inputs x or labels y) to
performance? improve performance?
Andrew Ng