Professional Documents
Culture Documents
— Steve Oudot
The big data era
Key figures:
2
The big data era
Key figures:
2
The big data era
Key figures:
2
The big data era
Key figures:
2
Data production
Data are produced at an unprecedented rate by:
I Industry / Economy
I Sciences
I End users
3
Challenges
Big data
(streamed, online, distributed)
4
Data science’s celebrated successes...
AI for games:
ImageNet Challenge:
[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009] 5
Data science’s celebrated successes...
ImageNet Challenge:
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...
ImageNet Challenge:
rrow I by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...
ImageNet Challenge:
unsupervised pre-training
I unsupervised ' usingleads
pre-training auto-encoders
to conceptaslearning
feature(e.g.
generators
humantoface,
be plugged
cat face)into
[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012] 5
Data science’s celebrated successes...
Healthcare data:
[Nicolau et al.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile
and excellent survival, PNAS 2011]
5
... and notorious failures
Microsoft’s Tay:
6
... and notorious failures
Microsoft’s Tay:
6
... and notorious failures
Microsoft’s Tay:
6
... and notorious failures
6
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
Core topics:
I statistical analysis
I machine learning
I pattern recognition
I data mining
7
Data?
Datum ≡ observation ≡ ”chunk of information”
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de t
{
x1
v1 ··· vd
observations
{
···
xn
coordinate
matrix
Vector
e: ici representation
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de t
{
Metric representation x1
x1 ··· xn
observations
{
···
xn
distance /
(dis-)similarity
matrix
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t
Metric representation
Rd
∈
?
···
8
Data?
Datum ≡ observation ≡ ”chunk of information”
Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t
Metric representation
Rd
∈
?
···
feature extraction
8
Programming languages for data science
te: all other modern languages are built on SQL (e.g. QBE is in fact just a front-e
9
Programming languages for data science
e: what is taught is: (1) principles of each approach; (2) how to apply it in Python
9
Programming languages for data science
[...]
9
Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
I classification (categorical labels)
I regression (continuous labels)
I forecasting (regression on time series)
energy consumption
? weather parameters
10
Learning paradigms
Unsupervised learning
Input: data without labels
Typical problems:
I clustering
I dimensionality reduction
I anomaly detection / noise removal
10
Learning paradigms
Unsupervised learning
Supervised learning
10
Learning paradigms
Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards
Typical problems:
I control learning
10
Learning paradigms
Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards
Typical problems:
I control learning
[Géron 2017] 10
Learning paradigms
(source: NVIDIA)
10
Course outline
• Lecture 1: introduction to data science / C++ as C (1/2)