You are on page 1of 40

INF442 — Algorithms for Data Analysis in C++

Introduction to Data Science

— Steve Oudot
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

2
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

I correlated with World’s total storage capacity


— data centers and cloud (45% - 55% in 2025)

2
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

I correlated with World’s total storage capacity


— data centers and cloud (45% - 55% in 2025)

I exponential growth (+ 30% each year)


— expected to be sustained on the long run

2
The big data era

Key figures:

I size of ‘global data sphere’:


obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation

I correlated with World’s total storage capacity


— data centers and cloud (45% - 55% in 2025)

I exponential growth (+ 30% each year)


— expected to be sustained on the long run

I small fraction of data is processed/analyzed


— shortage of trained data scientists

2
Data production
Data are produced at an unprecedented rate by:
I Industry / Economy
I Sciences

I End users

3
Challenges

Complex data Corrupted data


(non-linear, sparse, (noise, outliers,
high-dimensional) missing values)

Big data
(streamed, online, distributed)

4
Data science’s celebrated successes...

AI for games:

1997: IBM’s Deep Blue wins chess match


against world champion G. Kasparov

2016: DeepMind’s AlphaGo wins Go match


against 18-time world champion Lee Sedol

2019: DeepMind’s AlphaStar beats


Starcraft II professional players
5
Data science’s celebrated successes...

ImageNet Challenge:

I database of 40 · 106 + images, structured in 20 · 103 + categories

I images collected on the Internet

I annotation process crowdsourced


to Amazon Mechanical Turk

[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→


hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→


hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

rrow I by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality

[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
Data science’s celebrated successes...

ImageNet Challenge:

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→


hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%

unsupervised pre-training
I unsupervised ' usingleads
pre-training auto-encoders
to conceptaslearning
feature(e.g.
generators
humantoface,
be plugged
cat face)into

this image is the optimal stimulus of the

[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012] 5
Data science’s celebrated successes...

Healthcare data:

2011: a new subgroup of breast cancers


with excellent survival rates is discovered
using exploratory data analysis techniques

[Nicolau et al.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile
and excellent survival, PNAS 2011]
5
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

6
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

6
... and notorious failures

Microsoft’s Tay:

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

I training overrun by trolls

I numerous questions raised


(technical, legal, ethical)

6
... and notorious failures

Some notorious AI failures in 2018:

Google Photo’s AI feature confuses skier


and mountain

Amazon’s AI recruiting tool proven to be


gender-based

Uber’s self-driving car kills pedestrian


in Arizona

6
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data

word cloud of paper titles at NIPS 2016


(source: http://www.kaggle.com/benhamner/nips- papers) 7
What is data science?

Core topics:

I statistical analysis

I machine learning

I pattern recognition

I data mining

I optimization (convex / combinatorial)

I database management and distributed systems

I high-performance computing (streaming, distributed, cloud)

7
Data?
Datum ≡ observation ≡ ”chunk of information”

8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de t

{
x1
v1 ··· vd

observations
{
···
xn
coordinate
matrix

categorical variables: 1, 2, · · · , K (arbitrary labels)


(e.g. ”cat”, ”dog”, ”horse”)

continuous variables: real or complex values


(e.g. temperature, pressure, geographic coordinate, income, amplitude/phase)
8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de t

{
Metric representation x1
x1 ··· xn

observations
{
···
xn
distance /
(dis-)similarity
matrix

distances: Euclidean, Hamming, geodesic, diffusion, edit, Jaccard, Wasserstein, etc.

(dis-)similarity measures: cosine, Kullback-Leibler, Bregman divergences, etc.

8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t

Metric representation

Rd

?
···

8
Data?
Datum ≡ observation ≡ ”chunk of information”

Vector
e: ici representation
on parle des representations des donnees prise en entree par la plupart des algorithmes de t

Metric representation

Rd

?
···

feature extraction

8
Programming languages for data science

I Databases / data manipulation: Structured Query Language (SQL)

te: all other modern languages are built on SQL (e.g. QBE is in fact just a front-e

9
Programming languages for data science

I Databases / data manipulation: Structured Query Language (SQL)

I Data analysis: Python (CS) / R (stats)

e: what is taught is: (1) principles of each approach; (2) how to apply it in Python

9
Programming languages for data science

I Databases / data manipulation: Structured Query Language (SQL)

I Data analysis: Python (CS) / R (stats)

I Effective data processing:


C / C++ / CUDA (GPGPU)

[...]

9
Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
I classification (categorical labels)
I regression (continuous labels)
I forecasting (regression on time series)
energy consumption

? weather parameters
10
Learning paradigms

Unsupervised learning
Input: data without labels

Goal: identify patterns, correlations

Typical problems:
I clustering
I dimensionality reduction
I anomaly detection / noise removal

10
Learning paradigms

Unsupervised learning

Semi-supervised learning (only a fraction of the input data has labels)

Supervised learning
10
Learning paradigms

Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards

Goal: find policy that minimizes the regret


es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)

Typical problems:

I exploration vs. exploitation


(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce

I control learning

10
Learning paradigms

Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards

Goal: find policy that minimizes the regret


es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)

Typical problems:

I exploration vs. exploitation


(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce

I control learning

[Géron 2017] 10
Learning paradigms

(source: NVIDIA)
10
Course outline
• Lecture 1: introduction to data science / C++ as C (1/2)

• Lecture 2: Nearest-Neighbor search / C++ as C (2/2)

• Lecture 3: k-means clustering / classes (1/2)


unsupervised

• Lecture 4: hierarchical clustering / classes (2/2)

• Lecture 5: density estimation and noise removal / inheritance

• Lecture 6: k-NN classifier / genericity


supervised

• Lecture 7: linear models for regression / STL

• Lecture 8: linear models for classification / -

• Lecture 9: artificial neural networks / C++11

• Lecture 10: feature extraction and dimensionality reduction / -


11

You might also like