INF442 Algorithms Data Analysis C

INF442 — Algorithms for Data Analysis in C++
Introduction to Data Science
— Steve Oudot
The big data era
Key figures:
I size of ‘global data sphere’:

obal2.75
datazBsphere = summation of all created,
predict. captured
(2012) → 33 zB (2018) → 175 zB (2025) or replicated data in the W
(1 zB = 1021 Bytes)
— source: International Data Corporation
2
The big data era
Key figures:

obal2.75
predict. captured
(1 zB = 1021 Bytes)
I correlated with World’s total storage capacity

— data centers and cloud (45% - 55% in 2025)
2
The big data era
Key figures:

obal2.75
predict. captured
(1 zB = 1021 Bytes)

I exponential growth (+ 30% each year)

— expected to be sustained on the long run
2
The big data era
Key figures:

obal2.75
predict. captured
(1 zB = 1021 Bytes)

I exponential growth (+ 30% each year)

— expected to be sustained on the long run
I small fraction of data is processed/analyzed

— shortage of trained data scientists
2
Data production
Data are produced at an unprecedented rate by:
I Industry / Economy
I Sciences
I End users
3
Challenges
Complex data Corrupted data

(non-linear, sparse, (noise, outliers,
high-dimensional) missing values)
Big data
(streamed, online, distributed)
4
Data science’s celebrated successes...
AI for games:
1997: IBM’s Deep Blue wins chess match

against world champion G. Kasparov
2016: DeepMind’s AlphaGo wins Go match

against 18-time world champion Lee Sedol
2019: DeepMind’s AlphaStar beats

Starcraft II professional players
5
ImageNet Challenge:
I database of 40 · 106 + images, structured in 20 · 103 + categories
I images collected on the Internet
I annotation process crowdsourced

to Amazon Mechanical Turk
[J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei: ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009] 5
ImageNet Challenge:
I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
I until 2011, classification error rates around 25%
s network more than 60−→

hasbreakthrough
I 2012: million
deepparameters to tune
CNN (AlexNet) reduced error to 16%
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
ImageNet Challenge:

hasbreakthrough
I 2012: million
rrow I by now:
tasks: error one-against
typically rates below all
5%, performances
(e.g. recognizingbetter
cats, than
cars, human on narrow tasks
etc.) performance quality
[Krizhevsky, Sutskever, Hinton: ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012] 5
ImageNet Challenge:

hasbreakthrough
I 2012: million
unsupervised pre-training
I unsupervised ' usingleads
pre-training auto-encoders
to conceptaslearning
feature(e.g.
generators
humantoface,
be plugged
cat face)into
this image is the optimal stimulus of the
[Le et al.: Building high-level features using large scale unsupervised learning, ICML 2012] 5
Healthcare data:
2011: a new subgroup of breast cancers

with excellent survival rates is discovered
using exploratory data analysis techniques
[Nicolau et al.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile
and excellent survival, PNAS 2011]
5
... and notorious failures
Microsoft’s Tay:
I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016
I learned from its interactions with people
6
Microsoft’s Tay:
I shut down only 16 hours after launch
I produced inflammatory, offensive (racist, sexually-charged) tweets
6
Microsoft’s Tay:
I shut down only 16 hours after launch
I produced inflammatory, offensive (racist, sexually-charged) tweets
I training overrun by trolls
I numerous questions raised

(technical, legal, ethical)
6
Some notorious AI failures in 2018:
Google Photo’s AI feature confuses skier

and mountain
Amazon’s AI recruiting tool proven to be

gender-based
Uber’s self-driving car kills pedestrian

in Arizona
6
What is data science?
Aim: dev. tools to store, manipulate, analyze / extract knowledge from data
7
word cloud of paper titles at NIPS 2016

(source: http://www.kaggle.com/benhamner/nips- papers) 7





Core topics:
I statistical analysis
I machine learning
I pattern recognition
I data mining
I optimization (convex / combinatorial)
I database management and distributed systems
I high-performance computing (streaming, distributed, cloud)
7
Data?
Datum ≡ observation ≡ ”chunk of information”
8
Data?
Vector
e: ici representation
on parle des representations des donnees prise ven
ariables
entree par la plupart des algorithmes de t
{
x1
v1 ··· vd
observations
{
···
xn
coordinate
matrix
categorical variables: 1, 2, · · · , K (arbitrary labels)

(e.g. ”cat”, ”dog”, ”horse”)
continuous variables: real or complex values

(e.g. temperature, pressure, geographic coordinate, income, amplitude/phase)
8
Data?
Vector
on parle des representations des donnees prise oen
bservations
entree par la plupart des algorithmes de t
{
Metric representation x1
x1 ··· xn
observations
{
···
xn
distance /
(dis-)similarity
matrix
distances: Euclidean, Hamming, geodesic, diffusion, edit, Jaccard, Wasserstein, etc.
(dis-)similarity measures: cosine, Kullback-Leibler, Bregman divergences, etc.
8
Data?
Vector
on parle des representations des donnees prise en entree par la plupart des algorithmes de t
Metric representation
Rd
∈
?
···
8
Data?
Vector
on parle des representations des donnees prise en entree par la plupart des algorithmes de t
Metric representation
Rd
∈
?
···
feature extraction
8
Programming languages for data science
I Databases / data manipulation: Structured Query Language (SQL)
te: all other modern languages are built on SQL (e.g. QBE is in fact just a front-e
9
I Data analysis: Python (CS) / R (stats)
e: what is taught is: (1) principles of each approach; (2) how to apply it in Python
9
I Data analysis: Python (CS) / R (stats)
I Effective data processing:

C / C++ / CUDA (GPGPU)
[...]
9
Learning paradigms
horse
Supervised learning
Input: data with labels (examples)
cat dog
Goal: predict the labels of new data
?
Typical problems:
I classification (categorical labels)
I regression (continuous labels)
I forecasting (regression on time series)
energy consumption
? weather parameters
10
Learning paradigms
Unsupervised learning
Input: data without labels
Goal: identify patterns, correlations
Typical problems:
I clustering
I dimensionality reduction
I anomaly detection / noise removal
10
Learning paradigms
Unsupervised learning
Semi-supervised learning (only a fraction of the input data has labels)
Supervised learning
10
Learning paradigms
Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards
Goal: find policy that minimizes the regret

es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)
Typical problems:
I exploration vs. exploitation

(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce
I control learning
10
Learning paradigms
Reinforcement learning
Input: Markov decision process:
I agent & environment states, vis. rules, actions, transition probabilities, rewards
Goal: find policy that minimizes the regret

es: - it is the (expected loss is
total loss that of measured,
reward compared to the process, hence penalizing every mistake
throughout
optimal strategy)
Typical problems:
I exploration vs. exploitation

(e.g. multi-armed
pically, problems where bandit)
exploration vs. exploitation dilemma appears can be modelled as reinforce
I control learning
[Géron 2017] 10
Learning paradigms
(source: NVIDIA)
10
Course outline
• Lecture 1: introduction to data science / C++ as C (1/2)
• Lecture 2: Nearest-Neighbor search / C++ as C (2/2)
• Lecture 3: k-means clustering / classes (1/2)

unsupervised
• Lecture 4: hierarchical clustering / classes (2/2)
• Lecture 5: density estimation and noise removal / inheritance
• Lecture 6: k-NN classifier / genericity

supervised
• Lecture 7: linear models for regression / STL
• Lecture 8: linear models for classification / -
• Lecture 9: artificial neural networks / C++11
• Lecture 10: feature extraction and dimensionality reduction / -

11

INF442 Algorithms Data Analysis C

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

INF442 Algorithms Data Analysis C

Uploaded by

Copyright:

Available Formats

INF442 — Algorithms for Data Analysis in C++

Introduction to Data Science

I size of ‘global data sphere’:

I size of ‘global data sphere’:

I correlated with World’s total storage capacity

I size of ‘global data sphere’:

I correlated with World’s total storage capacity

I exponential growth (+ 30% each year)

I size of ‘global data sphere’:

I correlated with World’s total storage capacity

I exponential growth (+ 30% each year)

I small fraction of data is processed/analyzed

Complex data Corrupted data

1997: IBM’s Deep Blue wins chess match

2016: DeepMind’s AlphaGo wins Go match

2019: DeepMind’s AlphaStar beats

I database of 40 · 106 + images, structured in 20 · 103 + categories

I images collected on the Internet

I annotation process crowdsourced

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→

I annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

I until 2011, classification error rates around 25%

s network more than 60−→

this image is the optimal stimulus of the

2011: a new subgroup of breast cancers

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

I AI-powered chat bot, launched on Twitter (@TayandYou) on March 23, 2016

I learned from its interactions with people

I shut down only 16 hours after launch

I produced inflammatory, offensive (racist, sexually-charged) tweets

I training overrun by trolls

I numerous questions raised

Some notorious AI failures in 2018:

Google Photo’s AI feature confuses skier

Amazon’s AI recruiting tool proven to be

Uber’s self-driving car kills pedestrian

word cloud of paper titles at NIPS 2016

word cloud of paper titles at NIPS 2016

word cloud of paper titles at NIPS 2016

word cloud of paper titles at NIPS 2016

word cloud of paper titles at NIPS 2016

word cloud of paper titles at NIPS 2016

I optimization (convex / combinatorial)

I database management and distributed systems

I high-performance computing (streaming, distributed, cloud)

categorical variables: 1, 2, · · · , K (arbitrary labels)

continuous variables: real or complex values

distances: Euclidean, Hamming, geodesic, diffusion, edit, Jaccard, Wasserstein, etc.

(dis-)similarity measures: cosine, Kullback-Leibler, Bregman divergences, etc.

I Databases / data manipulation: Structured Query Language (SQL)

I Databases / data manipulation: Structured Query Language (SQL)

I Data analysis: Python (CS) / R (stats)

I Databases / data manipulation: Structured Query Language (SQL)