You are on page 1of 53

Week 8:

Machine Learning (in a Nutshell)


Winter term 2020 / 2021

Chair for Computational Social Science and Humanities


Markus Strohmaier, Florian Lemmerich, and Tobias Schumacher
Where are we?

Florian Lemmerich
Social Data Science
Sources and Resources
➢ Sara Hajian (Algorithmic Bias, KDD Tutorial 2016)
➢ S. Bird et al: Fairness-Aware Machine Learning in Practice
https://sites.google.com/view/fairness-tutorial
➢ S. Barocas and M. Hardt: Fair Machine Learning https://vimeo.com/248490141

➢ S. Hajian et al.: Discrimination-Aware Machine Learning


https://www.youtube.com/watch?v=mJcWrfoGup8
➢ http://www.mlandthelaw.org/slides/hajian.pdf

Florian Lemmerich
Social Data Science
3
Agenda
➢ Machine Learning in a Nutshell
▪ Overview
▪ Classification
▪ Clustering
▪ Where does ML work?

Florian Lemmerich
4 Social Data Science
8.1 Overview

5
Machine Learning

“Machine learning is the scientific study of algorithms and statistical


models that computer systems use to perform a specific task without
using explicit instructions, relying on patterns and inference instead.“

[Wikipedia]
➢ Subset of Artificial Intelligence
➢ Enhance AI systems by extracting patterns from raw data instead of using just
hard coded rules
➢ Applications oriented

➢ ”Statistics with theory and assumptions dropped” (?)

Florian Lemmerich
6 Social Data Science
Comparison of Machine Learning and Data Mining
➢ Strong overlap

➢ Data Mining: Emphasis on (interactive) exploration and discovery

➢ ML: Emphasis on prediction of new events

➢ Data Mining uses Machine Learning methods and vice versa

➢ “Data science” as an umbrella term?

Florian Lemmerich
7 Social Data Science
MLDM tasks
➢ Classification
➢ Regression (=Numeric Prediction)
➢ Clustering
➢ Association Rule Mining
➢ …

➢ On data types
▪ Tabular data
▪ Text
▪ Networks
▪ Images
▪ Sequences
▪ Time Series
▪ …
Florian Lemmerich
8 Social Data Science
8.1 Classification (+Regression)

9
Classification problems
➢ Given:
▪ A set of objects (data points)
▪ For each object a set of features (attribute values) (x1, . . ., xd)
▪ For each object a class (label) c ∈ C = {c1 , . . ., ck}
 “supervised technique“
➢ Goal:
Find correct class for new objects (given only the features)

➢ Difference to Clustering:
▪ Classification:
• classes are known a-priori,
• There is training data with known labels
▪ Clustering: clusters (=classes) have to be identified

Florian Lemmerich
10 Social Data Science
Classification Process
1. Training phase

Classification
Algorithm
Training
data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes if rank = ‘professor’
Dave Assistant Prof 6 no or years > 6
Anne Associate Prof 3 no then tenured = ‘yes’

2. Application phase
(Test phase) Unknown data Classifier

(Jeff, Professor, 4)
… Tenured?
… yes

Florian Lemmerich
11 Social Data Science
Evaluation scenario

Ground Truth
positive negative
positive True False

Pred.
positive positive
negative False True
negative negative

𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+ 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒


➢ Accuracy: 𝑎𝑙𝑙
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
➢ Precision: 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
➢ Recall:
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
➢ F1-Measure: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Florian Lemmerich
12 Social Data Science
Beyond Logistic Regression
➢ We can solve Classification problems with Logistic Regression

➢ Why do we need something else?


➢ Logistic Regression has certain assumptions
▪ E.g., Independence of features
▪ ➔ Complex relationships between variables can’t be captured
▪ Linear classification boundary
▪ Other algorithms might be better
➢ Only two outcomes (classes) possible
➢ These do not hold in all datasets

Florian Lemmerich
13 Social Data Science
Algorithms
➢ Naive Bayes
➢ Nearest Neighbor
➢ Decision Trees
➢ SVM
➢ Neural Networks
➢ Random Forest
➢ Boosting

Florian Lemmerich
14 Social Data Science
Naive Bayes
➢ Probability-based classifier:
Assign object o=(x1, …, xd ) to class cj such that P(cj|o) is maximized:
argmax P(c j | o)
➢ P(cj|o) is unknown! c j C

➢ Use Bayes‘ Theorem:


P (o | c j )  P (c j )
argmax P (c j | o) = argmax = argmax P (o | c j )  P (c j )
c j C c j C P (o ) c j C

➢ We can determine P (cj) in the training data by simple counting


➢ To compute P(o|cj) assume that all features are independent:
d
P(o | c j ) = P( x1 | c j )  ...  P( xd | c j ) =  P( xi | c j )
i =1

➢ Overall decision rule for the Naïve Bayes classifier:


d
argmax P(c j )   P( xi | c j )
c j C i =1

Florian Lemmerich
15 Social Data Science
Example Naive Bayes
Color Age Class
Red Old Good
White Old Good

Training
White Young Good
Red Old Bad
White Young Bad
Red Young ?
Compute values of the decision rule for both classes

Class “Good” :
P (Good) ⋅ P(Red| Good) ⋅ P(Young| Good) = 3/5 ⋅ 1/3 ⋅ 1/3 = 1/15

Class “Bad”:
P (Bad) ⋅ P(Red| Bad) ⋅ P(Young| Bad) = 2/5 ⋅ 1/2 ⋅ 1/2 = 1/10

 Higher probability for class “Bad“


 New instance is classified as “Bad”

Florian Lemmerich
16 Social Data Science
Nearest Neighbor Classification
➢ Define distances between data instances
➢ Specify a value k
➢ To classify a new data instance:
▪ Search the k most similar instances (k-nearest-neighbors)
▪ Select the majority class among those neighbors (“voting”)
▪ Option: Weight by distance
Close-by: High weight
far-away: Low weight
k=10

k=5

Florian Lemmerich
17 Social Data Science
Decision Trees
➢ Each inner node is an attribute NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
➢ Each edge is a value (range) of its source node Mary Assistant Prof 7 yes
Bill Professor 2 yes
➢ Each leaf node is a class label Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
➢ Learning: Greedy top-to-bottom Anne Associate Prof 3 no
➢ Prediction:
Follow the tree from top-to-bottom based on instance values

Rank

Professor Not professor

no Years

<7 

no yes

Florian Lemmerich
18 Social Data Science
Decision Tree Induction
➢ Top down, recursive
➢ At each point: try to separate classes
k
➢ Formally: use Information Gain: entropy (T ) = − pi  log pi
i =1

m
| Ti |
information _ gain(T , A) = entropy (T ) −   entropy (Ti )
i =1 | T |

➢ Recurse until no more data or all in one class

➢ Many variations
▪ Only binary splits
▪ Alternatives to information gain
▪ Pruning Limit size of the decision tree to avoid overfitting

Florian Lemmerich
19 Social Data Science
Support Vector Machines

Florian Lemmerich
20 Social Data Science
Support Vector Machines
➢ Each data point is a point in d-dimensional space
➢ Find separating hyperplane that maximizes distance to instances
➢ Base form requires linear separability

➢ Extensions can handle that very well also


▪ Kernel methods: transformation in higher dimensional spaces
▪ Slack variables: Add penalty term if instance is false classified

Florian Lemmerich
21 Social Data Science
Idea: Kernels and SVM

Base data

x
Data in higher dimensional space
x2

22
0 x

Florian Lemmerich
22 Social Data Science
Neural Networks (“Deep Learning”)
➢ Connect many perceptrons with each other
➢ Based on Backpropagation learning (gradient descent)

x1

y
x2

x3

Input layer Hidden Layer 1 Hidden Layer 2 Output Layer

➢ Huge variety of architectures


➢ Requires large amounts of training data
➢ “Black box”

Florian Lemmerich
23 Social Data Science
No-Free-Lunch-theorem (Machine Learning edition)
➢ Given many arbitrary datasets
▪ No assumptions
▪ Labels randomly

➢ Then all Algorithms are equally good!


➔ In general, there is no “best algorithm”

➢ Algorithms make more or less reasonable assumptions


▪ Linearity of features
▪ “Similar” objects have likely a similar class, …

Florian Lemmerich
24 Social Data Science
Idea of Ensemble Learning
➢ Different classifiers make different errors
➢ Try to combine classifiers to improve classification accuracy!
➢ Definition Ensemble Learning:
“An ensemble of classifiers is a set of classifiers whose individual decisions are combined
in some way to classify new examples” (Dietterich, 2000)

Base learners Single predictions Final prediction

Classifier C1
Class = b
Classifier C2
Combining Class = b
Class = b
Classifier C3 mechanism


Classifier Cn

Florian Lemmerich
25 Social Data Science
Example: Ensemble Learning
➢ Can this work?
➢ Example with 3 classifier on a test set i1, …, i5

i1 I2 i3 i4 I5 Overall Accuracy
True class + + + - -
Classifier 1 + + + + - 0.8
Classifier 2 + + - - - 0.8
Classifier 3 - - + - - 0.6
Majority
voting + + + - - 1.0

➢ The ensemble classifies better than each individual classifier!

Florian Lemmerich
26 Social Data Science
Random Forests
➢ Ensemble method
➢ Build many slightly different trees
▪ Subset of instances for each tree (bootstrap sampling)
▪ Choose from a subset of attributes at each node
➢ Collect predictions for each tree
➢ Majority decides

Florian Lemmerich
27 Social Data Science
Boosting
➢ Train a simple classifier
➢ Identify misclassified instances of this classifier
➢ Increase the weight of these instances
➢ Iterate

➢ Overall classifier: Combination of all previously trained models

Florian Lemmerich
28 Social Data Science
Task Variation: Multiclass Classification
➢ More than two possible classes (outcomes)
➢ What to do?

➢ Some algorithms extend to this naturally


▪ Decision Trees
▪ Nearest Neighbor
▪ Naïve Bayes

➢ Others?
▪ One-vs-all: Train a classifier for each class – is it this class or something else?
▪ One-vs-one: Train a classifier for each pair of classes – which one is more likely?
Then, do a majority voting.

Florian Lemmerich
29 Social Data Science
29
Other Variations
➢ Multilabel classification

➢ Cost-sensitive classification

➢ Unbiased classification

Florian Lemmerich
30 Social Data Science
Regression
➢ We learned most of the basics already in this course

➢ For most classification approaches, there is also a regression pendant

➢ Examples:
▪ Regression trees
▪ Support Vector Regression (SVR)
▪ Random forest regressor
▪ Boosted trees

Florian Lemmerich
31 Social Data Science
8.2 Clustering
Clustering: Goal and Definition
➢ Identification of a finite set of clusters in the data
(= categories, “classes”, groups)
➢ Objects in the same cluster should be as similar as possible.
➢ Objects in different clusters should be as dissimilar as possible
➢ “Unsupervised learning” => no groups given

Gene sequences Books / documents

Language dialects

Websites
Political opinions

Florian Lemmerich
33 Social Data Science
What does “Similar“ Mean?

Florian Lemmerich
34 Social Data Science
Similarity?

Florian Lemmerich
35 Social Data Science
Distance functions & similarity functions
➢ Formalization of distance and similarity:
▪ Determine functions dist(o1,o2) or sim(o1,o2) (distance more common)
▪ Small distance Distanz  similar Objects
➢ „Good“ distance measure is crucial, but depends on the application
➢ Common distance measures:
d

▪ Euklidische Distanz distEucl ( x, y ) =  (x − y )


i =1
i i
2

d
▪ Manhattan-Distanz distMan( x, y ) =  | xi − yi |
i =1

▪ Maximums-Metrik distMax( x, y ) = max{| xi − yi |1  i  d }

▪ Hamming-Distance: Number of different attribute values


E.g., x= (young, teacher, yes); y=(old, student, yes);
➔ distHamming (x,y) = 2

Florian Lemmerich
36 Social Data Science
Example: Distance metrics
➢ distEucl (A,B) = 13

➢ distMan (A,B) = 5

xB
➢ distMax (A,B) = 3

Ax

Florian Lemmerich
37 Social Data Science
K-means
1. Choose the number of clusters k
2. Pick k initial centroids (cluster centers) randomly

3. Assign each data point to the nearest centroid


4. Compute the new centroids as mean (vector!) of all points
of a cluster
5. If (change) goto 3

Florian Lemmerich
38 Social Data Science
K-Means: Example

10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
Compute centroids 3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Assign data points to clusters


10 10

9 9

8 8

7 7

6 6

5 5

4 4

2
Compute centroids 3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

https://www.youtube.com/watch?v=-lu8Ku5Cw9E
Florian Lemmerich
39 Social Data Science
https://www.youtube.com/watch?v=BVFG7fd1H30
Density-based Clustering (DB-Scan)
➢ Idea:
▪ Cluster are dense regions in d-dimensional space
▪ Clusters are separated by sparse regions

➢ Requirements for a cluster: p


▪ Define neighborhoods
▪ Each data point in the cluster has n neighbors q
▪ Data points in one cluster are connected by “dense points“

Florian Lemmerich
40 Social Data Science
Hierarchical Clustering
➢ Goal: A hierarchy of clusters
➢ Construction (bottom up): join the two closest existing clusters
➢ Result: dendrogram

8 9 2
7
5
2 4 1
3 6
5
1
1 0
1 2 3 4 5 6 7 8 9
1 5

Florian Lemmerich
41 Social Data Science
Example Dendrogram

Florian Lemmerich
42 Social Data Science
http://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png
8.3 Where does ML work?

43
Why now?

Why is there so much “snake oil” – absurd or even harmful application of AI/data
science in these fields.

1. Some technologies have made remarkable progress: content identification


(reverse image search), facial recognition, speech to text, medical diagnoses,
translation.

2. Companies advertise algorithms/AI as a solution to all problems:

https://governanceai.github.io/US-Public-Opinion-Report-Jan-2019/us_public_opinion_report_jan_2019.pdf

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
44 Social Data Science # 44
What determines where “AI” will work?

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
45 Social Data Science # 45
What determines where “AI” will work?

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
46 Social Data Science # 46
What determines where “AI” will work?

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
47 Social Data Science # 47
What determines where “AI” will work?

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
48 Social Data Science # 48
Can social outcomes be predicted?

The Fragile Families Study is an ongoing study of a large group of young people
throughout their lives. Since the late 90s, a cohort of nearly 5,000 children from
the US have been tracked and repeatedly surveyed.

The aim is to understand how family structure relates to outcomes of the


children.

A research team lead by Matt Salganik issued a challenge to researchers: can


you predict social outcomes?

https://fragilefamilies.princeton.edu/about

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
49 Social Data Science # 49
Can social outcomes be predicted?

https://fragilefamilies.princeton.edu/about

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
50 Social Data Science # 50
Can social outcomes be predicted?

https://fragilefamilies.princeton.edu/about

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
51 Social Data Science # 51
Can social outcomes be predicted?

https://fragilefamilies.princeton.edu/about

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
52 Social Data Science # 52
Can social outcomes be predicted?

https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

Florian Lemmerich
53 Social Data Science # 53

You might also like