Professional Documents
Culture Documents
Florian Lemmerich
Social Data Science
Sources and Resources
➢ Sara Hajian (Algorithmic Bias, KDD Tutorial 2016)
➢ S. Bird et al: Fairness-Aware Machine Learning in Practice
https://sites.google.com/view/fairness-tutorial
➢ S. Barocas and M. Hardt: Fair Machine Learning https://vimeo.com/248490141
Florian Lemmerich
Social Data Science
3
Agenda
➢ Machine Learning in a Nutshell
▪ Overview
▪ Classification
▪ Clustering
▪ Where does ML work?
Florian Lemmerich
4 Social Data Science
8.1 Overview
5
Machine Learning
[Wikipedia]
➢ Subset of Artificial Intelligence
➢ Enhance AI systems by extracting patterns from raw data instead of using just
hard coded rules
➢ Applications oriented
Florian Lemmerich
6 Social Data Science
Comparison of Machine Learning and Data Mining
➢ Strong overlap
Florian Lemmerich
7 Social Data Science
MLDM tasks
➢ Classification
➢ Regression (=Numeric Prediction)
➢ Clustering
➢ Association Rule Mining
➢ …
➢ On data types
▪ Tabular data
▪ Text
▪ Networks
▪ Images
▪ Sequences
▪ Time Series
▪ …
Florian Lemmerich
8 Social Data Science
8.1 Classification (+Regression)
9
Classification problems
➢ Given:
▪ A set of objects (data points)
▪ For each object a set of features (attribute values) (x1, . . ., xd)
▪ For each object a class (label) c ∈ C = {c1 , . . ., ck}
“supervised technique“
➢ Goal:
Find correct class for new objects (given only the features)
➢ Difference to Clustering:
▪ Classification:
• classes are known a-priori,
• There is training data with known labels
▪ Clustering: clusters (=classes) have to be identified
Florian Lemmerich
10 Social Data Science
Classification Process
1. Training phase
Classification
Algorithm
Training
data
2. Application phase
(Test phase) Unknown data Classifier
(Jeff, Professor, 4)
… Tenured?
… yes
Florian Lemmerich
11 Social Data Science
Evaluation scenario
Ground Truth
positive negative
positive True False
Pred.
positive positive
negative False True
negative negative
Florian Lemmerich
12 Social Data Science
Beyond Logistic Regression
➢ We can solve Classification problems with Logistic Regression
Florian Lemmerich
13 Social Data Science
Algorithms
➢ Naive Bayes
➢ Nearest Neighbor
➢ Decision Trees
➢ SVM
➢ Neural Networks
➢ Random Forest
➢ Boosting
Florian Lemmerich
14 Social Data Science
Naive Bayes
➢ Probability-based classifier:
Assign object o=(x1, …, xd ) to class cj such that P(cj|o) is maximized:
argmax P(c j | o)
➢ P(cj|o) is unknown! c j C
Florian Lemmerich
15 Social Data Science
Example Naive Bayes
Color Age Class
Red Old Good
White Old Good
Training
White Young Good
Red Old Bad
White Young Bad
Red Young ?
Compute values of the decision rule for both classes
Class “Good” :
P (Good) ⋅ P(Red| Good) ⋅ P(Young| Good) = 3/5 ⋅ 1/3 ⋅ 1/3 = 1/15
Class “Bad”:
P (Bad) ⋅ P(Red| Bad) ⋅ P(Young| Bad) = 2/5 ⋅ 1/2 ⋅ 1/2 = 1/10
Florian Lemmerich
16 Social Data Science
Nearest Neighbor Classification
➢ Define distances between data instances
➢ Specify a value k
➢ To classify a new data instance:
▪ Search the k most similar instances (k-nearest-neighbors)
▪ Select the majority class among those neighbors (“voting”)
▪ Option: Weight by distance
Close-by: High weight
far-away: Low weight
k=10
k=5
Florian Lemmerich
17 Social Data Science
Decision Trees
➢ Each inner node is an attribute NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
➢ Each edge is a value (range) of its source node Mary Assistant Prof 7 yes
Bill Professor 2 yes
➢ Each leaf node is a class label Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
➢ Learning: Greedy top-to-bottom Anne Associate Prof 3 no
➢ Prediction:
Follow the tree from top-to-bottom based on instance values
Rank
no Years
<7
no yes
Florian Lemmerich
18 Social Data Science
Decision Tree Induction
➢ Top down, recursive
➢ At each point: try to separate classes
k
➢ Formally: use Information Gain: entropy (T ) = − pi log pi
i =1
m
| Ti |
information _ gain(T , A) = entropy (T ) − entropy (Ti )
i =1 | T |
➢ Many variations
▪ Only binary splits
▪ Alternatives to information gain
▪ Pruning Limit size of the decision tree to avoid overfitting
Florian Lemmerich
19 Social Data Science
Support Vector Machines
Florian Lemmerich
20 Social Data Science
Support Vector Machines
➢ Each data point is a point in d-dimensional space
➢ Find separating hyperplane that maximizes distance to instances
➢ Base form requires linear separability
Florian Lemmerich
21 Social Data Science
Idea: Kernels and SVM
Base data
x
Data in higher dimensional space
x2
22
0 x
Florian Lemmerich
22 Social Data Science
Neural Networks (“Deep Learning”)
➢ Connect many perceptrons with each other
➢ Based on Backpropagation learning (gradient descent)
x1
y
x2
x3
Florian Lemmerich
23 Social Data Science
No-Free-Lunch-theorem (Machine Learning edition)
➢ Given many arbitrary datasets
▪ No assumptions
▪ Labels randomly
Florian Lemmerich
24 Social Data Science
Idea of Ensemble Learning
➢ Different classifiers make different errors
➢ Try to combine classifiers to improve classification accuracy!
➢ Definition Ensemble Learning:
“An ensemble of classifiers is a set of classifiers whose individual decisions are combined
in some way to classify new examples” (Dietterich, 2000)
Classifier C1
Class = b
Classifier C2
Combining Class = b
Class = b
Classifier C3 mechanism
…
Classifier Cn
Florian Lemmerich
25 Social Data Science
Example: Ensemble Learning
➢ Can this work?
➢ Example with 3 classifier on a test set i1, …, i5
i1 I2 i3 i4 I5 Overall Accuracy
True class + + + - -
Classifier 1 + + + + - 0.8
Classifier 2 + + - - - 0.8
Classifier 3 - - + - - 0.6
Majority
voting + + + - - 1.0
Florian Lemmerich
26 Social Data Science
Random Forests
➢ Ensemble method
➢ Build many slightly different trees
▪ Subset of instances for each tree (bootstrap sampling)
▪ Choose from a subset of attributes at each node
➢ Collect predictions for each tree
➢ Majority decides
Florian Lemmerich
27 Social Data Science
Boosting
➢ Train a simple classifier
➢ Identify misclassified instances of this classifier
➢ Increase the weight of these instances
➢ Iterate
Florian Lemmerich
28 Social Data Science
Task Variation: Multiclass Classification
➢ More than two possible classes (outcomes)
➢ What to do?
➢ Others?
▪ One-vs-all: Train a classifier for each class – is it this class or something else?
▪ One-vs-one: Train a classifier for each pair of classes – which one is more likely?
Then, do a majority voting.
Florian Lemmerich
29 Social Data Science
29
Other Variations
➢ Multilabel classification
➢ Cost-sensitive classification
➢ Unbiased classification
Florian Lemmerich
30 Social Data Science
Regression
➢ We learned most of the basics already in this course
➢ Examples:
▪ Regression trees
▪ Support Vector Regression (SVR)
▪ Random forest regressor
▪ Boosted trees
Florian Lemmerich
31 Social Data Science
8.2 Clustering
Clustering: Goal and Definition
➢ Identification of a finite set of clusters in the data
(= categories, “classes”, groups)
➢ Objects in the same cluster should be as similar as possible.
➢ Objects in different clusters should be as dissimilar as possible
➢ “Unsupervised learning” => no groups given
Language dialects
Websites
Political opinions
Florian Lemmerich
33 Social Data Science
What does “Similar“ Mean?
Florian Lemmerich
34 Social Data Science
Similarity?
Florian Lemmerich
35 Social Data Science
Distance functions & similarity functions
➢ Formalization of distance and similarity:
▪ Determine functions dist(o1,o2) or sim(o1,o2) (distance more common)
▪ Small distance Distanz similar Objects
➢ „Good“ distance measure is crucial, but depends on the application
➢ Common distance measures:
d
d
▪ Manhattan-Distanz distMan( x, y ) = | xi − yi |
i =1
Florian Lemmerich
36 Social Data Science
Example: Distance metrics
➢ distEucl (A,B) = 13
➢ distMan (A,B) = 5
xB
➢ distMax (A,B) = 3
Ax
Florian Lemmerich
37 Social Data Science
K-means
1. Choose the number of clusters k
2. Pick k initial centroids (cluster centers) randomly
Florian Lemmerich
38 Social Data Science
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
Compute centroids 3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9 9
8 8
7 7
6 6
5 5
4 4
2
Compute centroids 3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
https://www.youtube.com/watch?v=-lu8Ku5Cw9E
Florian Lemmerich
39 Social Data Science
https://www.youtube.com/watch?v=BVFG7fd1H30
Density-based Clustering (DB-Scan)
➢ Idea:
▪ Cluster are dense regions in d-dimensional space
▪ Clusters are separated by sparse regions
Florian Lemmerich
40 Social Data Science
Hierarchical Clustering
➢ Goal: A hierarchy of clusters
➢ Construction (bottom up): join the two closest existing clusters
➢ Result: dendrogram
8 9 2
7
5
2 4 1
3 6
5
1
1 0
1 2 3 4 5 6 7 8 9
1 5
Florian Lemmerich
41 Social Data Science
Example Dendrogram
Florian Lemmerich
42 Social Data Science
http://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png
8.3 Where does ML work?
43
Why now?
Why is there so much “snake oil” – absurd or even harmful application of AI/data
science in these fields.
https://governanceai.github.io/US-Public-Opinion-Report-Jan-2019/us_public_opinion_report_jan_2019.pdf
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
44 Social Data Science # 44
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
45 Social Data Science # 45
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
46 Social Data Science # 46
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
47 Social Data Science # 47
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
48 Social Data Science # 48
Can social outcomes be predicted?
The Fragile Families Study is an ongoing study of a large group of young people
throughout their lives. Since the late 90s, a cohort of nearly 5,000 children from
the US have been tracked and repeatedly surveyed.
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
49 Social Data Science # 49
Can social outcomes be predicted?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
50 Social Data Science # 50
Can social outcomes be predicted?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
51 Social Data Science # 51
Can social outcomes be predicted?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
52 Social Data Science # 52
Can social outcomes be predicted?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
53 Social Data Science # 53