08 Fair Machine Learning

Week 8:
Machine Learning (in a Nutshell)

Winter term 2020 / 2021
Chair for Computational Social Science and Humanities

Markus Strohmaier, Florian Lemmerich, and Tobias Schumacher
Where are we?
Florian Lemmerich
Social Data Science
Sources and Resources
➢ Sara Hajian (Algorithmic Bias, KDD Tutorial 2016)
➢ S. Bird et al: Fairness-Aware Machine Learning in Practice
https://sites.google.com/view/fairness-tutorial
➢ S. Barocas and M. Hardt: Fair Machine Learning https://vimeo.com/248490141
➢ S. Hajian et al.: Discrimination-Aware Machine Learning

https://www.youtube.com/watch?v=mJcWrfoGup8
➢ http://www.mlandthelaw.org/slides/hajian.pdf
Florian Lemmerich
Social Data Science
3
Agenda
➢ Machine Learning in a Nutshell
▪ Overview
▪ Classification
▪ Clustering
▪ Where does ML work?
Florian Lemmerich
4 Social Data Science
8.1 Overview
5
Machine Learning
“Machine learning is the scientific study of algorithms and statistical

models that computer systems use to perform a specific task without
using explicit instructions, relying on patterns and inference instead.“
[Wikipedia]
➢ Subset of Artificial Intelligence
➢ Enhance AI systems by extracting patterns from raw data instead of using just
hard coded rules
➢ Applications oriented
➢ ”Statistics with theory and assumptions dropped” (?)
Florian Lemmerich
Comparison of Machine Learning and Data Mining
➢ Strong overlap
➢ Data Mining: Emphasis on (interactive) exploration and discovery
➢ ML: Emphasis on prediction of new events
➢ Data Mining uses Machine Learning methods and vice versa
➢ “Data science” as an umbrella term?
Florian Lemmerich
MLDM tasks
➢ Classification
➢ Regression (=Numeric Prediction)
➢ Clustering
➢ Association Rule Mining
➢ …
➢ On data types
▪ Tabular data
▪ Text
▪ Networks
▪ Images
▪ Sequences
▪ Time Series
▪ …
Florian Lemmerich
8.1 Classification (+Regression)
9
Classification problems
➢ Given:
▪ A set of objects (data points)
▪ For each object a set of features (attribute values) (x1, . . ., xd)
▪ For each object a class (label) c ∈ C = {c1 , . . ., ck}
 “supervised technique“
➢ Goal:
Find correct class for new objects (given only the features)
➢ Difference to Clustering:
▪ Classification:
• classes are known a-priori,
• There is training data with known labels
▪ Clustering: clusters (=classes) have to be identified
Florian Lemmerich
Classification Process
1. Training phase
Classification
Algorithm
Training
data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes if rank = ‘professor’
Dave Assistant Prof 6 no or years > 6
Anne Associate Prof 3 no then tenured = ‘yes’
2. Application phase
(Test phase) Unknown data Classifier
(Jeff, Professor, 4)
… Tenured?
… yes
Florian Lemmerich
Evaluation scenario
Ground Truth
positive negative
positive True False
Pred.
positive positive
negative False True
negative negative
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+ 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

➢ Accuracy: 𝑎𝑙𝑙
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
➢ Precision: 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
➢ Recall:
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
➢ F1-Measure: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
Florian Lemmerich
Beyond Logistic Regression
➢ We can solve Classification problems with Logistic Regression
➢ Why do we need something else?

➢ Logistic Regression has certain assumptions
▪ E.g., Independence of features
▪ ➔ Complex relationships between variables can’t be captured
▪ Linear classification boundary
▪ Other algorithms might be better
➢ Only two outcomes (classes) possible
➢ These do not hold in all datasets
Florian Lemmerich
Algorithms
➢ Naive Bayes
➢ Nearest Neighbor
➢ Decision Trees
➢ SVM
➢ Neural Networks
➢ Random Forest
➢ Boosting
Florian Lemmerich
Naive Bayes
➢ Probability-based classifier:
Assign object o=(x1, …, xd ) to class cj such that P(cj|o) is maximized:
argmax P(c j | o)
➢ P(cj|o) is unknown! c j C
➢ Use Bayes‘ Theorem:

P (o | c j )  P (c j )
argmax P (c j | o) = argmax = argmax P (o | c j )  P (c j )
c j C c j C P (o ) c j C
➢ We can determine P (cj) in the training data by simple counting

➢ To compute P(o|cj) assume that all features are independent:
d
P(o | c j ) = P( x1 | c j )  ...  P( xd | c j ) =  P( xi | c j )
i =1
➢ Overall decision rule for the Naïve Bayes classifier:

d
argmax P(c j )   P( xi | c j )
c j C i =1
Florian Lemmerich
Example Naive Bayes
Color Age Class
Red Old Good
White Old Good
Training
White Young Good
Red Old Bad
White Young Bad
Red Young ?
Compute values of the decision rule for both classes
Class “Good” :
P (Good) ⋅ P(Red| Good) ⋅ P(Young| Good) = 3/5 ⋅ 1/3 ⋅ 1/3 = 1/15
Class “Bad”:
P (Bad) ⋅ P(Red| Bad) ⋅ P(Young| Bad) = 2/5 ⋅ 1/2 ⋅ 1/2 = 1/10
 Higher probability for class “Bad“

 New instance is classified as “Bad”
Florian Lemmerich
Nearest Neighbor Classification
➢ Define distances between data instances
➢ Specify a value k
➢ To classify a new data instance:
▪ Search the k most similar instances (k-nearest-neighbors)
▪ Select the majority class among those neighbors (“voting”)
▪ Option: Weight by distance
Close-by: High weight
far-away: Low weight
k=10
k=5
Florian Lemmerich
Decision Trees
➢ Each inner node is an attribute NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
➢ Each edge is a value (range) of its source node Mary Assistant Prof 7 yes
Bill Professor 2 yes
➢ Each leaf node is a class label Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
➢ Learning: Greedy top-to-bottom Anne Associate Prof 3 no
➢ Prediction:
Follow the tree from top-to-bottom based on instance values
Rank
Professor Not professor
no Years
<7 
no yes
Florian Lemmerich
Decision Tree Induction
➢ Top down, recursive
➢ At each point: try to separate classes
k
➢ Formally: use Information Gain: entropy (T ) = − pi  log pi
i =1
m
| Ti |
information _ gain(T , A) = entropy (T ) −   entropy (Ti )
i =1 | T |
➢ Recurse until no more data or all in one class
➢ Many variations
▪ Only binary splits
▪ Alternatives to information gain
▪ Pruning Limit size of the decision tree to avoid overfitting
Florian Lemmerich
Support Vector Machines
Florian Lemmerich
Support Vector Machines
➢ Each data point is a point in d-dimensional space
➢ Find separating hyperplane that maximizes distance to instances
➢ Base form requires linear separability
➢ Extensions can handle that very well also

▪ Kernel methods: transformation in higher dimensional spaces
▪ Slack variables: Add penalty term if instance is false classified
Florian Lemmerich
Idea: Kernels and SVM
Base data
x
Data in higher dimensional space
x2
22
0 x
Florian Lemmerich
Neural Networks (“Deep Learning”)
➢ Connect many perceptrons with each other
➢ Based on Backpropagation learning (gradient descent)
x1
y
x2
x3
Input layer Hidden Layer 1 Hidden Layer 2 Output Layer
➢ Huge variety of architectures

➢ Requires large amounts of training data
➢ “Black box”
Florian Lemmerich
No-Free-Lunch-theorem (Machine Learning edition)
➢ Given many arbitrary datasets
▪ No assumptions
▪ Labels randomly
➢ Then all Algorithms are equally good!

➔ In general, there is no “best algorithm”
➢ Algorithms make more or less reasonable assumptions

▪ Linearity of features
▪ “Similar” objects have likely a similar class, …
Florian Lemmerich
Idea of Ensemble Learning
➢ Different classifiers make different errors
➢ Try to combine classifiers to improve classification accuracy!
➢ Definition Ensemble Learning:
“An ensemble of classifiers is a set of classifiers whose individual decisions are combined
in some way to classify new examples” (Dietterich, 2000)
Base learners Single predictions Final prediction
Classifier C1
Class = b
Classifier C2
Combining Class = b
Class = b
Classifier C3 mechanism
…
Classifier Cn
Florian Lemmerich
Example: Ensemble Learning
➢ Can this work?
➢ Example with 3 classifier on a test set i1, …, i5
i1 I2 i3 i4 I5 Overall Accuracy
True class + + + - -
Classifier 1 + + + + - 0.8
Classifier 2 + + - - - 0.8
Classifier 3 - - + - - 0.6
Majority
voting + + + - - 1.0
➢ The ensemble classifies better than each individual classifier!
Florian Lemmerich
Random Forests
➢ Ensemble method
➢ Build many slightly different trees
▪ Subset of instances for each tree (bootstrap sampling)
▪ Choose from a subset of attributes at each node
➢ Collect predictions for each tree
➢ Majority decides
Florian Lemmerich
Boosting
➢ Train a simple classifier
➢ Identify misclassified instances of this classifier
➢ Increase the weight of these instances
➢ Iterate
➢ Overall classifier: Combination of all previously trained models
Florian Lemmerich
Task Variation: Multiclass Classification
➢ More than two possible classes (outcomes)
➢ What to do?
➢ Some algorithms extend to this naturally

▪ Decision Trees
▪ Nearest Neighbor
▪ Naïve Bayes
➢ Others?
▪ One-vs-all: Train a classifier for each class – is it this class or something else?
▪ One-vs-one: Train a classifier for each pair of classes – which one is more likely?
Then, do a majority voting.
Florian Lemmerich
29
Other Variations
➢ Multilabel classification
➢ Cost-sensitive classification
➢ Unbiased classification
Florian Lemmerich
Regression
➢ We learned most of the basics already in this course
➢ For most classification approaches, there is also a regression pendant
➢ Examples:
▪ Regression trees
▪ Support Vector Regression (SVR)
▪ Random forest regressor
▪ Boosted trees
Florian Lemmerich
8.2 Clustering
Clustering: Goal and Definition
➢ Identification of a finite set of clusters in the data
(= categories, “classes”, groups)
➢ Objects in the same cluster should be as similar as possible.
➢ Objects in different clusters should be as dissimilar as possible
➢ “Unsupervised learning” => no groups given
Gene sequences Books / documents
Language dialects
Websites
Political opinions
Florian Lemmerich
What does “Similar“ Mean?
Florian Lemmerich
Similarity?
Florian Lemmerich
Distance functions & similarity functions
➢ Formalization of distance and similarity:
▪ Determine functions dist(o1,o2) or sim(o1,o2) (distance more common)
▪ Small distance Distanz  similar Objects
➢ „Good“ distance measure is crucial, but depends on the application
➢ Common distance measures:
d
▪ Euklidische Distanz distEucl ( x, y ) =  (x − y )

i =1
i i
2
d
▪ Manhattan-Distanz distMan( x, y ) =  | xi − yi |
i =1
▪ Maximums-Metrik distMax( x, y ) = max{| xi − yi |1  i  d }
▪ Hamming-Distance: Number of different attribute values

E.g., x= (young, teacher, yes); y=(old, student, yes);
➔ distHamming (x,y) = 2
Florian Lemmerich
Example: Distance metrics
➢ distEucl (A,B) = 13
➢ distMan (A,B) = 5
xB
➢ distMax (A,B) = 3
Ax
Florian Lemmerich
K-means
1. Choose the number of clusters k
2. Pick k initial centroids (cluster centers) randomly
3. Assign each data point to the nearest centroid

4. Compute the new centroids as mean (vector!) of all points
of a cluster
5. If (change) goto 3
Florian Lemmerich
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
Compute centroids 3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Assign data points to clusters

10 10
9 9
8 8
7 7
6 6
5 5
4 4
2
Compute centroids 3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
https://www.youtube.com/watch?v=-lu8Ku5Cw9E
Florian Lemmerich
https://www.youtube.com/watch?v=BVFG7fd1H30
Density-based Clustering (DB-Scan)
➢ Idea:
▪ Cluster are dense regions in d-dimensional space
▪ Clusters are separated by sparse regions
➢ Requirements for a cluster: p

▪ Define neighborhoods
▪ Each data point in the cluster has n neighbors q
▪ Data points in one cluster are connected by “dense points“
Florian Lemmerich
Hierarchical Clustering
➢ Goal: A hierarchy of clusters
➢ Construction (bottom up): join the two closest existing clusters
➢ Result: dendrogram
8 9 2
7
5
2 4 1
3 6
5
1
1 0
1 2 3 4 5 6 7 8 9
1 5
Florian Lemmerich
Example Dendrogram
Florian Lemmerich
http://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png
8.3 Where does ML work?
43
Why now?
Why is there so much “snake oil” – absurd or even harmful application of AI/data
science in these fields.
1. Some technologies have made remarkable progress: content identification

(reverse image search), facial recognition, speech to text, medical diagnoses,
translation.
2. Companies advertise algorithms/AI as a solution to all problems:
https://governanceai.github.io/US-Public-Opinion-Report-Jan-2019/us_public_opinion_report_jan_2019.pdf
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
44 Social Data Science # 44
What determines where “AI” will work?
Florian Lemmerich
Florian Lemmerich
Florian Lemmerich
Florian Lemmerich
Can social outcomes be predicted?
The Fragile Families Study is an ongoing study of a large group of young people
throughout their lives. Since the late 90s, a cohort of nearly 5,000 children from
the US have been tracked and repeatedly surveyed.
The aim is to understand how family structure relates to outcomes of the

children.
A research team lead by Matt Salganik issued a challenge to researchers: can

you predict social outcomes?
https://fragilefamilies.princeton.edu/about
Florian Lemmerich
Florian Lemmerich
Florian Lemmerich
Florian Lemmerich
Florian Lemmerich

08 Fair Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 Fair Machine Learning

Uploaded by

Copyright:

Available Formats

Week 8:

Machine Learning (in a Nutshell)

Chair for Computational Social Science and Humanities

➢ S. Hajian et al.: Discrimination-Aware Machine Learning

“Machine learning is the scientific study of algorithms and statistical

➢ ”Statistics with theory and assumptions dropped” (?)

➢ Data Mining: Emphasis on (interactive) exploration and discovery

➢ ML: Emphasis on prediction of new events

➢ Data Mining uses Machine Learning methods and vice versa

➢ “Data science” as an umbrella term?

NAME RANK YEARS TENURED Classifier

𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+ 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

➢ Why do we need something else?

➢ Use Bayes‘ Theorem:

➢ We can determine P (cj) in the training data by simple counting

➢ Overall decision rule for the Naïve Bayes classifier:

 Higher probability for class “Bad“

Professor Not professor

➢ Recurse until no more data or all in one class

➢ Extensions can handle that very well also

Input layer Hidden Layer 1 Hidden Layer 2 Output Layer

➢ Huge variety of architectures

➢ Then all Algorithms are equally good!

➢ Algorithms make more or less reasonable assumptions

Base learners Single predictions Final prediction

➢ The ensemble classifies better than each individual classifier!

➢ Overall classifier: Combination of all previously trained models

➢ Some algorithms extend to this naturally

➢ For most classification approaches, there is also a regression pendant

Gene sequences Books / documents

▪ Euklidische Distanz distEucl ( x, y ) =  (x − y )

▪ Maximums-Metrik distMax( x, y ) = max{| xi − yi |1  i  d }

▪ Hamming-Distance: Number of different attribute values

3. Assign each data point to the nearest centroid

Assign data points to clusters

➢ Requirements for a cluster: p

1. Some technologies have made remarkable progress: content identification

2. Companies advertise algorithms/AI as a solution to all problems:

The aim is to understand how family structure relates to outcomes of the

A research team lead by Matt Salganik issued a challenge to researchers: can

You might also like