Frontiers of

Computational Journalism
Columbia Journalism School
Week 7: Algorithmic Accountability and Discrimination

October 27, 2017
This class
• Algorithmic accountability stories
• Analyzing bias in data
• FATML
• Unpacking Propublica’s “Machine Bias”
• Multistage discrimination Models
Algorithmic Accountability Stories
Algorithms in our lives
• Personalized search
• Political microtargeting
• Credit score / loans / insurance
• Predictive policing
• Price discrimination
• Algorithmic trading / markets
• Terrorist threat prediction
• Hiring models
From myfico.com
Predicted crime times and locations in the PredPol system.
Websites Vary Prices, Deals Based on Users' Information
Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
How Uber surge pricing really works, Nick Diakopolous
Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012
Analyzing Bias in Data
Title VII of Civil Rights Act, 1964
• It shall be an unlawful employment practice for an employer -

• (1) to fail or refuse to hire or to discharge any individual, or
otherwise to discriminate against any individual with respect to his
compensation, terms, conditions, or privileges of employment,
because of such individual’s race, color, religion, sex, or national
origin; or

• (2) to limit, segregate, or classify his employees or applicants for
employment in any way which would deprive or tend to deprive
any individual of employment opportunities or otherwise
adversely affect his status as an employee, because of such
individual’s race, color, religion, sex, or national origin.
Investors prefer entrepreneurial ventures pitched by attractive men,
Brooks et. al. 2014
Women in Academic science: a Changing Landscape
Ceci, et. al
Swiss judges: a natural experiment

24 Judges of Swiss Federal Administrative court are randomly assigned to cases. They
rule at different rates on migrant deportation cases. Here are their deportation rates
broken down by party.

Barnaby Skinner and Simone Rau, Tages-Anzeiger.
https://github.com/barjacks/swiss-asylum-judges
Florida sentencing analysis adjusted for “points”

Bias on the Bench, Michael Braga, Herald Tribune
Containing 1.4 million entries, the DOC database notes the exact number of points assigned to
defendants convicted of felonies. The points are based on the nature and severity of the crime
committed, as well as other factors such as past criminal history, use of a weapon and whether
anyone got hurt. The more points a defendant gets, the longer the minimum sentence required by
law.

Florida legislators created the point system to ensure defendants committing the same crime are
treated equally by judges. But that is not what happens.

The Herald-Tribune established this by grouping defendants who committed the same crimes
according to the points they scored at sentencing. Anyone who scored from 30 to 30.9 would go
into one group, while anyone who scored from 31 to 31.9 would go in another, and so on.

We then evaluated how judges sentenced black and white defendants within each point range,
assigning a weighted average based on the sentencing gap.

If a judge wound up with a weighted average of 45 percent, it meant that judge sentenced black
defendants to 45 percent more time behind bars than white defendants.

Bias on the Bench: How We Did It, Michael Braga, Herald Tribune
Unadjusted disciplinary rates

The Scourge of Racial Bias in New York State’s Prisons, NY Times
Limited data for adjustment
In most prisons, blacks and Latinos were disciplined at higher rates than whites — in some cases
twice as often, the analysis found. They were also sent to solitary confinement more frequently and
for longer durations. At Clinton, a prison near the Canadian border where only one of the 998
guards is African-American, black inmates were nearly four times as likely to be sent to isolation as
whites, and they were held there for an average of 125 days, compared with 90 days for whites.

A greater share of black inmates are in prison for violent offenses, and minority inmates are
disproportionately younger, factors that could explain why an inmate would be more likely to
break prison rules, state officials said. But even after accounting for these elements, the disparities
in discipline persisted, The Times found.

The disparities were often greatest for infractions that gave discretion to officers, like disobeying a
direct order. In these cases, the officer has a high degree of latitude to determine whether a rule is
broken and does not need to produce physical evidence. The disparities were often smaller,
according to the Times analysis, for violations that required physical evidence, like possession of
contraband.
The Scourge of Racial Bias in New York State’s Prisons, NY Times
Comparing more subjective offenses

The Scourge of Racial Bias in New York State’s Prisons, NY Times
Simpson’s paradox

Sex Bias in Graduate Admissions:
Data from Berkeley
Bickel, Hammel and O'Connell,
1975
Fairness and Transparency in
Machine Learning (FATML)
Learning from Facebook likes

From Kosinski et. al., Private traits and attributes are predictable from digital records of
human behavior
Predicting gender from Twitter

Zamal et. al., Homophily and Latent Attribute Inference: Inferring Latent
Attributes of Twitter Users from Neighbors
Predicting race from Twitter

Pennacchiotti and Popescu, A Machine Learning Approach to Twitter User Classification
Even if two groups of the population admit simple classifiers, the whole population may
not (from How Big Data is Unfair)
Unpacking ProPublica’s
“Machine Bias”
Should Prison Sentences Be Based On Crimes That Haven’t Been
Committed Yet?, FiveThirtyEight
How We Analyzed the COMPAS Recidivism Algorithm, ProPublica
Stephanie Wykstra, personal communication
ProPublica argument

False positive rate
P(high risk |black, no arrest) = C/(C+A) = 0.45
P(high risk |white, no arrest) = G/(G+E) = 0.23

False negative rate
P(low risk | black, arrested ) = B/(B+D) = 0.28
P(low risk | white, arrested ) = F/(F+H) = 0.48

Northpointe response

Positive predictive value
P(arrest| black, high risk) = D/(C+D) = 0.63
P(arrest| white, high risk) = H/(G+H) = 0.59
P(outcome | score) is fair

Fair prediction with disparate impact: A study of bias in recidivism prediction
instruments, Chouldechova
Or, as ProPublica put it

How We Analyzed the COMPAS Recidivism Algorithm, ProPublica
The Problem

Fair prediction with disparate impact: A study of bias in recidivism prediction
instruments, Chouldechova
Impossibility theorem
When the base rates differ by protected group and when there is not separation, one
cannot have both conditional use accuracy equality and equality in the false negative
and false positive rates.

The goal of complete race or gender neutrality is unachievable.

Altering a risk algorithm to improve matters can lead to difficult stakeholder choices. If it
is essential to have conditional use accuracy equality, the algorithm will produce
different false positive and false negative rates across the protected group categories.
Conversely, if it is essential to have the same rates of false positives and false negatives
across protected group categories, the algorithm cannot produce conditional use
accuracy equality. Stakeholders will have to settle for an increase in one for a decrease
in the other.

Fairness in Criminal Justice Risk Assessments: The State of the Art, Berk et. al.
Multi-stage discrimination models
All The Stops, Thomas Rhiel, Bklynr.com, 2012
In search of fairness
Benchmark Test

Is group A searched more often than group B?

Problem: Assumes A and B have identical distributions of behavior (or whatever
signals are used to decide on searching.)

Outcome Test

Do searches of group A result in a “hit” less of than group B?

Problem: Infra-marginality
Simoiu et. al.
Infra-marginality
Outcome tests, however, are imperfect barometers of bias. To see this,
suppose that there are two, easily distinguishable types of white drivers: those
who have a 1% chance of carrying contraband, and those who have a 75%
chance. Similarly, assume that black drivers have either a 1% or 50% chance
of carrying contraband. If officers, in a race-neutral manner, search
individuals who are at least 10% likely to be carrying contraband, then
searches of whites will be successful 75% of the time whereas searches of
blacks will be successful only 50% of the time. This simple example illustrates a
subtle failure of outcome tests known as the problem of infra-marginality.

The Problem of Infra-marginality in Outcome Tests for Discrimination, Simoiu et. al
Simiou et. al.
Simoiu et. al.
Threshold for searching

p(contraband|d)
p(contraband|r)

How many drivers
has this department
How many drivers searched ?
of this race have
we searched ? Simoiu et. al.

Searches Hits

r = race
d = department
Simoiu et. al.