You are on page 1of 9

Racial Bias in Computer Vision via Convolutional Neural

Networks
This paper was downloaded from TechRxiv (https://www.techrxiv.org).

LICENSE

CC BY 4.0

SUBMISSION DATE / POSTED DATE

15-07-2022 / 19-07-2022

CITATION

Srivastava, Kunal; Lim, Sean; Chan, Connor (2022): Racial Bias in Computer Vision via Convolutional Neural
Networks. TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.20323818.v1

DOI

10.36227/techrxiv.20323818.v1
Racial Bias in Computer Vision via
Convolutional Neural Networks
Kunal Srivastava, Connor Chan, Sean Lim
University of Washington
kunalsr@uw.edu, cachan25@uw.edu, seanxlim@uw.edu

July 14, 2022


Abstract

In the past fifty years, computer vision has evolved from recording and displaying images to
computationally-intensive tasks like flagging criminals in the airport and unlocking one’s very
mobile device. Technology utilizing computer vision has been improving exponentially since its
creation, allowing for implementation into everyday tasks for many users. This advancement in
technology has created an ease-of-life experience for many humans, but there is an undiscussed
downside of potential bias in the information that is produced. The widespread acceptance
of such a technology (some argue it to be premature), has led to modern issues in computer
vision–specifically towards members of the non-white community. Our project looks into the ways
that mass adoption of computer vision with low regard for the data behind it may leave specific
groups at a clear disadvantage. We analyzed the widely popular UTKFace dataset to implement a
Convolutional Neural Network (CNN) algorithm, used to predict gender given an image of the
human. Over the course of the experiment, we observed various statistics of the dataset, to give a
strong representation of what could drive the results our model returned. By further investigating
the current policy as it comes to data collection and privacy, we hope to suggest practices that will
lead humanity to a safer, more equal future. Through the analysis of images from popular datasets,
optimization of machine learning models, and the changing information industry standards, we
aim to find and assess a difference in accuracies of gender-prediction models when it comes to
race.

I. Introduction 5.3% of Google’s workforce identifying as black


[1], and 3.9% of Facebook’s workforce identify-
echnological bias grows a more pressing ing as black [4]. An example of consequences
T issue as advancements in machine learn-
ing are made. Today, as technology advances,
that occur from limited racial diversity in the in-
dustry is computer vision’s inability to equally
so does the unjust representation of minor- identify different images–especially for people
ity/ethnic groups. Technology is made and who identify as black. The central reasoning
directed toward the use of a predominantly for a lack of representation revolves around
white population. The central issue of misrep- limited opportunities for people of color com-
resentation in technology is developed from its pared to the majority populations.
creators working for major technology compa- To eliminate bias in the technological work-
nies–the vast majority of which are overpopu- force, larger representation of underrepre-
lated by a white majority. sented groups must be added to create a more
A review of Microsoft, Facebook, and accepting experience for all users. The teams
Google’s most recent diversity reports show who engineer data that is used at large levels
a strong lack of representation with 5.7% of hold control over its use and function. They
Microsoft’s workforce identifying as black [6], are directly at fault for bias and unjust treat-

1
ment of users as there is unequal focus on data variations in pose, facial expression, illumina-
engineering [9]. tion, occlusion, resolution, and more. Each
In 2015, David Oppenheimer, a University of image is labeled in its filename by age, gender,
California Berkeley law professor stated that and ethnicity.
even without an intent to discriminate against The classification labels for each feature are
ethnic groups, if designers or engineers “repro- as follows: Age is an integer from 0 to 116,
duce social preferences even in a completely indicating the age. Gender is either 0 (male)
rational way, they also reproduce those forms or 1 (female). Race is an integer from 0 to
of discrimination”[5]. The creation of artificial 4, denoting White, Black, Asian, Indian, and
intelligence is similar to the creation of humans. Others (like Hispanic, Latino, Middle Eastern)
Both grow and develop as a result of their sur- respectively.
roundings. The environment in which artificial
intelligence is created represents the machine’s
capabilities.
II. Methods
When technology disproportionately affects
Our project seeks to assess a neural network’s
one population to another, many people, in-
accuracy for detecting gender compared to its
cluding Anna Lauren Hoffmann, an assistant
accuracy for either all white-labeled races or all
professor with The Information School at the
non-white-labeled races. Convolutional Neural
University of Washington, depict data science
Networks (CNNs) have demonstrated consid-
as “data violence” [8]. Most notably, when ref-
erable potential in fields of computer vision,
erencing machine learning models that utilize
including facial recognition, so we intend to
popular datasets, it is essential to ask questions
draw comparisons between the performances
such as: Who is the creator? Is anyone bene-
of the model by race.
fiting or being harmed? Who decides? With
A Convolutional Neural Network is a deep
these questions in mind, data bias will be min-
learning algorithm that deals with image pro-
imized while benefiting a wide array of user
cessing. It takes various image inputs and as-
experiences. It is important to understand that
signs weights to them in order to distinguish
data that is publicly reputable is not necessarily
between images and patterns in similar images.
truly reputable data.
It relies on the back propagation of error to
refine its weights and accuracy, and has seen a
rise in popularity in recent years due to being
more powerful than traditional classification
and regression models, such as Decision Trees
and Support Vector Machines, as CNNs require
less image pre-processing.

i. Data Preprocessing
Figure 1: UTKFace Dataset Sample Each image in UTKFace is dependent on sev-
eral other features in the dataset, such as the
Our findings use a dataset entitled UTKFace, age and race features. A person’s age and
a large-scale dataset consisting of images of race have an effect on an image: that is why
humans ranging from age 0 to age 116. There the human eye is able to distinguish a white
are 23,708 images in this dataset, deeming it- child from an asian man, for example. Upon
self worthy of complex image processing. Im- initial exploration and analysis of the dataset,
ages were collected all over the internet, from we found no duplicates to remove. However,
popular image databases like Google Images, when analyzing the distributions of the cate-
Pinterest, and more. The images cover large gorical variables in the dataset, one point of

2
of the input and normalization. Neural net-
works process inputs using small weights, and
inputs with large integer values can disrupt
or slow down the learning process, so nor-
malizing each image was necessary, such that
each pixel value has a value between 0 and 1.
This allows for maximum efficiency when the
computer is processing large amounts of data.
Then, the model is trained against the training
set and checked against the testing (validation)
set. These processes yield the training and
testing losses and accuracies.
Figure 2: Raw Race Counts

ii. Neural Network Architecture


Below is a visualization of the neural network
architecture used in the training and testing
processes.
There are several types of layers in the ar-
chitecture, which utilize different activation
functions. We use four reLU and one sigmoid
Figure 3: Raw and Adjusted Age Counts activation function. These activation functions
tell a neuron (individual node) in the network
whether or not to ‘fire’ and pass information
interest arose: unsurprisingly, there were sig- to the neighboring neurons in the next layer.
nificantly more images of whites in the dataset When designing the network, we used Con-
than there were non-whites as shown in Figure volutional 2-D layers (Conv2D), Max Pooling
2. This could be a possible explanation for bias 2-D layers (MaxPooling2D), and Flatten, Dense,
in datasets, as explained above. and Dropout layers. Each serves a different
In addition, when looking at the age distribu- function in the process of training and propa-
tion, we noticed that there was an abundance gating data.
of children (specifically aged 0-4). This gave The rectified linear activation function (reLU
the left tail of the distribution a rather irregular for short) is a piecewise linear algorithm that
jump. In addition, the right tail seemed rela- directly returns max(x, 0): the element-wise
tively long. We proposed a change (Figure 3) of maximum of zero and the input tensor, other-
removing images with age 80+, and randomly wise returning zero. A tensor is an algebraic
keeping 1/3 of images with age 0-4 (inclusive). object that describes a relationship between
This makes the distribution appear much more sets of algebraic objects (in this case pixels) re-
normal, which will ease the computer in image lated to a vector space. The sigmoid activation
processing tasks like distinguishing gender. function is a special form of the logistic func-
Before feeding the dataset to the neural net- tion. For small values, sigmoid returns a value
work architecture itself, we needed to pre- close to zero, and for large values the result of
process the images and to split the dataset for the function gets close to 1.
training and testing. Based on the Pareto prin- The dense layer is the regular connected-
ciple, we followed the 80-20 rule of splitting component layer, being the most common and
between training and testing sets. Appropri- widely used. The Conv2D layer performs a spa-
ate transforms were then applied on the im- tial convolution over images by creating a con-
ages. These included setting the dimensions volution kernel that is convolved with the layer

3
Figure 4: Convolutional Neural Network (CNN) Architecture

input to produce a tensor of outputs. Max- architecture is being used. Rather than leaving
Pooling a layer downsamples the input along them as controlled variables, the goal is to see
its spatial dimensions by taking the maximum how variations in the architecture would inter-
value over an input window for each channel act with variations with the hyperparameters
of the input. The dropout layer will randomly to influence the performance itself. Note that
set input units to 0 at a set frequency during the loss function (another potential hyperpa-
training. This helps prevent underfitting, or rameter) was kept constant at cross entropy
training on too little data. Finally, the flatten loss.
layer will flatten the data into the next layer.
For example, if flatten is called on a layer with
shape (batch, 2, 2), the result will be (batch, 4).

Figure 6: Proposed Hyperparameter Alterations

Figure 5: Example Visualization of Convolutional Neu-


ral Network (CNN) The batch size of a neural network is the
number of training examples used in one itera-
tion (epoch). We explore switching between a
large (64) and small (32) batch size to increase
iii. Hyperparameter Tuning
performance during training. By observing the
Hyperparameters represent constant aspects effect of this change on our dataset and specific
of the program that aren’t affected by which use case, we aim to discern what sort of batch

4
size is better suited for the task.
Next, we suggest altering the learning rate.
The learning rate is a parameter in the network
that determines the step size when working
to find the minimum of its loss function. By
experimenting with the learning rate in the
training of our model(s), we want to see how
learning rate can optimize overall performance.
Finally, we switched the optimizer. An op-
timizer is an algorithm that simply adjusts at-
tributes of the network, such as weights, to
achieve a higher accuracy. Optimizers can take
a momentum parameter. Momentum is a tech- Figure 7: Model Training and Validation Accuracy
nique used to accelerate the process of gradi-
ent vectors converging. The reason for swap-
in Figure 7, the validation accuracy starts off
ping between Adam and Stochastic Gradient
aligned with the training accuracy, but eventu-
Descent (SGD) is the nature of their momen-
ally plateaus to a relatively constant accuracy
tum parameter. SGD requires a declaration
(as expected).
of its momentum, which we provided as 0.9
It is also important to confirm if there were
to speed up the convergence of gradients, but
any biases in false predictions. This means
Adam dynamically calculates and updates its
checking if the neural network may predomi-
momentum as the training proceeds. This is
nantly be more errant with males than females,
because Adam uses the moving average of the
thus lowering the overall accuracy. In the con-
gradient instead of the gradient itself, like SGD
fusion matrix below (Figure 8), we can see that
does. This results in a gradient that constantly
there were 140 mislabeled males and 119 mis-
changes in Adam, but is stuck at one value in
labeled females. This is inside our margin of
SGD. Therefore, by comparing these two opti-
error, therefore the network is accepted as fair.
mizers, we are also able to see to what extent
the performance of a static momentum value
varies from a dynamic one.

III. Results
The experimental simulations were imple-
mented in the Google Colaboratory environ-
ment on the GPU runtime. The GPU ensured
a lower processing time and quicker results.
These simulations were implemented on the
UTKFace dataset to assess the difference of the
performance on different test sets and optimize
them by tuning their hyperparameters.
After the model was trained, 20% of the data Figure 8: Confusion Matrix on Overall Validation Test
that was already partitioned from the set by a
split function was used to test the algorithm. Two additional tests were proposed: test-
Of this 20%, 61% was white, and the other 39% ing on only whites, and testing on only non-
was non-white. After training on this 20%, we whites. The large 20 % test would serve as
successfully created a model that could predict some form of control–or baseline to see the
gender from images at 88.6% accuracy. As seen standard. Then, by comparing the results of

5
the smaller tests, a controlled and meaningful sort of clarity on the potential bias of the data,
bias may make itself visible. it could have negative and irreversible effects
on minority groups and communities whether
purposeful or not.
The Trump administration ushered in an age
of distrust and disinformation, widely acceler-
ated by the exponentially growing social media
Figure 9: Validation Test Results industry. While the information this adminis-
tration spread was not from an official data
After performing the training, Figure 9 source, it did come from the president, who
shows clear difference in accuracy between 42% of Americans would follow, according to
whites and non-whites became apparent. On statistics site fivethirtyeight [3]. To those Amer-
the white-labeled dataset, the accuracy de- icans, the Trump administration was a data
creased by 2.48% from the overall accuracy, source. The danger of trusting an unreputable
while the non-white-labeled dataset decreased source spoke for itself, as 2021 was one of the
by 11.40%, almost five times as drastic. This is most volatile years for political disagreement
a notable, unnatural observation because peo- via social media. Eventually, unsupported or
ple of color are automatically at a disadvantage false statements released by the Trump admin-
when using this network due to the data that istration were flagged as untrustworthy and
it was trained on. made available to the user through the user
From our data analysis and experimental interface. Other fixes included a Twitter ban
retrials, we confirmed that balancing the distri- for spreading disinformation.
bution of races in the image dataset improved Rekognition is Amazon’s Machine Learning
bias the greatest, and responsible developers Image and Analysis software that has been sold
have the ability to do the same. However, is- and used by a number of United States govern-
sues with unfair technology may arise if there ment agencies. Examination of Rekognition by
is no benchmark standard or policy for this MIT professor Joy Buolamwini showed an error
new and growing concept of data inclusiveness. rate of 0.8% for light-skinned men, 34.7% for
The question of how to increase the fairness of dark-skinned women [7]. Steps can be taken to
algorithms in the AI industry persists. In the ensure AI fairness among large corporations, in
next section, we discuss possible solutions. order to limit unintentional harm that affect the
communities in which they serve. As bias in al-
IV. Further Thoughts gorithms is the effect of biased training data, it
is currently up to a company’s discretion to call
At a high level, regardless of the implementa- for responsible technology. To ensure higher
tion of image processing algorithms, the data standards of data quality and fairness, work
used is constant and shared. For example, must be done to ensure unbiased experiences.
Google lays host to more than 28 billion images Possible plans to prevent recurrences could
and videos. As stated previously, our team include requiring departments to internally
analyzed and performed experiments on a rep- publish fairness outcomes tests before any al-
utable and ubiquitous dataset. When a dataset gorithms are published live. Once groups that
is used for engineering by an esteemed corpo- may be receiving unfair biases due to the algo-
ration or an aspiring college student, there is a rithm are made aware, corporations can stimu-
trust exchange. A user will trust the software, late users from that demographic and monitor
not the dataset: and this is where a problem the results. Finally, standardized policy can
arises. set a benchmark for data behind technology
Data is easily trusted yet can lead to ma- to meet–in terms of bias. Recently, the U.S.
nipulation. As a result, If there is not some Equal Employment Opportunity Commission

6
(EEOC) launched the Artificial Intelligence and VI. Contributions
Algorithmic Fairness Initiative, ensuring the
innovative systems comply with data fairness All authors conceived the experiments, K.S. de-
and privacy benchmarks [2]. vised the code for the experiments, All authors
conducted the experiments and analyzed the
results. All authors wrote and reviewed the
manuscript.

V. Conclusion VII. Acknowledgements


We would like to acknowledge Github user
Our experimentation yielded results that susanqq for making the UTKFace dataset pub-
aligned with our hypothesis and initial ratio- licly available for non-commercial research use.
nale for conducting the experiment–computers We would also like to give thanks to Kaggle
are unaware of historical bias, and only aware and Google Colab, which provided us ample
of what data they are given. In our Convo- resources to conduct our research and experi-
lutional Neural Network (CNN) architecture, mentation.
9 hidden layers were utilized in a custom-
build layout. Regardless of architecture size
or whether the model is pre-trained or not, the
data a model draws from remains constant.
This network was trained on an Apple M1
Macbook. It took roughly 6 hours to train
the model, in addition to a virtual graphics
processing unit (GPU). Our hyperparameter
tuning yielded an accuracy increase of 1.8%,
reducing our margin of error by 7.7% and thus
is a statistically significant improvement. In
general, the findings of our work show a dif-
ference in computer vision accuracy when it
comes to race, or more specifically, skin color.
Bias in Artificial Intelligence remains a new
topic in the realm of technology. Publications
continue to release new data on the unjust
treatment of users in and out of the work-
force. There remains a great area for improve-
ment in this field of research and creation. The
UTKFace Dataset contains 23,708 variables that
show racial bias in the dataset; and altering
this range of data would yield more respon-
sible results. For our test, we sought to test
the famous UTKFace Dataset on gender and
ethnic prediction categorized by race. Our re-
sults match our hypothesis that users are un-
equally represented when comparing the race
of the user, and speak to the general standard
of unchecked and unmonitored data engineer-
ing across the industry.

7
References [9] Sarah Myers West, Meredith Whittaker,
and Kate Crawford. “Discriminating sys-
[1] url: https : / / static . tems”. In: AI Now (2019).
googleusercontent . com / media /
diversity . google / en / /annual -
report / static / pdfs / google _ 2021 _
diversity_annual_report.pdf.
[2] Artificial Intelligence and algorithmic fairness
initiative. url: https://www.eeoc.gov/
ai.
[3] DataDhrumil. Donald Trump : Favorabil-
ity polls. July 2022. url: https : / /
projects.fivethirtyeight.com/polls/
favorability/donald-trump/.
[4] Facebook diversity. url: https : / /
diversity.fb.com/2021-report/.
[5] Ben Guarino. Google faulted for racial bias in
image search results for black teenagers. Oct.
2021. url: https://www.washingtonpost.
com / news / morning - mix / wp / 2016 / 06 /
10/google-faulted-for-racial-bias-
in-image-search-results-for-black-
teenagers/.
[6] Microsoft’s 2021 Diversity and Inclusion
Report: Demonstrating progress and re-
maining accountable to our commitments.
Oct. 2021. url: https : / / blogs .
microsoft . com / blog / 2021 / 10 /
20 / microsofts - 2021 - diversity -
inclusion - report - demonstrating -
progress-and-remaining-accountable-
to-our-commitments/.
[7] Larry Hardesty | MIT News Office. Study
finds gender and skin-type bias in commercial
artificial-intelligence systems. url: https :
/ / news . mit . edu / 2018 / study - finds -
gender - skin - type - bias - artificial -
intelligence-systems-0212.
[8] The problem with ai? study says it’s too
white and male, calls for more women, mi-
norities. url: https://imdiversity.com/
diversity - news / the - problem - with -
ai - study - says - its - too - white -
and - male - calls - for - more - women -
minorities/.

You might also like