You are on page 1of 6

STA1007S Lab 5: Conditional probability

SUBMISSION INSTRUCTIONS:

Your answers need to be submitted on Vula.

Go into the Submissions section and click on Lab Session 5 to access the submission form. Please note that
the answers get automatically marked and so have to be in the correct format:

ENTER YOUR ANSWERS TO 2 DECIMAL PLACES UNLESS THE ANSWER IS A ZERO OR AN INTE-
GER (for example if the answer is 0 you just enter 0 and not 0.00, or if the answer is 2 you enter 2 and not 2.00).

DO NOT INCLUDE ANY UNITS (ie meters, mgs, etc).

PROBABILITIES MUST BE BETWEEN 0 AND 1, SO A 50% CHANCE WOULD CORRESPOND TO A


PROBABILITY OF 0.5.

Introduction
In previous labs, we’ve become familiar with the individual variables contained in the Class_Survey2 data
set. The next step is to understand relations between variables. We determine that there is a correlation
between two variables when knowing something about one of them provides information about the other one.
This is an extremely important concept in statistical analysis, since very often we can’t observe the variable
we are interested on, but other variables that provide information about it are readily available. e.g. the type
of vegetation at a site is likely to provide some information about the presence of certain bird species. In
this lab, we’ll keep using the Class_survey2 data set to illustrate how conditional probability can be used to
determine such relations.
You will find that most of the R code necessary to execute the R commands is provided. This lab is meant to
be practice for you, so even if the code and the output of the code is provided, you are expected to create
your own script, run the pieces of code yourself and check whether the output is what you would expect it to
be. Every now and then, you will be asked to fill in blank pieces of code marked as ---. In addition to “fill
in the code”, you will need to answer other questions for which you must produce plots, run your own code
or explore your data. The questions you need to submit through Vula will appear in the submission boxes.
At any time you might call the function help(), to obtain information from any function you want. E.g. If
you wanted to obtain a description of how the function sample() works, you can at any time type in the
console (bottom left panel in RStudio):
help("sample")

or you can just type:


?sample

You should make it a habit to check the help files of the functions you use for the first time.

1
Start a new R script and import your data
By now, you should already know how to start a new R script in an existing R project and write a few lines
describing what you are going to do. Do this now.
Remember to add a line to clean your working environment and one to double check that your working
directory is correct.
Now, read into R the Class_Survey2 data set and check that it has been read correctly by using the functions:
head(), summary(), etc. We will assume that you named the data frame containing your data classData
and this is how we will refer to it during the lab. It is perfectly fine if you want to call it something else, you
will just need to adapt your code accordingly.
Remember to save your script frequently!

Contingency tables
Remember that in the previous lab, we used the function table() to obtain the frequency of the different
levels of our quantitative variables. This is, how many observations of each level we have in our data frame.
We are now going to use the function table() to cross-tabulate some of the categorical variables. We just
need to pass on an extra argument to the function table(). To explain how this works, let’s see how many
male STA1007S students have blue eyes.
# Produce a table cross-classifying the factors gender and eyes
table(classData$gender, classData$eyes)

##
## Blue Brown Green Hazel Other
## female 27 83 15 8 5
## male 8 39 4 3 1
## No Answer 0 1 0 0 0
## other 1 1 0 0 0
The code above, gives you the cross-tabulation of gender (in the rows) versus eye colour (in the columns). In
other words, it gives you the counts for each combination of factor levels. To answer our question above, we
can now see that there are 8 blue-eyed boys in the class. This type of table is often called a contingency table.

SUBMISSION:
Vula Question 1 How many students in the class have a conservative world view and a tattoo?
Vula Question 2 How many students in the class with a tattoo get 4 or 5 hours of sleep?
Incidentally, the function barplot() will happily accept a table like this and produce a plot.
# Produce a barplot with stacked frequencies of eye colours by gender
barplot(height = table(classData$gender, classData$eyes))

2
100 120
80
60
40
20
0

Blue Brown Green Hazel Other


Life is good. . . The function barplot understands that you want to stack the frequencies in the second row,
on top of the frequencies in the first row. However, you will need to work on it a little bit to produce an
informative legend. We won’t cover this now.
Frequencies are good to start with; however, one often wants to know the proportion of cases that fall into
each group, rather than the absolute numbers. For example, we might want to know what the proportion of
females with hazel eyes is. To obtain a table with proportions in R, we need the function prop.table().
prop.table(), as you may see in its help file, needs a table as input. We could do this in one step, like we
did with the function barplot(), but let’s now do it in two steps: first we create an object, say gender.eyes
and we assign to it a contingency table, like the one we just produced. Then, we pass on this object as an
argument to the prop.table() function.
# Create an object with a table cross-classifying the factors gender and eyes
gender.eyes <- table(classData$gender, classData$eyes)

# Produce a table with proportions of the same crossed factors


prop.table(gender.eyes)

##
## Blue Brown Green Hazel Other
## female 0.137755102 0.423469388 0.076530612 0.040816327 0.025510204
## male 0.040816327 0.198979592 0.020408163 0.015306122 0.005102041
## No Answer 0.000000000 0.005102041 0.000000000 0.000000000 0.000000000
## other 0.005102041 0.005102041 0.000000000 0.000000000 0.000000000
The first line of code above shows you, once more, that we can create an object with the output of virtually
any function in R. We can then retrieve that output any time we need it in the future, by just calling this
object. Usually, we do this when we anticipate that we will be using the object we’ve created several times.
If it is going to be a once-off call, then is preferable not to create extra objects that end up cluttering our
working environment.
If you decided to go the once-off way, you would need to nest the function table() inside the function
prop.table(). We’ve done something similar before. The code would look something like this and it would
produce the exact same output:
# Produce a table with proportions of the same crossed factors
prop.table(table(classData$gender, classData$eyes))

If we stare at the proportions table we’ve created, we will realize that all the proportions add up to 1. This is
because the function prop.table(), by default, has taken the proportions in relation to the TOTAL number

3
of observations. To answer the question of what proportion of students in class are females with hazel eyes,
we look at the first row of the fourth column. We see that approximately 4.1% of the class are females with
hazel eyes.
Now, we can ask R to calculate the proportions in relation to rows or columns, instead of to the whole data
set. This can get a bit confusing, so read carefully. For example, say that we are only interested in females
now. We want to know what proportion of females have hazel eyes. In other words, we want to calculate the
proportion of females with hazel eyes, out of the total number of females. We can tell R to calculate the
proportions in relation to row totals by specifying the extra argument margin in the function prop.table().
# Produce a table with proportions of eye colour out of gender totals
prop.table(gender.eyes, margin = 1)

##
## Blue Brown Green Hazel Other
## female 0.19565217 0.60144928 0.10869565 0.05797101 0.03623188
## male 0.14545455 0.70909091 0.07272727 0.05454545 0.01818182
## No Answer 0.00000000 1.00000000 0.00000000 0.00000000 0.00000000
## other 0.50000000 0.50000000 0.00000000 0.00000000 0.00000000
The argument margin = 1 tells R to calculate the proportions out the row totals. We see now a table with
the proportion of each eye colour, out of the total number of observations in each gender. Note that in this
case, the sum of each row of the table adds up to 1, because we are calculating the proportions out of row
totals. It follows that if you sum all the proportions in the table, the result is 0.
This type of table is useful to see if proportions are maintained across groups. For example, we see that
approximately 6% of females have hazel eyes, and 5% of males have hazel eyes. Brown eyes are clearly the
most frequent group in both males and females but the proportion of females with brown eyes is about 60%,
whereas the proportion of males with brown eyes goes up to 71%.
We can do the same exercise with the columns, by specifying the argument margin = 2 to the function
prop.table()
# Produce a table with proportions of gender out of eye colour totals
prop.table(gender.eyes, margin = 2)

##
## Blue Brown Green Hazel Other
## female 0.750000000 0.669354839 0.789473684 0.727272727 0.833333333
## male 0.222222222 0.314516129 0.210526316 0.272727273 0.166666667
## No Answer 0.000000000 0.008064516 0.000000000 0.000000000 0.000000000
## other 0.027777778 0.008064516 0.000000000 0.000000000 0.000000000
We now see what proportion of students with blue, brown, etc. eyes are males or females. The proportions
are calculated with respect to the column totals and therefore, the sum of the values in each column equals 1.

SUBMISSION:
Do students with a tattoo sleep more or less than students without tattoos? To answer this question, you
would need to create a table crossing the variable ‘tattoo’ on the rows and ‘sleep’ on the columns.
Vula Question 3 Choose the best statement out of the options below:
a. This question is best answered using frequencies (absolute numbers).
b. This question is best answered using proportions out of the total number of students (table totals).
c. This question is best answered using proportions out of row totals (tattoo totals).

4
Conditional probabilities
These proportion tables are quite useful if we wanted to calculate conditional probabilities. You might
remember the formula for the conditional probability from the lectures. It looks something like this:

P r(A ∩ B)
P r(A|B) = (1)
P r(B)

It reads: “the probability of A given B is equal to the probability of A AND B divided by the probability of
B”. In other words, we know B has occurred, then what is the probability that A occurs as well? Or does, the
fact that B occurred give any information about A occurring too?
Let’s apply what we know about contingency tables. For example, we have selected a student with a tattoo,
what would our best guess be about this student’s political views, based on our data?
Let’s first produce a contingency table with the proportion of students in class with all the combinations of
political views and having a tattoo.
# Produce a table with proportions crossing the factors views and tattoo
prop.table(table(classData$views, classData$tattoo), margin = NULL)

##
## No Yes
## communist 0.03571429 0.00000000
## conservative 0.05612245 0.00000000
## liberal 0.38265306 0.05612245
## moderate 0.29081633 0.01530612
## socialist 0.13775510 0.02551020
The argument margin = NULL tells the function to produce proportions out of the total number of observations
(this is also the default). Looking at this table, we see that amongst the options of students with tattoo the
liberal views make up the highest proportion. However, what is the probability that, knowing that a student
has a tattoo, has also liberal views? Is it 0.06? Think carefully about this: 0.06 is the probability of picking
up a student completely at random and him/her having both a tattoo and liberal political views. But we are
not picking completely at random, we know that the student has a tattoo! We can rule out a big portion of
the class and work only with those students with a tattoo.
In other words, we should calculate the proportions out the students with a tattoo. We can use what we
learnt in the last section and produce a table with proportions out of tattoo totals. Fill in the blank code
and make sure that you get the proportions right.
# Produce a table with proportions of views out of tattoo totals
prop.table(table(classData$---, classData$---), margin = 1)

##
## communist conservative liberal moderate socialist
## No 0.03954802 0.06214689 0.42372881 0.32203390 0.15254237
## Yes 0.00000000 0.00000000 0.57894737 0.15789474 0.26315789
One way to check that you are getting the correct proportions is by making sure that the row (or column) of
the group you know something about sum up to 1. In this case, we know that the student has a tattoo, that
means that the sum of proportions in the row (or column) with the ‘Yes’ for tattoo must add up to 1.
As we can see, the probability of the student being liberal, once you know he/she has a tattoo is much larger
than the probability of being liberal and having a tattoo.

5
SUBMISSION:
Let’s see if you can find the ingredients to plug into the conditional probability formula. We pick one student
at random from the class:
Vula Question 4. What is the probability of the student being a male and sleeping 4 or 5 hours?
Vula Question 5. What is the probability of the student being a male?
Vula Question 6. Suppose that we know that the student is a male, what is the probability of him sleeping
4 or 5 hours? (hint: you could use the conditional probability formula, but use the function prop.table()
to avoid rounding problems)

The commands you learned today


These are the functions and operators that you learned today. Fill in your own description of what they do.
table()
prop.table()

You might also like