You are on page 1of 10

R lab - Probability distributions

Probability mass and density functions


From the lectures you may recall the concepts of probability mass and density
functions. Probability mass functions relate to the probability distributions discrete
variables, while probability density functions relate to probability distributions of
continuous variables. Suppose we have the following probability density function:

# the data frame

data <- data.frame(outcome = 0:5, probs = c(0.1, 0.2, 0.3, 0.2, 0.1, 0.1))

barplot(names = data$outcome, height = data$probs)

# make a histogram of the probability distribution

Probability mass and density functions


(2)
For continuous variables, the values of a variable are associated with a probability
density. To get a probability, you will need to consider an interval under the curve of the
probability density function. Probabilities here are thus considered surface areas.

In this exercise, we will simulate some random normally distributed data using
the rnorm() function. This data is contained within the data vector. You will then need to
visualize the data.
# simulating data

set.seed(11225)

data <- rnorm(10000)

help(dnorm)

# check for documentation of the dnorm function

density <- dnorm(data)

# calculate the density of data and store it in the variable density

density <- dnorm(data)

# make a plot with as x variable data and as y variable density

plot(x = data, y = density)

The cumulative probability distribution


In the last two exercises, we saw the probability distributions of a discrete and a
continuous variable. In this exercise we will jump into cumulative probability
distributions. Let's go back to our probability density function of the first exercise:

All the probabilities in the table are included in the


dataframe probability_distribution which contains the variables outcome and probs.
We could sum individual probabilities in order to get a cumulative probability of a given
value. However, in some cases, the function cumsum() may come in handy.
What cumsum() does is that returns a vector whose elements are the cumulative sums of
the elements of the arguments. For instance, if we would have a vector which contains
the elements: c(1, 2, 3), cumsum() would return c(1, 3, 6)
# probability that x is smaller or equal to two
prob <- (0.1 + 0.2 + 0.3)

#' probability that x is 0, smaller or equal to one,

#' smaller or equal to two, and smaller or equal to three

cumsum(c(0.1, 0.2, 0.3, 0.2))

Summary statistics: The mean


One of the first things that you would like to know about a probability distribution are
some summary statistics that capture the essence of the distribution. One example of
such a summary statistics is the mean. The mean of a probability distribution is
calculated by taking the weighted average of all possible values that a random variable
can take. In the case of a discrete variable, you calculate the sum of each possible
value times its probability. Let's go back to our probability mass function of the first
exercise.

# calculate the expected probability value and assign it to the variable expected_score

expected_score <- sum(data$outcome * data$probs)

# print the variable expected_score

expected_score
Summary statistics: Variance and the
standard deviation
In addition to the mean, sometimes you would also like to know about the spread of the
distribution. The variance is often taken as a measure of spread of a distribution. It is
the squared deviation of an observation from its mean. If you want to calculate it on the
basis of a probability distribution, it is the sum of the squared difference between the
individual observation and their mean multiplied by their probabilities. See the following
formula: var(X)=∑(xi−x¯)2∗Pi(xi)var(X)=∑(xi−x¯)2∗Pi(xi).

If we want to turn that variance into the standard deviation, all we need to do is to take
its square root. Let's go back to our probability mass function of the first exercise and
see if we can get the variance.

# the mean of the probability mass function

expected_score <- sum(data$outcome * data$probs)

# calculate the variance and store it in a variable called variance

variance <- sum((data$outcome -expected_score)^2 * data$probs)

# calculate the standard deviation and store it in a variable called std

std <- sqrt(variance)


The normal distribution
Now that we know what probability mass and probability density functions are and we
know how to calculate some summary statistics, let's consider the normal distribution.
The normal distribution, als known as the Gaussian distribution, is the probability
distribution that is encountered most frequently. It is characterized by a nice bell curve.
A normal distribution is centered at its mean called μμ. Its spread is defined by the
standard deviation. The image below gives an idea how the probability density function
and the standard deviation of a normal distribution are related:

Look at the visualization; what is the probability that an observation from a normal
distribution is between 1 standard deviation below the mean and 2 standard deviations
above the mean?

0.815

The normal distribution and cumulative


probability
In the previous assignment we calculated probabilities according to the normal
distribution by looking at an image. However, it is not always as simple as that.
Sometimes we deal with cases where we want to know the probability that a normally
distributed variable is between a certain interval. Let's work with an example of female
hair length.

Hair length is considered to be normally distributed with a mean of 25 centimeters and a


standard deviation of 5. Imagine we wanted to know the probability that a woman's hair
length is less than 30. We can do this in R using the pnorm() function. This function
calculates the cumultative probability. We can use it the following way: pnorm(30, mean
= 25, sd = 5). If you wanted to calculate the probability of a woman having a hair
length larger or equal to 30 centimers, you can set the lower.tail argument to FALSE.
For instance, pnorm(30, mean = 25, sd = 5, lower.tail = FALSE). Let's visualize this.
Note that the first example is visualized on the left, while the second example is
visualized on the right:

# probability of a woman having a hair length of less than 20 centimeters


round(pnorm(20, mean = 25, sd = 5), 2)

The normal distribution and quantiles


Sometimes we have a probability that we want to associate with a value. This is
basically the opposite situation as the situation described in the previous question. Say
we want the value of a woman's hair length that corresponds with the 0.2 quantile
(=20th percentile). Let's consider visually what this means:

In the visualization, we are given a blue area with a probability of 0.2. We however want
to know the value that is associated with the yellow dotted vertical line. This value is the
0.2 quantile (=20th percentile) and divides the curve in an area that contains the lower
20% of the scores and an area that the rest of the scores. If our variable is normally
distributed, in R we can use the function qnorm() to do so. We can specify the
probability as the first parameter, then specify the mean and then specify the standard
deviation, for example, qnorm(0.2, mean = 25, sd = 5).

# 85th percentile of female hair length


qnorm (0.85, mean = 25, sd = 5)

The normal distribution and Z scores


A special form of the normal probability distribution is the standard normal distribution,
also known as the z - distribution. A z distribution has a mean of 0 and a standard
deviation of 1. Often you can transform variables to z values. You can transform the
values of a variable to z-scores by subtracting the mean, and dividing this by the
standard deviation. If you perform this transformation on the values of a data set, your
transformed data set will ave a mean of 0 and a standard deviation of 1. The formula to
transform a value to a z score is the following:

Zi=xi−x¯sxZi=xi−x¯sx

The Z-score represents how many standard deviations from the mean a value lies.

# calculate the z value and store it in the variable z_value

z_value = 2.6

The binomial distribution


The binomial distribution is important for discrete variables. There are a few conditions
that need to be met before you can consider a random variable to binomially distributed:

1. There is a phenomenon or trial with two possible outcomes and a constant


probability of success - this is called a Bernoulli trial
2. All trials are independent

Other ingredients that are essential to a binomial distribution is that we need to observe
a certain number of trials, let's call this n, and we count the number of successes in
which we are interested, let's call this x. Useful summary statistics for a binomial
distribution are the same as for the normal distribution: the mean and the standard
deviation.
The mean is calculated by multiplying the number of trials n by the probability of a
success denoted by p. The standard deviation of a binomial distribution is calculated by
the following formula: n∗p∗(1−p)−−−−−−−−−−−√n∗p∗(1−p).

# calculate the mean and store it in the variable mean_chance

mean_chance <- 25 * 0.2

# calculate the standard deviation and store it in the variable std_chance

std_chance <- sqrt(25 * 0.2 * (1 - 0.2))

Calculating probabilities of binomial


distributions in R
Just as with the normal distribution, we can also calculate probabilities according to the
binomial distributions. Let's consider the example in the previous question. We had an
exam with 25 questions and 0.2 probability of guessing a question correctly. In contrast
to the normal distribution, when we have to deal with a binomial distribution we can
calculate the probability of exactly answering say 5 questions correctly. This is because
a binomial distribution is a discrete distribution.

When we want to calculate the probability of answering 5 questions correctly, we can


use the dbinom function. This function calculates an exact probability. If we would like to
calculate an interval of probabilities, say the probability of answer 5 or more questions
correctly, we can use the pbinom function. We have already seen a similar function
when we were dealing with the normal distribution: the pnorm() function.

# probability of answering 5 questions correctly

five_correct <- dbinom(5, size = 25, prob = 0.2)

# probability of answering at least 5 questions correctly

atleast_five_correct <- pbinom(4, size = 25, prob = 0.2, lower.tail = FALSE)


Quantiles and the binomial distribution
Remember the concept of quantiles? If not, let me briefly recap it. Quantiles are used
when you have a probability and you want to associate this probability with a value. In
our last example we had 25 questions and the probability of guessing a question
correctly was 0.2. Also, in our last example we wanted to know the probability of
answering at least 5 questions correctly and used the pbinom() function to do so. With
quantiles, we do the exact opposite; we want to calculate the value that is associated
with for instance the 0.2 quantile (=20th percentile). In case we are working with a
binomial distribution, we can use the function qbinom() for this.
# calculate the 60th percentile

qbinom(0.6, 25, 0.2)

You might also like