You are on page 1of 3

GV207 – Political Analysis, Week 04

Department of Government, University of Essex

Probabilities and the Normal Distribution

Importance of the normal distribution




Many variables in the real world are normally distributed: height, shoe-size, length of tree leafs,
IQ, but also marks in class tests . . .
In the case of a normal distribution, mean = median = mode.
Most importantly, we know the properties of the normal distribution: 68% of the area under the
curve lie between one standard deviation to the left and one to the right of the mean, around 95%
between two standard deviations,1 and 99.7% between three standard deviations.
A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal
distribution.
Every normal distribution can be transformed into a standard normal distribution using the ztransformation.

The z-transformation2
When we have a sample, we can use the mean ( ) and the standard deviation (s) of a normally
distributed variable (X) to calculate the z-score for a particular observation’s value. The z-score
indicates by how many standard deviations the value deviates from the mean. Z-scores enable us to
calculate the percentile of a specific observation because all z-scores and corresponding percentiles
are already tabulated.

Starting from the z-score for an observation, plus the mean and standard deviation of our variable, we
can rearrange the equation to figure out the value of our observation:

1

The precise number is actually about 1.96, but we usually use 2 for sake of simplicity.
Not to be confused with the z-transformation that comes up on Wikipedia. What we mean here is the
transformation of an observation’s value so that we get a z-score.
2

1

GV207 – Political Analysis, Week 04

Department of Government, University of Essex

How to compute z-scores in Stata
We can generate our z-scores in Stata with the use of two commands, one of which we’ve already
seen: summarize and generate.
First we need to find out the mean and the standard deviation of the variable we want to compute zscores for. To do so we simply use:
summarize varname

Using the mean and the standard deviation, we can generate our new variable (z_varname) by using
the equation for the z-score as given above. Assume that the mean of our variable is 70 and the
standard deviation is 20.
generate z_varname = (varname - 70)/20

This generates a new variable z_varname where the values are the z-scores which correspond to the
values of varname.

Calculating the area under the curve above or below an observation
In the lecture you were told how to find out what proportion of observations lie above or below a
particular value of a variable. 3 In this case, you consulted a table (from a statistics textbook) that told
you the proportion of observations that corresponds to the z-score. While this is a perfectly valid way
of doing so, it becomes time consuming if we want to calculate this for the values of every
observation on our variable. Thankfully Stata allows us to do this much quicker.
The function normal() gives us the value of the cumulative standard normal distribution, 4 i.e. what
proportion of observations lie below a particular value. Thus we just combine this function with
generate to create a new variable called below_varname that tells us what proportion of observations
are below the value of each observation on our variable:
generate below_varname = normal(z_varname)

If instead we want to know the proportion of observations above the value of a particular observation
then we don’t have to change too much. If we know the proportion below (Pbelow) then simply:
Pabove = 1 – Pbelow
Or if we want to create a new variable called above_varname that provides us with the proportion of
observations above a particular observation’s value, we simply do:
generate above_varname = 1 - normal(z_varname)

3

Proportions here are equivalent to percentages. If the proportion of observations below a particular value is
0.2, this is equivalent to saying 20% of the observations are below this particular value.
4
As noted above, the standard normal distribution is a normal distribution with mean 0 and standard deviation
1. This is why the z-transformation is sometimes called “standardisation”.
2

GV207 – Political Analysis, Week 04

Department of Government, University of Essex

Stata exercise
As always, we will use the data set Democracy small.dta.
1. Find a continuous variable in the data set.
2. Check to see if is normally distributed using the commands shown last week ( histogram and
kdensity). Does it look normally distributed?
3. If it isn’t find another continuous variable and do the same checks. If you still don’t have a
normally distributed variable just continue onto the next part with this variable.
4. Calculate the mean and standard deviation of this variable using the summarize command.
5. Using the list command and the appropriate if conditions list the values of this variable as well as
the names of the countries in the data set.
6. Find the value of this variable for your home country and calculate (on paper or in your head) the
corresponding z-score. How many standard deviations is it above or below the mean?
7. Using the commands outlined above, create a new variable that is the z-score associated with the
variable you chose. Make sure to give the variable a meaningful name. You can also label the
variable for more detail.
8. Using your new variable, generate a new variable that is the proportion of observations below the
associated z-score.
9. Use the list command again with the appropriate if conditions to find out what proportion of
observations lie below your home country’s value for this variable.
10. Knowing this, what proportion of countries have a value higher than that of your home country?

3