Probability

Probability and the Normal Curve
MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit
February 17, 2010
Basic rules of probability

Probability is the cornerstone of decision-making in quantitative research in particular, how to judge evidence given a specic hypothesis The specic question with research is the following: How likely was it that I obtained this sample of data, given my hypothesis? Probability is on the unit interval [0, 1], even though in common parlance we may refer to probability as percentages Probability axioms:
1. For any set of events E : 0 Pr (E ) 1. 2. Probability that something occurs is 1.0: Pr () = 1 3. Any countable sequence of pairwise disjoint events E1 , E2 , . . . satises Pr (E1 E2 ) = i Pr (Ei )
For instance: for a coin toss, Pr(heads)=0.5, Pr(tails)=0.5
Computing probabilities
Probability refers to the relative likelihood of occurrence of any given outcome or event Alternatively, the probability associated with an event is the number of times that event can occur, relative to the total number of times any event can occur
Pr(given outcome) = number of times outcome can occur total number of times any outcome can occur
Example: If a classroom contains 20 Democrats and 10 Republicans, then the probability that a randomly selected student will be Democrat is 20/(20 + 10) = .667 Other example: The probability of picking the ace of spades from a single draw from a deck of cards is 1/52; the probability of drawing any ace is 4/52
Probability distributions
A probability distribution is directly analagous to a frequency distribution, except that it is based on theory (probability theory) rather than what is observed in sample data In a probability distribution, we calculate determine the possible values of outcomes, and compute probabilities associated with each outcome Example:
Probability versus frequency distributions

The dierence is that a frequency distribution depends on a sample. Example: ip two coins 10 times (Table 5.3 LF&F): > table(rbinom(10,2,.5))/10 0 1 2 0.3 0.6 0.1 If we increase the coin ips to 1000, we get: > table(rbinom(1000,2,.5))/1000 0 1 2 0.266 0.496 0.238 A probability distribution is like a frequency distribution where N= Note: Your exact numbers will dier!
Empirical distribution of heads in 50 tosses of 2 coins

0.5 0.0 0.1 0.2 0.3 0.4
barplot(tab
Empirical distribution of heads in 1,000 tosses of 2 coins
0.0
0.1
0.2
0.3
0.4
barplot(tab
Mean and variance of a probability distribution

Remember the formula for a mean: X =
P X N
We can compute the mean for any given frequency distribution already Returning to the two coin ip example:
> # Means of frequency distributions of 2 coin tosses > mean(rbinom(10,2,.5)) [1] 0.9 > mean(rbinom(100,2,.5)) [1] 1.04 > mean(rbinom(1000,2,.5)) [1] 0.961 > mean(rbinom(10000,2,.5)) [1] 0.9987
If we performed this test an innite number of times, we would expect the average to be 1.0. This is why we call the mean of probability distributions an expected value
Greeks and Romans
For sample statistics, we use Roman characters For population parameters, we use Greek characters
Quantity Mean Standard deviation Variance
Sample Notation X s s2
Population Notation 2
The normal curve

The single most important probability distribution in all of statistics Features:
symmetrical continuous unimodal follows a specic mathematical form involving two parameters
0.4 dnorm(x) 0.0 0.1 0.2 0.3
-4
-2
0 x
The area under the normal curve

Remember that the normal distribution describes a (continuous) probability distribution empirical distributions (may) only approximate it To use the normal distribution in solving problems, we calculate probabilities in a probability distribution that comes from integrating the normal curve Normal distribution has the density:
(x)2 1 f (x) = e 22 2
where
> 0 is the standard deviation is the exepected value
The area under the normal curve (cont.)
Typically we consider the area relative to standard deviations from the mean A constant proportion of the total area under the normal curve will lie between the mean and any given distance measured in units of For instance, the area under the normal curve and the point 1 above the mean always contains 34.13% of cases The same distance below the mean contains the identical proportion of cases
Area under the normal curve in distances
Determining exact areas under the normal curve
To determine the probability of an event that a random variable X with a normal distribution is less than or equal to x, we evaluate the cumulative distribution function of the normal probability distribution at x The cdf of the normal distribution is: 1 ,2 (x) = 2
x
(x)2 2 2
dx
This equation (and the integral in the cdf) makes it possible for us to determine the total area under the curve for any given distance from the mean
Determining exact areas under the normal curve
For instance, what if we wanted to know the percentage of cases contained for 1.4 units? We can do this in R, or use Table A from Appendix C of LF&F Using Table A from Appendix B... In R:
> 1-pnorm(1.4) [1] 0.08075666 > round((1-pnorm(1.4))*100, 2) [1] 8.08
Standardized scores
We can transform any distribution into a set of standard deviations from the mean this is called a z score or standardized score The z score measures the direction and degree that any given raw score deviates from the mean of a distribution on a scale of units Formula for computing z score: zi = Xi
In order to convert any raw scores into scores that can be assessed using the normal curve, we convert them into units (or z scores) by standardizing them The normal curve can be used in conjunction with z scores and Table A (or pnorm) to determine the probability of obtaining any raw score in a distribution
Example
Assume we have a normal distribution of hourly wages
mean wage is $10, and the standard deviation is $1.5 we wish to nd out probability of an hourly wage of $12
What is probability of obtaining a score between mean of $10 and value of $12?
Example continued
Steps: 1. Standardize the raw score of $12: z = (12 10)/1.5 = 1.33 2. Use Table A to nd the total frequency under the normal curve between z = 1.33 and the mean. This is p=.4082 Alternatively, we could have used R: > pnorm(1.33)-pnorm(0) [1] 0.4082409 Standard normal curve: special version of the normal curve with = 0, 2 = 1
Example variation
What if we had wanted to nd the probability that a wage might be greater than the observed value (in this case, $12)? In this case, we would integrate from 1.33 to +
In R, this is easy: > 1-pnorm(1.33) [1] 0.09175914
Example variation 2
We could also have obtained the probability that a wage would be either less than $8 or more than $12. The transformation would then yield -1.33 and 1.33 In R:
> pnorm(1.33) - pnorm(-1.33) [1] 0.8164817
This means that 1 .8165 = .1835 of the area lies below $8 and above $12
0.4 dnorm(x) 0.0 0.1 0.2 0.3
-1.33
0.00 x
1.33
Standardizing scores in R
> x <- c(1,4, 5, 7, 14, 0, 21) > (stdx <- (x - mean(x)) / sd(x)) [1] -0.8518410 -0.4543152 -0.3218066 -0.0567894 > scale(x) [,1] [1,] -0.8518410 [2,] -0.4543152 [3,] -0.3218066 [4,] -0.0567894 [5,] 0.8707708 [6,] -0.9843496 [7,] 1.7983310 attr(,"scaled:center") [1] 7.428571 attr(,"scaled:scale") [1] 7.54668
0.8707708 -0.9843496
1.798331
Methods for determining normality in samples
Visual inspection of the kernel density Some tests also exist, e.g. Anderson-Darling test, Kolmogorov-Smirnov test, Pearson 2 test Q-Q plot to visualize normality
quantile-quantile plot: plots ranked samples against a similar number of ranked samples from a normal distribution normality shows up as a straight-line correspondence In R: qqnorm()
Normal Q-Q plot

Normal Q-Q Plot
50000 Sample Quantiles 0 -3 10000 20000 30000 40000
-2
-1
0 Theoretical Quantiles
> > > >
load("dail2002.Rdata") attach(dail2002) qqnorm(spend_total) qqline(spend_total)
Central Limit Theorem

States that as the sample size N increases, the distribution of the sample means will be normally distributed with a mean and variance 2 no matter what the shape of the original distribution Second fundamental theorem of probability (law of large numbers is the rst) This allows us to make probability statements about sample means, against a theoretical (probability) distribution of means that we might have drawn, since that probability distribution is normally distributed We could illustrate this with the spend total example from the dail2002.Rdata dataset

Probability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability

Uploaded by

Copyright:

Available Formats

Probability and the Normal Curve

MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit

February 17, 2010

Basic rules of probability

For instance: for a coin toss, Pr(heads)=0.5, Pr(tails)=0.5

Probability versus frequency distributions

Empirical distribution of heads in 50 tosses of 2 coins

Empirical distribution of heads in 1,000 tosses of 2 coins

Mean and variance of a probability distribution

Greeks and Romans

Quantity Mean Standard deviation Variance

The normal curve

The area under the normal curve

The area under the normal curve (cont.)

Area under the normal curve in distances

Determining exact areas under the normal curve

Determining exact areas under the normal curve

In R, this is easy: > 1-pnorm(1.33) [1] 0.09175914

Methods for determining normality in samples

Normal Q-Q plot

> > > >

load("dail2002.Rdata") attach(dail2002) qqnorm(spend_total) qqline(spend_total)

Central Limit Theorem

You might also like