Professional Documents
Culture Documents
Computing probabilities
Probability refers to the relative likelihood of occurrence of any given outcome or event Alternatively, the probability associated with an event is the number of times that event can occur, relative to the total number of times any event can occur
Pr(given outcome) = number of times outcome can occur total number of times any outcome can occur
Example: If a classroom contains 20 Democrats and 10 Republicans, then the probability that a randomly selected student will be Democrat is 20/(20 + 10) = .667 Other example: The probability of picking the ace of spades from a single draw from a deck of cards is 1/52; the probability of drawing any ace is 4/52
Probability distributions
A probability distribution is directly analagous to a frequency distribution, except that it is based on theory (probability theory) rather than what is observed in sample data In a probability distribution, we calculate determine the possible values of outcomes, and compute probabilities associated with each outcome Example:
barplot(tab
0.0
0.1
0.2
0.3
0.4
barplot(tab
We can compute the mean for any given frequency distribution already Returning to the two coin ip example:
> # Means of frequency distributions of 2 coin tosses > mean(rbinom(10,2,.5)) [1] 0.9 > mean(rbinom(100,2,.5)) [1] 1.04 > mean(rbinom(1000,2,.5)) [1] 0.961 > mean(rbinom(10000,2,.5)) [1] 0.9987
If we performed this test an innite number of times, we would expect the average to be 1.0. This is why we call the mean of probability distributions an expected value
For sample statistics, we use Roman characters For population parameters, we use Greek characters
Sample Notation X s s2
Population Notation 2
-4
-2
0 x
where
> 0 is the standard deviation is the exepected value
Typically we consider the area relative to standard deviations from the mean A constant proportion of the total area under the normal curve will lie between the mean and any given distance measured in units of For instance, the area under the normal curve and the point 1 above the mean always contains 34.13% of cases The same distance below the mean contains the identical proportion of cases
To determine the probability of an event that a random variable X with a normal distribution is less than or equal to x, we evaluate the cumulative distribution function of the normal probability distribution at x The cdf of the normal distribution is: 1 ,2 (x) = 2
x
(x)2 2 2
dx
This equation (and the integral in the cdf) makes it possible for us to determine the total area under the curve for any given distance from the mean
For instance, what if we wanted to know the percentage of cases contained for 1.4 units? We can do this in R, or use Table A from Appendix C of LF&F Using Table A from Appendix B... In R:
> 1-pnorm(1.4) [1] 0.08075666 > round((1-pnorm(1.4))*100, 2) [1] 8.08
Standardized scores
We can transform any distribution into a set of standard deviations from the mean this is called a z score or standardized score The z score measures the direction and degree that any given raw score deviates from the mean of a distribution on a scale of units Formula for computing z score: zi = Xi
In order to convert any raw scores into scores that can be assessed using the normal curve, we convert them into units (or z scores) by standardizing them The normal curve can be used in conjunction with z scores and Table A (or pnorm) to determine the probability of obtaining any raw score in a distribution
Example
Assume we have a normal distribution of hourly wages
mean wage is $10, and the standard deviation is $1.5 we wish to nd out probability of an hourly wage of $12
What is probability of obtaining a score between mean of $10 and value of $12?
Example continued
Steps: 1. Standardize the raw score of $12: z = (12 10)/1.5 = 1.33 2. Use Table A to nd the total frequency under the normal curve between z = 1.33 and the mean. This is p=.4082 Alternatively, we could have used R: > pnorm(1.33)-pnorm(0) [1] 0.4082409 Standard normal curve: special version of the normal curve with = 0, 2 = 1
Example variation
What if we had wanted to nd the probability that a wage might be greater than the observed value (in this case, $12)? In this case, we would integrate from 1.33 to +
Example variation 2
We could also have obtained the probability that a wage would be either less than $8 or more than $12. The transformation would then yield -1.33 and 1.33 In R:
> pnorm(1.33) - pnorm(-1.33) [1] 0.8164817
This means that 1 .8165 = .1835 of the area lies below $8 and above $12
0.4 dnorm(x) 0.0 0.1 0.2 0.3
-1.33
0.00 x
1.33
Standardizing scores in R
> x <- c(1,4, 5, 7, 14, 0, 21) > (stdx <- (x - mean(x)) / sd(x)) [1] -0.8518410 -0.4543152 -0.3218066 -0.0567894 > scale(x) [,1] [1,] -0.8518410 [2,] -0.4543152 [3,] -0.3218066 [4,] -0.0567894 [5,] 0.8707708 [6,] -0.9843496 [7,] 1.7983310 attr(,"scaled:center") [1] 7.428571 attr(,"scaled:scale") [1] 7.54668
0.8707708 -0.9843496
1.798331
Visual inspection of the kernel density Some tests also exist, e.g. Anderson-Darling test, Kolmogorov-Smirnov test, Pearson 2 test Q-Q plot to visualize normality
quantile-quantile plot: plots ranked samples against a similar number of ranked samples from a normal distribution normality shows up as a straight-line correspondence In R: qqnorm()
-2
-1
0 Theoretical Quantiles