# Some deﬁnitions

◦ Individual: each object described by a set of data ◦ Variable: any characteristic of an individual ⋄ Categorical variable: places an individual into one of several groups or categories. ⋄ Quantitative variable: takes numerical values on which we can do arithmetic. ◦ Distribution of a variable: tells what values it takes and how often it takes these values. Example: The following data set consists of ﬁve variables about 20 individuals.
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Age: Education: Sex: Total income: Job class: Age 43 35 43 33 38 53 64 27 34 27 47 48 39 30 35 47 51 56 57 34 Education 4 3 2 3 3 4 6 4 4 3 6 3 2 3 3 4 4 5 6 1 Sex 1 2 1 1 2 1 1 2 1 2 1 1 1 2 2 2 2 1 1 1 Total income 18526 5400 3900 28003 43900 53000 51100 44000 31200 26030 6000 8145 37032 30000 17874 400 22216 26000 100267 15000 Job class 5 7 7 5 7 5 6 5 5 5 6 5 5 5 5 5 5 6 7 5

age in years 1=no high school, 2=some high school, 3=high school diplom, 4=some college, 5=bachelor’s degree, 6=postgraduate degree 1=male, 2=female income from all sources 5=private sector, 6=government, 7=self employed

Variables Age and Total income are quantitative, variables Eduction, Sex, and Job class are categorical.
Graphical Description of Data, Jan 5, 2004 -1-

Categorical variable analysis
Questions to ask about a categorical variable: ◦ How many categories are there? ◦ In each category, how many observations are there? Bar graphs and pie charts Categorical data can be displayed by bar graphs or pie charts. ◦ In a bar graph, the horizontal axis lists the categories, in any order. The height of the bars can be either counts or percentages. ◦ For better comparison of the frequencies, the variables can be ordered from most frequent to lest frequent. ◦ In a pie chart, the area of each slide is proportional to the percentage of individuals who fall into that category. Example: Education of people aged 25 to 34
30 30 no HS some HS HS diploma Bachelor’s some college postgrad 0 HS diploma Bachelor’s some college some HS Percent of people aged 25 to 34 10 20

0

Percent of people aged 25 to 34 10 20

no HS

Education level

Education level

6.7%3.6%

7.5%

22.7%

30.4%

29.1%

no HS HS diploma some college

Graphical Description of Data, Jan 5, 2004

-2-

Categorical variable analysis
Example: Education of people aged 25 to 34 STATA commands:
. . . . > . . . > . . > . . . . infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear drop if AGE<25 | AGE>34 label values EDUC Education label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "Bachelor’s" 5 "some college" 6 "postgrad" set scheme s1mono gen COUNT=100/_N graph bar (sum) COUNT, over(EDUC) ytitle("Percent of people aged 25 to 34") b1title("Education level") translate @Graph bar1.eps, translator(Graph2eps) replace graph bar (sum) COUNT, over(EDUC, sort(1) descending) ytitle("Percent of people aged 25 to 34") b1title("Education level") translate @Graph bar2.eps, translator(Graph2eps) replace set scheme s1color graph pie COUNT, over(EDUC) plabel(_all perc, format(%4.1f) gap(-5)) translate @Graph pie.eps, translator(Graph2eps) replace

Graphical Description of Data, Jan 5, 2004

-3-

Quantitative variables: stemplots
Example: Sammy Sosa home runs
Year 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 Home runs 4 15 10 8 33 25 36 40 36 66 63 50 64 49 40

Producing stemplots in STATA:
. infile YEAR HR using sosa.dat . stem HR Stem-and-leaf plot for HR 0* 1* 2* 3* 4* 5* 6* | | | | | | | 48 05 5 366 009 0 346

How to make a stemplot 1. Separate each observation into a stem and a leaf. e.g. 15 → 1 5 and 4 → 0 4 stem leaf stem leaf

2. Write the stems in a vertical column in increasing order. 3. Write each leaf next to stem, in increasing order out from the stem. How to choose the stem ◦ Rounding: each leaf should have exactly one digit, so rounding long numbers before producing the stemplot can help produce a more compact and informative plot. ◦ Splitting: if each stem (or many stems) have a large number of leaves, all stems can be split, with leaves of 0-4 going to the ﬁrst stem and 5-9 going to the second.
Graphical Description of Data, Jan 5, 2004 -4-

but when the bar has equal width.03 . use equally spaced bins. area is determined by the height. Example: Sammy Sosa home runs Year 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 Home runs 4 15 10 8 33 25 36 40 36 66 63 50 64 49 40 0 .02 .04 0 10 20 30 40 Home runs 50 60 70 The area of each bar is proportional to the percentage of data in that range. too many decimate the pattern. ◦ Label the axes with units of measurement. Graphical Description of Data.Quantitative variables: histograms How to make a histogram 1. We care about the area. Group the observations into “bins” according to their value. Choose the bins carefully: too few hide detail. not the height. Count the individuals in each bin.01 Density . For simplicity. 3. 2004 -5- . Jan 5. Draw the histogram ◦ Leave no space between bars. 2. ◦ The y-axis is can be counts or percentages (per unit).

.07 0.05 0.02 . ◦ The horizontal axis represents a numerical quantity. Jan 5.00 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0.07 Histogram of Sosa Home Runs 0.07 Histogram of Sosa Home Runs 0.04 0. .00 0. .02 0.06 0.eps.03 0. start(0. ◦ There is no space between the bars. translator(Graph2eps) replace hist HR.06 0.06 10 20 30 40 50 60 70 Home Runs Home Runs Home Runs Home Runs Producing histograms in STATA: .01 0.06 0. with an inherent order.03 Percentage 0 10 20 30 40 50 60 70 0.00 0.01 Density .04 5 0 10 20 30 40 Home runs 50 60 70 0 1 0 Frequency 2 3 4 .1) width(10) xlabel(0(10)70) xtitle(Home runs) translate @Graph hist1.02 0.04 Percentage Percentage Percentage 0.03 0. 2004 -6- .02 0.02 0.07 Histogram of Sosa Home Runs 0.05 0.dat hist HR.05 0. translator(Graph2eps) replace .Quantitative variables: histograms Example: Sammy Sosa home runs Histograms with diﬀerent bin widths: Histogram of Sosa Home Runs 0.01 0. infile YEAR HR using sosa.05 0. not height. start(0.03 0 10 20 30 40 Home runs 50 60 70 Why is a histogram not a bar graph? ◦ Frequencies are represented by area. .eps.03 0.04 0.01 0.00 0 0.01 0.04 0. Graphical Description of Data.1) width(10) xlabel(0(10)70) xtitle(Home runs) freq translate @Graph hist2.

0 ◦ Center: Where is the “middle” of the distribution? ◦ Spread: What are the smallest and largest values? ◦ Outliers: Are there any observations that lie outside the overall pattern? They could be unusual observations. ◦ Shape: Is the distribution (approximately) symmetric or skewed? Histogram of x 1500 2000 This distribution is skewed right because it has a long right-hand tail. Check them! Example: Newcomb’s measurements of the passage time of light (IPS Tab 1.0 x 1. Frequency 0 0.1) 25 20 Frequency 15 10 5 0 −60 −40 −20 0 Time 20 40 60 Graphical Description of Data.0 500 1000 0.5 2. 2004 -7- . Jan 5.Interpreting histograms ◦ Describe the overall pattern and any signiﬁcant deviations from that pattern.5 1. or they could be mistakes.

1f)) xtick(0(12)159) xlabel(0 "1988" 24 "1990" 48 "1992" 72 "1994" 96 "1996" 120 "1998" 144 "2000") xtitle(Year) ytitle(Retail gasoline price) Graphical Description of Data. it is a good idea to have a time plot.0 1.2 1.3 1. . ylabel(0.5 1. format(%3. Jan 5. which can be misleading when systematic change over time exists.1)1.Time plots Example: Average retail price of gasoline from Jan 1988 to Apr 2001 1.6 1.txt.1 Retail gasoline price 1. > > infile PRICE using gasoline.8. Stemplots and histograms ignore time order.8 0.9(0. Producing a time plot in STATA: .7 1990 1992 1994 Year 1996 1998 2000 Note: Whenever data are collected over time.9 1988 1.4 1. 2004 -8- . clear graph twoway line PRICE T.

the median is the average of the two center observations in the ordered list: M= x(n/2) + x(n/2+1) 2 -1- Numerical Description of Data. 1. xn. . . If the number of observation n is even. 2004 . the median is the center observation in the ordered list: M = x((n+1)/2) 3. . How to ﬁnd the median Suppose the observations are x1. 2. Jan 7.Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x1 + · · · + xn 1 n x= ¯ =n xi n i=1 The median The median is the midpoint of a distribution: the number M such that ◦ half the observations are smaller and ◦ half are larger. If the number of observations n is odd. x2. . Arrange the data in increasing order and let x(i) denote the ith smallest observation.

1 Arrange in increasing order: x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) 1. so the median is M= x(n/2) + x(n/2+1) x(5) + x(6) 4.9 x7 4. so the median is M = x((n+1)/2) = x(5) = 4.2 x8 2.9 x4 4.3 2.9 3.3 x10 5.1 + 4.3 2.Measures of center Examples: Data set 1: x1 2 x2 4 x3 3 x4 4 x5 6 x6 5 x7 4 x8 -6 x9 5 Arrange in increasing order: x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) -6 2 3 4 4 4 5 5 6 There is an odd number of observations.1 44.9 + 4.49.4 x6 5.2 = = = 4.9 x9 1. Jan 7.3 + 8.1 x5 6.15.8 There is an even number of observations.3 + 5.1 5.8 x3 3.4 + 5.2 + 2.8 + 3.9 + 1. The mean is given by x= ¯ 2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5 27 = = 3.1 4.9 = = 4.4 8. 9 9 Data set 2: x1 2.9 6. 10 10 -2- Numerical Description of Data.9 4.3 x2 8. 2004 . 2 2 2 The mean is given by x= ¯ 2.1 + 6.9 + 4.2 5.

while the median is not. then mean=median. Numerical Description of Data.Mean versus median ◦ The mean is easy to work with algebraically. Jan 7. Example: 0 1 2 3 4 5 6 7 8 9 10 The original mean and median are 0+1+2 = 1 and M = x((n+1)/2) = 1 3 The modiﬁed mean and median are 0 + 1 + 10 2 x= ¯ = 3 and M = x((n+1)/2) = 1 3 3 ◦ If the distribution is exactly symmetric. while the median is more robust. ◦ The median is preferable for strongly skewed distributions. the mean is further out in the longer tail than the median. 2004 -3- . ◦ The mean is sensitive to extreme observations. x= ¯ ◦ In a skewed distribution. or when outliers are present.

Measures of spread Example: Monthly returns on two stocks 40 Frequency 30 20 10 0 −10 −5 0 5 10 15 20 Stock A Frequency 40 30 20 10 0 −10 −5 0 Stock B 5 10 15 20 Daily returns (in %) Daily returns (in %) Stock A Stock B Mean 4. Jan 7.68 The distributions of the two stocks have approximately the same mean and median.95 4. but stock B is more volatile and thus more risky. 2004 -4- . Common measures of spread are ◦ the quartiles and the interquartile range ◦ the standard deviation Numerical Description of Data. ◦ Measures of center alone are an insuﬃcient description of a distribution and can be misleading ◦ The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread.99 4.82 Median 4.

Find the median of the observations to the right of M. QU Examples: Data set: x1 2 x2 4 x3 3 x4 4 x5 6 x6 5 x7 4 x8 -6 x9 5 Arrange in increasing order: x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) -6 2 3 4 4 4 5 5 6 ◦ QL is the median of {−6. 4}: QL = 2.5 ◦ QU is the median of {4. 3. 5. that is the lower quartiles.Quartiles Quartiles divide data into 4 even parts ◦ Lower (or ﬁrst) quartile QL : median of all observations less than the median M ◦ Middle (or second) quartile M = QM : median of all observations ◦ Upper (or third) quartile QU : median of all observations lgreater than the median M ◦ Interquartile range: IQR = QU − QL distance between upper and lower quartile How to ﬁnd the quartiles 1.5 Numerical Description of Data. QL 3. 2004 -5- .5 = 2. Find the median of the observations to the left of M. 5. Jan 7. 6}: QU = 5 ◦ IQR = 5 − 2. that is the upper quartiles. 2. Arrange the data in increasing order and ﬁnd the median M 2.

. the pth percentile is the average of the x(np/100) and x(np/100+1). 3. then x(k+1) is the pth percentile. If np/100 is not an integer. 2. If np/100 is an integer. Jan 7. . where k is the largest integer less than np/100. . Numerical Description of Data.Percentiles More generally we might be interested in the value which is exceeded only by a certain percentage of observations: The pth percentile of a set of observations is the value such that ◦ p% of the observation are less than or equal to it and ◦ (100 − p)% of the observation are greater than or equal to it. xn} is given by x(1) QL M QU x(n) A simple boxplot is a graph of the ﬁve-number summary. . 2004 -6- . Arrange the data into increasing order. How to ﬁnd the percentiles 1. Five-number summary A numerical summary of a distribution {x1.

2004 -7- . xsize(2) ysize(5) 0 −10 10 Stock A Stock B Numerical Description of Data. infile A B using stocks. graph box A B. Measurements falling outside 1.5 IQR from the ends of the box are potential outliers and Plotting a boxplot with STATA: marked by ◦ or ∗. Jan 7. 20 How to draw a boxplot Box-and-whisker plot) 1.Boxplots A common “rule” for discovering outliers is the 1. Lines (the whiskers) are drawn from the ends of the box to the most extreme observations within a distance of 1. clear . 2. . 4.5 IQR (Interquartile range). The median of the data is shown by a line in the box. label var A "Stock A" .5 × IQR below QL or above QU . A box (the box) is drawn from the lower to the upper quartile (QL and QU ).txt. 3.5 × IQR rule: An observations is a suspected outlier if it lies more than falls more than 1. label var B "Stock B" .

Jan 7. but represents a rare (chance) event.Boxplots Interpretation of Box Plots ◦ The IQR is a measure for the sample’s variability. upper quartile) are outliers. We accept the last explanation only after carefully ruling out all others. with one of the following explanations: a) The measurement is incorrect (error in measurement process or data processing). c) The measurement is correct. ◦ Very extreme observations (more than 3 IQR away from the lower resp. b) The measurement belongs to a diﬀerent population. Numerical Description of Data. ◦ If the whiskers diﬀer in length the distribution of the data is probably skewed in the direction of the longer whisker. 2004 -8- .

. x2. Numerical Description of Data. Jan 7. . if you know any n − 1 of the diﬀerences. . The number of “freely varying” observations. 2004 -9- . xn. Why n − 1? Note that n i=1 (xi − x) = 0 ¯ Thus. n − 1 in this case. . the last diﬀerence can be determined from the others. The variance of the n observations is: (x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2 ¯ ¯ ¯ s = n−1 2 = 1 n−1 n i=1 (xi − x)2 ¯ This is (approximately) the average of the squared distances of the observations from the mean. is called the “degrees of freedom”.Variance and standard deviation Suppose there are n observations x1. The standard deviation is: s= √ s2 = 1 n−1 n i=1 (xi − x)2 ¯ Why n − 1? Division by n − 1 instead of n in the variance calculation is a common cause of confusion.

◦ x and s are preferred for symmetric distributions with no out¯ liers. since each side has a diﬀerent spread. like x is not resistant to outliers. ◦ s. Jan 7. 2004 . ◦ s = 0 ⇔ all observations are the same ◦ s is in the same units as the measurements.Properties of s ◦ Measures spread around the mean =⇒ use only if the mean is used as a measure of center. Numerical Description of Data. ¯ Five-number summary versus standard deviation ◦ The 5-number summary is better for describing skewed distributions.10 - . while s2 is in the square of these units.

The Normal Distrbution. ◦ Area under the curve in a range of values indicates the proportion of values in that range. Jan 9. approximate it with a smooth curve. 2004 -1- . Any curve that is always on or above the horizontal axis and has total are underneath equal to one is a density curve. but the “normal” family of familiar bell-shaped densities is commonly used. ◦ Come in a variety of shapes. but it simpliﬁes analysis and is generally accurate enough for practical use. ◦ Remember the density is only an approximation.Histograms and density curves What’s in our toolkit so far? ◦ Plot the data: histogram (or stemplot) ◦ Look for the overall pattern and identify deviations and outliers ◦ Numerical summary to brieﬂy describe center and spread A new idea: If the pattern is suﬃciently regular.

02 0.05 Density 0.02 0.03 0.03 0. Jan 9.04 0.29 0.01 0.01 0.02 0.05 Density 0.00 0 10 20 Sulfur oxide (in tons) 30 40 Shaded area under the curve: 0.02 0.05 Density 0.03 0.04 0.01 0.04 0.00 0 10 20 Sulfur oxide (in tons) 30 40 0.07 0.30 0.03 Density 0.01 0.04 0.Examples 0.06 0.00 40 46 52 58 64 70 76 82 Waiting time between eruptions (min) 88 94 100 The Normal Distrbution.06 0.00 0 10 20 Sulfur oxide (in tons) 30 40 Shaded area of histogram: 0. 2004 -2- .07 0.06 0.07 0.

The mean and standard deviation of a density are denoted µ and σ. ◦ The mean of a skewed curve is pulled away from the median in the direction of the long tail. 2004 -3- . to indicate that they refer to an idealized ¯ model.Median and mean of a density curve Median: The equal-areas point with 50% of the “mass” on either side. Mean: The balancing point of the curve. if it were a solid mass. Jan 9. and not actual data. The Normal Distrbution. Note: ◦ The mean and median of a symmetric density curve are equal. rather than x and s.

2004 -4- . Jan 9.Normal distributions: N (µ. ◦ bell-shaped. σ) The normal distribution is ◦ symmetric. The density curve is given by f (x) = √ 1 2πσ 2 1 exp − 2σ2 (X − µ)2 . The Normal Distrbution. ◦ single-peaked. It is determined by two parameters µ and σ: ◦ µ is the mean (also the median) ◦ σ is the standard deviation Note: The point where the curve changes from concave to convex is σ units from µ in either direction.

The 68-95-99.7 rule ◦ About 68% of the data fall inside (µ − σ. µ + σ). ◦ About 95% of the data fall inside (µ − 2σ. ◦ About 99. 2004 -5- . Jan 9.7% of the data fall inside (µ − 3σ. µ + 3σ). The Normal Distrbution. µ + 2σ).

Example Scores on the Wechsler Adult Intelligence Scale (WAIS) for the 20 to 34 age group are approximately N (110. 2004 -6- . ◦ About what percent of people in this age group have scores above 110? ◦ About what percent have scores above 160? ◦ In what range do the middle 95% of all scores lie? The Normal Distrbution. Jan 9. 25).

Jan 9.5) observations greater than 5? ◦ What is the proportion of N (10. For example: ◦ What is the proportion of N (0. a σ) In particular it follows that X −µ ∼ N (0. and in what direction. 1) is called standard normal distribution. σ N (0. For a real number x the standardized value or z-score x−µ z= σ tells how many standard deviations x is from µ. 1) observations less than 1.Standardization and z-scores Linear transformation of normal distributions: X ∼ N (µ. 1. 2004 -7- . 1).2? ◦ What is the proportion of N (3. 5) observations between 3 and 9? The Normal Distrbution. σ) ⇒ a X + b ∼ N (a µ + b. Standardization enables us to use a standard normal table to ﬁnd probabilities for any normal variable.

Jan 9. 3. 2. solve z = x−µ σ for x. 2.7. 5. State the problem in terms of x. given the probabilities: If MPG ∼ N (25.e. Look up the required value(s) on the standard normal table.88). Standardize: z = x−µ σ . Look up the required value(s) on the standard normal table. State the problem in terms of the probability of being less than some number. 3.” i. Reality check: Does the answer make sense? Backward normal calculations We can also calculate the values. The Normal Distrbution. 4. 2004 -8- .Normal calculations Standard normal calculations 1. what is the minimum MPG required to be in the top 10%? “Backward” normal calculations 1. “Unstandardize.

95 Suppose X ∼ N (10. 5). 1).99 È(−z ≤ X < z) = 0. ◦ ◦ ◦ È(X ≤ 2) = ? È(X > 2) = ? È(−1 ≤ X ≤ 2) = ? ⋄ ⋄ ⋄ ⋄ ◦ Find the value z such that ⋄ È(X ≤ z) = 0.95 È(−z ≤ X < z) = 0.997 È(−3 < X < 5) = ? È(−x < X < x) = 0.95 The Normal Distrbution.68 È(−z ≤ X < z) = 0. Jan 9. 2004 -9- . ◦ È(X < 5) = ? ◦ ◦ È(X > z) = 0.Example Suppose X ∼ N (0.

the plot will be close to a straight line. Find the z-scores for these percentiles.8 Sample Quantiles 2 3 4 5 Sample Quantiles −1 0 1 U(0.6 0. 1) −3 −2 −1 0 1 2 Theoretical Quantiles 3 The Normal Distrbution. ◦ Outliers appear as points that are far away from the overall patter of the plot. ◦ Systematic deviations from a straight line indicate a nonnormal distribution. 2004 0 .10 - .Assessing Normality How to make a normal quantile plot 1.0 N(0. 1) 1 Exp(1) 0. Jan 9. n ). Use of normal quantile plots ◦ If the data are (approximately) normal.0 2 6 −2 −3 −2 −1 0 1 2 Theoretical Quantiles 3 −3 −2 −1 0 1 2 Theoretical Quantiles 3 0. . Record the percentiles ( n . . 1. 1 2 2. . . n . Plot x on the vertical axis against z on the horizontal axis. 4.2 Sample Quantiles 0.4 0. Arrange the data in increasing order. n 3.

0.2 Density 0. some with compact mathematical formulas and many without. 2004 .0 0 10 20 Velocity of galaxy (1000km/s) 30 40 The Normal Distrbution.11 - .1 0. Jan 9. There are many others.Density Estimation The normal density is just one possible density curve. Density estimation software ﬁts an arbitrary density to data to give a smooth summary of the overall pattern.

Histogram
How to scale a histogram? ◦ Easiest way to draw a histogram: ⋄ qqually spaced bins ⋄ counts on the vertical axis
Frequency

5 4 3 2 1 0 0

10

20

30

40

50

60

70

Sosa home runs

Disadvantage: Scaling depends on number of observations and bin width. ◦ Scale histogram such that area of each bar corresponds to proportion of data: height = counts width · total number
0.04 0.03 Density 0.02 0.01 0.00 0

10

20

30

40

50

60

70

Sosa home runs

Proportion of data in interval (0, 10]: height · width = 0.02 · 10 = 0.2 = 20% Since n = 15 this corresponds to 3 observations.

The Normal Distrbution, Jan 9, 2004

- 12 -

Density curves

0.5 0.4 Density 0.3 0.2 0.1 0.0 −4

n=250

−3

−2

−1

0 x

1

2

3

4

0.5 0.4 Density 0.3 0.2 0.1 0.0 −4

n=2500

Proportion of data in (1,2]:
−3 −2 −1 0 x 1 2 3 4

0.5 0.4 Density 0.3 0.2 0.1 0.0 −4

n=250000

#{xi : 1 < xi ≤ 2} n

−3

−2

−1

0 x

1

2

3

4

0.5 0.4 Density 0.3 0.2 0.1 0.0 −4

↓ n→∞
2

n→∞

f (x) dx
1
−3 −2 −1 0 x 1 2 3 4

Probability that a new observation X fall into [a, b]

È(a ≤ X ≤ b) =
The Normal Distrbution, Jan 9, 2004

b a

f (x) dx = lim

#{xi : 1 < xi ≤ 2} n→∞ n

- 13 -

Relationships between data
Example: Smoking and mortality ◦ Data from 25 occupational groups (condensed from data on thousands of individual men) ◦ Smoking (100 = average number of cigarettes per day) ◦ Mortality ratio for deaths from lung cancer (100 = average ratio for all English men) Scatter plot of the data:

140 Mortality (index) 120 100 80 60 70 80 90 100 110 Smoking (index) 120 130

In STATA:
. insheet using smoking.txt . graph twoway scatter mortality smoking

Scatterplots and correlation, Jan 12, 2004

-1-

Relationship between data
Assessing a scatter plot: ◦ What is the overall pattern? ⋄ form of the relationship? ⋄ direction of the relationship? ⋄ strength of the relationship? ◦ Are there any deviations (e.g. outliers) from these patterns? Direction of relationship/association: ◦ positive association: above-average values of both variables tend to occur together, and the same for below-average values ◦ negative association: above-average values of one variable tend to occur with below-average values of the other, and vice versa. Strength of relationship/association: ◦ determined by how closely the points follow the overall pattern ◦ diﬃcult to assess numerical measure

Scatterplots and correlation, Jan 12, 2004

-2-

2004 -3- . ¯ ¯ i=1 Properties: ◦ dimensionless quantity ◦ not aﬀected by linear transformations: for x′i = a xi + b and yi′ = c yi + d rx′ y ′ = rxy ◦ −1 ≤ rxy ≤ 1 ◦ rxy = 1 if and only if yi = a xi + b for some a and b ◦ measures linear association between xi and yi Scatterplots and correlation. The sample correlation r is deﬁned as sxy . ¯ sxy = 1 n−1 i=1 n (yi − y )2. rxy = √ sx sy where sx = sy = 1 n−1 1 n−1 n i=1 n (xi − x)2.Correlation Correlation is a numerical measure of the direction and strength of the linear relationship between two quantitative variables. Jan 12. ¯ (xi − x)(yi − y ).

Jan 12.3 2 1 1 2 ρ=0 y 0 −2 −2 −1 0 x 1 2 −2 −2 y 0 −1 0 x 1 2 ρ = 0.99 y 0 −2 −2 −1 0 x 1 2 −2 −2 y 0 −1 0 x 1 2 Scatterplots and correlation.6 y 0 −2 −2 −1 0 x 1 2 −2 −2 y 0 −1 0 x 1 2 ρ = −0.3 2 1 1 2 ρ = 0. 2004 -4- .9 2 1 1 2 ρ = −0.6 y 0 −2 −2 −1 0 x 1 2 −2 −2 y 0 −1 0 x 1 2 ρ = 0.Correlation ρ = −0.9 2 1 1 2 ρ = 0.

) ◦ Salaries: response . 2004 -1- . explanatory . education.Introduction to regression Regression describes how one variable (response) depends on another variable (explanatory variable). Jan 14.salary (\$).house size (sq.listing prize (\$). ◦ Response variable: variable of interest. explanatory . sex Least squares regression. ft.sound level (decibels). explanatory .age (years) ◦ Real estate market: response . measures the outcome of a study ◦ Explanatory variable: explains (or even causes) changes in response variable Examples: ◦ Hearing diﬃculties: response .experience (years).

what can we tell about Y ? Linear regression: If the response Y depends linearly on the explanatory variable X.Introduction to regression Example: Food expenditures and income Data: Sample of 20 households 20 16 food expenditure 12 8 4 0 0 20 40 60 income 80 100 120 Questions: ◦ How does food expenditure (Y ) depend on income (X)? ◦ Suppose we know that X = x0. we can use a straight line (regression line) to predict Y from X. Jan 14. 2004 -2- . Least squares regression.

. . . y means) ¯ ¯ Least squares regression.Least squares regression How to ﬁnd the regression line 20 16 food expenditure 12 18 8 food expenditure Observed y ^ Difference y − y ^ Predicted y 16 14 12 10 8 50 60 4 0 0 20 40 60 income 80 100 120 70 income 80 90 Since we intend to predict Y from X. (xn. The least squares regression line of Y on X is the line that minimizes the sum of squared errors. y1). Jan 14. the regression line is given by ˆ Y = a+ bX where s b = r sy x and a = y − bx ¯ ¯ (r correlation coeﬃcient. . the errors of interest are mispredictions of Y for ﬁxed X. For observations (x1. sx . yn ). 2004 -3- . x. sx standard deviations.

946 · 4.1 The summary statistics are: ◦ x = 45.9 42 58 28 20 4.66 ◦ r = 0.402 ¯ ¯ 20 food expenditure 15 10 5 0 0 20 40 60 income 80 100 120 Least squares regression.97 ¯ ◦ sy = 4. 2004 -4- .5 = −0.8 30 5.0 13.184 a = y − b x = 7.8 40 82 5.9 11.1 32 24 54 59 5.96 = 0.6 11.8 5.1 18.1 42 7.9 47 6.7 5.946 The regression coeﬃcients are: s b = r sy = x 0.2 26 5.Least squares regression Example: Food expenditure and income X Y X Y 28 5.6 4. Jan 14.66 23.50 ¯ ◦ sx = 23.97 − 0.4 44 7.184 · 45.2 4.8 112 85 31 20.96 ◦ y = 7.3 8.0 26 2.

◦ To make a prediction for an unobserved X. it often will not pass through any of them. not the true Y values. just plug it in and ˆ calculate Y . Least squares regression. depending on whether or not X can take values near 0. The “hat” denotes prediction.Interpreting the regression model ˆ ◦ The response in the model is denoted Y to indicate that these are predicted Y values. ˆ ◦ The slope of the line indicates how much Y changes for a unit change in X. ◦ Note that the line need not pass through the observed data points. 2004 -5- . It may or not have a physical interpretation. ˆ ◦ The intercept is the value of Y for X = 0. Jan 14. In fact.

y ) x ¯ The correlation r measures how much the points spread around the SD line. ◦ has slope sy /sx Least squares regression. 2004 -6- .078 fathers and sons 80 78 76 74 Son’s height (inches) 72 70 68 66 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 80 Father’s height (inches) Points are scattered around the SD line: sy ¯ ◦ (y − y ) = sx (x − x) ¯ ◦ goes through center (¯.Regression and correlation Correlation analysis: We are interested in the joint distribution of two (or more) quantitive variables. Jan 14. Example: Heights of 1.

04 0.Regression and correlation Regression analysis: We are interested how the distribution of one response variable depends on one (or more) explanatory variables. Least squares regression. the points are distributed around the regression line.15 Density 76 74 Son’s height (inches) 72 70 68 0.12 0. Example: Heights of 1.16 66 0.05 x Father’s height = 64 inches Father’s height = 68 inches 60 62 64 66 68 70 72 Son’s height (inches) 74 76 78 80 Father’s height = 72 inches In each vertical strip.18 0. Jan 14.08 0.09 0. 2004 -7- .10 0.06 0.078 fathers and sons 80 0.03 0.00 58 x 0.00 58 80 78 76 74 Son’s height (inches) 72 70 68 66 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 80 Father’s height (inches) 60 62 64 66 68 70 72 Son’s height (inches) 74 76 78 80 x 0.12 Density 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 80 Father’s height (inches) 0.20 78 0.00 58 60 62 64 66 68 70 72 Son’s height (inches) 74 76 78 80 0.15 Density 0.

Properties of least squares regression ◦ The distinction between explanatory and response variables is essential. you should report r2. Looking at vertical deviations means that changing the axes would change the regression line. When reporting the results of a linear regression. ◦ The least squares regression line always passes through the point (¯. 2004 -8- . Least squares regression. x ¯ ◦ r2 (the square of the correlation) is the fraction of the variation in the values of y that is explained by the least squares regression on x. Jan 14. y ). 80 ^ x = a’ + b’y 78 76 74 Son’s height (inches) 72 70 68 66 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 80 Father’s height (inches) ^ y = a + bx ◦ A change of 1 sd in X corresponds to a change of r sds in Y . These properties depend on the least-squares ﬁtting criterion and are one reason why that criterion is used.

The regression eﬀect Regression eﬀect In virtually all test-retest situations. Jan 14. 80 78 76 74 Son’s height (inches) 72 70 68 66 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 80 Father’s height (inches) Regression fallacy Thinking that the regression eﬀect must be due to something important. 2004 -9- . is the regression fallacy. the bottom group on the ﬁrst test will on average show some improvement on the second test .and the top group will on average fall back. not just the spread around the line. Least squares regression. The statistician and geneticist Sir Francis Galton (1822-1911) called this eﬀect “regression to mediocrity”. This is the regression eﬀect.

size(large)) xscale(range(0 120)) xlabel(0(20)120.2154862 _cons | -.0149345 12.572965 1 369.97 = 0. .192615 --------------------------------------------------------------------------- 20 Food expenditure 15 10 5 0 0 20 40 60 80 100 120 Income This graph has been generated using the graphical user interface of STATA. Interval] ------------+-------------------------------------------------------------income | .1527336 .1841099 . > .8882 = 1.8941 = 0.txt graph twoway scatter food income || lfit food income. legend(off) ytitle(food) regress food income Number of obs F( 1.7637666 -0. 18) Prob > F R-squared Adj R-squared Root MSE = 20 = 151. > > > > twoway (scatter food income. 2004 . Err.54 0.43180756 ------------+-----------------------------Total | 413.016613 1. labsize(medlarge)) legend(off) ysize(2) xsize(3) Least squares regression. Jan 14.7550264 --------------------------------------------------------------------------food | Coef.572965 Residual | 43. size(large)) ylabel(. infile food income size using food.0000 = 0.345502 19 21. Std.000 . ytitle(Food expenditure. valuelabel angle(horizontal) labsize(medlarge)) xtitle(Income.Regression in STATA .4119994 .10 - .596 -2. The complete command is: . t P>|t| [95% Conf. range(0 120) clcolor(black) clpat(solid) clwidth(medium)).5594 Source | SS df MS ------------+-----------------------------Model | 369. msymbol(circle) msize(medium) mcolor(black)) (lfit food income.7725361 18 2.33 0.

2004 . ◦ Points that are extreme in the x direction are potential high inﬂuence points. Patterns to look for: ◦ Curvature indicates that the relationship is not linear. Removing them would signiﬁcantly change the regression line. Residual plot A residual plot is a scatterplot of the residuals against the explanatory variable. the residuals always have mean zero. Jan 14. Least squares regression. It is a diagnostic tool to assess the ﬁt of the regression line.Residual plots Residuals: diﬀerence of observed and predicted values ei = observed y − predicted y = yi − (a + b xi) For a least squares regression. Inﬂuential observations are individuals with extreme x values that exert a strong inﬂuence on the position of the regression line. ◦ Points with large residuals are outliers in the vertical direction.11 - = yi − yi ˆ . ◦ Increasing or decreasing spread indicates that the prediction will be less accurate in the range of explanatory variables where the spread is larger.

12 - .Regression Diagnostics Example: First data set 10 Y 5 0 5 X 2 10 15 1 Residuals 0 −1 −2 4 6 Fitted values 2 8 10 1 Residuals 0 −1 −2 5 X 10 15 residuals are regularly distributed Least squares regression. Jan 14. 2004 .

2004 .Regression Diagnostics Example: Second data set 10 Y 5 0 5 X 2 10 15 1 Residuals 0 −1 −2 4 6 Fitted values 2 8 10 1 Residuals 0 −1 −2 5 X 10 15 functional relationship other than linear Least squares regression.13 - . Jan 14.

regression line misﬁts majority of data Least squares regression. 2004 .Regression Diagnostics Example: Third data set 15 10 Y 5 0 5 X 10 15 3 2 Residuals 1 0 −1 4 6 Fitted values 8 10 3 2 Residuals 1 0 −1 5 X 10 15 outlier.14 - . Jan 14.

15 - . 2004 .Regression Diagnostics Example: Fourth data set 15 10 Y 5 0 5 X 2 10 15 1 Residuals 0 −1 −2 4 6 Fitted values 2 8 10 1 Residuals 0 −1 −2 5 X 10 15 heteroscedasticity Least squares regression. Jan 14.

Regression Diagnostics Example: Fifth data set 15 10 Y 5 0 5 10 X 2 15 20 1 Residuals 0 −1 −2 6 8 10 Fitted values 12 14 2 1 Residuals 0 −1 −2 5 10 X 15 20 one separate point in direction of x. highly inﬂuential Least squares regression. 2004 .16 - . Jan 14.

2004 -1- .3 ± 0. b Can we conclude that babies are brought by the stork? Causation. Jan 16.The Question of Causation Example: Are babies brought by the stork? ◦ Data from 54 countries ◦ Variables: ⋄ Birth rate (newborns per 1000 women) ⋄ Number of storks (per 1000 women) 21 18 15 Birth rate 12 9 6 3 0 0 1 2 3 4 Number of storks (per 1000 women) 5 Model: Birth rate (Y) is proportional to the number of storks (X) Y = bX + ε Least squares regression yields for the slope of the regression line ˆ = 4.2.

g. The eﬀects of education and self assurance can not be separated. self assurance. 2004 -2- .response ◦ level of education X . On the other hand the level of education could have an impact on e. Question: Does better education increase income? X Y X ? Y X ? Y Z Z (a) causal eﬀect (b) confounding (c) Possible alternative explanation: Confounding ◦ People from prosperous homes are likely to receive many years of education and are more likely to have high earnings.explanatory variable There is a positive association between income and the education. Confounding: Response and explanatory variable both depend on a third (hidden) variable. ◦ Education and income might both be aﬀected by personal attributes such as self assurance. Jan 16. Causation.The Question of Causation A more serious example: Variables: ◦ Income Y .

Example: Smoking and lung cancer Causation.Establishing Causal Relationships Controlled experiments: A cause-eﬀect relationship between two variables X and Y can be established by conducting an experiment where ◦ the values of X are manipulated and ◦ the eﬀect on Y is observed. ◦ The association is consistent across multiple studies. Problem: Often such experiments are not possible. 2004 -3- . Jan 16. ◦ Higher doses are associated with stronger responses. we can still collect evidence from observational studies: ◦ The association is strong. ◦ The alleged cause precedes the eﬀect in time. If we cannot establish a causal relationship by a controlled experiment. ◦ The alleged cause is plausible.

no causation out Causation is . 2004 -4- . Any relationship between the manipulated variable and the response must be due to a cause-eﬀect relationship. we need some knowledge about the causal relationships between the variables in the study.Caution about Causation Association is not causation Two variables may be correlated because both are aﬀected by some other (measured or unmeasured) variable. ◦ may mask a real relationship. For inference on cause-eﬀect relationships. They ◦ may suggest a relationship where there is none or No causation in . Randomized experiments guarantee the absence of any confounding variables.no statistical concept.unlike association . Jan 16. Unmeasured confounding variables can inﬂuence the interpretation of relationships among the measured variables. Causation.

Experiments and Observational Studies Two major types of statistical studies ◦ Observational study . Thus the eﬀect of the explanatory variable on the response variable might be confounded (mixed up) with the eﬀect of some other variables. ◦ Designed experiments allow statements about causal relationship between treatment and response. ◦ In economics. ◦ Clinical studies are often designed experiments. Remarks: ◦ Sample survey are an example of an observational study. most studies are observational.observes individuals/objects and measures variables of interest but does not attempt to interfere with the natural process.deliberately imposes some treatment on individuals to observe their responses. ◦ Observational studies have no control over variables. ◦ Designed experiment . Experiments and Observational Studies. 2004 -5- . Such variables are called confounder and a major source of bias. Jan 16.

Jan 16.000 children in control group treated with placebo The diﬀerence between the responses of the two groups show that the vaccine reduces the risk of polio infection.Designed Experiments • In controlled experiments. the subjects are assigned to one of two groups. • In a double blind experiment. • One precaution in designed experiments if the use of a placebo. any diﬀerence in the response thus cannot be attirbuted to psychological and psychosomatical eﬀects. 2004 -6- . ◦ treatment group and ◦ control group (which does not receive treatment). • A controlled experiment is randomized if the subjects are randomly assigned to one of the two groups. Experiments and Observational Studies. which are made of a completely neutral substance. Example: The Salk polio vaccine ﬁeld trial ◦ Randomized controlled double-blind experiment in 11 states ◦ 200. The subjects do not know whether they receive the treatment or a placebo. neither the subjects nor the treatment administrators know who is assigned to the two groups.000 children in treatment group ◦ 200.

◦ The children in the control did not receive any extra milk. A confounder is a third variable. Experiments and Observational Studies. Jan 16. ◦ The subjects of the experiment were school children. ◦ The teachers assigned poorer children to treatment group so that they got extra milk The eﬀect of pasteurized milk on the health of children is confounded with the eﬀect of wealth: Poorer children are more exposed to diseases.Confounding Confounding means a diﬀerence between the treatment and control groups—other than the treatment—which aﬀects the responses being studied. Example: Lanarkshire Milk Experiment The purpose of the experiment was to study the eﬀect of pasteurized milk on the health of children. ◦ The children in the treatment group got a daily portion of pasteurized milk. 2004 -7- . associated with exposure and with disease.

2004 -8- . The observed high association could be attributed to the confounding eﬀect of such a gene.Observational Studies Confounding is a major problem in observational studies. Experiments and Observational Studies. • Observation: Smokers have higher cancer rates • Tobacco industry: There might be a gene which ◦ makes people smoke and ◦ causes cancer In that case stopping smoking would not prevent cancer since it is caused by the gene. • However: Studies with identical twins—one smoker and one nonsmoker—puts some serious doubt on the gene theory. Association is NOT Causation Example: Does smoking cause cancer. • Designed experiment not possible (cannot make people smoke). Jan 16.

◦ poorer women were less likely to accept screening than richer ones. 2004 -9- . and ◦ most diseases fall more heavily on the poor than the rich. Jan 16. Epidemiologists who worked on the study found that ◦ screening had little impact on diseases other than breast cancer. Experiments and Observational Studies.800 refused. starting in 1963 ◦ 62.Example Do screening programs speed up detection of breast cancer? ◦ Large-scale trial run by the Health Insurance Plan of Greater New York.000 women age 40 to 64 (all members of the plan) ◦ Randomly assigned to two equal groups ◦ Treatment group: ⋄ women were encouraged to come in for annual screeening ⋄ 20. ◦ Control group: ⋄ was oﬀered usual health care ◦ All the women were followed for many years.200 women did come in for screening ⋄ 10.

2004 . by cause. Rates per 1. but unlike most other diseases) aﬀects the rich more than the poor. Cause of Death Breast cancer All other Number of persons Number Rates Number Rates Treatment group Examined Refused Control group 31. someone wants to compare 1. Which numbers in the table conﬁrm this association between breast cancer and income? ◦ The death rate (from all causes) among women who accepted screening is about half the death rate among women who refused.1 1.10 - .000 women. what explains the diﬀerence in death rates? ◦ To show that screening reduces the risk from breast cancer.3 1.800 31. Jan 16.Example Deaths in the ﬁrst ﬁve years of the screening trial.1 and 1. Is this a good comparison? Is it biased against screening? For screening? Experiments and Observational Studies.5 2.200 10. Did screening cut the death rate in half? In not.000 39 23 16 63 1.5.000 20.0 837 428 409 879 27 21 38 28 Questions: ◦ Does screening save lives? ◦ Why is the death rate from all other causes in the whole treatment group (“examined” and “refused” combined) about the same as the rate in the control group? ◦ Why is the death rate from all other causes higher for the “refused” group than the “examined” group? ◦ Breast cancer (like polio.

g. Jan 19.g. testing every produced light bulb for its lifetime) Statistical approach: ◦ collect information from part of the population (sample) ◦ use information on sample to draw conclusions on whole population Questions: ◦ How to choose a sample? ◦ What conclusions can be drawn? Survey Sampling.g. ◦ too expensive ◦ too time consuming ◦ not sensible (e. 2004 -1- .Survey Sampling Situation: Population of N individuals (or items) e. ◦ amount of money students spent on books this quarter ◦ percentage of students who bought more than 10 books in this quarter ◦ lifetime of light bulbs Full data collection is often not possible because it is e.g. ◦ students at this university ◦ light bulbs produced by a company on one day Seek information about population e.

g. .Survey Sampling Objective of a sample survey: Gather information on some variable for population of N individuals: xi ˜ x1 . . . 1 µpop = N 1 2 σpop = N N xj ˜ j=1 N population mean population variance j=1 (˜j − µpop)2 x Estimate population parameter from sampled values: 1 n xi µpop = x = ˆ ¯ n i=1 1 N 2 2 σpop = s = ˆ (xj − x)2 ¯ n − 1 j=1 sample mean sample variance A function of the sample x1. xn values obtained from sampling Parameter . . xn is called a statistic. xN ˜ ˜ value of interest for ith individual values for population Sample of length n: x1 . Survey Sampling. . . e. . . Jan 19. . .number that describes the population. . 2004 -2- . .

In our example. ◦ Ask 20 students in the University bookshop. ◦ Select randomly 20 students from the register of the university.Sampling Distribution Suppose we are interested in the amount of money students at this university have spent on books this quarter. Survey Sampling. The value we obtain will vary from sample to sample. that is. Jan 19. Idea: Ask 20 students about the amount they have spent and take the average. if we asked another 20 students we would get a diﬀerent answer. Sampling distribution The sampling distribution of a statistic is the distribution of all values taken by the statistic if evaluated for all possible samples of size n taken from the same population. The design of a sample refers to the method used to choose the sample from the population. ◦ Ask 20 students in your department. the sampling distribution of the average amount obtained from the sample depends on the way we choose the sample from the population: ◦ Ask 20 students in this class. 2004 -3- .

38302 300 400 1 n n xi i=1 (b) 12 Frequency (%) 9 6 3 0 0 for sample sizes (a) n = 2 (b) n = 3 (c) n = 4 100 200 x 300 400 (c) 12 Frequency (%) 9 6 3 0 0 σ = 35.Sampling Distribution Example: Consider a population of 20 students who spent the following amounts on books: x1 ˜ x2 ˜ x3 ˜ x4 ˜ x5 ˜ x6 ˜ x7 ˜ x8 ˜ x9 x10 x11 x12 x13 x14 x15 ˜ ˜ ˜ ˜ ˜ ˜ ˜ 100 120 150 180 200 220 220 240 260 280 290 300 310 350 400 (a) 12 Frequency (%) 9 6 3 0 0 σ = 55.96526 100 200 x 300 400 Survey Sampling. 2004 -4- . Jan 19.4247 Sampling distribution of x= ¯ 100 200 x σ = 43.

◦ The sample mean might overestimate the true amount spent on books. Jan 19. A statistic is unbiased if the mean of its sampling distribution is equal to the parameter being estimated. Careful: A poor sample design can produce misleading conclusions. ◦ The sample is not representative for the population of all students. The design of a study is biased if it systematically favors some parts of the population over others. Sample: 20 students in the University bookshop Do we get a good estimate for the average amount spent on books last quarter by UofC students? ◦ Students who buy more books and spend more money on books are more likely to be found in bookshops than students who buy less books.Bias Example: Suppose we are interested in the amount of money students at this university have spent on books last quarter. Otherwise we say the statistic is biased. Survey Sampling. 2004 -5- .

000 callers responded.” A properly designed sample survey showed that 72% of adults want the UN to stay.” The survey was “conducted among Midway Metrolink passengers between New York and Chicago.” A reworded poll 1994 asked “Does it seem possible to you that the Nazi extermination of Jews never happened.800 calls for the poll came from the oﬃces owned by one man. Survey Sampling. ◦ A 1992 Roper poll asked “Does it seem possible or does it seem impossible to you that the Nazi extermination of Jews never happened?” 22% of the American respondents said “seems possible.” ◦ ABC network program Nightline once asked whether the United Nations should continue to have its headquarters in the United States. and TWA. or do you feel certain that it happened?” This time only 1% of the respondents said it was “possible it never happened. ◦ A call-in poll conducted by USA Today concluded that Americans love Donald Trump.640 of the 7.Bias Examples: Biased Sampling ◦ Midway Airlines Ads in the New York Times and the Wall Street Journal stated that “84 percent of frequent business travelers to Chicago prefer Midway Metrolink to American. United. USA Today later reported that 5. Jan 19. and 67% said “No. Cincinnati ﬁnancier Carl Lindner. More than 186. 2004 -6- .

Caution about Sample Surveys • Undercoverage ◦ occurs when same groups in the population are left out of the process of choosing the sample ◦ no accurate list of the population ◦ results in bias if this group diﬀers from the rest of the population • Nonresponse ◦ occurs when a chosen individual cannot be contacted or does not cooperate ◦ results in bias if this group diﬀers from the rest of the population • Response bias ◦ subjects may not want to admit illegal or unpopular behaviour ◦ subjects may be aﬀected by the interviewers appearance or tone ◦ subjects may not remember correctly • Question wording ◦ confusing or leading questions can introduce strong bias ◦ do not trust sample survey results unless you have read the exact questions posed Survey Sampling. Jan 19. 2004 -7- .

. . ◦ Random selection eliminates bias in sampling. Jan 19.) Survey Sampling. are chosen.e. . .g. (E. ◦ A sample of 10%of all student at the University of Chicago is chosen by numbering the students 1. 15. . students 5. . drawing a random integer i from 1 to 10. 2004 -8- . ◦ Every possible sample has an equal chance of being selected. if i = 5. and the top ﬁve dealt. and drawing every tenth student beginning with i. 25. each valid phone number is equally likely). . SRS or Not? Is each of the following samples an SRS or not? ◦ A deck of cards if shuﬄed.Simple Random Sampling A simple random sample (SRS) of size n consists of n individuals chosen from the population in such a way that every set of n individuals is equally likely to be selected. N . ◦ Every individual has an equal chance of being selected. ◦ A sample of Illinois residents is drawn by choosing all the residents in each of 100 census blocks (in such a way that each set of 100 blocks is equally likely to be chosen) ◦ A telephone survey is conducted by dialing telephone numbers at random (i.

called strata ◦ choose simply random sample within each group ◦ size of samples in each groups e.g. Survey Sampling.g.Stratiﬁed Sampling Example: ◦ Population: Students at this university ◦ Objective: Amount of money spent on books this quarter ◦ Knowledge: Students in e. Jan 19. 2004 -9- . proportional to size of groups Can reduce variability of estimate signiﬁcantly. humanities spend more money on books Use knowledge to build sample: ◦ divide sample into groups of similar individuals.

◦ A statistic from a random sample has a sampling distribution that describes how the statistic varies in repeated data production. ◦ Use statistics to make inferences about unknown population parameters. Survey Sampling. Jan 19. 2004 .10 - . ◦ A Simple random sample (SRS) of size n consists of n individuals from the population sampled without replacement. every set of n individuals has an equal chance to be the sample actually selected.Summary ◦ A number which describes a population is a parameter. that is. ◦ A statistic as an estimator of a parameter may suﬀer from bias or from high variability. ◦ A number computed from the data is a statistic. The variability of the statistic is described by the spread of its sampling distribution. Bias means that the mean of the sampling distribution is not equal to the true value of the parameter.

or as a “success”. then the probability of a “success” is s . 2003 -1- . What is the chance ◦ of getting a six? Event of interest: All possible events: ½ ¾ ¿ ⇒ 1 (one out of six) 6 ◦ of getting an even number? Event of interest: ¾ All possible events: ½ ¾ ¿ ⇒ 1 (three out of six) 2 The classical probability concept: If there are N equally likely possibilities.First Step Towards Probability Experiment: Toss a die and observe the number on the face up. of which one must occur and s are regarded favorable. N Counting. Jan 21.

Counting. 2003 -2- . Jan 21. How likely is this outcome under the assumption that the company does not discriminate? How many ways are there to choose ◦ 10 out of 100 applicants? ( ⇒ N ) ◦ 2 out of 50 female applicants and 8 out of 50 male applicants? ( ⇒ s) To compute such probabilities we need a way to count the number of possibilities (favorable and total).First Step Towards Probability Example: Suppose that of 100 applicants for a job 50 were women and 50 were men. all equally qualiﬁed. Further suppose that the company hired 2 women and 8 men.

. . . Jan 21. . . Example: If you toss a die 5 times. . . then the total number of possible sequences (x1. 941. respectively. xn) is N! . . then the total number of possible sequences (x1. Sampling in order with replacement If you sample n times in order with replacement from a set of N elements. 440. . N (N − 1) · · · (N − n + 1) = (N − n)! Example: If you select 5 cards in order from a card deck of 64. . 2003 -3- . the number of possible results is 64 · 63 · 62 · 61 · 60 = 914. Sampling in order without replacement If you sample n times in order without replacement from a set of N elements.The Multiplicative Rule Suppose you have k choices with N1. the number of possible results is 65 = 7776. xn) is N n . to make. Counting. . Then the total number of possibilities is the product N1 · · · Nk . . Nk possibilities.

. . any new sequence (xi1 . . . . . . . ij = j. How many permutations of n distinct elements are there? The multiplicative rule yields n · · · (n − 1) · · · 1 = n!. . xin ) with permuted indices {i1. Counting. n}. i. A permutation of this sequence is any rearrangement of the elements without loosing or adding any elements. not in the order in which you received them.e. Example (contd): The number of diﬀerent sequences of 5 ﬁxed cards is 5! = 5 · 4 · 3 · 2 · 1 = 120. . . . The trivial permutation does not change the order. that is. . . you are typically only interested in the cards you have. Jan 21. How many diﬀerent combinations of 5 cards out of 64 are there? To answer this question we ﬁrst address the question of how many diﬀerent sequences of the same 5 cards exist. . 2003 -4- . in } = {1. . xn) be a sequence. Permutation: Let (x1.Permutations and Combinations Example: If you select 5 cards from a card deck of 64.

Permutations and Combinations How many diﬀerent combinations of n elements chosen from N distinct elements are there? Recall that ◦ The number of diﬀerent sequences of length n that can be chosen from N distinct elements are N! . . . . . (N − n)! ◦ The number of permutions of any sequence of length n is n!. Jan 21. . 2003 -5- . . . Since two permuted (ordered) sequences (x1. Thus the number of combinations of n elements chosen from N distinct elements is N! = n! (N − n)! N n N n = N . ordered) combination {x1. xn} we divide the number of ordered se- Counting. xn) lead to the same (unquences by the number of permutations. . N −n are referred to as binomial coeﬃcient.

2 ◦ 8 men out of 50 is 50 8 . The number of ways to choose ◦ 2 women out of 50 is 50 . the chance of this or a more extreme event (only one or no woman is hired) is 0. 2003 -6- . Jan 21. not in the order in which you received them. How many diﬀerent combinations of 5 cards out of 64 are there? The answer is 64 5 = 64 · 63 · 62 · 61 · 60 914941444 = = 7. Thus the chance of this event is = 0.037 Moreover. Counting. 624. 100 10 ◦ 10 applicants out of 100 is 50 50 2 8 100 10 . you are typically only interested in the cards you have. 5·4·3·2·1 120 Example: Recall the example with the 100 applicants for a job. 512.Examples Example: If you select 5 cards from a card deck of 64.046.

Jan 21. 2003 -7- .Summary The number of possibilities to sample with or without replacement in order or unordered n elements from a set of N distinct elements are summarized in the following table: in order without order N! N without replacement (N − n)! n N +n−1 Nn with replacement N Sampling Counting.

Three Concepts of Probability ◦ Frequency interpretation ◦ Subjective probabilities ◦ Mathematical probability concept Elements of Probability.Introduction to Probability Classical Concept: ◦ requires ﬁnitely many and equally likely outcomes ◦ probability of event deﬁned as number of favorable outcomes (s) divided by number of total outcomes (N): s Probability of event = N ◦ can be determined by counting outcomes In many practical situations the diﬀerent outcomes are not equally likely: ◦ Success of treatment ◦ Chance to die of a heart attack ◦ Chance of snowfall tomorrow It is not immediately clear how to measure chance in each of these cases. Jan 23. 2003 -1- .

The Frequentist Approach In the long run. The frequency interpretation of probability is based on the following theorem: The Law of Large Numbers If a situation. or experiment is repeated again and again. Thus we would estimate the probability of snowfall on Jan 21 in Chicago as 0. John Maynard Keynes (1883-1943) The Frequency Interpretation of Probability The probability of an event is the proportion of time that events of the same kind (repeated independently and under the same conditions) will occur in the long run. Elements of Probability. we are all dead. Example: Suppose we collect data on the weather in Chicago on Jan 21 and we note that in the past 124 years it snowed in 34 years on Jan 21. 34 that is 124 100% = 27.274. the proportion of successes will converge to the probability of any one outcame being a success. trial.4% of the time. 2003 -2- . Jan 23.

Jan 23.2 0.8 0 100 200 300 400 500 Number of Tosses 600 700 800 900 1000 0.50 0.0 Relative Frequency of Heads Tosses 1 − 1000 0.0 0.500 1 2 3 4 5 6 7 8 9 10 Number of Tosses (in 100000s) Elements of Probability.48 0.51 1 10 20 30 40 50 60 70 80 90 100 Number of Tosses (in 1000s) 0.495 0.The Frequentist Approach 1.505 Relative Frequency of Heads Tosses 100000 − 1000000 0.52 Tosses 1000 − 100000 Relative Frequency of Heads 0.49 0. 2003 -3- .6 0.4 0.

Example: Weather forecast Meteorologist says that the probability of snowfall tomorrow is 90%. Need to quantify our uncertainty about an event A: Game with two players: ◦ 1st player determines p such that he will “win” \$c · (1 − p) if event A occurs and otherwise he will “loose” \$c · p. ◦ 2nd player chooses c which can be positive or negative. Elements of Probability.The Subjectivist (Bayesian) Approach Not all events are repeatable: ◦ Will it snow tomorrow? ◦ Will Mr Jones. 42. 2003 -4- . but we are uncertain about the right answer. The Bayesian interpretation of probability is that probability measures the personal (subjective) uncertainty of an event. Jan 23. live to 65? ◦ Will the Dow Jones rise tomorrow? ◦ Does the Iraq have weapons of mass destruction? To all these questions the answer is either “yes” or “no”. He should be willing to bet \$90 against \$10 that it snows tomorrow and \$10 against \$90 that it does not snow.

{3. 5}. {1.set of possible outcomes Example: An urn contains ﬁve balls. {3. {4. 2003 -5- . {3. {2. 2}. 5} . For a mathematical treatment we need: Sample Space S .The Elements of Probability A (statistical) experiment is a process of observation or measurement. Jan 23. {2. 5} . 3}. 4}. È . We choose two at random and at the same time.assigns each A a value in [0. 3}. {1. Probability Function 3 È(A) = 10 . {1. 4}. numbered from 1 through 5. 3}. {2. 5}. 5}. 4}.an event is a subset of the sample space S Example: In the example above the event A that two balls with uneven numbers are choses is A = {1. Events A ⊆ S . 1] Example: Assuming that all events are equally likely we obtain Elements of Probability. What is the sample space? S = {1. 5}. {1.

Jan 23. 2003 -6- . Elements of Probability. 1 3 ◦ Probability that the spinner lands in [ 2 . ◦ ◦ È({s}) = c > 0 ⇒ È(S) = ∞ È({s}) = 0 ⇒ È(S) = 0 Solution: Assign to each subset of S a probability equal to the “length” of that subset: 1 1 ◦ Probability that the spinner lands in [0. In the case of ﬁnite or countably inﬁnite sample spaces S there are no such exceptions and A covers all subsets of S. Strictly speaking. È(spinner lands in [a. ◦ Assign probabilities uniformly on S. ◦ Suppose that all outcomes s ∈ S = [0. 1) are equally likely.The Elements of Probability Why not assign probabilities to outcomes? Example: Spinner labeled from 0 to 1. b]) = Remark: b a dx = b − a. 4 ) is 1 . 4 1 2 ◦ Probability that the spinner lands on In integral notation we have is 0. we can deﬁne above probability only on a set A of subsets A ⊆ S which however covers all important and for this class relevant subsets. 4 ) is 4 .

. 2003 -7- . the set of natural numbers ◦ ◦ Ê = (−∞. 2. Note that A ∩ A∁ = ∅ and A ∪ A∁ = S A and B are disjoint if A and B have no common elements. the set of real numbers = {. that is A∩B = ∅. Elements of Probability. ∞). If a is not an element of the set A then we write a ∈ A. (Georg Cantor.A Set Theory Primer A set is “a collection of deﬁnite. 3. . 1845-1918) Some important sets: ◦ Æ = {1. . .}. Jan 23. The empty set is denoted by ∅ (Note: ∅ ⊆ A for all subsets A of S).}. B ⊆ S). . 1) the interval from 0 to 1 not including 0 and 1 If a is an element of the set A then we write a ∈ A. Two events A and B with this property are said to be mutually exclusive. / Suppose that A and B are subsets of S (denoted as A. . the set of integers Intervals are denoted as follows: [0. . 1. Diﬀerence of A and B (A\B): Set of all elements in A which are not in B. Union of A and B (A ∪ B): Set of all elements in S that are in A or in B. Complement of A (A∁ or A′): Set of all elements in S that are not in A. −2. . 1] the interval from 0 to 1 including 0 and 1 [0. 0. well distinguished objects of our perception or of our thought”. −1. 1) the interval from 0 to 1 including 0 but not 1 (0. 2. . Intersection of A and B (A ∩ B): Set of all elements in S which are both in A and in B.

Elements of Probability. if countably many events Ai. Ai ∩ Aj = ∅ whenever i = j) then È ∞ Ai = ∞ i=1 i=1 È(Ai).The Postulates of Probability A probability on a sample space S (and a set A of events) is a function which assigns each subset A a value in [0. Jan 23. for all events A.e. that is the probability that one or the other occurs is the sum of their probabilities. 2003 -8- . More generally. 1] and satisﬁes the following rules: Axiom 1: All probabilities are nonnegative: È(A) ≥ 0 È(S) = 1. i ∈ Æ are mutually exclusive (i. Axiom 2: The probability of the whole sample space is 1: Axiom 3 (Addition Rule): If two events A and B are mutually exclusive then È(A ∪ B) = È(A) + È(B).

Jan 23. #S #S Elements of Probability. #S where #A denotes the number of elements (outcomes) in A. It satisﬁes ◦ È(A) ≥ 0 ◦ È(S) = #S/#S = 1 ◦ If A and B mutually exclusive then È(A ∪ B) = #(A ∪ B) #S = #A #B + = È(A) + È(B). 2003 -9- .The Postulates of Probability Classical Concept of Probability The probability of an event A is deﬁned as È(A) = #A .

It satisﬁes ◦ È(A) ≥ 0 ◦ È(S) = limn→∞ n = 1 n ◦ If A and B mutually exclusive then n(A ∪ B) = n(A) + n(B).10 - . = lim n→∞ n n→∞ n Elements of Probability. lim n where n(A) is the number of times event A occurred in n repetitions. Jan 23.The Postulates of Probability Frequency Interpretation of Probability The probability of an event A is deﬁned as È(A) = n→∞ n(A) . Hence ∪ È(A ∪ B) = n→∞ n(An B) lim n(A) n(B) = lim + n→∞ n n n(B) n(A) + lim = È(A) + È(B). 2003 .

The Postulates of Probability Example: Toss of one die } are mutually exclusive. 6 3 È(C) = 6 = 1 . In particular ◦ È(∅) = 0 ◦ È(S) = 1 . 2 On the other hand we get for C = A ∪ B = {½ } The ﬁrst two axioms can be summarized by the Cardinal Rule: For any subset A of S 0 ≤ È(A) ≤ 1. Jan 23. The addition rule yields 1 1 1 È(A ∪ B) = 6 + 3 = 3 = 2 . 2003 .11 - Elements of Probability. The events A = {½} and B = { Since all outcomes are equiprobable we obtain 1 È(A) = 6 and 1 È(B) = 3 .

The Calculus of Probability Let A and B be events in a sample space S. Partition rule: È(A) = È(A ∩ B) + È(A ∩ B ∁) Example: Roll a pair of fair dice È(Total of 10) = È(Total of 10 and double) + È(Total of 10 and no double) = 2 3 1 1 + = = 36 36 36 12 Complementation rule: È(A∁) = 1 − È(A) Example: Often useful for events of the type “at least one”: È(At least one even number) 9 3 = = 1 − È(No even number) = 1 − 36 4 Containment rule È(A) ≤ È(B) for all A ⊆ B Example: Compare two aces with doubles. Jan 26. 2003 -1- . 1 1 6 = È(Two aces) ≤ È(Doubles) = = 36 36 6 Calculus of Probability.

} ½½.The Calculus of Probability Inclusion and exclusion formula È(A ∪ B) = È(A) + È(B) − È(A ∩ B) Example: Roll a pair of fair dice È(Total of 10 or double) = È(Total of 10) + È(Double) − È(Total of 10 and double) = 3 6 1 8 2 + − = = 36 36 36 36 9 The two events are Total of 10 = { and Double = { . . Adding the probabilities for the two events. The intersection is Total of 10 and double = { event is added twice. . } . the probability for the Calculus of Probability. 2003 -2- . Jan 26. }.¾¾.¿¿.

Conditional probabilities for causes of death: ◦ È(accident) = 0.443 22 92.125 207 341 920 16.548 9. Homicide and legal intervention measure probability with respect to a subset of S Conditional probability of A given B ∩ È(A|B) = È(A(B)B) . Example: Number of Deaths in the U.17832 Calculus of Probability. Often: Have partial information about event of interest.S.261 7.003 149 174 420 22.998 2. Jan 26.872 26.904 380.486 395 513 6.171.092 32.396 1.947 8.935 5.699 148.886 544.554 16.218 2 Accidents and adverse eﬀects.155 3.521 13. in 1996 Cause Heart Cancer HIV Accidents1 Homicide2 All causes 1 All ages 1-4 5-14 15-24 25-44 45-64 ≥ 65 733.564 24. È If if È(B) > 0 È(B) = 0 then È(A|B) is undeﬁned.717 52 2.Conditional Probability Probability gives chances for events in sample space S.332 30.161 440 1.510 612.642 22.465 32.42423 È(accident|age=40) = 0.717. 2003 -3- .805 386.795 8.261 102.147 132.04282 ◦ ◦ ◦ È(age=10) = 0.035 1.00390 È(accident|age=10) = 0.

2003 -4- .Conditional Probability Example: Select two cards from 32 cards ◦ What is the probability that the second card is an ace? È(2nd card is an ace) = 1 8 ◦ What is the probability that the second card is an ace if the ﬁrst was an ace? 3 È(2nd card is an ace|1st card was an ace) = 31 Calculus of Probability. Jan 26.

01473 · 0.02055 · 0.3 177. Jan 26.00038 È(die from accident|age = 40) = 0.00873 = 0. 2003 -5- .04281 · 0.00037 ◦ ◦ ◦ ◦ ◦ È(die from accident|age = 10) = 0.17832 · 0.0 5071.00178 = 0.4 Can we combine these rates with the table on causes of death? ◦ What is the probability to die from an accident (HIV)? ◦ What is the probability to die from an accident at age 10 (40)? Know È(accident|die) = È(die from accident)/È(die) ⇒ È(die from accident) = È(accident|die)È(die) Calculate probabilities: ◦ È(die from accident) = 0.00178 = 0.00090 = 0.3 22.00002 È(die from HIV|age = 40) = 0.42423 · 0.15308 · 0.00013 È(die from HIV|age = 10) = 0.00027 È(A ∩ B) = È(A|B)È(B) = È(B|A)È(A) General multiplication rule Calculus of Probability.00031 È(die from HIV) = 0.0 90.00873 = 0.00090 = 0.000 people) All Ages 1-4 5-14 15-24 25-44 45-64 ≥ 65 872.8 708.5 38.Multiplication rules Example: Death Rates (per 100.

no matter what the ﬁrst die shows. Jan 26. 1 È(2nd die = ½|1st die = ½) = 6 È(2nd die = ½) = 1 6 È(A|B) = È(A). Calculus of Probability. Equivalently. A and B are independent if È(A ∩ B) = È(A)È(B) Otherwise we say A and B are dependent. Such events are called independent: The event A is independent of the event B if its chances are not aﬀected by the occurrence of B.Independence Example: Roll two dice ◦ What ist the probability that the second die shows ½? ◦ What ist the probability that the second die shows ½ if the ﬁrst die already shows ½? ◦ What ist the probability that the second die shows ½ if the ﬁrst does not show ½? 1 È(2nd die = ½|1st die = ½) = 6 The chances of getting ½ with the second die are the same. 2003 -6- .

one price.Let’s Make a Deal The Rules: ◦ Three doors . two blanks ◦ Candidate selects one door ◦ Showmaster reveals one loosing door ◦ Candidate may switch doors 1 2 3 Would YOU change? Can probability theory help you? ◦ What is the probability of winning if candidate switches doors? ◦ What is the probability of winning if candidate does not switch doors? Calculus of Probability. Jan 26. 2003 -7- .

Jan 26. 2003 -8- .choose winning door at the beginning ◦ W .The Rule of Total Probability Events of interest: ◦ A .win the price Strategy: Switch doors (S) Know: 1 ◦ ÈS (A) = 3 ÈS (W |A) = 0 ◦ ÈS (W |A∁) = 1 ◦ ÈS (A∁) = 2 3 Probability of interest: ÈS (W ): ÈS (W ) = ÈS (W ∩ A) + ÈS (W ∩ A∁) = ÈS (W |A)ÈS (A) + ÈS (W |A∁)ÈS (A∁) ◦ =0· Know: ◦ 1 + 1 · 2= 3 3 2 3 Strategy: Do not switch doors (N ) ◦ ÈN (A) = 1 ÈN (W |A) = 1 3 ◦ ÈN (W |A∁) = 0 ◦ ÈN (A∁) = 2 3 Probability of interest: ÈN (W ): ÈN (W ) = ÈN (W ∩ A) + ÈN (W ∩ A∁) = ÈN (W |A)ÈN (A) + ÈN (W |A∁)ÈN (A∁) =1· 1 + 0 · 2= 3 3 1 3 Calculus of Probability.

. 2003 -9- .7 + 0. then È(A) = È(A|B1)È(B1) + .7. . Jan 26. + È(A|Bk )È(Bk ) Example: Suppose an applicant for a job has been invited for an interview. È(S) = È(S|N )È(N ) + È(S|N ∁)È(N ∁) = 0. ◦ the interview is succussful if he is nervous is ◦ È(S|N ) = 0. . .9 · 0. .3 = 0. The chance that ◦ he is nervous is È(N ) = 0.2.441 What is the probability that the interview is successful? Calculus of Probability. . the interview is succussful if he is not nervous is È(S|N ∁) = 0. ∪ Bk = S.2 · 0. Bk mutually exclusive and B1 ∪ . .The Rule of Total Probability Rule of Total Probability If B1. .9.

C2=“2nd coin” È(H) = È(H|C1)È(C1) + È(H|C2)È(C2) 1 = (0.35 Choose a coin at random and ﬂip it.The Rule of Total Probability Example: Suppose we have two unfair coins: ◦ Coin 1 comes up heads with probability 0. 2003 . C1=“1st coin”.8 ◦ Coin 2 comes up heads with probability 0. Jan 26. What is the probability of its being a head? Events: H=“heads comes up”.8 + 0.10 - .35) = 0.575 2 Calculus of Probability.

S.0001 · 0.S.11 - . Jan 26.91 0.001 = 0.all couples in S in which the husband has since murdered his wife ◦ Event M .001 + 0.J. in which the husband beat his wife in 1988 ◦ Event H .001 È(M |H) = 1 since H ⊆ M È(M |H ∁) = 0.married couples in U. 2003 .999 Calculus of Probability.all couples in S in which the wife has been murdered since 1988 We have ◦ ◦ ◦ Then È(H) = 0. So he murdered his wife with probability 0.Bayes’ Theorem Example: O. |H) È(H|M ) = È(MÈ(MÈ(H) ) È(M |H)È(H) = È(M |H)È(H) + È(M |H ∁)È(H ∁) = 0.0001 at most in the U.J. Simpson on TV Fact: Simpson pleaded no contest to beating his wife in 1988. Simpson “Only about 1 10 of one percent of wife-batterers actually murder their wives” Lawyer of O.001? ◦ Sample space S .

12 - . . ∪ Bk = S. Bk mutually exclusive and B1 ∪ .È(Bi(A|B )È(B ) )+. +È 1 1 k k (General form of Bayes’ Theorem) Calculus of Probability. . . . 2003 . Jan 26. . then È ) È(Bi|A) = È(A|B )È(B (A|B.Bayes’ Theorem Reversal of conditioning (general multiplication rule) È(B|A)È(A) = È(A|B)È(B) Rewriting È(A) using the rule of total probability we obtain Bayes’ Theorem È(B|A) = È(A|B)È(B) È(A|B)È(B) + È(A|B ∁)È(B ∁) If B1. .i).

9998 in 2nd population .92 in 1st population ◦ È(I+|T++) = 0.1 (greater risk) 0.005 · 0.98 · 0.1 +·0.956 testing on large scale not sensible (too many false positives) Repeat test (Bayesian updating): ◦ È(I+|T++) = 0.positive for infected) ◦ ◦ È(T-|I-) = 0.9 = 0.13 - Calculus of Probability.0003 (prevalence) È(T+|I+) È(I+|T+) = È(T+|I+)È(I+) + È(I+) È(I-) È(T+|I-) = = 0.98 (sensitivity .0003 0.98 0. Jan 26.98 · 0.995 (speciﬁcity . 2003 .0003 + 0.98 · 0.005 · 0.9997 What is the probability that the tested person is infected if the test was positive? Consider diﬀerent population with È(I+) = 0.1 È(I+|T+) = 0.Bayes’ Theorem Example: Testing for AIDS Enzyme immunoassay test for HIV: ◦ È(T+|I+) = 0.05556 0.negative for noninfected) È(I+) = 0.

The actual observed values x1. Jan 28. . . xn Problem: ◦ Data aﬀected by chance variation ◦ New set of data would look diﬀerent Suppose we observe/measure some characteristic (variable) of n individuals. . Random variable: a variable whose value is a numerical outcome of a random phenomenon Remark: Mathematically. . xn are the outcome of a random phenomenon. . a random variable is a real-valued function on the sample space S: S −−→ Ê −− ω −→ x = X(ω) X ◦ SX = X(S) is the sample space of the random variable. ◦ X induces a probability P (B) = È(X ∈ B) on SX . the probability distribution of X Example: Roll one die Outcome ω Realization X(ω) Random Variables. . ½ ¾ ¿ 1 2 3 4 5 6 -1- . .Random Variables Aim: ◦ Learn about population ◦ Available information: observed data x1. 2003 ◦ The outcome x = X(ω) is called realisation of X. .

6) (3.5) (4.5) (2.5) (6.3) (3.1) (5.1) (6.1) (4.total number of points (a function of random variables is again a random variable) Table of outcomes: Outcome (X1.1) (2.2) (2.6) (6.5) (5.2) (5.4) (3.3) (5.1) (3.3) (4.Random Variables Example: Roll two dice ◦ X1 .4) (1.4) (6.2) (3.number on the second die ◦ Y = X1 + X2 .5) (3. X2 ) (4. 2003 -2- .3) (1.2) (1.4) (4. Jan 28.2) (4.6) (2.4) (2.2) (6.3) (6.number on the ﬁrst die ◦ X2 .6) ½½ ½¾ ½¿ ½ ½ ½ ¾½ ¾¾ ¾¿ ¾ ¾ ¾ ¿½ ¿¾ ¿¿ ¿ ¿ ¿ Y 2 3 4 5 6 7 3 4 5 6 7 8 4 5 6 7 8 9 Outcome (X1.6) ½ ¾ ¿ ½ ¾ ¿ ½ ¾ ¿ Y 5 6 7 8 9 10 6 7 8 9 10 11 7 8 9 10 11 12 Random Variables.4) (5.5) (1.3) (2.1) (1.6) (5. X2 ) (1.

or uncountable set ◦ probability of any particular outcome x is zero È(X = x) = 0 for all x ∈ SX Example: Ten tosses of a coin Suppose we toss a coin ten times.Random Variables Two important types of random variables: • Discrete random variable ◦ takes values in a ﬁnite or countable set • Continuous random variable ◦ takes values in a continuum. Let ◦ X be the number of heads in ten tosses of a coin ◦ Y be the time it takes to toss ten times Random Variables. Jan 28. 2003 -3- .

.. . x2. Jan 28. 2003 -4- . . x2.Discrete Random Variables Suppose X is a discrete random variables with values x1. Note: p deﬁnes a probability on SX = {x1. . Properties of a discrete probability distribution ◦ p(x) ≥ 0 for all values of X ◦ i p(xi ) =1 Random Variables.}: P (B) = x∈B p(x) = È(X ∈ B). . . We call P the (probability) distribution of X. Example: Roll two dice Y = X1 + X2 total number of points 2 3 4 5 6 7 8 9 10 11 12 y 1 2 3 4 5 6 5 4 3 2 1 È(Y = y) 36 36 36 36 36 36 36 36 36 36 36 Frequency function: The function p(x) = È(X = x) = È({ω ∈ S|X(ω) = x}) is called the frequency function or probability mass function.

12} otherwise Example: Three tosses of a coin Let X be the number of heads in three tosses of a coin. . . .Discrete Random Variables Example: Roll one die Let X denote the number of points on the face turned up. 0 otherwise 1 6 Example: Roll two dice The probability mass function of the total number of points Y = X1 + X2 can be written as: p(y) = È(Y = y) = 1 36 0 6 − |y − 7| if y ∈ {2. 6} . . Since all numbers are equally likely we obtain p(x) = È(X = x) = if x ∈ {1. . x 8 Random Variables. . There are 3 x outcomes with x heads and 3 − x tails. 2003 -5- . . thus p(x) = 3 1 . . Jan 28.

Continuous Random Variables
For a continuous random variable X, the probability that X falls in the interval (a, b ] is given by

È(a < X ≤ B) =

b

f (x)dx,
a

where f is the density function of X. Note: The density deﬁnes a probability on Ê:
b

P [a, b] =
a

f (x) dx = È X ∈ [a, b]

We call P the (probability) distribution of X.
Remark: The deﬁnition of P can be extended to (almost) all B ⊆

Ê.

Example: Spinner Consider a spinner that turns freely on its axis and slowly comes to a stop. ◦ X is the stopping point on the circle marked from 0 to 1. ◦ X can take any value in SX = [0, 1). ◦ The outcomes of X are uniformly distributed over the interval [0, 1). Then the density function of X is f (x) = Consequently 1 0 if 0 ≤ x < 1 . otherwise

È È

X ∈ [a, b] = b − a.

Note that for all possible outcomes x ∈ [0, 1) we have X ∈ [x, x] = x − x = 0.
-6-

Random Variables, Jan 28, 2003

Independence of Random Variables
Recall: Two events A and B are independent if

È(A ∩ B) = È(A)È(B)
Independence of Random Variables Two discrete random variables X and Y are independent if

È(X ∈ A, Y

∈ B) = È(X ∈ A) È(Y ∈ B)

for all A ⊆ SX and B ⊆ SY . Remark: It is suﬃcient to show that

È(X = x, Y

= y) = pX (x) pY (y) = È(X = x) È(Y = y)

More generally, X1, X2 , . . . are independent if for all n ∈ Æ

for all x ∈ SX and y ∈ SY .

È(X1 ∈ A1, . . . , Xn ∈ An) = È(X1 ∈ A1) · · · È(Xn ∈ An).

for all Ai ⊆ Xi .

Example: Toss coin three times
Consider Xi = 1 0 if head in ith toss of coin otherwise

X1 , X2 , and X3 are independent: 1 È(X1 = x1, . . . , X3 = x3) = 8 = È(X1 = x1)È(X2 = x2)È(X3 = x3)

Random Variables, Jan 28, 2003

-7-

Multivariate Distributions: Discrete Case Discrete Case
Let X and Y be discrete random variables. Joint frequency function of X and Y pXY (x, y) = È(X = x, Y = y) = È({X = x} ∩ {Y = y}) Marginal frequency function of X pX (x) =
i

pXY (x, yi)

Marginal frequency function of Y pY (y) =
i

pXY (xi, y)

The random variables X and Y are independent if and only if pXY (x, y) = pX (x) pY (y) for all possible values x ∈ SX and y ∈ SY . Conditional probability of X = x given Y = y

È(X = x|Y

= y) = pX|Y (x|y) =

pXY (x, y) = pY (y)

È(X = x, Y = y) È(Y = y)

where pX|Y (x|y) is the conditional frequency function.

Random Variables, Jan 28, 2003

-8-

Multivariate Distributions Discrete Case
Example: Three Tosses of a Coin ◦ X - number of heads on the ﬁrst toss (values in {0, 1}) The joint frequency function pXY (x, y) is given by the following table x\y 0 1 0
1 8

◦ Y - total number of heads (values in {0, 1, 2, 3})

1
2 8 1 8 3 8

2
1 8 2 8 3 8

3 0
1 8 1 8 1 2 1 2

0
1 8

1

Marginal frequency function of Y pY (0) = È(Y = 0) = 1 +0= 8
1 8

= È(Y = 0, X = 0) + È(Y = 0, X = 1)

pY (1) = È(Y = 1) =2+1= 8 8 . . .
3 8

= È(Y = 1, X = 0) + È(Y = 1, X = 1)

Random Variables, Jan 28, 2003

-9-

Joint density function of X and Y : fXY such that A B fXY (x. y) dy Marginal density function of Y fY (y) = fXY (x. Y ∈ B) Marginal density function of X: fX (x) = fXY (x. y) = fX (x) fY (y) for all possible values x ∈ SX and y ∈ SY . 2003 .10 - . Jan 28. y) fY (y) Conditional probability of X ∈ A given Y = y È(X ∈ A|Y = y) = A fX|Y (x|y) dx Random Variables.Multivariate Distributions Continuous Case Let X and Y be continuous random variables. Conditional density function of X given Y = y fX|Y (x|y) = fXY (x. y) dy dx = È(X ∈ A. y) dx The random variables X and Y are independent if and only if fXY (x.

Jan 30. Distributions. 0 and 1.Bernoulli Distribution Example: Toss of coin Deﬁne X = 1 if head comes up and X = 0 if tail comes up. Both realizations are equally likely: È(X = 1) = È(X = 0) = Examples: Often: Two outcomes which are not equally likely: ◦ Success of medical treatment ◦ Interviewed person is female ◦ Student passes exam ◦ Transmittance of a disease 1 2 Bernoulli distribution (with parameter θ) ◦ X takes two values. 2003 -1- . 1} otherwise Example: A = blood pressure above 140/90 mm HG. with probabilities p and 1 − p ◦ Frequency function of X p(x) = ◦ Often: X= 1 if event A has occured 0 otherwise θx (1 − θ)1−x 0 for x ∈ {0.

. . . . . . 2003 -2- . .. . Xn p(x1. ..+xn (1 − θ)n−x1 −.Bernoulli Distribution Let X1. Xn be independent Bernoulli random variables with same parameter θ. . . . Distributions. n Example: Paired-Sample Sign Test ◦ Study success of new elaborate safety program ◦ Record average weekly losses in hours of labor due to accidents before and after installation of the program in 10 industrial plants 4 5 6 7 8 9 10 Plant 1 2 3 Before 45 73 46 124 33 57 83 34 26 17 After 36 60 44 119 35 51 77 29 24 11 Deﬁne for the ith plant Xi = 1 if ﬁrst value is greater than the second 0 otherwise 1 1 1 1 0 1 1 1 1 1 Result: The Xi’s are independently Bernoulli distributed with unknown parameter θ. .. Jan 30. Frequency function of X1. 1} and i = 1. xn) = p(x1) · · · p(xn) = θx1 +..−xn for xi ∈ {0. . . .

. . . . . . . .Binomial Distribution Let X1. . 2003 -3- . xn) = θy (1 − θ)n−y ◦ Number of diﬀerent realizations with y successes: n y Distributions. . . + Xn Example: Paired Sample Sign Test (contd) Deﬁne for the ith plant Xi = n 1 if ﬁrst value is greater than the second 0 otherwise Xi i=1 Y = Y is the number of plants for which the number of lost hours has decreased after the installation of the safety program We know: ◦ Xi is Bernoulli distributed with parameter θ ◦ Xi’s are independent What is the distribution of Y ? ◦ Probability of realization x1. . Xn be independent Bernoulli random variables ◦ Often only interested in number of successes Y = X1 + . . xn with y successes: p(x1. Jan 30. .

. We write Y ∼ Bin(n. ◦ the probability of success is the same for each trial. . and ◦ the trials are independent. . . . Example: Paired Sample Sign Test (contd) Let Y be the number of plants for which the number of lost hours has decreased after the installation of the safety program. Jan 30. .Binomial Distribution Binomial distribution (with parameters n and θ) Let X1. . Then Y ∼ Bin(n. . Xn be independent and Bernoulli distributed with parameter θ and n Y = i=1 Xi . 2003 -4- . Y has frequency function p(y) = n θy (1 − θ)n−y y for y ∈ {0. Note that ◦ the number of trials is ﬁxed. θ) Distributions. n} Y is binomially distributed with parameters n and θ. θ).

5 0.2 0.3 p(x) 0.4 θ = 0.2 0.3 0.Binomial Distribution Binomial distribution for n = 10 0.1 0.1 0.8 0. Jan 30.2 p(x) 0 1 2 3 4 5 x 6 7 8 9 10 0.1 0.0 0 1 2 3 4 5 x 6 7 8 9 10 Distributions.2 p(x) 0 1 2 3 4 5 x 6 7 8 9 10 0.0 0 1 2 3 4 5 x 6 7 8 9 10 0.1 0.3 0.4 θ = 0. 2003 -5- .3 p(x) 0.1 0.0 0.4 θ = 0.0 0.4 θ = 0.3 0.

. Thus 3 1 1 È(X ≤ 3) = 3 + 1 · 2 + 3 3 3 2 3 2 = 0. x = 1. 2. Distributions. .Geometric Distribution Consider a sequence of independent Bernoulli trials. X is geometrically distributed with parameter θ. What is the chance that he misses the ball less than 3 times? The number X of balls up to the ﬁrst success is geometrically distributed with parameter 1 . What is the distribution of X? ◦ Probability of no success in x − 1 trials: (1 − θ)x−1 ◦ Probability of one success in the xth trial: θ The frequency function of X is p(x) = θ(1 − θ)x−1 .7037. Example: 1 Suppose a batter has probability 3 to hit the ball. Jan 30. 2003 -6- . ◦ Let X be the number of trials up to the ﬁrst success. 3. . ◦ On each trial. a success occurs with probability θ.

Hypergemetric Distribution
Example: Quality Control Quality control - sample and examine fraction of produced units ◦ N produced units ◦ M defective units ◦ n sampled units What is the probability that the sample contains x defective units? The frequency function of X is p(x) =
M x N−M n−x N n

,

x = 0, 1, . . . , n.

X is a hypergeometric random variable with parameters N , M , and n. Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men, all equally qualiﬁed. If we select 10 applicants at random what is the probability that x of them are female? The number of chosen female applicants is hypergeometrically distributed with parameters 100, 50, and 10. The frequency function is p(x) =
50 x 50 10−x 100 10

for x ∈ {0, . . . , n}

for x = 0, 1, . . . , 10.

Distributions, Jan 30, 2003

-7-

Poisson Distribution
Often we are interested in the number of events which occur in a speciﬁc period of time or in a speciﬁc area of volume:
◦ Number of alpha particles emitted from a radioactive source during a given period of time ◦ Number of telephone calls coming into an exchange during one unit of time ◦ Number of diseased trees per acre of a certain woodland ◦ Number of death claims received per day by an insurance company

Characteristics Let X be the number of times a certain event occurs during a given unit of time (or in a given area, etc). ◦ The probability that the event occurs in a given unit of time is the same for all the units. ◦ The number of events that occur in one unit of time is independent of the number of events in other units. ◦ The mean (or expected) rate is λ. Then X is a Poisson random variable with parameter λ and frequency function λx −λ e , p(x) = x! x = 0, 1, 2, . . .

Distributions, Jan 30, 2003

-8-

Poisson Approximation
The Poisson distribution is often used as an approximation for binomial probabilities when n is large and θ is small: p(x) =
n λx −λ x n−x θ (1 − θ) ≈ e x! x

with λ = n θ. Example: Fatalities in Prussian cavalry
Classical example from von Bortkiewicz (1898). ◦ Number of fatalities resulting from being kicked by a horse ◦ 200 observations (10 corps over a period of 20 years) Statistical model: ◦ Each soldier is kicked to death by a horse with probability θ. ◦ Let Y be the number of such fatalities in one corps. Then Y ∼ Bin(n, θ) where n is the number of soldiers in one corps. Observation: The data are well approximated by a Poisson distribution with λ = 0.61 Deaths per Year Observed Rel. Frequency Poisson Prob. 0 109 0.545 0.543 1 65 0.325 0.331 2 22 0.110 0.101 3 3 0.015 0.021 4 1 0.005 0.003

Distributions, Jan 30, 2003

-9-

Poisson Approximation
Poisson approximation of Bin(40, θ)

1.0 0.8 0.6 p(x) 0.4 0.2 0.0

θ=

1 400

1.0 0.8 0.6 p(x) 0.4 0.2 0.0

λ=

1 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0.5 0.4 0.3 p(x) 0.2 0.1 0.0

θ=

1 40

0.5 0.4 0.3 p(x) 0.2 0.1 0.0

λ=1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0.2

θ=

1 8

0.2

λ=5

p(x)

0.1

p(x) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0.1

0.0

0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0.2

θ=

1 4

0.2

λ = 10

p(x)

0.1

p(x) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

0.1

0.0

0.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x

Distributions, Jan 30, 2003

- 10 -

θ) −2 −1 0 1 X 2 3 4 40 0 10 Frequency 20 30 Exp(λ) −2 −1 0 1 X 2 3 4 40 0 10 f (x) = √ 1 2πσ 2 (X) = µ var(X) = σ 2 exp − 1 (x−µ)2 2 2σ Frequency 20 30 Ê N(µ. θ) Range (0. 1) 1 f (x) = 1(0. Jan 30.θ) (x) θ θ (X) = 2 θ2 var(X) = 12 Exponential distribution Exp(λ) Range [0. θ) Exp(λ) N(µ.11 - .∞)(x) 1 (X) = λ 1 var(X) = 2 λ Normal distribution N (µ. σ2) −2 −1 0 1 X 2 3 4 −2 0 2 4 6 U(0.Continuous Distributions Uniform distribution U (0. σ 2) Range 40 0 10 Frequency 20 30 U(0. ∞) f (x) = λ exp(−λx)1[0. 2003 . σ2) Distributions.

Expected Value
Let X be a discrete random variable which takes values in SX = {x1, x2, . . . , xn} Expected Value or Mean of X:
n

(X) =
i=1

xi p(xi)

Example: Roll one die Let X be outcome of rolling one die. The frequency function is p(x) = , and hence (X) =
x 7 = = 3.5 2 x=1 6
6

1 6

x = 1, . . . , 6,

Example: Bernoulli random variable Let X ∼ Bin(1, θ). p(x) = θx (1 − θ)1−x Thus the mean of X is (X) = 0 · (1 − θ) + 1 · θ = θ.

Expected Value and Variance, Feb 2, 2003

-1-

Expected Value
Linearity of the expected value Let X and Y be two discrete random variables. Then (a X + b Y ) = a (X) + b (Y ) for any constants a, b ∈ Ê Note: No independence is required. Proof: (a X + b Y ) =
x,y

(a x + b y)p(x, y) x p(x, y) + b
x,y x,y

=a
p(x, y) = p(y)
x

y p(x, y) y p(y)

=a
x

x p(x) + b
y

= a (X) + b (Y )

Example: Binomial distribution Let X ∼ Bin(n, θ). Then X = X1 +. . .+Xn with Xi ∼ Bin(1, θ):
n n

(X) =
i=1

(Xi) =
i=1

θ = nθ

Expected Value and Variance, Feb 2, 2003

-2-

Expected Value
Example: Poisson distribution Let X be a Poisson random variable with parameter λ. λx −λ (X) = x e x! x=0 = λe
−λ ∞ x=0 −λ λ ∞

λx−1 (x − 1)!

= λe e =λ

Remarks: ◦ For most distributions some “advanced” knowledge of calculus is required to ﬁnd the mean. ◦ Use tables for means of commonly used distribution.

Expected Value and Variance, Feb 2, 2003

-3-

Expected Value
Example: European Call Options Agreement that gives an investor the right (but not the obligation) to buy a stock, bond, commodity, or other instruments at a speciﬁc time at a speciﬁc price. What is a fair price P for European call options? If ST is the price of the stock at time T , the proﬁt will be Proﬁt = (ST − K)+ − P. Proﬁt is a random variable.
30 −10 0 0 10 0 20

10

20 0

30

40

50

Fair price P for this option is expected value P = (ST − K)+.

Expected Value and Variance, Feb 2, 2003

-4-

Feb 2.2 0. .Expected Value Example: European Call Options (contd) Consider the following simple model: ◦ St = St−1 + εt.4 0. .25 19.6 0. . t = 1.25 9.25 13.75 −0.25 Profit Frequency function of proﬁt Expected Value and Variance. p) Therefore the price P is (assuming s0 = 0 without loss of generality) P = (ST − K) = + y=1 (2 y − T − K) pθ (y) 1{y>(K+T )/2} Let n = 20. T with Y ∼ Bin(T.25 17.5 P = 2.75 p(x) 0. θ = 0.25 5.25 15. The distribution of ST is given by (s0 known at time 0) ST = s0 + 2 Y − T.3 0. . St is also called a random walk.1 0.25 3. 2003 -5- .0 −2. K = 10. T ◦ È(εt = 1) = p and È(εt = −1) = 1 − p.25 7.25 11.75 1.

Feb 2.20 2 0.35 0. 2003 -6- . • Alternative scheme: ◦ n samples.25 0.30 Plot of (N ): Proportion 0.Expected Value Example: Group testing Suppose that a large number of blood samples are to be screened for a rare disease with prevalence 1 − p. • If each sample is assayed individually.45 (N ) = (Xi ) = n 1 + 1 − pk k 0.40 0.50 Hence 4 6 8 k 10 12 14 16 Expected Value and Variance. m groups with k samples ◦ Split each sample in half and pool all samples in one group ◦ Test pooled sample for each group What is the expected number of tests under this alternative scheme? Let Xi be the number of tests in group i. 0. n tests will be required. The frequency function of Xi is p(x) = pk 1 − pk if x = 1 if x = k + 1 ◦ If test positive test all samples in group separately The expected number of tests in each group is (Xi) = pk + (k + 1)(1 − pk ) = k + 1 − kpk m i=1 The mean is minimized for groups of size 11.

Variance of X: var(X) = X − (X) . . 6} with frequency function p(x) = 1 . 2 σX = var(X) and its standard deviation by σX . 5. Then the variance of X can be written as n var(X) = i=1 xi − n xj p(xj ) j=1 2 p(xi) Example: Roll one die X takes values in {1. xn}. . . Expected Value and Variance. 2003 -7- .Variance Let X be a random variable. 4. Suppose X is discrete random variable with SX = {x1. 3. Feb 2. . 2 The variance of X is the expected squared distance of X from its mean. 2. 6 6 (X) = x=1 6 x = 1 6 7 2 1 25 9 1 1 9 25 7 21 35 var(X) = = + + + + + x− = 2 6 6 4 4 4 4 4 4 12 x=1 2 We often denote the variance of a random variable X by σX .

2003 -8- . Example: Let X ∼ Bin(n. Then var(X) = n θ (1 − θ) Expected Value and Variance.Properties of the Variance The variance can also be written as var(X) = (X 2) − var(X) = = = (X − (X))2 (X) 2 To see this (using linearity of the mean): X 2 − 2X (X) + (X) 2 X 2 − 2 (X) (X) + (X) 2 = (X 2 ) − (X) 2 Example: Let X ∼ Bin(1. Then var(X) = (X 2) − = (X) − (X) (X) 2 2 = θ − θ2 = θ (1 − θ) Rules for the variance: ◦ For constants a and b var(aX + b) = a2var(X). ◦ For independent random variables X and Y var(X + Y ) = var(X) + var(Y ). Feb 2. θ). θ).

Y ) = (XY ) − (X) (Y ) ◦ cov(X. X) = var(X) ◦ cov(X. 2003 -9- . Y ) Expected Value and Variance. Question: What about dependent random variables? It can be shown that var(X + Y ) = var(X) + var(Y ) + 2 cov(X. Feb 2. 1) = 0 ◦ cov(X.Covariance For independent random variables X and Y we have var(X + Y ) = var(X) + var(Y ). Y ) + b cov(X2. Y ) where cov(X. Properties of the covariance ◦ cov(X. Y ) = a cov(X1. X) ◦ cov(a X1 + b X2. Y ) = cov(Y. Y ) = (X − (X))(Y − (Y ) is the covariance of X and Y .

X and Y are not independent! Note: The covariance of X and Y measures only linear dependence. 1} with probabilities x = −1. 0. Then (X) = 0 and cov(X. 0. Expected Value and Variance. Y ) = 0 does NOT imply that X and Y are independent. X 2) = (X 3) = (X) = 0 On the other hand 1 È(X = 1.10 - . X 2 = 0) = 0 = 9 = È(X = 1)È(X 2 = 0). 1. Feb 2. Example: Suppose X ∈ {−1. È(X = x) = 1 3 for that is. 2003 .Covariance Important: cov(X.

c Y + d) = corr(X. x2): x1\x2 0 1 1 1 0 3 6 1 1 1 6 3 Thus: cov(X1. X2) = 1 · 12 = 1 3 1 2 ◦ −1 ≤ ρXY ≤ 1 and 1 1 = 4 · 1+ 1 · 3+21 ·1 = 3 4 4 6 1 12 Expected Value and Variance.6) Let Xi = 1{penny on ith draw}. p) with p = joint frequency function p(x1. Y ) = cov(X. X2) = [(X1 − p)(X2 − p)] 4 1 corr(X1. i. Properties: ◦ dimensionless quantity ◦ not aﬀected by linear transformations.Correlation The correlation coeﬃcient ρ is deﬁned as ρXY = corr(X. Y ) ◦ ρXY = 1 if and only if È(Y = a + b X) = 1 for some a and b ◦ measures linear association between X and Y Example: Three boxes: pp. Feb 2. pd. Then Xi ∼ Bin(1.11 - . corr(a X + b. and dd (Ex 3. Y ) var(X)var(Y ) .e. 2003 .

score of student on the second examination (Y − a − b X) 2 Thus the best linear predictor is ˆ Y = µ + ρ (X − µ) Note: We expect the student’s score on the ﬁnal to diﬀer from the mean only by half the diﬀerence observed in the midterm (regression to the mean). The correlation between the tests is always around ρ = 0. Feb 2.Prediction An instructor standardizes his midterm and ﬁnal so the class average is µ = 75 and the SD is σ = 10 on both tests.score of student on the ﬁrst examination Since X and Y are dependent we should be able to predict the score in the ﬁnal from the midterm score. 2003 .50. Expected Value and Variance.12 - . ◦ X . Approach: ◦ Predict Y from linear function a + b X ◦ Minimize mean squared error MSE = = var(Y − b X) + Solution: a = µ−bµ and b= σXY 2 =ρ σX Y −a −bX 2 ◦ Y .

M.Bin(n.Poiss(λ) λx −λ p(x) = e x! (X) = λ var(X) = λ Geometric distribution p(x) = θ(1 − θ)x−1 1 θ 1−θ var(X) = θ2 (X) = (X) = θ var(X) = θ(1 − θ) Hypergeometric distribution .H(N.Bin(1. 2003 . Feb 2. θ) p(x) = θx(1 − θ)1−x Binomial distribution .Summary Bernoulli distribution .13 - . θ) p(x) = n x θ (1 − θ)n−x x (X) = nθ var(X) = nθ(1 − θ) Poisson distribution . n) p(x) = M x N −M n−x N n (X) = nM N Expected Value and Variance.

. . 2004 -1- . ◦ The variance of the sample mean decreases as the sample size increases. Question: ◦ How close to µ is the sample mean for ﬁnite n? ◦ Can we answer this without knowing the distribution of X? Central Limit Theorem. n ¯ = 1 Xi X n i=1 (sample mean) Then n ¯ = 1 µ=µ (X) n i=1 n σ2 2 ¯ = 1 var(X) σ = n2 i=1 n Remarks: ◦ The sample mean is an unbiased estimate of the true mean. Xn independent and identically distributed (iid) with mean µ and variance σ 2.Properties of the Sample Mean Consider X1. . . Feb 4. ◦ Law of Large Numbers: It can be shown that for n → ∞ 1 ¯ X= n n i=1 Xi → µ.

2004 .997 -2- Central Limit Theorem. 1) 1 ¯ X= n iid n i=1 1 Xi ∼ N (0. Then for any ε > 0 È Proof: Let σ2 |X − µ| > ε ≤ 2 .889 9 However: Known to be not very precise Example: Xi ∼ N (0. Feb 4. n ) Therefore È 3 3 ¯ −√ ≤X ≤√ n n = 0.Properties of the Sample Mean Chebyshev’s inequality Let X be a random variable with mean µ and variance σ 2. ε 1 0 if |xi − µ| > ε otherwise 1{|xi − µ| > ε} = Then (xi − µ)2 1 1{|xi − µ| > ε} p(xi) = > 1 p(xi) ε2 i=1 i=1 n (x − µ)2 σ2 i ≤ p(xi) = 2 ε2 ε i=1 n n Application to the sample mean: È 3σ 3σ ¯ µ− √ ≤X ≤µ+ √ n n ≥1− 1 ≈ 0.

. 12). . 2004 -3- . X2. . Example: ◦ U1 . n Xi − µ . . n→∞ σ where Φ(x) is the standard normal probability z Φ(z) = −∞ f (x) dx. Central Limit Theorem For large n. be a sequence of random variables ◦ independent and identically distributed For n ∈ Æ deﬁne ¯ √ X −µ 1 Zn = n =√ σ ◦ with mean µ and variance σ 2. n i=1 σ Zn has mean 0 and variance 1. the distribution of Zn can be approximated by the standard normal distribution N (0. ¯ √ X −µ lim È a ≤ n ≤ b = Φ(b) − Φ(a). that is. . ◦ What is the probability that the sample mean exceeds 9? √ ¯ ¯ È(U > 9) = È 12 U√− 6 > 3 ≈ 1 − Φ(3) = 0. U12 uniformly distributed on [ 0. More precisely. the area under the standard normal curve to left of z. . Feb 4.0013 12 Central Limit Theorem. . 1).Central Limit Theorem Let X1.

3 0.0 0.3 density f(x) density f(x) −3 −2 −1 0 1 2 3 0.n=6 0.n=6 Exp(1).1 0.1 0.n=2 0.3 density f(x) 0.n=100 0.2 0.4 Exp(1).n=1 0.n=1 1.3 0.4 U[0.1].2 0.1].0 −3 0.0 Central Limit Theorem.0 −3 −2 −1 0 1 2 3 0.2 0.0 −3 −2 −1 0 1 2 3 0.1 0.4 0.3 0.1 0.3 density f(x) density f(x) −3 −2 −1 0 1 2 3 0.1 0.6 0.0 Exp(1).1 0.1 0.2 0.4 −2 −1 0 1 2 3 0.Central Limit Theorem 0.1].4 U[0.5 Exp(1).4 0.2 0.n=2 0.1 0.2 0.4 −2 −1 0 1 2 3 0.n=100 0.4 U[0.0 0.1 0. Feb 4.n=12 0.1].2 0.2 0.4 Exp(1).3 density f(x) density f(x) −3 0. 2004 -4- .5 0.0 U[0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0.3 density f(x) density f(x) −3 −2 −1 0 1 2 3 0.n=12 0.1].8 density f(x) 0.2 0.4 U[0.2 0.3 0.0 −3 −2 −1 0 1 2 3 0.0 0.

Central Limit Theorem Example: Shipping packages Suppose a company ships packages that vary in weight: ◦ Packages have mean 15 lb and standard deviation 10 lb.e. Question: What is the probability that 100 packages will have a total weight exceeding 1700 lb? Let Xi be the weight of the ith package and 100 T = i=1 Xi . 2004 -5- . i. packages are independent. Then È(T > 1700 lb) = È ≈ 1 − Φ(2) = 0. Feb 4.023 T − 1500 lb >2 =È √ 100 · 10 lb T − 1500 lb 1700 lb − 1500 lb √ > √ 100 · 10 lb 100 · 10 lb Central Limit Theorem. ◦ They come from a arge number of customurs.

◦ for not identically distributed random variables or ◦ for dependent. e. but not “too” dependent random variables. Feb 4. 1]. a larger value of n is needed.g. • There are many central limit theorems covering many situations. Central Limit Theorem. ◦ If it is very skewed or if its tails die down very slowly. 2004 -6- . the approximation is good for n = 12. • Central limit theorems are very important in statistics. n can be relatively small.Central Limit Theorem Remarks • How fast approximation becomes good depends on distribution of Xi’s: ◦ If it is symmetric and has tails that die oﬀ rapidly. Example: Exponential distribution. iid Example: If Xi ∼ U [0.

X is approximately N np. In terms of the standard normal distribution we get È a ≤ X ≤ b) = È =Φ a − 1 − np 2 b + 1 − np 2 np(1 − p) ≤Z ≤ −Φ ′ 1 b + 2 − np where Z ′ ∼ N (0. Feb 4. Rule of thumb for n: np > 5 and n(1 − p) > 5. iid Therefore we can apply the Central Limit Theorem: Normal Approximation to the Binomial Distribution For n large enough. p). Recall that X is the sum of n iid Bernoulli random variables. 1). n X= i=1 Xi . 2004 np(1 − p) a − 1 − np 2 np(1 − p) np(1 − p) -7- . Xi ∼ Bin(1. Central Limit Theorem. np(1 − p) . np(1 − p) distributed: È where 1 1 a ≤ X ≤ b) ≈ È a − 2 ≤ Z ≤ b + 2 Z ∼ N np.The Normal Approximation to the Binomial Let X be binomially distributed with parameters n and p.

2 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 1.The Normal Approximation to the Binomial 1.5) 0.1 0.3 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 Central Limit Theorem.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.2 0.3 Bin(10.5 Bin(5.1) 0.1 0.5) 0.0.1) 0.8 0.0.5 Bin(10.1) 0.4 0.0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.1 0.1 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.4 0.3 p(x) p(x) 0.4 0.6 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.3 Bin(20.1) 0.2 0. Feb 4.3 Bin(50.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.3 Bin(20.5) 1.2 p(x) p(x) 0.0 Bin(1.0.4 0.0.8 0.2 0.5) 0.8 0. 2004 -8- .0.6 p(x) p(x) 0.0.0 Bin(2.4 0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.6 p(x) p(x) 0.0 Bin(5.1 0.0.4 0.2 p(x) p(x) 0.0.6 0.2 0.8 0.1 0.2 0.1) 0.0 Bin(1.0.0 0 1 2 3 4 5 6 7 8 9 10 x 12 14 16 18 20 0.5) 1.2 0.

◦ His successive step directions are independent.007 How does the probability change if he has same idea of where he wants to go and steps north with probability p = 2 and south with 3 1 probability 3 ? Central Limit Theorem.452) = 0. How likely is he to have advanced 10 m north after one hour? ◦ Position after one hour: X · 1 m − 30 m ◦ X binomially distributed with parameters n = 60 and p = ◦ X is approximately normal with mean 30 and variance 15: 1 2 È(X · 1 m − 30 m > 10 m) = È(X > 40) ≈ È(Z > 39.5) 9. 2004 -9- . 15) 15 15 = 1 − Φ(2.5 Z − 30 >√ =È √ Z ∼ N (30. with probability 2 each. Feb 4. ◦ His step length is 50 cm.The Normal Approximation to the Binomial Example: The random walk of a drunkard Suppose a drunkard executes a “random” walk in the following way: 1 ◦ Each minute he takes a step north or south.

2004 -1- . and 14 days after the attack Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Y1 270 236 210 142 280 272 160 220 226 242 186 266 206 318 Y2 218 234 214 116 200 276 146 182 238 288 190 236 244 258 Y3 156 193 242 120 181 256 142 216 248 298 168 236 238 200 Id 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Y1 294 282 234 224 276 282 360 310 280 278 288 288 244 236 Y2 240 294 220 200 220 186 352 202 218 248 278 248 270 242 Y3 264 220 264 213 188 182 294 214 170 198 236 256 280 204 Aim: Make inference on distribution of ◦ cholesterol level 14 days after the attack: Y3 ◦ decrease in cholesterol level: D = Y1 − Y3 ◦ relative decrease in cholesterol level: R = Y1 − Y3 Y3 Conﬁdence intervals I. 4.Estimation Example: Cholesterol levels of heart-attack patients Data: Observational study at a Pennsylvania medical center ◦ blood cholesterol levels patients treated for heart attacks ◦ measurements 2. Feb 11.

. . d28 observed decrease in cholesterol level In this example. the variation of the cholesterol level. . Feb 11.Estimation Data: d1. . ◦ Point estimates provide no information about their chance variation. pD = È(D ≤ 0) probability of no decrease in cholesterol level These parameters are naturally estimated by the following sample statistics: 1 n µD = ˆ di (sample mean) n i=1 1 n 2 ¯ σD = ˆ (di − d)2. Conﬁdence intervals I. 2004 -2- . pD = ˆ ◦ Estimates without an indication of their variability are of limited value. parameters of interest might be µD = (D) 2 σD = var(D) the mean decrease in cholesterol level. (sample mean) n i=1 #{di|di ≤ 0} (sample proportion) n Such statistics are point estimators since they estimate the corresponding parameter by a single numerical value.

n = 28.96 √ ≤ X ≤ µ + 1.96 √ n n = 0. Conﬁdence intervals I. the 95% conﬁdence interval for µ is [18. after rearranging the terms. σ σ ¯ ¯ X − 1.96 √ n n will cover the mean µ.96 √ ≤ µ ≤ X + 1.95. Interpretation: There is 95% probability that the random interval σ σ ¯ ¯ X − 1.96 √ n n = 0. Therefore. Feb 11. X + 1. 2 n ◦ 68-95-99 rule: With 95% probability the sample diﬀers from its mean µ by less that two standard deviations.96 √ .00. Example: Cholesterol levels ¯ d = 36.78]. 2004 -3- . σ = 51. 55.89. or equivalently.Conﬁdence Intervals for the Mean Recall: ◦ CLT for the sample mean: For large n we have σ ¯ X ≈ N µ. More precisely.95. we have È È σ σ ¯ µ − 1.00.

4 0. ˆ Deﬁnition: Conﬁdence interval for µ (σ known) The interval σ σ ¯ ¯ X − zα/2 √ . (1 − α) is the conﬁdence level.1 0.Conﬁdence Intervals for the Mean Assumption: The population standard deviation σ is known. For large sample sizes n. X + zα/2 √ . we will drop this unrealistic assumption. since then σ ≈ σ by the law of large numbers.2 0. an approximate (1 − α) conﬁdence interval for µ is given by σ ¯ ˆ σ ˆ ¯ X − zα/2 √ . Feb 11. ◦ In the next lecture. X + zα/2 √ n n is called a 1 − α conﬁdence interval for the population mean µ.3 0. ◦ Assumption is approximately satisﬁed for large sample sizes. zα is the α-critical value of the standard normal distribution: ◦ zα has area α to its right ◦ Φ(zα ) = 1 − α f(x) 0. n n Here. 2004 -4- .0 −3 α −2 −1 0 z 1 zα 2 3 Conﬁdence intervals I.

96 · √ . 220 + 1. ¯ ◦ Sample mean: X = 220 millions of dollars ◦ Sample standard deviation: SD = 161 millions of dollars ◦ Histogram of sampled values: 20 ◦ Sample size: n = 110 Assets of Community Banks in the U. 250].05.96 A 95% conﬁdence interval for the mean assets (in millions of dollars) is 161 161 220 − 1.Conﬁdence Interval for the Mean Example: Community banks ◦ Community banks are banks with less than a billion dollars of assets. Annual survey of the Community Bankers Council of the American Bankers Association (ABA) ◦ Population: Community banks in the United States. zα/2 = 1.96 · √ ≈ 190. (sample of 110 community banks) 15 Frequency 10 5 0 0 100 200 300 400 500 600 700 Assets (in millions of dollars) 800 900 1000 Suppose we want to give a 95% conﬁdence interval for the mean total assets of all community banks in the United States.S. ◦ Variable of interest: Total assets of community banks. 110 110 Conﬁdence intervals I. ◦ α = 0. Feb 11. ◦ Approximately 7500 such banks in the United States. 2004 -5- .

Conﬁdence intervals I.72]. z0. For this we have to increase the sample size n.06.01. 2004 -6- . a sample of 44 patients is needed to estimate µD with error e = 20 and 99% conﬁdence.58 · √ 50.Sample Size Example: Cholesterol levels Suppose we want a 99% conﬁdence interval for the decrease in cholesterol level: ◦ α = 0.005 = 2.16.89 + 2.58 ◦ The 99% conﬁdence interval for µD is 36.89 − 2. the conﬁdence interval becomes wider.58 · √ ≈ 12. 20 that is.93 50. 36. Suppose we want to obtain increase the conﬁdence level without increasing the error of estimation (indicated by the half-width of the conﬁdence interval). Feb 11.93 2 = 43.58 · 50. 61. Question: What sample size n is needed to estimate the mean decrease in cholesterol with error e = 20 and conﬁdence level 99%? The error (half-width of the conﬁdence interval) is e = zα/2 √ σ n Therefore the sample size ne needed is given by ne ≥ zα/2 σ 2 = e 2. 28 28 Note: If we raise the conﬁdence level.93 .

Feb 11.3 3.0 n n ◦ To get an estimation with error e = 3.4.0 (half-width of conﬁdence interval) it suﬃces to sample ne banks.7 ˆ ◦ σLTDR = 12. ne ≥ zα/2σLT DR e 2 = 1.96 · 12. Loan−To−Deposit Ratio of Community Banks (sample of 110 community banks) 18 15 Frequency 12 9 6 3 0 50 60 70 80 90 LTDR (in %) 100 110 120 Sample statistics: ◦ n = 110 ◦ µLTDR = 76.05. 2004 -7- . X + zα/2 √ = 74.Estimation of the Mean Example: Banks’ loan-to-deposit ratio The ABA survey of community banks also asked about the loan-to-deposit ratio (LTDR).0 2 = 64.6.17 ¯ n σ ◦ 95% conﬁdence interval for µLTDR : σLT DR ¯ σLT DR ¯ X − zα/2 √ . Thus a sample of ne = 65 banks it suﬃcient.96 √ ◦ Standard error σX = LT DR = 1.3 ˆ Construction of 95% conﬁdence interval: ◦ α = 0. zα/2 = 1. 79. Conﬁdence intervals I. a bank’s total loans as a percent of its total deposits.

only 95% of the conﬁdence intervals will cover the true parameter is a random: µ Conﬁdence intervals II. Feb 13. ◦ For repeated samples. 2004 -1- .Conﬁdence intervals Deﬁnition: Conﬁdence interval A (1 − α) conﬁdence interval for a parameter is an interval that ◦ depends only on sample statistics and ◦ covers the parameter with probability (1 − α) Note: ◦ Conﬁdence intervals are random while the estimated parameter is ﬁxed.

More realistic situation: σ is unknown. Notation: Critical values of distributions zα tn. 2004 .α/2 · √ . 0. . s/ n It is t distributed with n − 1 degrees of freedom.3 ¯ X −µ T = √ ∼ tn−1. we obtain σ σ ¯ ¯ X − zα/2 · √ . .0 −4 −3 −2 −1 0 x 1 2 3 4 Conﬁdence interval for the mean µ (σ unknown) The interval s s ¯ ¯ X − tn−1. X + tn−1. . σ 2). .4 t1 t3 t10 N(0. 1) σ/ n iid (*) Assuming that σ is known. Feb 13. Approach: Replace by estimate σ = s ˆ This approach leads to the t statistic f(x) 0. Xn ∼ N (µ. 1) 0.α standard normal distribution t distribution with n degrees of freedom -2- Conﬁdence intervals II.α/2 · √ n n is a (1 − α) conﬁdence interval for the mean µ.1 0. X + zα/2 · √ n n as (1 − α) conﬁdence interval for µ. Then ¯ X −µ √ ∼ N (0.2 0.Conﬁdence Intervals for the Mean Suppose that X1.

27. Conﬁdence intervals II.5. Feb 13.89.36 · 7.2.5.78.05 ◦ Then 50.5 + √ = [16.00. measured in milligrams per 100 grams (mg/100 g) of corn soy blend. 2004 -3- .55.0.2 2.05 · √ = [16. ◦ µD = 36. t7. 36.89 − 2.025 = 2. 8 8 ◦ The large sample CI would be [17.01] 27 27 is a 95% conﬁdence interval for µD ◦ The large sample conﬁdence interval based on (*) was [18.5]. the mean vitamin C content of the CSB produced during this run? ◦ µ = 22.Conﬁdence Intervals for the Mean Example: Cholesterol levels In the study on cholesterol levels. for a random sample of size 8 from a production run: 26 31 23 22 11 22 14 31 What is the 95% conﬁdence interval for µ.05 · √ .36 ˆ ˆ ◦ The 95% conﬁdence interval for µ is 22.94 36.94 50. Example: Level of vitamin C The following data are the amounts of vitamin C.94 ˆ ˆ ◦ t27. σ = 7. 22. 28.5 − 2.5].36 · 7. σD = 50. 57.2 √ .025 = 2. the standard deviation of the decrease of cholesterol level was unknown.0.89 + 2.5.78].

57 Conﬁdence intervals II.Conﬁdence Intervals for the Variance For normally distributed data X1.19 ◦ The 95% conﬁdence interval for σ 2 is 27 · 2030.99] . The (1 − α) conﬁdence interval for σ 2 is (n − 1) · s2 χ2 n−1.55 = [1269. .0.55. 2 where χ2 n−1. . χ27.57.55 27 · 2030. the cholesterol level 14 days after the attack.19 14. Caution: This conﬁdence interval is not robust against departures from normality regardless of the sample size. χ2 27. Example: Cholesterol levels Suppose we are interested in the variance of Y3.025 = 43. Xn ∼ N (µ. σ 2).975 = 14. (n − 1) · s2 χ2 n−1. the ratio (n − 1) · s2 σ2 iid has a χ2 distribution with n − 1 degrees of freedom.α/2 . 2004 -4- .α is the α fractile of the χn−1 distribution.26. ◦ Normal probability plot: 300 250 Cholesterol level 200 150 −2 −1 0 Normal quantiles 1 2 Data seem to be normally distributed.1−α/2 . 2 ◦ s2 = 2030. 43. . .0. 3761. Feb 13.

How likely is this (or a more extreme) outcome under the assumption that there is no diﬀerence before and after implementation of the safety program. all equally qualiﬁed. Further suppose that the company hired 2 women and 8 men. Feb 16. Question: ◦ Does the company discriminate against female job applicants? ◦ How likely is this outcome under the assumption that the company does not discriminate? Example: ◦ Study success of new elaborate safety program ◦ Record average weekly losses in hours of labor due to accidents before and after installation of the program in 10 industrial plants 4 5 6 7 8 9 10 Plant 1 2 3 Before 45 73 46 124 33 57 83 34 26 17 After 36 60 44 119 35 51 77 29 24 11 Question: ◦ Has the safety program an eﬀect on the loss of labour due to accidents? ◦ In 9 out of 10 plants the average weekly losses have decreased after implementation of the safety program.Statistical Tests Example: Suppose that of 100 applicants for a job 50 were women and 50 were men. Testing Hypotheses I. 2004 -1- .

We suspect it might be unfair. Our suspicion (“coin not fair”) is a hypothesis about the population parameter θ (θ = 1 ) and thus about È. We devise a statistical experiment: ◦ Toss coin 100 times ◦ Conclude that coin is fair if we see between 40 and 60 heads ◦ Otherwise decide that the coin is not fair Let θ be the probability that the coin lands heads. θ).Statistical Tests Example: Fair coin Suppose we have a coin. θ = 1 2 The null hypothesis represents the default belief (here: the coin is fair). 2004 -2- . 2 ) Alternative hypothesis Ha : X ∼ Bin(100. 60] is called a statistical test for the test problem H0 vs. We emphasize this dependence of È 2 on θ by writing Èθ . The data-based decision rule reject H0 if X ∈ [40. The alternative is the hypothesis we accept in view of evidence against the null hypothesis. 60] / do not reject H0 if X ∈ [40. Testing Hypotheses I. Ha . Decision problem: 1 Null hypothesis H0 : X ∼ Bin(100. that is. È(Xi = 1) = θ and È(Xi = 0) = 1 − θ. Feb 16.

2004 -3- .02 0.5 Reject H0: p ≠ 0.02 0.0.06 p(x) 0.02 0.06 p(x) 0. 0. 0.6 ◦ with probability 0.0049) if p = 0.047) if p = 0.g.0.06 p(x) 0.00 20 25 30 35 40 45 50 x 55 60 65 70 75 80 0.5 Reject H0: p ≠ 0.7) Accept H0: p ≠ 0.08 0.5 Reject H0: p ≠ 0.08 0.5 0.5) Accept H0: p ≠ 0. Feb 16.10 Bin(100.04 0.Statistical Tests Example: Fair coin (contd) Note: It is possible to obtain e.0009) if p = 0.00 20 25 30 35 40 45 50 x 55 60 65 70 75 80 0.04 0.5 0.10 Bin(100.048 (resp.6) Accept H0: p ≠ 0.0005 (resp.0.10 Bin(100.08 0.04 0.5 0.7 0. X = 55 (or X = 65) ◦ with probability 0.048 (resp.5 ◦ with probability 0.00 20 25 30 35 40 45 50 x 55 60 65 70 75 80 Testing Hypotheses I. 0.

this is a Type II error.Types of errors Example: Fair coin (contd) It is possible that the test (decision rule) gives a wrong answer: ◦ If θ = 0. H0 true H0 false type I error correct decision correct decision type II error Testing Hypotheses I.7 and x = 55. The following table lists the possibilities: Decision Reject H0 Accept H0 Deﬁnition (Types of error) ◦ If we reject H0 when in fact H0 is true. ◦ If we do not reject H0 when in fact H0 is false. Feb 16. ◦ If θ = 0. 2004 -4- .5 and x = 65. we do not reject the null hypothesis that the coin is fair although the coin in fact is not fair. this is a Type I error. we reject the null hypothesis that the coin is fair although the coin in fact is fair.

Testing Hypotheses I. Probability of type I error: α If the null hypothesis is true.035. 60]) 60 = x=40 100 x θ (1 − θ)n−x x Thus.e. the probability of committing an error of either type should be small. Feb 16. denoted as α. 2004 -5- . Probability of type II error: β(θ) If the null hypothesis is false and the true probability of observing “head” is θ with θ = 1 . i. the probability of an error of type II depends on θ. then 2 Èθ (accept H0) = Èθ (X ∈ [40.5%. 60]) = 1 − Èθ (X ∈ [40. It will be denoted as β(θ). Thus the probability of a type I error.Types of errors Question: How good is our decision rule? For a good decision rule. then 2 / Èθ (reject H0) = Èθ (X ∈ [40. 60]) 60 =1− x=40 100 x 1 2 100 = 0. is 3. θ = 1 .

5 θ 0.9 1.8 1 − β(θ) 0.6 0.1 0. 2004 -6- .4 0.7 0. Deﬁnition (Power of a test) We call 1 − β(θ) the power of the test as it measures the ability to detect that the null hypothesis is false.4 0.0 0.2 0. Note: ◦ If θ = 1 2 this is the probability of committing a error of type I: 1 2 1−β ◦ If θ > 1 2 =α this is the probability of correctly rejecting H0 .8 0.0 0.6 0. 60]) / = 1 − Èθ (accept H0 ) = 1 − β(θ). Feb 16.2 0.0 Testing Hypotheses I. 1.60] 0.0 reject if X ∉ [40.3 0.Power of Tests Question: How good is our test in detecting the alternative? Consider the probability of rejecting H0 Èθ (reject H0) = Èθ (X ∈ [40.

8 1 − β(θ) 0. 2004 -7- .2 0.5 θ 0. Deﬁnition A test of this kind is called a signiﬁcance test with signiﬁcance level α.4 0.0 reject if X ∉ [40. ◦ all tests taught in this course are of this kind.2 0.0 0.7 0.60] reject if X ∉ [38.0 0.9 1. Feb 16.62] reject if X ∉ [42.6 0. Testing Hypotheses I.6 0.3 0.0 Note: If we decrease the probability of a type I error.58] 0.Signiﬁcance Tests Idea: minimize probability of committing an error of type I and II Diﬀerent probabilities of type I error 1.4 0. ◦ the power of the test. 1 − β(θ) decreases as well and ◦ the probablity of a type II error increases.8 0. Problem: cannot minimize both errors simultaneously Solution: ◦ choose ﬁxed level α for probability of a type I error ◦ under this restriction ﬁnd test with small probability of a type II error Remark: ◦ you do not have to do this minimization yourself.1 0.

Feb 18.H0 : drug has no eﬀect ◦ Criminal case . Null hypothesis H0 default (current) theory which we try to falsify Alternative hypothesis Ha alternative to adopt if null hypothesis is rejected Examples: ◦ Clinical study of new drug . Test problem .decide between two hypotheses ◦ the null hypothesis H0 and ◦ the alternative hypothesis Ha . ◦ some parameters: mean and median are identical. 2004 -1- . which may be expressed in terms of ◦ some parameter: mean is zero.Statistical Hypotheses A statistical hypothesis is an assertion or conjecture about a population. Popperian approach to scientiﬁc theories ◦ Scientiﬁc theories are subject to falsiﬁcation.H0 : project not proﬁtable ◦ Testing for independence .H0 : suspect is not guilty ◦ Safety test of nuclear power station .H0 : random variables are independent Testing Hypotheses II.H0 : power station is not safe ◦ Chances of new investment . or ◦ some sampling distribution: this sample is normally distributed. ◦ It is impossible to verify a scientiﬁc theory.

Consequently. Testing Hypotheses II. Question: Does the concentration cP of pesticide in the water exceed the allowed maximum concentration c0 ? ◦ The aim of the company is to avoid ﬁnes for exceeding the allowed level. ◦ Without evidence the agency assumes that the pesticide concentration cP is within the limits of the law. The question now corresponds to the test problem H0 : cP ≥ c0 vs Ha : cP < c0 . the null hypothesis of the company should be that the pesticide concentration cP exceeds c0 . the null hypothesis of the agency is that the pesticide concentration cP does not exceed c0 . Feb 18. Suppose that the company regularly also runs tests on the amount of pesticide in the discharge water. Thus. 2004 -2- .Statistical Tests Example: Testing for pesticide in discharge water Suppose the Environmental Protection Agency takes 10 readings on the amount of pesticide in the discharge water of a chemical company. Thus the question corresponds to the test problem H0 : cP ≤ c0 vs Ha : cP > c0 . the agency must have some evidence that the concentration cP exceeds the allowed level. Question: Does the concentration cP of pesticide in the water exceed the allowed maximum concentration c0 ? ◦ Before taking action against the company. Thus the company wants to make sure that the concentration stays within the allowed limits.

6. formulate a criterion for testing H0 against Ha . Decide whether or not to reject the null hypothesis H0 . Rejection criterion: reject H0 if T ∈ [40. 5. the signiﬁcance level α. Determine null hypothesis H0 and alternative Ha . 4. Let θ be the probability of heads. Calculate value of the test statistic T . Example: Fair coin (contd) We want to decide from 100 tosses of a coin whether it is fair or not. 2004 -3- . Signiﬁcance level: α = 0. Test statistic: T =X (number of heads in 100 tosses of the coin) 4. Observed value of test statistic: Suppose after 100 tosses we obtain t = 55 6. 2. Find an appropriate test statistic T . Decision: Since 55 does not lie in the rejection region. 3. Testing Hypotheses II. Feb 18. we do not reject H0.05 (most commonly used signiﬁcance level) 3. 60] / 5. Decide on probability of type I error. Based on the sampling distribution of T .Six Steps of Conducting a Test Steps of a signiﬁcance test 1. Test problem: 1 H0 : θ = 2 vs Ha : θ = 1 2 2. 1.

Testing Hypotheses II. 2004 -4- . ◦ We suspect that the cholesterol level after a heart attack might me higher than in the general population. we might be interested in one-sided test problems of the form H0 : µY1 ≤ µ0 vs H0 : µY1 > µ0 . Then we have a two-sided test problem H0 : µY1 = µ0 vs H0 : µY1 = µ0 . Feb 18. the form of the test does not depend on the form of H0 . we have a one-sided test problem H0 : µY1 = µ0 vs H0 : µY1 > µ0 . but only on the parameter value in H0 that is closest to Ha . that is µ0 . Two cases: ◦ We are interested in any diﬀerence from the population mean µ0 . ◦ For all common test situations (in particular those discussed in this course).One and Two-sided Hypotheses Example: Blood cholesterol after a heart attack Suppose we are interested in whether the blood cholesterol level two days after a heart attack diﬀers from the average cholesterol level in the (general) population (µ0 = 193). which accounts for the possibility that µ might be smaller than µ0 . Remark: ◦ More generally. In this case.

2004 -5- . ˆ ◦ Ha : θ < θ0: reject H0 if θ − θ0 is much smaller than zero ˆ ◦ Ha : θ = θ0: reject H0 if |θ − θ0| is much larger than zero Testing Hypotheses II. ˆ ◦ If θ = θ0 (null hypothesis). we expect the estimate θ to take a value near θ0. Two-sided test problem H0 : θ = θ0 against Ha : θ = θ0 One-sided test problem H0 : θ = θ0 against Ha : θ > θ0 (or Ha : θ < θ0) ˆ Suppose that θ is an estimate for θ. This suggests the following decision rules: ˆ ◦ Ha : θ > θ0: reject H0 if θ − θ0 is much larger than zero ˆ Problem: Often the sampling distribution of the estimate θ depends on the unknown parameter θ. ◦ Large deviations from θ0 are evidence against H0 . Feb 18.Test Statistic Let θ be the parameter of interest. Deﬁnition (Test statistic) A test statistic is a random variable ◦ that measures the compatibility between the null hypothesis and the data and ◦ has a sampling distribution which we know (under H0 ).

2 σX . . .645 ◦ Outcome of test: Since the observed value of T is t= 253.Test Statistic Example: Blood cholesterol after a heart attack Data: X1. 2004 -6- .05 ◦ Test statistic: Assume σ = 47. 47. 1) σ/ 28 (under H0 ). 28 This suggests to the standardized sample mean as a test statistic ¯ X − µ0 √ ∼ N (0. T = ¯ X − µ0 √ σ/ 28 ◦ Rejection criterion: Reject H0 if T > z0. X28 ◦ blood cholesterol level of 28 patients two days after a heart attack The parameter µ can be estimated by the sample mean 1 ¯ X= 28 2 ◦ assumed to be normally distributed with mean µX and variance σX 28 i=1 Xi ∼ N µX .7/ 28 we reject the null hypothesis that µ = 193.76. Testing Hypotheses II. . . Feb 18. Test H0 : µ ≤ 193 vs Ha : µ > 193 at signiﬁcance level α = 0.05 = 1.9 − 193 √ = 6.7 to be known.

we obtain the test statistic T = ¯ X − µ0 √ ∼ t27 .76. 2004 -7- .α/2 reject H0 if |T | > zα/2 (µ < µ0 ) (µ < µ0 ) (T < −tn−1. we still reject H0 .05 = 1.703 and t = 6. Testing Hypotheses II.Tests for the Mean Tests for the mean µ (σ 2 known): ◦ Test statistic: ¯ X − µ0 √ T = σ/ n ◦ Two sided test: H0 : µ = µ0 against Ha : µ = µ0 ◦ One sided tests: H0 : µ = µ0 against Ha : µ > µ0 reject H0 if T > zα (T < −zα ) Tests for the mean µ (σ 2 unknown): ◦ Test statistic: ¯ X − µ0 √ T = s/ n ◦ Two sided test: H0 : µ = µ0 against Ha : µ = µ0 ◦ One sided tests: H0 : µ = µ0 against Ha : µ > µ0 reject H0 if T > tn−1.α reject H0 if |T | > tn−1.α ) Example: Blood cholesterol after a heart attack Estimating the standard deviation from the data. s/ 28 Noting that t27.0. Feb 18.

then Èθ θ ∈ C(X) = Èθ Tθ (X) ∈ R = 1 − Èθ Tθ (X) ∈ R = 1 − α. . / ¯ n n Testing Hypotheses II.α/2 √ .α/2 √ . σ 2). . Let ◦ T = Tθ0 (X) be the test statistic of the test (depends on θ0) ◦ R be the critical region of the test Then C(X) = {θ : Tθ (X) ∈ R} / is a (1 − α) conﬁdence interval for θ: If θ is the true parameter. Example: Normal distribution Let X1 .α/2 s/ n iid or equivalently s ¯ X − µ0 > tn−1. . X + tn−1. / We have θ0 ∈ C(X) ⇔ Tθ0 (X) ∈ R ⇔ H0 is not rejected / Result A level α two-sided signiﬁcance test rejects the null hypothesis H0 : θ = θ0 if and only if the parameter θ0 falls outside a (1 − α) conﬁdence interval for θ. We reject H0 : µ = µ0 if ¯ X − µ0 √ > tn−1. we ﬁnd that we reject if s s ¯ µ0 ∈ X − tn−1. 2004 -8- . . Feb 18. Xn ∼ N (µ.Tests and Conﬁdence Intervals Consider level α signiﬁcance test for the two-sided test problem H0 : θ = θ0 vs Ha : θ = θ0.α/2 √ n Rearranging terms.

76) = 1.The P -value Deﬁnition (P -value) The probability that under the null hypothesis H0 the test statistic would take a value as extreme or more extreme that that actually observed is called the P -value of the test. ◦ Without a measure of its variability it is not safe to interpret the actually observed P -value. The corresponding P -value is È(T > 6. we reject the null hypothesis H0 . Equivalently. 2004 -9- . ◦ If the P -value is smaller than the chosen signiﬁcance level α. Feb 18. Testing Hypotheses II. The P -value is often interpreted a measure for the strength of evidence against the null hypothesis: the smaller the P -value. We thus reject the null hypothesis.76. 272. the conﬁdence interval for µ is [235. s/ 28 is t = 6.43.42]. the stronger the evidence. However: ◦ The P -value is a random variable (under H0 uniformly distr.47 · 10−07. Three approaches to deciding on test problem: ◦ reject if θ0 ∈ C(X) / ◦ reject if T (X) ∈ R ◦ reject if P -value p ≤ α Example: Blood cholesterol after a heart attack The observed value for the test statistic T = ¯ X − µ0 √ ∼ t27 . Since it does not contain µ0 = 193 we reject H0 (for the third and last time!). 1]). on [0.

Testing Hypotheses II.08889 -----------------------------------------------------------------Degrees of freedom: 109 Ho: mean(x) = 0 Ha: mean < 0 t = 3.025 = 1.2179 P > |t| = 0.4% Test problem: H0 : µ = 0 against Ha : µ = 0 .05 and thus the test rejects H0 .09] and thus the test rejects H0. therefore the test rejects H0 at signiﬁcance level α = 0.05. P -value is less than α = 0.10 - .982 Result: ◦ |t| > t109. ◦ Equivalently. Err.0017 Ha: mean > 0 t = 3. 2004 .1 26.025. ttesti 110 8.11.0009 Critical value of t distribution with 109 degrees of freedom: t109.2179 P < t = 0. Feb 18. [95% Conf.1% ¯ ◦ sample standard deviation s = 26. / ◦ Equivalently. Std. Dev.4 3.Example Data: Banks’ net income ◦ percent change in net income between ﬁrst half of last year and ﬁrst half of this year ◦ sample mean x = 8. µ0 = 0 ∈ [3.1 2.9991 Ha: mean != 0 t = 3.517141 26. Interval] ----+------------------------------------------------------------x | 110 8.0.2179 P > t = 0.4 0 One-sample t test -----------------------------------------------------------------| Obs Mean Std. 13.0.111108 13.

Note: ◦ Exact binomial tests typically have smaller signiﬁcance level α due to discreteness of distribution. Decision problem: ◦ Null hypothesis H0 : coin is fair ◦ Alternative hypothesis Ha : coin is unfair Testproblem: H0 : θ = 1 2 1 2 vs Ha : θ = .035.025] = [40. the probability of a type I error is È(reject H0) = α = 0. Modelling: ◦ θ is the probability that the coin lands heads up ◦ X is the number of heads in 100 tosses of the coin ◦ X is binomially distributed with parameters n and θ. b100. ◦ In the above example. Testing Hypotheses III. the distribution of X is known.0.Exact Binomial Test Example: Fair coin Data: 100 tosses of a coin which we suspect might be unfair. X ∼ Bin 100. 2004 -1- . 2 Reject null hypothesis if X ∈ [b100. Under the null hypothesis H0. Feb 20.0. / where bn. 60]. 1 .0.α is the α fractile of Bin(n.0.5.5.975. θ).θ.

◦ Null hypothesis H0 : θ = Example: For the safety program data.Sign Test Example: Safety program ◦ Study success of new elaborate safety program ◦ Record average weekly losses in hours of labor due to accidents before and after installation of the program in 10 industrial plants Plant 1 2 3 4 5 6 7 8 9 10 Before 45 73 46 124 33 57 83 34 26 17 After 36 60 44 119 35 51 77 29 24 11 Question: ◦ Has the safety program an eﬀect on the loss of labour due to accidents? ◦ Ignore pairs with diﬀerence 0 ◦ Number of trials n is the count of the remaining pairs ◦ The test statistic is the count X of pairs with positive diﬀerence ◦ X is binomially distributed with parameters n and θ. 2004 -2- .0107 2 Since the P -value is smaller than α = 0. Feb 20. X = 9 ◦ Test H0 : θ = 1 2 1 2 The Sign Test for matched pairs (i.05 we reject the null hypothesis H0 that the safety program has no eﬀect on the loss of labour due to accidents.e. we ﬁnd ◦ n = 10. Testing Hypotheses III. median of the diﬀerences is zero) against Ha : θ > 1 10 + 2 1 2 ◦ The P -value of the observed count X is È(X ≥ 9) = 9 1 10 = 0.

Example: Blood cholesterol after a heart attack ◦ n = 28. Question: Does a decrease occur more often than an increase? Test problem: H0 : p = Exact tests: Since X is binomially distributed.5 = 3. 1) p(1 − p)/n p − p0 ˆ (for large n) Under the null hypothesis H0 . we can use exact binomial tests.7675 0. α = 0.79 · 0. 1). z0.79.24 · 10−5.05. Hence. we get T = p0 (1 − p0 )/n ≈ N (0.05 = 1.21/28 È(T > t) = 8.79 − 0.645 ◦ t= ◦ P-value: 0. x = 22. 2004 . Feb 20. p = 0. The proportion p can be estimated by the sample proportion p= ˆ X n where X is the number of patients whose cholesterol level decreased. -3- Testing Hypotheses III. Large sample approximations: Facts: ◦ ◦ (ˆ) = p p p(1 − p) n 1 2 1 2 vs Ha : p > ◦ var(ˆ) = p p−p ˆ ≈ N (0. we reject H0 if T > zα .Tests for Proportions Example: Blood cholesterol after a heart attack Suppose we are interested in the proportion p of patients who show a decrease of cholesterol level between the second and the 14th day after a heart attack.

cii 28 22 -.0775443 . [95% Conf.7857143 . Interval] ---------+----------------------------------------------------------| 28 .590469 . Feb 20.Conﬁdence Intervals for Proportions Exact binomial conﬁdence intervals ◦ diﬃcult to compute ◦ use statistics software Example: Blood cholesterol after a heart attack ◦ 28 patients in the study ◦ 22 showed a decrease in cholesterol level between second and 14th day after the attack Computation of an exact binomial conﬁdence interval in STATA: . 2004 -4- .Binomial Exact -Variable | Obs Mean Std. Err.9170394 Testing Hypotheses III.

wilson -----. ˆ p ≈ N p. p + zα/2 ˜ n+4 p(1 − p) ˜ ˜ n+4 X +2 n+4 (Wilson estimate) as a (1 − α) conﬁdence interval for p Example: Blood cholesterol after a heart attack . 2004 -5- .0775443 .8978754 Testing Hypotheses III. ˆ p(1 − p) n Problems: ◦ variance is unknown ◦ estimate p(1 − p)/n is zero if p = 0 or p = 1 ˆ ˆ ˆ ˆ Example: What is the proportion of HIV+ students at the UofC? ◦ Random sample of 100 students ◦ None test positive for HIV Are you absolutely sure that there are no HIV+ students at the UofC? Idea: Estimate p by p= ˜ and use p − zα/2 ˜ p(1 − p) ˜ ˜ . Err. cii 28 22. Feb 20.6046141 . [95% Conf.Wilson -----Variable | Obs Mean Std.Conﬁdence Intervals for Proportions Large sample approximations The CLT states that for large n p is approximately normally distributed. Interval] ---------+----------------------------------------------------------| 28 .7857143 .

27.01 = 2.5 Reject if T > tn−1.07 and t9. Feb 20. 2004 -6- .98.0014 ◦ Test rejects H0 at signiﬁcance level α = 0.01 ◦ One sample t test: T = ¯ D √ s/ n iid 20 15 10 5 0 −1.82.0 −0. P -value: 0. σ 2) ◦ H0 : µ = 0 against Ha : µ > 0 ◦ Signiﬁcance level α = 0.5 Normal quantiles 1.0 0. n = 10 ¯ ◦ t = 4.0 1.α Result: ◦ y = 10.5 0.5 −1.01 Testing Hypotheses III.0.Paired Samples Example: Safety program ◦ Study success of new elaborate safety program ◦ Record average weekly losses in hours of labor due to accidents before and after installation of the program in 10 industrial plants Plant 1 2 3 4 5 6 7 8 9 10 Before 45 73 46 124 33 57 83 34 26 17 After 36 60 44 119 35 51 77 29 24 11 Question: Does the safety program have a positive eﬀect? Approach: ◦ Consider diﬀerences before and after implementation of the program: Di = Xi (after) − Xi (before) ◦ Di ’s are approximately normal Decrease in losses of work 25 Di ∼ N (µ. s = 7.

Yn) Assumptions: ◦ Pairs are independent iid ◦ Di = Xi − Yi ∼ N (µ.8 0.α/2 Testing Hypotheses III. .6 0.0 0 1 2 3 4 5 δ 6 7 8 9 10 11 t test Sign test reject H0 if |T | > tn−1. Feb 20. σ 2) ◦ Apply one-sample t test Paired sample t test ◦ Test statistic ¯ D − µ0 √ T = s/ n ◦ Two-sided test: H0 : µ = µ0 against Ha : µ = µ0 ◦ One-sided test: H0 : µ = µ0 against Ha : µ > µ0 reject H0 if T > tn−1. Y1). 2004 1 − β( δ ) -7- .α Power of the paired sample t test and the paired sign test: 1.Paired Sample t Test Data: (X1 . .2 0. .0 0.4 0. . (Xn.

keeps signiﬁcance level regardless of Testing Hypotheses III. conduct either t test or sign test depending on result in step 1 does not attain the chosen signiﬁcance level α (two tests!). 2004 -8- .Sign and t Test t test: ◦ based on Central Limit Theorem ◦ readsonably robust against departures from normality ◦ do not use if n is small and ⋄ data are strongly skewed or Sign test: ⋄ data have clear outliers ◦ uses much less information than t test ◦ for normal data less powerful than t test ◦ no assumption on distribution distribution ◦ preferable for very small data sets Remark: ◦ The two-step procedure 1. ◦ The sign test is rarely used since there are more powerful distributionfree tests. assess normality by normal quantile plot 2. Feb 20.

. ◦ The responses in each group are independent of those in the other group. y23 . . .weight gains for control group ◦ Test problem: H0 : µX = µY vs Ha : µX = µY 50 −10 Treatment 0 Weight gain (in gram) 10 20 30 40 ◦ Idea: Reject null hypothesis if x − y is large. Feb 23. . ◦ Control group: 23 rats were kept in an ozone-free environment ◦ Data: Weight gains after 7 days We are interested in the diﬀerence in weight gain between the treatment and control group. . Question: Do the weight gains diﬀer between groups? ◦ x1.weight gains for treatment group ◦ y1. . ◦ Each group is a sample from a diﬀerent population. Example: Eﬀects of ozone Study the eﬀects of ozone by controlled randomized experiment ◦ 55 70-day-old rats were randomly assigned to two treatment or control ◦ Treatment group: 22 rats were kept in an environment containing ozone. 2004 -1- . . . ¯ ¯ Control Two Sample Tests.Two Sample Problems Two sample problems ◦ The goal of inference is to compare the responses in two groups. x22 .

. . σX + σY ¯ X m n Two-sample t test ◦ Two-sample t statistic T = ¯ ¯ X −Y s2 X m + s2 Y n Distribution of T can be approximated by t distribution ◦ Two-sided test: H0 : µX = µY against Ha : µX = µY ◦ One-sided test: H0 : µX = µY against Ha : µX > µY reject H0 if T > tdf. Xm and Y1 .Comparing Means Let X1 . 2004 -2- . .α/2 + 2 s2 Y n 2 2 + 1 n−1 s2 Y n commonly used. Yn be two independent normally distributed samples. . Feb 23. . n − 1) Two Sample Tests. . .α ◦ Degrees of freedom: ◦ Approximations for df provided by statistical software ◦ Satterthwaite approximation df = 1 m−1 s2 X m s2 X m reject H0 if |T | > tdf. conservative approximation ◦ Otherwise: use df = min(m − 1. Then 2 2 ¯ − Y ∼ N µX − µY . .

577378 19. sX = 10.417 4.02. ttest weight. n = 23 ¯ Testproblem: ◦ H0 : µX = µY vs Ha : µX = µY ◦ α = 0.46) = 0.43. Err.023 Thus we reject the hypothesis that ozone has no eﬀect on weight gain.0096 Two Sample Tests.01711 2.96311 21. Dev.985043 20.4629 P > |t| = 0.9179 Ho: mean(0) . by(group) unequal Two-sample t test with unequal variances ---------------------------------------------------------------------------Group | Obs Mean Std.78.0863 1 | 22 11.635531 1.Comparing Means Example: Eﬀects of ozone Data: ◦ Treatment group: x = 11. df = min(m − 1.01.42609 2. m = 22 ¯ ◦ Control group: x = 22. n − 1) = 21. 2004 -3- .76587 27.025 = 2. Std.422057 16. Interval] ---------+-----------------------------------------------------------------0 | 23 22.08 The value of the test statistic is t= x−y ¯ ¯ s2 X m + s2 Y m = −2. sX = 19.46 The corresponding P-value is È(|T | ≥ |t|) = È(|T | ≥ 2.24765 11.054461 19. Feb 23. Two-sample t test with STATA: .05.0.4629 P > t = 0.0192 Ha: diff > 0 t = 2.247108 10.77675 17.00909 4.72578 ---------+-----------------------------------------------------------------diff | 11.4629 P < t = 0. t21.9904 Ha: diff != 0 t = 2.mean(1) = diff = 0 Ha: diff < 0 t = 2.84444 2.84895 ---------------------------------------------------------------------------Satterthwaite’s degrees of freedom: 32. [95% Conf.4408 ---------+-----------------------------------------------------------------combined | 45 16.

Comparing Means 2 2 Suppose that σX = σY = σ 2. m n m n Estimate σ 2 by the pooled sample variance s2 = p (m − 1)s2 + (n − 1)s2 Y X . Then 1 1 σ2 σ2 + = σ2 + . reject H0 if |T | > tm+n−2. m+n−2 Pooled two-sample t test ◦ Two-sample t statistic T = ¯ ¯ X −Y 1 m sp + 1 n T is t distributed with m + n − 2 degrees of freedom.α Remarks: ◦ If m ≈ n. ◦ Tests for diﬀerences in variances are sensitive to nonnormality. Feb 23.α/2 Two Sample Tests. the test is reasonably robust against ◦ nonnormality and ◦ unequal variances. test is very sensitive to unequal variances. 2004 -4- . ◦ Two-sided test: H0 : µX = µY against Ha : µX = µY ◦ One-sided test: H0 : µX = µY against Ha : µX > µY reject H0 if T > tm+n−2. ◦ If sample sizes diﬀer a lot.

2004 1. by(group) Two-sample t test with equal variances --------------------------------------------------------------------------Group | Obs Mean Std. among other things. Feb 23.4105745 2.mean(1) = diff = 0 Ha: diff < 0 t = -2. Std.7805 P > |t| = 0.7805 P > t = 0. Interval] ---------+----------------------------------------------------------------0 | 14 1. aﬀects a person’s ability to speak ◦ Overall condition can be improved by an operation ◦ How does the operation aﬀect the ability to speak? ◦ Treatment group: Eight patients received operation ◦ Control group: Fourteen patients ◦ Data: ⋄ score on several test ⋄ high scores indicate problem with speaking Treat.14516 .142645 1 | 8 2.309884 ---------+----------------------------------------------------------------diff | -. [95% Conf. Contr.793249 ---------+----------------------------------------------------------------combined | 22 2. infile ability group using parkinson.txt .7805 P < t = 0.2260675 -1.10014 -.821429 .790116 2.6285714 . Dev.0 -5- .45 .05 .1249675 .5 3.5 Speaking ability 2.5861497 1.500212 2.9942 Two Sample Tests.0115 Ha: diff > 0 t = -2.1570029 --------------------------------------------------------------------------Degrees of freedom: 20 Ho: mean(0) .Comparing Means Example: Parkinson’s disease Study on Parkinson’s disease ◦ Parkinson’s disease. Err.0058 Ha: diff != 0 t = -2. Pooled twpo sample t test with STATA: .0 2.106751 2.5563322 1.148686 . ttest ability.

4105745 2.8 −1.14516 .1249675 .0 1.836 Ha: sd(0) > sd(1) P > F_obs = 0. Interval] ---------+-------------------------------------------------------------------0 | 14 1. 2004 -6- .2135 Result: We cannot reject the null hypothesis that the variances are equal. [95% Conf. Are the data compatible with this assumption? F test for equality of variances The F test statistic s2 F = X. s2 Y is F distributed with m − 1 and n − 1 degrees of freedom.148686 .6 2.) Problem: Are the data normally distributed? 2.793249 ---------+-------------------------------------------------------------------combined | 22 2. Err.5 Theoretical Quantiles Speaking ability (Contr.05 .5 2.836 0.0 1.7865 Ha: sd(0) != sd(1) P < F_L + P > F_U = 0. Std. by(group) Variance ratio test -----------------------------------------------------------------------------Group | Obs Mean Std.2 2.7) lower tail = F_L = 1/F_obs = F(13.5 1.142645 1 | 8 2.3767 1.106751 2.8 2.0 3.5861497 1.500212 2.Comparing Variances Example: Parkinson’s disease In order to apply the pooled two-sample t test.0 1. Feb 23.7) upper tail = F_U = F_obs = Ha: sd(0) < sd(1) P < F_obs = 0.5 0.821429 .0 2.5563322 1.545 1. sdtest ability.790116 2.7) observed = F_obs = F(13.45 .5 −1 0 1 Theoretical Quantiles Two Sample Tests. the variances of the two groups have to be equal.) 3.309884 -----------------------------------------------------------------------------Ho: sd(0) = sd(1) F(13.4 2. Speaking ability (Treat.5 −0. Dev. .

Under H0 . the test statistic is approximately standard normally distributed. Feb 23. ˆ ˆ Note that p1 − p2 ≈ N p1 − p2 . ˆ ˆ p1 − p2 ˆ ˆ p(1 − p) ˆ ˆ 1 n1 p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 This suggests the test statistic T = 1 n2 + where p is the combined proportion of successes in both samples ˆ p= ˆ n p + n2 p2 ˆ ˆ X1 + X2 = 1 1 n1 + n2 n1 + n2 with X1 and X2 denoting the number of successes in each sample. 2004 -7- .Comparing Proportions Suppose we have two populations with unknown proportions p1 and p2. ◦ Random samples of size n1 and n2 are drawn from the two population ◦ p1 is the sample proportion for the ﬁrst population ˆ ◦ p2 is the sample proportion for the second population ˆ Question: Are the two proportions p1 and p2 diﬀerent? Test problem: H0 : p 1 = p 2 vs H1 : p 1 = p 2 Idea: Reject H0 if p1 − p2 is large. Two Sample Tests.

689 Two-sample test of proportion x: Number of obs = y: Number of obs = 615 585 -------------------------------------------------------------------------Variable | Mean Std.753 .6514889 .7265111 ---------+---------------------------------------------------------------diff | .0191387 .064 .9933 Ha: diff != 0 z = 2. Would you favor or oppose a law that would require a person to obtain a police permit before purchasing a gun? 2.473 P > z = 0.7189155 .Comparing Proportions Example: Question wording The ability of question wording to aﬀect the outcome of a survey can be a serious issue. Feb 23.473 P < z = 0. z P>|z| [95% Conf.473 P > |z| = 0.proportion(y) = diff = 0 Ha: diff < 0 z = 2.0133163 .0258595 . the following results were obtained: Question Yes No Total 1 463 152 615 2 403 182 585 Question: Is the true proportion of people favoring the permit law the same in both groups or not? . Would you favor or oppose a law that would require a person to obtain a police permit before purchasing a gun.0067 Two Sample Tests.013 -------------------------------------------------------------------------Ho: proportion(x) . Err. Interval] ---------+---------------------------------------------------------------x | . or do you think such a law would interfere too much with the right of citizens to own guns? In two surveys.0173904 . Consider the following two questions: 1.0134 Ha: diff > 0 z = 2.753 585 0.7870845 y | . 2004 -8- .47 0.1146837 | under Ho: . prtesti 615 0.689 .0258799 2.

Problem: ◦ Power was generally low in the signiﬁcance tests employed in the studies. In practice. the probability of a type I error. discussion of power of test also important: Example: Eﬃcient Market Hypothesis “Eﬃcient market hypothesis” for stock prices: ◦ future stock prices show only random variation ◦ market incorporates all information available now in present prices ◦ no information available now will help to predict future stock prices Testing of the eﬃcient market hypothesis: ◦ Many studies tested H0: Market is eﬃcient Ha : Prediction is possible ◦ Almost all studies failed to ﬁnd good evidence against H0. Two Sample Tests. ◦ Consequently the eﬃcient market hypothesis became quite popular. ◦ Failure to reject H0 is no evidence that H0 is true. 2004 -9- . Feb 23.Final Remarks Statistical theory focuses on the signiﬁcance level. ◦ More careful studies showed that the size of a company and measures of value such as ratio of stock price to earnings do help predict future stock prices.

but only that there is strong evidence that there is some diﬀerence. σm = 14. ◦ However we might conclude that the diﬀerence is scientiﬁcally irrelevant. Feb 23.Final Remarks Example ◦ IQ of 1000 women and 1000 men ◦ µw = 100.0.91 ˆ ◦ µm = 98.90. Note: A low signiﬁcance level does not mean there is a large diﬀerence. σw = 14.58. ◦ The diﬀerence in the IQ is statistically signiﬁcant at the 0.01 level.005 = 2.68 ˆ ◦ Pooled two-sample t test: T = −2. 2004 .7009 ◦ Reject H0 : µw = µm since |T | > t1998.10 - .68. Two Sample Tests.

Beware of searching for signiﬁcance We therefore might have expected at least one signiﬁcant association.the outcome is Bernoulli distributed with parameter 0. ◦ The number of false positive tests is binomially distributed: N ∼ Bin(20.Final Remarks Example: Is radiation from cell phones harmful? ◦ Observational study ◦ Comparison of brain cancer patients and similar group without brain cancer ◦ No statistically signiﬁcant association between cell phone use and a group of brain cancers known as gliomas.11 - . ◦ Each test has 5% chance of being signiﬁcant .05) ◦ The probability of getting one or more positive results is È(N ≥ 1) = 1 − È(N = 0) = 1 − 0.05. Feb 23. 2004 . Think for a moment: ◦ Suppose all 20 null hypotheses are true. ◦ Separate analysis for 20 types of gliomas found association between phone use and one rare from. ◦ Risk seemed to decrease with greater mobile phone use.64.9520 = 0. Two Sample Tests. 0.

Idea: Adjust signiﬁcance level of each single test.0083 Two Sample Tests.476 0. α/k 0.12 - .010 0.05 level.Final Remarks Problem: If several tests are performed.241 0. 2004 .001* Only two tests (*) are signiﬁcant at the 0. the probability of a type I error increases.008* 0. Bonferroni procedure: ◦ Perform k tests ◦ Use signiﬁcance level α/k for each of the k tests ◦ If all null hypothesis are true. Example Suppose we perform k = 6 tests and obtain the following P -values: P -value 0.032 0. the probability is α that any of the tests rejects its null hypothesis. Feb 23.

◦ The marital status is a column variable. ◦ The severity of depression is a row variable. mild) ⋄ marital status (single. 2004 -1- .Two-Way Tables Example: Depression and marital status Question: Does severity of depression depend on marital status? ◦ Study of 159 depression patients ◦ Patients were categorized by ⋄ severity of depression (severe. Feb 25. normal. widowed/divorced) The following two-way table summarizes the data: Depression Severe Normal Mild Total Marital Status Total Single Married Wid/Div 16 22 19 57 29 33 14 76 9 14 3 26 54 69 36 159 ◦ Each combination of values deﬁnes a cell. married. Inference for Two-Way Tables.

the sample distribution can be obtained by dividing each cell by the total sample size n = 159: Depression Severe Normal Mild Total Marital Status Single Married Wid/Div 0. ◦ Conditional distribution: distribution of one variable at a given level of the other variable Inference for Two-Way Tables.101 0.000 ◦ Joint distribution: proportion for each combination of values ◦ Marginal distribution: distribution of the row and column variables separately.340 0.164 1.019 0.434 0.119 0.208 0.057 0.478 0. 2004 -2- .088 0.Two-Way Tables From this table of counts. Feb 25.138 0.088 0.358 0.226 Total 0.182 0.

3 0.1 0. 2004 -3- .0 single married Marital status wid/div Question: Is a relationship between the row variable (depression) and the column variable (marital status)? ◦ The distribution for widowed/divorced patients seems to diﬀer from the distributions for single or married patients. ◦ Are these diﬀerences signiﬁcant or can they be attributed to chance variation? ◦ How likely are diﬀerences as large or larger than those observed if the two variables were indeed independent (and thus the conditional distribution were the same)? A statistical test will be required to answer these questions. Inference for Two-Way Tables.2 0.Test for Independence Example: Depression and marital status Conditional distributions of severity of depression given marital status: 0.5 severe normal mild 0.4 Sample proportion 0. Feb 25.

Feb 25. ◦ We need a statistic that measures the diﬀerence between the tables. Inference for Two-Way Tables.Test for Independence Test problem: H0 : the row and the column variables are independent Ha : the row and the column variables are dependent How can we measure evidence against the null hypothesis? ◦ What counts would we expect to observe if the null hypothesis were true? row total × column total Expected Cell Count = total count Recall: For two independent events A and B. then the table of expected counts should be “close” to the observed table of counts. 2004 -4- . If the null hypothesis H0 is true. È(A ∩ B) = È(A) È(B). ◦ And we need to know what is the distribution of the statistic to make statistical inference.

Why (r − 1)(c − 1)? Recall that our “expected” table is based on some quantities estimated from the data: namely the row and column totals. there are only (r − 1)(c − 1) freely varying quantities in the table. 2004 -5- .α . Once these totals are known. Thus. ﬁlling in any (r − 1)(c − 1) undetermined table entries actually gives us the whole table. the diﬀerence between the tables should be “small” The χ2 (Chi-Squared) Statistic To measure how far the expected table is from the observed table. Consequently we reject H0 at signiﬁcance level α if X ≥ χ2 (r−1)(c−1). Feb 25. ◦ We reject H0 if observed and expected counts are very diﬀerent and hence X is large.Test for Independence Idea of the test: ◦ construct table of expected counts ◦ compare expected with observed counts ◦ if the null hypothesis is true. T is approximately χ2 distributed with (r − 1)(c − 1) degrees of freedom. we use the following test statistic: X= all cells (Observed − Expected)2 Expected ◦ Under the null hypothesis. Inference for Two-Way Tables.

Recall that X has only an approximate χ2 (r−1)(c−1) distribution.05 0.The χ2 Distribution What does the χ2 distribution look like? χ2 Densities 0. ◦ Unlike the Normal or t distributions. 2004 -6- . we require that each expected count be at least 5. the exact shape of the χ2 distribution depends on its degrees of freedom.00 0 10 20 χ2 30 40 50 ◦ As with the t distribution. ◦ For 2×2 tables. Feb 25. ∞).10 0.20 Degrees of Freedom 1 5 10 20 30 0. the χ2 distribution takes values in (0. we require that the average expected cell count is at least 5 and each expected count is at least one. Inference for Two-Way Tables. When is the approximation valid? ◦ For any two-way table larger than 2 × 2.15 Density 0.

. 2004 -7- È(X ≥ x) = È(X ≥ 6.49. Feb 25.74 5.83) (11.36)2 + + .36) (24.05 (22 − 24.145 ≥ α .28) (5.05) is χ2 4.05 = 9. ◦ The observed value of the χ2 test statistic is = 6.83 ≤ χ2 4.21) 9 14 3 26 (8.89) 54 69 36 159 ◦ The table is 3 × 3.0.89 Thus we do not reject the null hypothesis of independence. + x= 19.36 24. ◦ The corresponding P-value is Again we do not reject H0 Inference for Two-Way Tables.Test for Independence Example: Depression and marital status The following table show the observed counts and expected counts (in brackets): Depression Severe Normal Mild Total Marital Status Total Single Married Wid/Div 16 22 19 57 (19.98) (17.90) 29 33 14 76 (25. ◦ The critical value (signiﬁcance level α = 0.83) = 0.0.74) (12.. so there are (r − 1)(c − 1) = 2 × 2 = 4 degrees of freedom.89)2 (16 − 19.81) (32.74)2 (3 − 5.

145 Inference for Two-Way Tables.txt. clear (3 vars. tabi 16 22 19 \ 29 33 14 \ 9 14 3. Feb 25. 159 obs) . chi2 | col row | 1 2 3 | Total -----------+---------------------------------+---------1 | 16 22 19 | 57 2 | 29 33 14 | 76 3 | 9 14 3 | 26 -----------+---------------------------------+---------Total | 54 69 36 | 159 Pearson chi2(4) = 6.8281 Pr = 0. 2004 -8- . tabulate depression marital. insheet using depression. chi2 | Marital Depression | Married Single Wid/Div | Total -----------+---------------------------------+---------Mild | 14 9 3 | 26 Normal | 33 29 14 | 76 Severe | 22 16 19 | 57 -----------+---------------------------------+---------Total | 69 54 36 | 159 Pearson chi2(4) = 6.8281 Pr = 0.Test for Independence The χ2 test in STATA: .145 The same result can be obtained by the command .

Models for Two-Way Tables The χ2 -test for the presence of a relationship between two distributions in a two-way table is valid for data produced by several diﬀerent study designs. and classify each individual by GPA range. Biology and Math majors. Then. Ha : The distribution is not the same. Example: Suppose we select independent SRSs of Psychology. 39.5. . and categorize each my major and GPA (e. of sizes n1 . (0. nc . . . 2004 -9- . Inference for Two-Way Tables. ⋄ This yields a r × c table. we can use a χ2 -test to ascertain whether or not the distribution of grades is the same in all three populations. ⋄ Classify each individual according to two categorical variables. 4]). . 0. . Feb 25. Then. 35. . . we can use the χ2 -test to ascertain whether grades and major are independent. ⋄ Classify each individual according to a categorical response variable with r possible values (the same across populations). of sizes 40. . Question: Does the distribution of the response variable diﬀers between populations? Test problem: H0: The distribution is the same in all populations. ◦ Examining independence between variables ⋄ Select random sample of size n from a population. (3. ◦ Comparing several populations ⋄ Select independent random samples from each of c population. Question: Is there a relationship between the two variables? Test problem: H0: The two variables are independent Ha : The two variables are not independent Example: Suppose we collect an SRS of 114 college students.g. although the exact null hypothesis varies.5].

she left the novel Sanditon only partially completed. A highly literate admirer ﬁnished the novel. and the hybrid was published. Feb 25. df=?.27. P-value=? ◦ Was the imitator successful (are the frequencies of the words the same in Austen’s work and the imitator’s work)? Inference for Two-Way Tables. 1995) When Jane Austen died. 2004 . but she left a summary of the reminder.Models for Two-Way Tables Example: Literary Analysis (Rice. Austen Imitator Sense and Emma Sanditon I Sanditon II Word Sensibility a 147 186 101 83 an 25 26 11 29 this 32 39 15 15 that 94 105 37 22 with 59 74 28 43 without 18 10 10 4 TOTAL 375 440 202 196 Questions: ◦ Is there consistency in Austen’s work (do the frequencies with which Austen used these words change from work to work)? Answer X = 12. attempting to emulate Austen’s style.10 - . Someone counted the occurrences of various words in several chapters from various works.

11 - . Inference for Two-Way Tables. in each age group a higher percent of nonsmokers survive. 2004 . Simpson’s Paradoxon An association/comparison that holds for all of several groups can reverse direction when the data are combined to form a single group.Simpson’s Paradoxon Example: Medical study ◦ contact randomly chosen people in a district in England ◦ data on 1314 women contacted ◦ either current smoker or who had never smoked Question: Survival rate after 20 years? Smoker Not Dead 139 230 Alive 438 502 Result: A higher percent of smokers stayed alive! Here are the same data classiﬁed by their age at time of the survey: Age 18 to 44 Smoker Not Dead 19 13 Alive 269 327 Age 45 to 64 Smoker Not Dead 78 52 Alive 162 147 Age 65+ Smoker Not Dead 42 165 Alive 7 28 Age at time of the study is a confounding variable. Feb 25.

0 Body Density (103kg m3) 1. ◦ Research suggests that skinfold thickness can accurately predict body density.Simple Linear Regression Example: Body density Aim: Measure body density (weight per unit volume of the body) (Body density indicates the fat content of the human body.09 Questions: ◦ Are body density and skinfold thickness related? ◦ How accurately can we predict body density from skinfold thickness? Regression: predict response variable for ﬁxed value of explanatory variable ◦ describe linear relationship in data by regression line ◦ ﬁtted regression line is aﬀected by chance variation in observed data Statistical inference: accounts for chance variation in data Simple Linear Regression.8 1. Feb 27. 2.06 1.0 1.05 1.) Problem: ◦ Body density is diﬃcult to measure directly.03 1.08 1.2 1.04 1.6 1.07 Skinfold Thickness (mm) 1.4 1. ◦ Skinfold thickness is measures by pinching a fold of skin between calipers. 2004 -1- .

mean of Y given X = x) (cond. (Y |X = x) (cond.Population Regression Line Simple linear regression studies the relationship between ◦ a response variable Y and We expect that diﬀerent values of X will produce diﬀerent mean responses of Y . Feb 27. For given X = x. Simple Linear Regression. we consider the subpopulation with X = x: ◦ this subpopulation has mean µY |X=x = ◦ and variance 2 σY |X=x = var(Y |X = x) ◦ a single explanatory variable X. ◦ The variance (and standard deviation) does not depend on x. variance of Y given X = x) Linear regression model with constant variance: (Y |X = x) = µY |X=x = a + b x (population regression line) 2 var(Y |X = x) = σY |X=x = σ 2 ◦ The population regression line connects the conditional means of the response variable for ﬁxed values of the explanatory variable. 2004 -2- . ◦ This population regression line tells how the mean response of Y varies with X.

2004 -3- . yn) 6 1 2 5 3 4 4 5 6 3 7 8 2 9 10 1 11 12 0 Sampling probability f (x. .Conditional Mean Sample (x1. y) 0 6 1 2 5 3 4 4 5 6 3 7 8 2 9 10 1 11 12 0 ﬁx x = x0 6 0 1 5 2 3 4 4 5 6 3 7 8 2 9 10 11 12 0 1 f (x0. . . . (xn. y) rescale by fX (x0 ) 6 0 1 5 2 3 4 4 5 3 6 7 8 2 9 10 11 12 0 1 Conditional probability f (y|x0) = fXY (x0. y1). y) fX (x0) (Y |X = x0 ) = y fY |X (y|x0) dy conditional mean Simple Linear Regression. Feb 27.

. ◦ Variation about mean does not depend on xi. Often we additionally assume: ◦ The errors are normally distributed. not random). where Yi xi εi response (also dependent variable) predictor (also independent variable) error i = 1. ◦ Errors have zero mean. Feb 27. var(εi) = σ 2 . For ﬁxed x the response Y is normally distributed with Y ∼ N (a + b x. . iid Simple Linear Regression. σ 2). (εi) = 0. 2004 -4- . i. ◦ Errors εi are independent. σ 2). n Assumptions: ◦ Predictor xi is deterministic (ﬁxed values. .The Linear Regression Model Simple linear regression Yi = a + b x i + ε i . εi ∼ N (0.e. .

(Yn.intercept b .slope Least Squares Approach: ˆ Minimize squared distance between observed Yi and ﬁtted Yi : n ﬁtted values for coeﬃcients a and b L(a.Least Squares Estimation Data: Aim: (Y1 . xn) Find straight line which ﬁts data best: ˆ Yi = a + b x i a . x1). . . . Feb 27. 2004 -5- . b) = i=1 ˆ (Yi − Yi )2 = n i=1 (Yi − a − b xi)2 Set partial derivatives to zero (normal equations): ∂L =0 ⇔ ∂a ∂L =0 ⇔ ∂b n i=1 n (Yi − a − b xi) = 0 (Yi − a − b xi) · xi = 0 i=1 Solution: Least squares estimators SXY ¯ a=Y − ˆ ¯ ·X SXX ˆ = SXY b SXX where n SXY = SXX = i=1 n ¯ (Yi − Y )(xi − x) ¯ (xi − x)2 ¯ (sum of squares) i=1 Simple Linear Regression. .

Feb 27.Least Squares Estimation ˆ Least squares predictor Y ˆ Yi = a + ˆ x i ˆ b Residuals εi: ˆ ˆ ε i = Yi − Yi ˆ = Yi − a − ˆ x i ˆ b n n Residual sum of squares (SS Residual ) SS Residual = i=1 ε2 = ˆi i=1 ˆ (Yi − Yi)2 Estimation of σ 2 n 1 1 ˆ σ = ˆ SS Residual (Yi − Yi )2 = n − 2 i=1 n−2 2 Regression standard error se = σ = ˆ SS Residual /(n − 2) Variation accounting: n SS Total = SS Model = SS Residual = i=1 n i=1 n i=1 ¯ (Yi − Y )2 total variation variation explained by linear model remaining variation ˆ ¯ ( Yi − Y ) 2 ˆ (Yi − Yi )2 Simple Linear Regression. 2004 -6- .

568 + 11.06 1.08 1.6 1.05 1.187 RSS = = 0.064 1.70 ˆ ¯ b¯ σ2 = ˆ 1.244 1.2 1. Feb 27.03 1.0235 -0.0132 = 0.1149 Simple Linear Regression.04 1.0132 n−2 90 √ √ ˆ se = σ 2 = 0.Least Squares Estimation Example: Body density Scatter plot with least squares regression line: 2. 2004 -7- .0 1.064 = 13.0 Body Density (103kg m3) 1.40 · 1.07 Skinfold Thickness (mm) 1.2679 4.40 b SXX 0.267 = −11.09 Calculation of least squares estimates: x ¯ y ¯ SXX SXY SY Y SS Residual 1.8 1.568 0.187 ˆ = SXY = −0.023 a = y − ˆx = 1.4 1.

7173 .1)) (scatter BODYD SKINT).06 SKin thickness 1.7204 0. Interval] -------------+---------------------------------------------------------------SKINT | -11.7975822 17.05747739 1 3.7494999 -15.71221 . infile ID BODYD SKINT using bodydens. range(1 1.000 -12.000 12.05747739 Residual | 1. 2004 -8- .12768 15.txt.19 0. 90) Prob > F R-squared Adj R-squared Root MSE = = = = = = 92 231.046638546 Number of obs F( 1.18663025 90 .41345 . t P>|t| [95% Conf.924433 _cons | 13.89 0.Least Squares Estimation Example: Body density Using STATA: . xtitle(Skin thickn > ess) ytitle(Body density) scheme(s1color) legend(off) Body density 1 1 1. Feb 27.04 1.23 0.08 1. Err. twoway (lfitci BODYD SKINT. Std.1 Simple Linear Regression. regress BODYD SKINT Source | SS df MS -------------+-----------------------------Model | 3.90246 -9.0000 0.5 1.5 2 2.11482 -----------------------------------------------------------------------------BODYD | Coef.02 1.013184781 -------------+-----------------------------Total | 4.29675 -----------------------------------------------------------------------------.24410764 91 . clear (92 observations read) .

2004 -1- . σ b SXX Mean and variance of a ˆ (ˆ) = a a var(ˆ) = a 1 x2 ¯ + σ2 n SXX Distribution of a ˆ 1 x2 ¯ a ∼ N a. ˆ + σ2 n SXX Inference for Regression.Properties of Estimators Statistical properties of a and ˆ ˆ b Mean and variance of ˆ b (ˆ = b b) 2 ˆ = σ var(b) SXX Recall that n SXX = i=1 (xi − x)2 ¯ Distribution of ˆ b 2 ˆ ∼ N b. Mar 1.

we obtain ˆ−b b √ ∼ tn−2 se / SXX (1 − α) conﬁdence interval for b: ˆ ± tn−2.α/2 · se · ˆ 1 x2 ¯ + n SXX Inference for Regression.Conﬁdence Intervals Note that ˆ ∼ N b. 2004 -2- . Sσ . Thus b XX 2 ˆ−b b √ ∼ N (0.a/2 · √ se b SXX Similarly a−a ˆ σ 1 n ¯ X2 SXX + ∼ N (0. 1) Substituting se for σ. we obtain a−a ˆ se 1 n + x2 ¯ SXX ∼ tn−2 (1 − α) conﬁdence interval for a: a ± tn−2. 1) σ/ SXX Substituting se for σ. Mar 1.

The test statistic is given by ˆ − b0 b √ Tb = ∼ tn−2 se / SXX The null hypothesis H0 : b = b0 is rejected if |T | > tn−2.α/2 Question: Is a equal to some value a0 ? The correspoding test problem is H0 : a = a0 versus Ha : a = a0 . The test statistic is given by Ta = se a − a0 ˆ 1 n x2 ¯ SXX + ∼ tn−2 The null hypothesis H0 : a = a0 is rejected if |T | > tn−2. 2004 -3- . Mar 1.Tests on the Coeﬃcients Question: Is b equal to some value b0 ? The correspoding test problem is H0 : b = b 0 versus Ha : b = b0.α/2 Inference for Regression.

0132 · 1 1.92.99 · √ = [−12.α/2 · √ se b SXX √ 0. 2004 -4- .025 = 1.11.0. 15.05: The coeﬃcient b is statistically signiﬁcantly diﬀerent from zero. The corresponding P -values are ◦ È(|Ta| ≥ 15.26) ≈ 0 Inference for Regression.023 = 13.025 = 1.99 se / SXX Thus we reject H0 : b = 0 at signiﬁcance level 0. −9.α/2 se ˆ 1 x2 ¯ + n SXX √ 0.90] 0.0.023 The conﬁdence interval for a is given by a ± tn−2.99 + Thus we reject H0 : a = 0 at signiﬁcance level 0.0132 = −11.30] 92 0.22) ≈ 0 ◦ È(|Tb| ≥ 17.26 > t90.Inference for the Coeﬃcients Example: Body density The conﬁdence interval for b is given by ˆ ± tn−2.99 · Furthermore we ﬁnd for ˆ b √ Tb = = −15. Similarly Ta = se a ˆ 1 n x2 ¯ SXX = 17.22 > t90.41 ± 1.062 + = [12.71 ± 1.05: The coeﬃcient a is statistically signiﬁcantly diﬀerent from zero. Mar 1.

Estimating the Mean In the linear regression model. ˆ b Question: How precise is this estimate? Note that ˆ ¯ b(x Yx0 = a + ˆ x0 = Y − ˆ 0 − x). ˆ b ¯ Hence we obtain ˆ (Yx0 ) = a + b x0 ¯2 ˆx ) = 1 + (x0 − x) σ 2 var(Y 0 n SXX (1 − α) conﬁdence interval for (ˆ + ˆ x0) ± tn−2. the mean of Y at x = x0 is given by (Y ) = a + b x0 Our estimate for the mean of Y at X = x0 is ˆ Yx0 = a + ˆ x0 . 2004 -5- .α/2 · se · a b (Yx0 ) 1 (x0 − x)2 ¯ + n SXX Inference for Regression. Mar 1.

1 = 1. .1 Inference for Regression.06)2 + 92 0.023 = (13.0132 · 1 (1.8 2 1. legend(off) scheme(s1color) 1 1.6 1.71 − 11.71 − 11.08 1. What is the mean body density for this value of skin thickness? ◦ Point estimate: ˆ Yx0 = a + hb x0 = 13. . 1. Mar 1.09.. stdp generate low=BDH-invttail(49.41 · 1. ◦ Conﬁdence interval: (ˆ + ˆ x0) ± tn−2.04 1.1 − 1.41 · 1. the standard error for estimating the mean of Y is calculated by passing the option stdp to predict: . clpattern(dash dash solid) clcolor(black bla ck black) || scatter BODYD SKINT. > predict BDH predict SE.4 1. .. .025)*SE generate high=BDH+invttail(49. .159 · 103 kg/m3.Estimating the Mean Example: Body density Suppose the measured skin thickness is x0 = 1.1 mm.α/2 · se · a b 1 (x0 − x)2 ¯ + n SXX √ 0.025)*SE sort SKINT graph twoway line low high BDH SKINT.06 SKINT 1.02 1. 2004 -6- .1) ± 1.22] In STATA.99 · = [1.159 ˆ The mean body density is 1.2 1.

Prediction Suppose we want to predict Y at x = x0 . σ 2 1 + + ˆ b n SXX Thus the desired (1 − α) conﬁdence interval for Yx0 is given by a + ˆ x0 ± tn−2.α/2 · se · ˆ b ¯ 1 (x0 − X)2 1+ + n SXX Inference for Regression. Mar 1. 2004 -7- . Aim: (1 − α) conﬁdence interval for Y Note that ¯ 1 (x0 − X)2 a + ˆ x0 − Y ∼ N 0.

71 − 11.159 · 103 kg/m3.Prediction Example: Body density Suppose the measured skin thickness is x0 = 1. What is the predicted body density for this value of skin thickness? ˆ ˆ ◦ Point estimate: Yx0 = a + hb x0 = 13.1 mm. . . clpattern(dash dash solid) clcolor(black bla ck black) || scatter BODYD SKINT.1) ± 1.α/2 · se · a b 1+ 1 (x0 − x)2 ¯ + n SXX = [0.08 1.1 Inference for Regression. 1.025)*SE generate high=tbillh+invttail(49.025)*SE graph twoway line low high BDH SKINT..1 1 1 Body density 1. . twoway (lfitci BODYD SKINT. > xtitle(Skin thickness) ytitle(Body density) scheme(s1color) legend(off) 2.1) stdf) (scatter BODYD SKINT). legend(off) scheme(s1color) Alternatively.06)2 1 + 92 0. Mar 1.023 In STATA.92.99 · √ 0. ◦ Conﬁdence interval: (ˆ + ˆ x0) ± tn−2. the standard error for predicting Y is calculated by passing the option stdf to predict: . stdf generate low=tbillh-invttail(49.06 SKin thickness 1.08 1. range(1 1.04 1.5 1 1.40] = (13.41 · 1. we can use the following command: .06 SKINT 1.5 2 1.02 1. 2004 -8- .159 The predicted body density is 1.41 · 1.5 2 2.1 − 1.0132 · 1+ (1..1 = 1. .71 − 11.5 1.02 1. > drop SE low high predict SE.04 1.

174867 _cons | 1.2154862 _cons | -.33 0. Std. t P>|t| [95% Conf.54 0.0149345 12. Err.746252 4. regress food income ------------------------------------------------------------------------food | Coef. Err.000 .4224493 5.192615 ------------------------------------------------------------------------.016613 1.1527336 .41 0.399801 3.596 -2.7637666 -0.410627 0. Interval] --------+---------------------------------------------------------------number | 2.4119994 .000 1. Interval] --------+---------------------------------------------------------------income | .217365 1. 2004 -1- .Multiple Regression Example: Food expenditure and family income Data: ◦ Sample of 20 households ◦ Food expenditure (response variable) ◦ Family income and family size . Mar 3. Std.86 0.180981 ------------------------------------------------------------------------- 20 20 16 Food Expenditure Food Expenditure 0 20 40 60 Income 80 100 120 16 12 12 8 8 4 4 0 0 0 1 2 3 Family Size 4 5 6 Multiple Regression. regress food number ------------------------------------------------------------------------food | Coef.399 -1.1841099 . t P>|t| [95% Conf.287334 .

17) = 121.156433 Prob > F = 0.499913 .i + b2 x2.1482117 . Err. 2004 -2- . n Multiple Regression.i.9346 --------+-----------------------------Adj R-squared = 0. .005 .312865 2 193. Interval] --------+---------------------------------------------------------------income | .Multiple Regression Multiple regression model Yi = b0 + b1 x1. .i + .i predictor variables (ﬁxed. xp. .106 -2. bp regression coeﬃcients ◦ εi ∼ N (0.118295 .0326365 17 1.0000 Resid.71 0. σ 2) error variable Example: Food expenditure and family income Fitting multiple regression models in STATA: . Std.2633232 ------------------------------------------------------------------------iid i = 1.0163786 9. . Mar 3.24 0.9269 Total | 413.1136558 . t P>|t| [95% Conf. .47 Model | 386. .261 ------------------------------------------------------------------------food | Coef.i + εi where ◦ Yi response variable ◦ x1. . regress food income number Source | SS df MS Number of obs = 20 --------+-----------------------------F( 2. . .1827676 number | .000 . | 27.308831 _cons | -1. . nonrandom) ◦ b0.7550264 Root MSE = 1. + bp xp.59015509 R-squared = 0. .05 0.345502 19 21. . .2444411 3.7931055 .6548524 -1. .2773798 1.

20 Fitted regression model: Food = ˆ0 + ˆ1 Income + ˆ2 Number b b b Yi ^ Yi 20 16 12 8 4 6 5 4 3 2 1 0 20 0 40 60 80 100 0 120 Fitted model is a two-dimensional plane . Incomei . . Mar 3. Multiple Regression. 2004 -3- . .diﬃcult to visualize. .Multiple Regression Example: Food expenditure and family income Data: (Foodi . . Numberi ). i = 1.

Yn 1 x1. .. .  . Mar 3.   . · · · xp.  .  . σ 2X(X T X)−1X T course in regression analysis (STAT 22200) or econometrics Multiple Regression. σ 2(X T X)−1 n i=1 ˆ (Yi − Yi )2 ∼ N 0.  . 2004 -4- . .  = .n bp εn Thus the model can be    Y1 1 x1. . σ 2 1 − X(X T X)−1X T ∼ N X b. . . .1 b0 ε1 .  .Inference for Multiple Regression Multiple regression model (matrix notation) Y =Xb+ε where Y X b ε n dimensional vector n × (1 + p) dimensional matrix 1 + p dimensional vector n dimensional vector written as     · · · xp.  . . +  .1  .n Least squares approach: Minimize ˆ Y −Y = Results: ˆ = (X T X)−1X T Y b ˆ Y = X(X T X)−1X T Y ˆ ε = Y − Y = 1 − X(X T X)−1X T Y ˆ ˆ Y −Y 2 2 2 σ = se = ˆ n−p n 1 ˆ (Yi − Yi )2 = n − p i=1 Details ∼ N b..

quietly regress food number .000 .004 . Err. t P>|t| [95% Conf. residuals . Interval] ---------+-------------------------------------------------------------e_num | .1816525 ------------------------------------------------------------------------ Result: ◦ bj measures the dependence of Y on xj after removing the linear eﬀects of all other predictors xk . Std.1482117 . 2004 -5- . quietly regress income number . Err.7931055 . ◦ bj = 0 if xj does not provide information for the prediction of Y additionally to the information given by the other predictor variables. regress e_food1 e_num -----------------------------------------------------------------------e_food1 | Coef.Inference for Multiple Regression Example: Food expenditure and family income Interpretation of regression coeﬃcients . Mar 3.34 0.2375541 3. t P>|t| [95% Conf.114771 . predict e_inc. quietly regress number income . Std. Interval] ---------+-------------------------------------------------------------e_inc | . k = j. residuals . quietly regress food income .31 0. predict e_num. residuals .0159172 9.292188 -----------------------------------------------------------------------. Multiple Regression.2940229 1. predict e_food1. regress e_food2 e_inc -----------------------------------------------------------------------e_food2 | Coef. residuals . predict e_food2.

Multiple Regression Example: Heart cathederization Description: A Teﬂon tube (catheder) 3 mm is diameter is passed into a major vein or artery at the femoral region and pushed up into the heart to obtain information about the heart’s physiology and functional ability. The length of the catheder is typically determined by a physician’s educated guess. Mar 3. Data: ◦ Study with 12 children with congenital heart defects ◦ Exact required catheder length was measured using a ﬂuoroscope ◦ Patient’s height and weight were recorded Question: How accurately can catheder length be determined by height and length? 50 45 Distance (cm) Distance (cm) 30 40 Height (in) 50 60 40 35 30 25 20 50 45 40 35 30 25 20 20 40 60 Weight (lb) 80 Multiple Regression. 2004 -6- .

599 -.729167 11 65.62 Model | 578.7621 Total | 718.278 -.9428 -----------------------------------------------------------------------------distance | Coef. 2004 -7- .8053 -------------+-----------------------------Adj R-squared = 0.0006 Residual | 139. regress distance height weight Source | SS df MS Number of obs = 12 -------------+-----------------------------F( 2.6193422 1.545893 R-squared = 0.1827991 .40 0.3390152 Root MSE = 3.81613 2 289. ◦ The regression on both variables explains 80% of the variation of the response (length of catheder).408065 Prob > F = 0.5644547 _cons | 21.1908278 .0084 8.height ◦ x2 .1963566 .012056 weight | .54 0.3605845 0.16 0. t P>|t| [95% Conf.165164 1. Interval] -------------+---------------------------------------------------------------height | .Multiple Regression Example: Heart cathederization (contd) Regression model: Y = b0 + b1 x 1 + b2 x 2 + ε where ◦ Y .211907 40.distance to pulmonary artery ◦ x1 . 9) = 18. Mar 3. Multiple Regression.weight STATA regression output: . Err.80489 ------------------------------------------------------------------------------ Note: ◦ Neither height nor weight seem to be signiﬁcant for predicting the distance to the pulmonary artery.751156 2. Std.040 1.913037 9 15.

7989 -----------------------------------------------------------------------------distance | Coef. the coeﬃcients for both height and weight are not signiﬁcantly diﬀerent from zero.247174 2. ◦ In a multiple regression of Y on height and weight.8223732 _cons | 12.17181 30.1792571 .3711492 .7765 -----------------------------------------------------------------------------distance | Coef.Multiple Regression Example: Heart cathederization (contd) Consider predicting the length by height alone and by weight alone: .660752 21.12405 4.79 0.89 0.85 0.30 0. Interval] -------------+---------------------------------------------------------------weight | . regress distance height R-squared = 0.10311 ------------------------------------------------------------------------------ Note: ◦ In a simple regression of Y on either height or weight.5967612 . Err. t P>|t| [95% Conf. the explanatory variable is highly signiﬁcant for predicting Y .0439881 6.1012558 5.000 .63746 2. Std. t P>|t| [95% Conf.000 .3752804 _cons | 25.000 21.017 2.2772687 .58734 -----------------------------------------------------------------------------. Err.004207 12. regress distance weight R-squared = 0. Problem: Explanatory variables are highly linearly dependent (collinear) 80 Weight (lb) 60 40 20 20 30 40 Height (in) 50 60 70 Multiple Regression. Mar 3. Interval] -------------+---------------------------------------------------------------height | . Std. 2004 -8- .

variation in regression model ◦ SS Model = SS Total − SS Residual ˆ ¯ = i (Yi − Y )2 .81613 2 289.913037 9 15.81.variation explained by regression Coeﬃcient of determination: The ratio R2 = SS Model SS Total indicates how well the regression model predicts the response.in a simple linear regression we have R2 = ρ2 . R2 is also the squared multiple correlation coeﬃcient .Analysis of Variance Decomposition of variation: ¯ ◦ SS Total = i (Yi − Y )2 . Multiple Regression.total variation ◦ SS Residual = i (Yi ˆ − Yi )2 .7621 3.0006 0. 2004 -9- . 9) Prob > F R-squared Adj R-squared Root MSE = = = = = = 12 18.82 = 0.9428 The coeﬃcient of determination for these data is R2 = 578.408065 Residual | 139.729167 11 65.8053 0.545893 -------------+-----------------------------Total | 718. XY Example: Heart cathederization Source | SS df MS -------------+-----------------------------Model | 578.73 Regression on height and weight explains 81% of the variation of distance.62 0.3390152 Number of obs F( 2. Mar 3. 718.

61. Example: Heart cathederization Source | SS df MS -------------+-----------------------------Model | 578. p}. .82 · = 18.05. Multiple Regression.8053 0.91 The critical value for rejecting H0 : b1 = b2 = 0 is F2. that is. .9428 is F distributed with p and n − p − 1 degrees of freedom.3390152 Number of obs F( 2.729167 11 65.26.0. Under the null hypothesis H0 the F statistic F = n − p − 1 SS Total − SS Residual n − p − 1 SS Model · = · p SS Residual p SS Residual The null hypothesis H0 is rejected at level α if F > Fp.α.0006 0. . Thus the null hypothesis H0 that both coeﬃcients b1 and b2 are zero is rejected at signiﬁcance level α = 0. 2 139. . H0 : b1 = . Mar 3.81613 2 289.05 = 4.Analysis of Variance Question: Is improvement in prediction (decrease in variation) signiﬁcant? Our null hypothesis is that none of the explanatory variables helps to predict the response.10 - .n−p−1.9.62 0. 9) Prob > F R-squared Adj R-squared Root MSE = = = = = = 12 18. . .913037 9 15. The value of the F statistic is F = 9 578.408065 Residual | 139.545893 -------------+-----------------------------Total | 718.7621 3. 2004 . = bp = 0 versus Ha : bj = 0 for any j ∈ {1.

Comparing Models Example: Cobb-Douglas production function Y = t · K a · Lb · M c where ◦ Y .6 0.6 0.0 −0.2 0.capital ◦ L .0 Y Y Multiple Regression.4 M 0.8 0. 2004 Y . Mar 3.output ◦ K .labour ◦ M .6 0.8 0.4 0.4 0.2 0.11 - .8 1.4 0.0 0.6 0.2 0.2 L 0.2 0.0 −0.4 K 0.2 0.0 0.4 0.0 0.materials Regression model: log Y = log t + a log K + b log L + c log M 0.6 0.8 0.8 1.0 0.0 0.2 0.2 0.6 0.

t P>|t| [95% Conf.81964 .003241219 R-squared = 0.0512244 . Interval] ---------+--------------------------------------------------------------LK | . Err.50 0.0718626 .34977753 1 1. Mar 3.9488 Total | 1. t P>|t| [95% Conf.70 0.0904808 Question: Is model M0 signiﬁcantly better than model M1 ? Multiple Regression.12 - .059143043 Root MSE = .1125629 Two variables.7072231 .3929366 LM | .9977188 _cons | . regress LY LM Source | SS df MS Number of obs = 25 ---------+----------------------------F( 1. Std. do not improve prediction of log Y .0430421 21.47 0.0000 Residual | .9520 ---------+----------------------------Adj R-squared = 0.623 -.013 .0431395 .000 .059143043 Root MSE = .0000 Residual | . 23) = 445.0347117 . alternative model M1 log Y = log t + c log M .095355 _cons | .35 0.0374354 0.41943303 24 .05503 ------------------------------------------------------------------------LY | Coef.2492114 .0824768 1.11 0.646 -. Interval] ---------+--------------------------------------------------------------LM | . Err.4248755 0.011968 .1543912 0.05693 ------------------------------------------------------------------------LY | Coef. 21) = 138. Std.0189767 2. log K and log L.364 -.35136742 3 .331969 LL | .450455808 Prob > F = 0.93 0.2117778 .34977753 Prob > F = 0.98 Model | 1.9452 Total | 1.6717991 1.069655501 23 . regress LY LK LM LL Source | SS df MS Number of obs = 25 ---------+----------------------------F( 3.69 Model | 1.9509 ---------+----------------------------Adj R-squared = 0. 2004 .0030285 R-squared = 0.028 .41943303 24 .9086794 .068065609 21 .3004146 2.Comparing Models Example: Cobb-Douglas production function (contd) Regression model M0 for Cobb-Douglas function: log Y = log t + a log K + b log L + c log M .

i + εi . .Comparing Models Consider the multiple regression model with p explanatory variables Yi = b0 + b1 x1. ◦ Regress Y on just p − q explanatory variables that remain after you (2) remove the q variables from the model. Multiple Regression. + bp xp.13 - . Read SS Residual from the output. F = (1) q SS Residual (2) (1) Under the null hypothesis. ◦ The test statistic is n − p − 1 SS Residual − SS Residual · . Mar 3. 2004 . F is F distributed with q and n − p − 1 defrees of freedom. Solution: (1) ◦ Regress Y on all p explanatory variables and read SS Residual from the output. ◦ Reject if F > Fq.α. Problem: Test the null hypothesis H0 : q speciﬁc explanatory variables all have zero coeﬃcients versus Ha : any of these q explanatory variables has a nonzero coeﬃcient.i + .n−p−1. .

◦ F = 21 .05 = 3. Mar 3. ◦ M1: SS Residual = .06807 · = 0. Using STATA: . 2004 .21.06807 and n − p − 1 = 21. test LK LL _cons ( 1) ( 2) ( 3) LK = 0 LL = 0 _cons = 0 F( 3.47 we cannot reject H0 : a = b = 0.7847 .06966 and q = 2. 21) = Prob > F = 2.06966 − .14 - .0.2453 2 .25 0.0934 Multiple Regression.43 0.Comparing Models Example: Cobb-Douglas production function Comparison of models M0 and M1 : (0) ◦ M0: SS Residual = . 21) = Prob > F = 0. test LK LL ( 1) ( 2) LK = 0 LL = 0 F( 2.06807 (1) ◦ Since F < F2.

75=high) Box plots 60 50 40 Time (in minutes) 30 20 10 0 female male female male female male female male 2 grams 5 grams 7 grams 2 grams Multiple Regression II.25=low.7 or 10 grams ◦ Response variable: time until noticeable relieve (in minutes) ◦ Other explanatory variables: ⋄ sex (0=female. 1=male) ⋄ blood pressure (0. 2004 -1- .5.Case Study Example: Headaches and pain reliever ◦ 24 patients with a common type of headache were treated with a new pain reliever ◦ Medicamentation was given to each patient in one of four dosage levels: 2. Mar 5. 0.50=medium. 0.

Case Study . fitted values) . regress time dose bp if sex==0 R-squared = 0.179499 bp | -2.014646 -4. Interval] ---------+---------------------------------------------------------------dose | -3.gph b.25||line YHm dose if bp==0.8861 -------------------------------------------------------------------------time | Coef. Err.53 0.000 30.66083 -0.346814 5.5|| > line YHf dose if bp==0. Mar 5. 2004 -2- .519608 .gph saved) .35 0.72778 -------------------------------------------------------------------------.5|| > line YHm dose if bp==0. twoway line YHm dose if bp==0. regress time dose bp if sex==1 R-squared = 0.2482 72.35342 16.5 13. Std.35342 _cons | 61. replace) (file a.gph saved) .439407 -0.25||line YHf dose if bp==0.18 0.gph 60 40 20 0 2 4 Fitted values Fitted values 6 dose 8 Fitted values time 10 0 2 20 40 60 4 Fitted values Fitted values 6 dose 8 Fitted values time 10 Multiple Regression II. Err. saving(a. Interval] ---------+---------------------------------------------------------------dose | -5. graph combine a.000 -7.46 0. predict YHm (option xb assumed.007 -5. replace) (file b. Std.11765 6.458495 9.5765 -------------------------------------------------------------------------time | Coef. twoway line YHf dose if bp==0.506776 -1.50 0.343137 .75||scatter time dose if(sex==1).6608907 -8. fitted values) .75||scatter time dose if(sex==0).53612 -------------------------------------------------------------------------.609 -26. t P>|t| [95% Conf.39216 9.50 0.859 -33.000 46.024569 bp | -5 9.50752 75. predict YHf (option xb assumed.40294 _cons | 51. saving(b. t P>|t| [95% Conf.9564492 -3.40294 28.

30515 19 65.89 0.553222 -1. 2004 -3- .000 -7.9133 Prob > F = 0.195367 -3.dat (24 observations read) .780797 Root MSE = 8. fitted values) .Case Study Model: Time = Dose + Sex + Sex · Dose + BP + ε . predict E.51305 --------------------------------------------------------------------------.000 46.648 -20.3844814 R-squared = 0. Err. Interval] ----------+---------------------------------------------------------------dose | -5.47224 74.698634 9.19341 4.843849 sex | -8. regress time dose sex sexdose bp Source | SS df MS Number of obs = 24 ----------+-----------------------------F( 4.7329 Total | 5629.95833 23 244.46 0.7793 ----------+-----------------------------Adj R-squared = 0.12 0.75 8.8006399 -6.086067 -0.546351 bp | -3. generate sexdose=sex*dose .0861 --------------------------------------------------------------------------time | Coef. predict YH (option xb assumed. 19) = 16. Std.28457 7.070 -.17433 _cons | 60.49265 6.47549 7.132276 1.67433 13.78 Model | 4387.92 0. Mar 5.176471 1. infile time dose sex bp using headache.03 0. residuals Residual plot: residualsi vs Dose 15 15 10 Residuals (in minutes) 10 5 Sample Quantiles 2 4 6 Dose (in grams) 8 10 5 0 0 −5 −5 −10 −10 −2 −1 0 1 Theoretical Quantiles 2 Multiple Regression II.276 -24.0000 Residual | 1242.519608 .65319 4 1096. t P>|t| [95% Conf.333585 sexdose | 2.

95 0.000 -17.363656 -0. predict E.002 .45098 7.47549 5.20 Model | 4901.9805396 bp | -3.176471 .91961 2.619545 _cons | 77. regress time dose sex sexdose dosesq bp Source | SS df MS Number of obs = 24 ----------+-----------------------------F( 5. Std.025 .6166667 .8705 ----------+-----------------------------Adj R-squared = 0.Case Study Model: Time = Dose + Dose2 + Sex + Sex · Dose + BP + ε .171 -20.563 -17. Mar 5.171775 -5.44 0.75 6.11955 9.013047 sexdose | 2.48234 -8. test sex bp ( 1) ( 2) sex = 0 bp = 0 F( 2.048581 dosesq | .3774 --------------------------------------------------------------------------.104701 10.59 0. Err.944312 -1.8346 Total | 5629.43 0. t P>|t| [95% Conf.205637 Prob > F = 0.930147 18 40.3043598 4.19 0.8910901 2.4961193 R-squared = 0.52456 92.95833 23 244. 18) = 24.356878 sex | -8.02819 5 980.1731968 3. drop YH E . generate dosesq=dose^2 .3270 -4- Multiple Regression II. residuals 10 10 5 Residuals (in minutes) Sample Quantiles 2 4 6 Dose (in grams) 8 10 5 0 0 −5 −5 −10 −10 −2 −1 0 1 Theoretical Quantiles 2 .000 62.3637 --------------------------------------------------------------------------time | Coef. 2004 .0000 Residual | 728.780797 Root MSE = 6.96403 4.2527937 . 18) = Prob > F = 1.90 0. Interval] ----------+---------------------------------------------------------------dose | -12.56 0.

853653 sexdose | 1. > legend(label(1 "female") label(2 "male")) 60 50 Fitted time (in minutes) 40 30 20 10 0 2 4 6 Dose (in grams) 8 10 Multiple Regression II. Mar 5.59 0. Std.81 Model | 4804.2136452 1.Case Study Model: Time = Dose + Dose2 + Sex · Dose + ε .53 0. 2004 -5- .853771 dosesq | .34823 2. Interval] ----------+---------------------------------------------------------------dose | -12.8428 -7. twoway line YH dose if sex==0|| line YH dose if sex==1. Err.8314 Total | 5629.016 .000 59.9813667 _cons | 71.6166667 .780797 Root MSE = 6.33824 5.1748353 3.0000 Residual | 825.16 --------------------------------------------------------------------------.63 0.3931338 2.54639 Prob > F = 0. t P>|t| [95% Conf. regress time dose sexdose dosesq Source | SS df MS Number of obs = 24 ----------+-----------------------------F( 3.8534 ----------+-----------------------------Adj R-squared = 0.95833 23 244.51647 83.63916 3 1601.319178 20 41.2519667 .667294 12.002 .2659589 R-squared = 0.73 0.033708 .000 -16.4239 --------------------------------------------------------------------------time | Coef.154675 -5. 20) = 38.

Comparing Several Means Example: Comparison of laboratories ◦ Task: Measure amount of chlorpheniramine maleate in tablets ◦ Seven laboratories were asked to make 10 determinations of one tablet Box plot ◦ Study consistency between labs and variability of measurements 4.80 Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7 One-Way Analysis of Variance.00 3.95 3.85 3. Mar 8. 2004 -1- .05 Amount of chlorphenimarine (in mg) 4.90 3.10 4.

Comparing Several Means Example: Comparison of drugs ◦ Experimental study of drugs to relieve itching ◦ Five drugs were compared to a placebo and no drug ◦ Ten volunteer male subjects ◦ Each subject underwent one treatment per day (randomized order) ◦ Drug or placebo were given intravenously ◦ Itching was induced on forearms with cowage Box plot ◦ Subjects recorded duration of itching 400 300 Duration of itching (sec) 200 100 No drug Papaverine Aminophylline Tripelennamine Placebo Morphine Pentobarbital One-Way Analysis of Variance. Mar 8. 2004 -2- .

000 1.105 -. Dev.998 .127 0.000 | 7 | -.020789517 5.005161408 Bartlett’s test for equal variances: chi2(6) = 24.07184294 70 Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups .000 1.02311808 10 4 | 3.142 -.005 1.005 .000 Comparison of amount by lab (Bonferroni) Row Mean-| Col Mean | 1 2 3 4 5 6 ---------+-----------------------------------------------------------------2 | -.000 | 4 | -.920 .955 .448 1.Comparing Several Means .03333330 10 5 | 3. over(lab) .000 1.041 .txt (70 observations read) .003673017 -----------------------------------------------------------------------Total .000 1.059 . graph box amount.1247371 6 .083 | 0.064 .048 .0001 Within groups .001 -.107 -.000 One-Way Analysis of Variance.04 -. 2004 -3- .997 .000 0.035 -.068 | 5 | -.065 | 0.000 1.077 -.06704064 10 7 | 3.042 -. Mar 8.03259178 10 2 | 3.062 .957 .08969706 10 3 | 4.115 1.078 .231400073 63 .043 | 0.9845715 .000 1.08482662 10 ------------+-----------------------------------Total | 3.05716445 10 6 | 3.006 | 0. ------------+-----------------------------------1 | 4.004 1.002 | 0. infile amount lab using labs.000 | 6 | -.046 .003 .000 1. oneway amount lab. bonferroni tabulate | Summary of amount lab | Mean Std.356137173 69 .037 | 0.000 0.408 | 3 | -. Freq.66 0.3697 Prob>chi2 = 0.698 1.

2 67.000 One-Way Analysis of Variance.3828 Prob>chi2 = 0.000 | 7 | -23.6 49 19.8 -37.8857 6 8835.092 | 4 | -43 -56.499465 10 ------------+-----------------------------------Total | 164.5 32.000 1.3 | 1.000 1.738748 10 5 | 144. bonferroni tabulate | Summary of duration drug | Mean Std.48095 2. oneway duration drug.328 0.7 -60.286 69 4687.861442 10 2 | 204.5 26.1 -3. Freq.8 | 1.8 | 1.2 22.3 42.6 | 0.809511 10 4 | 148.8 29.28571 68.000 | 5 | -46.0708 Within groups 270409.0 44.904 1.8 105.000 1. Mar 8.856130 10 7 | 167.4 63 4292.0 54. Dev.3 28.077 Comparison of duration by drug (Bonferroni) Row Mean-| Col Mean | 1 2 3 4 5 6 ---------+-----------------------------------------------------------------2 | 13.000 1.2127 -----------------------------------------------------------------------Total 323422.000 1.463709 70 Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 53012.076782 10 6 | 176.000 | 6 | -14.000 0.000 1.Comparing Several Means .2 52.000 1.5 68.06 0.3 58.000 1.000 1.000 | 3 | -72.7 | 1. 2004 -4- .000 1.2795 Bartlett’s test for equal variances: chi2(6) = 11.723750 10 3 | 118.5 -28.9 -9.2 | 1.8 -86.000 1. ------------+-----------------------------------1 | 191.000 1.