You are on page 1of 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Chapter 3 A Closer Look at Assumptions
STAT 3022 School of Statistic, University of Minnesota

2013 spring

1 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Introduction
In Chapter 2, we discussed the mechanics of using t-procedures to perform statistical inference. Namely t-tests and confidence interval. We base these procedures on certain assumptions: we have random samples, representative of populations data come from Normal population samples are drawn independently. in pooled two-sample settings, we have equal variance (σ1 = σ2 = σ) In practice, these assumptions are usually not strictly met. When are these procedures still “appropriate”?

2 / 27

Intrduction Robustness Resistance Transformation Outlier Case Study: Making it Rain Data collected in southern Florida between 1968 . how much? 3 / 27 .1972 to test hypothesis that massive injection of silver iodide (AgI) into cumulus clouds can lead to increased rainfall. Randomly assigned treatment.pilots flew through cloud every day. Researchers were blind to the treatment . whether treatment or control. Over 52 days. either seeded a target cloud or left it unseeded (as control). Question: Did cloud seeding have an effect on rainfall? If so. This process is called “cloud seeding”. and mechanism in plane either seeded the cloud or left it unseeded.

data=case0301) Rainfall (acre−feet) 0 500 1000 1500 2000 2500 Unseeded Seeded 4 / 27 . ylab='Rainfall (acre-feet)'.Intrduction Robustness Resistance Transformation Outlier Graphical Summaries library("Sleuth2") boxplot(Rainfall ~ Treatment.

xlab="") Frequency 0 2 4 6 8 10 12 Seeded − Rainfall 0 500 1000 1500 2000 2500 3000 Frequency 0 5 10 15 20 Unseeded − Rainfall 0 500 1000 1500 2000 2500 3000 5 / 27 . mar=c(4. xlim=c(0. breaks=8.3000).3000). main="Seeded .Rainfall".Rainfall".1.Intrduction Robustness Resistance Transformation Outlier Graphical Summaries par(mfrow=c(2.1).0. xlim=c(0.5)) hist(case0301$Rainfall[case0301$Treatment=="Seeded"]. col="gray".4. xlab="") hist(case0301$Rainfall[case0301$Treatment=="Unseeded"].col="gray". main="Unseeded . breaks=10.

there are problems with our necessary assumptions: both distributions are very skewed both distributions have outliers variability is much greater in the seeded group than in the unseeded group Can we use our usual t-tools to analyze these data? How? 6 / 27 .Intrduction Robustness Resistance Transformation Outlier Numerical Summaries and Interpretations Numerical Summaries: Do it yourself (follow the R-code on page 42 of Chapter 2 slides) Graphical and numerical summaries indicate that rainfall tended to be greater on seeded days. However.

05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.5885 441. data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1. + var.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.Intrduction Robustness Resistance Transformation Outlier Can we do this? > t.224179 1. alternative="two. df = 50.9846 How much did the violations of our assumptions affect these results? 7 / 27 .9982. p-value = 0.sided".test(Rainfall ~ Treatment.equal=TRUE.

Intrduction Robustness Resistance Transformation Outlier Robustness t-tools may be used even when assumptions are violated. 8 / 27 . because the t-tools are robust. to a certain degree. Robustness: A statistical procedure is robust to departures from a particular assumption if it is valid even when the assumption is not met.

Intrduction Robustness Resistance Transformation Outlier Type 1: Robustness Against Departures from Normality Recall that the Central Limit Theorem (CLT) states that sample averages have approximately Normal sampling distributions. 9 / 27 . As long as samples are “large enough”. regardless of the shape of the population distribution. the t-ratio will follow an approximate t-distribution even if the data is non-Normal. for large samples.

Larger sample size diminish this effect. then validity of t-tools is affected very little by skewness. See Display 3. then validity of t-tools is affected substantially by skewness.4 in the textbook for simulation results. but n1 ̸= n2. 10 / 27 .Intrduction Robustness Resistance Transformation Outlier Type 1: Robustness Against Departures from Normality Effects of Skewness If two populations have same standard deviations and approximately same shapes. and if n1 ≈ n2. If two populations have same standard deviations and approximately same shapes. If skewness in two populations differs considerably. tools can be very misleading with small and moderate sample sizes.

more serious problems may arise: sp no longer estimates any parameter SE(¯1 − ¯2 ) no longer estimates the standard deviation of x x the difference between averages the t-ratio no longer follows a t-distribution What can we do: If n1 ≈ n2.sided'. t-tools remain fairly valid even when σ1 ̸= σ2 .5 in the textbook for simulation results.Intrduction Robustness Resistance Transformation Outlier Type 2: Robustness Against Differing Standard Deviations When we cannot assume σ1 = σ2 .test(x1.equal = FALSE) 11 / 27 . See Display 3. alternative = 'two. we need the ratio σ1 /σ2 to be between 1/2 and 2 to have reliable results. > t. When n1 and n2 are very different. var. x2.

Observations in the same subgroup tend to be more similar in their responses than observations in different subgroups. t-tools are usually not recommended in such cases. the standard error becomes very inaccurate. A serial effect occurs when measurements are taken over time and observations close together in time tend to be more similar (or more different) than observations collected at distant time points. 12 / 27 .e. 2 When the assumption of independence is violated. lack of independence) that commonly arise: 1 A cluster effect occurs when the data have been collected in subgroups..Intrduction Robustness Resistance Transformation Outlier Type 3: Robustness Against Departures from Independence There are two types of dependence (i.

Question: Can you tell the difference between “Robustness” and “Resistance”? 13 / 27 . perhaps drastically.Intrduction Robustness Resistance Transformation Outlier Resistance and Outliers An outlier is an observation judged to be far from its group average. Whether or not we should simply remove such observations depend on how resistant our tools are to changes in the data. A statistical procedure is resistant if it does not change very much when a small part of the data changes.

0 −6 3 −0.0 −1.Intrduction Robustness Resistance Transformation Outlier Example of Outlier 1.5 −4 −2 0 x 2 4 6 −3 −3 −2 −1 0 1 2 −2 −1 0 1 2 3 14 / 27 .0 0.5 0.

70 The sample mean is 36. 50. 30. while the sample mean is not. 20.Intrduction Robustness Resistance Transformation Outlier Example of Resistance Consider a hypothetical sample: 10. and the sample median is 30. 30. 50. 20. 700 What happens to the sample mean? What about the sample median? The sample median is resistant to any change in a single observation. Now consider the sample: 10. 15 / 27 .

Compare your results to see how influential the outlier in question is. 16 / 27 . it is good practice to run your analysis with and without the outlier in the data set. they are not resistant. Small portion of the data can have a major influence on the results. One or two outliers can affect a 95% CI or change a p-value enough to alter a conclusion.Intrduction Robustness Resistance Transformation Outlier Resistance of t-Tools Since t-tools are based on ’mean’. Solution: When you have an outlier.

Alternative tools that do not require model assumptions (Chapter 4) 3 17 / 27 .5) to see if the transformed data looks “nicer” b. and evaluate appropriateness of t-tools: 1 2 think about possible cluster and serial effects evaluate the suitability of t-tools by examining graphical displays (side-by-side histograms or box plots) consider alternatives a. using available data.Intrduction Robustness Resistance Transformation Outlier Practical Strategies for the Two-Sample Problem Our task is to size up actual conditions. Transform the data (Section 3.

.Intrduction Robustness Resistance Transformation Outlier Transformations of Data For positive data.71828. log(1) = 0 log(ex ) = x log function log(x) −2 0 −1 0 1 2 2 4 x 6 8 10 18 / 27 . the most useful transformation is the logarithm (log). particularly the natural (base e) logarithm (e = 2..).

before transformation 1200 after transformation 800 1000 log(y) −2 0 2 0 200 400 600 4 6 1 2 3 4 5 1 2 3 4 5 19 / 27 . with the group with the larger average having a greater spread. then a log transformation could be a good choice.Intrduction Robustness Resistance Transformation Outlier When Do We Use Log Transformation In a set of data. then a log transformation if samples are skewed. if the ratio might be useful. max min > 10.

772064 6 244.5 Unseeded 5. with the “seeded” days having a larger average and a greater spread.with(case0301. > max(case0301$Rainfall[case0301$Treatment=="Seeded"])/ + min(case0301$Rainfall[case0301$Treatment=="Seeded"]) [1] 669.498397 before transformation 8 after transformation 2500 2000 log(Rainfall) Unseeded Seeded 1500 1000 500 0 0 2 4 6 Unseeded Seeded 20 / 27 .092241 2 830.6586 > max(case0301$Rainfall[case0301$Treatment=="Unseeded"])/ + min(case0301$Rainfall[case0301$Treatment=="Unseeded"]) [1] 1202.6 > case0301$logRain <.6 Unseeded 7.2 Unseeded 5.919969 4 345. log(Rainfall)) > head(case0301) Rainfall Treatment logRain 1 1202.844993 5 321.Intrduction Robustness Resistance Transformation Outlier Cloud Seeding .Transformation Recall both groups are skewed.4 Unseeded 5.1 Unseeded 6.3 Unseeded 5.721546 3 372.

p-value = 0.sided".3904045 sample estimates: mean in group Unseeded mean in group Seeded 3.9846 After: > t. + alternative="less".test(Rainfall ~ Treatment. p-value = 0.5444.9982. df = 50.Intrduction Robustness Resistance Transformation Outlier Two-Sample t-Analysis Before: > t.134187 There is convincing evidence that seeding increased rainfall. + var.224179 1.5885 441. data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.test(logRain ~ Treatment. data=case0301.equal=TRUE. alternative="two.990406 5.equal=TRUE) Two Sample t-test data: logRain by Treatment t = -2. var.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.007041 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0. 21 / 27 . df = 50.

138614 We interpret this in the following way: “The volume of rainfall produced by a seeded cloud is estimated to be 3.14 times as large as the volume that would have been produced in the absence of seeding”. It is estimated that the response ¯ ¯ of an experimental unit to treatment 2 will be eZ2 −Z1 times as ¯ large as its response to treatment 1 (where Z1 = average of log(Y1 )).mean(case0301$logRain[case0301$Treatment=='Seeded']) > m2 <.m2) [1] 1.mean(case0301$logRain[case0301$Treatment=='Unseeded']) > (diffmeans <.143781 > (est. > m1 <.mult.m1 .Intrduction Robustness Resistance Transformation Outlier Multiplicative Treatment Effect Definition: Suppose Z = log Y.effect <.exp(diffmeans)) [1] 3. 22 / 27 .

var. df = 50.786 times.129 to 0.7859476 attr(.95 > exp(test$conf. p-value = 0. data=case0301.0466973 -0.int [1] -2.equal=TRUE)) Two Sample t-test data: logRain by Treatment t = -2.1291608 0.2408651 sample estimates: mean in group Unseeded mean in group Seeded 3.95 A 95% confidence interval for the multiplicative effect of unseeding/seeding is 0.t.990406 5.test(logRain ~ Treatment.int) [1] 0.Intrduction Robustness Resistance Transformation Outlier Confidence Interval > (test <.level") [1] 0."conf.level") [1] 0.5444.134187 > test$conf."conf. 23 / 27 .01408 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.2408651 attr(.0466973 -0.

The idea is to perform analysis with and without the suspected outliers. 24 / 27 . leave the observation out of the analysis Often there is no way to know what caused the outlier(s).see Display 3. report both results. If both analyses give same answer. Two tools exist: employ resistant statistical tool (Chapter 4) adopt a careful strategy .6 in the text. If not. correct the observation if not.Intrduction Robustness Resistance Transformation Outlier A Strategy for Dealing with Outliers If outlying observation resulted from measurement error or contamination from another population: if the right value is known. only report results INCLUDING suspected outliers.

] Country Life Income Type 15 Portugal 68. data[16.2 10000 Industrialized 17 Sweden 74.10000 # set it as an outlier > data[15:17.1 2963 Industrialized > ### dealing with Missing data ### > (cc <.1 2963 Industrialized TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 25 / 27 . na.1 956 Industrialized 16 South_Africa 68.7 5596 Industrialized > > d1 <. 'Income'] <. ] Country Life Income Type 15 Portugal 68.7 5596 Industrialized 18 Switzerland 72.cases(ex0327)) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE [16] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > data2 <. ex0327[15:17.subset(data. Income < 8000).Intrduction Robustness Resistance Transformation Outlier Removing Outliers and Other Data Points > library(Sleuth2).1 956 Industrialized 17 Sweden 74.1 956 Industrialized 17 Sweden 74.ex0327[cc. ] Country Life Income Type 15 Portugal 68.2 NaN Industrialized 17 Sweden 74. ] Country Life Income Type 15 Portugal 68. data2[15:17.ex0327.7 5596 Industrialized > range(ex0327$Income.1 956 Industrialized 16 South_Africa 68.7 5596 Industrialized 18 Switzerland 72. d1[15:17.rm=TRUE) [1] 110 5596 > data <.complete. ].

Intrduction Robustness Resistance Transformation Outlier Q: How many conservative economists does it take to change a light bulb? 26 / 27 .

27 / 27 .Intrduction Robustness Resistance Transformation Outlier A: None. they’re all waiting for the unseen hand of the market to correct the lighting disequilibrium.