# Syllabus and Course Outline How do we analyze data?

Statistical Inference Prediction, Explanation, and the Role of Models Summary

Lecture 1: Introduction

September 17, 2007

Outline
Syllabus and Course Outline How do we analyze data? Enumeration, Summary, and Comparison Inference Statistical Inference The Role of Probability Reversing the problem Prediction, Explanation, and the Role of Models Prediction versus Explanation The Role of Models
Enumeration, Summary, and Comparison Inference

Inference

Estimation

Hypothesis Testing

Summary

Comparison

Enumeration

Figure: Diagram from Efron 1982. “Maximum Likelihood and Decision Theory.” The Annals of Statistics. 10: 340-356.
Enumeration, Summary, and Comparison Inference

Enumeration
Data from Fish, M. Steven. 2002. “Islam and Authoritarianism.” World Politics. 55: 4-37. 1 2 3 4 5 6 . . . 156 157 Democracy 1.100000 4.100000 2.150000 1.900000 5.650000 3.950000 . . . 4.200000 3.000000 Income 2.250420 2.925312 3.214314 2.824126 3.762078 3.187803 . . . 2.653213 2.848805 Muslim 1 1 1 0 0 0 . . . 0 0 OPEC 0 0 1 0 0 0 . . . 0 0

Enumeration, Summary, and Comparison Inference

Summary

Min. 1st Qu. Median Mean 3rd Qu. Max.

Democracy 1.000 2.550 4.075 4.102 5.675 7.000

Income 2.000 2.660 3.178 3.220 3.649 4.662

Muslim 0.0000 0.0000 0.0000 0.3013 1.0000 1.0000

OPEC 0.00000 0.00000 0.00000 0.07051 0.00000 1.00000

Enumeration, Summary, and Comparison Inference

Summary

Enumeration, Summary, and Comparison Inference

Comparison
Enumeration, Summary, and Comparison Inference

Enumeration, Summary, and Comparison Inference

Inference

The observed data set (sample) is interesting, but we may be more interested in a larger data set that we haven’t observed (population). For example... a large population of individuals a conceptually inﬁnite data set (i.e. the process by which these data were generated) counterfactual values for the variables

Enumeration, Summary, and Comparison Inference

Estimation and Testing
Estimation – Would the summary be accurate in the "larger data set"? For example, is 1.669 (the slope of the line in the ﬁrst plot) close to the slope of the line in some larger population of countries? Testing – Would the comparison be accurate in the "larger data set"? For example, the slope for Muslim countries (Inc v Dem) is not equal to the slope for non-Muslim countries (Inc v Dem). Is this difference due to something other than chance variation? Would the difference still be there in the larger (maybe conceptually inﬁnite) group of countries.

The Role of Probability Reversing the problem

In order to make inference about a population/process, we often need a model. A deterministic model will not allow us to account for observations that do not ﬁt. Probability models allow us to deal with "noise" in the data. Loosely speaking... Probability allows us to reason from populations/processes to samples. Statistical inference is the practice of reasoning from samples to populations/processes.

The Role of Probability Reversing the problem

Coin Flipping Example
Suppose we have a fair coin that we plan on ﬂipping 10 times. “Fair” means that the probability of getting “H” on any given ﬂip is 1 . This P(H) is the parameter of the population/process. 2 We will usually represent population/process parameters with greek letters. (e.g. θ ≡ P(H)) Given θ we can answer questions like the following: Q: What is the probability of seeing 4 or fewer heads? A: approximately 0.377. Q: What is the probability of seeing the sequence {T , H, H, T , T , T , H, T , H, T }? A: approximately 0.0009766.
The Role of Probability Reversing the problem

“In solving a problem of this sort, the grand thing is to be able to reason backward... Most people, if you describe a train of events to them, will tell you what the result would be. They can put those events together in their minds, and argue from them that something will come to pass. There are few people, however, who, if you told them a result, would be able to evolve from their own inner consciousness what the steps were which led up to that result. This power is what I mean when I talk of reasoning backward...” – Sherlock Holmes

The Role of Probability Reversing the problem

[Demonstration] Let θ be the
# of red cards total # of cards .

What is our best guess for θ, and how good is our guess? (estimation)
1 Is it a regular deck? (i.e. does θ = 2 ) (testing)

Our ability to answer these questions depends on our assumptions about the sample process.

The Role of Probability Reversing the problem

2004 Ohio Example

Ohio Vote Counts (from 2004 FEC report) Bush: 2,859,768 (≈ 51%) Kerry: 2,741,167

Exit Poll Counts (Freeman, S.F. 2004) Bush: 941 (≈ 48%) Kerry: 1,022

The Role of Probability Reversing the problem

What if we knew the truth?
Population Population Index Vote Choice Repeated Sampling Sample Index Population Index Vote Choice Sample Index Population Index Vote Choice . . . 1 2, 790,375 Kerry 1 3, 548,192 Kerry . . . 2 47,893 Kerry 2 5,168,386 Bush . . . ... ... ... ... ... ... . . . 1,963 3,983,486 Bush 1,963 1,926,017 Bush . . . 1 Bush 2 Kerry 3 Bush ... ... 5,600,935 Bush

The Role of Probability Reversing the problem

1,000 SRS “Polls” from Ohio Vote Counts (n=1,963)
Histogram of Simulated Polls
The Role of Probability Reversing the problem

Usually we don’t know the truth!
Sample Estimate of the Population: Population Index Vote Choice 1 Bush 2 Kerry 3 Bush ... ... 1,963 Bush

Resampling with Replacement: Resample Index Sample Index Vote Choice Resample Index Sample Index Vote Choice . . . 1 925 Kerry 1 447 Kerry . . . 2 396 Kerry 2 1,076 Bush . . . ... ... ... ... ... ... . . . 1,963 842 Bush 1,963 447 Bush . . .

The Role of Probability Reversing the problem

Estimation versus Testing
Null and Estimated Sampling Distributions

Prediction versus Explanation The Role of Models

The 1994 State Failure Task Force
State Failure: revolutionary wars genocide or politicide adverse or disruptive regime transition In addition to information on state failures, the task force collected data on more than one thousand variables for 195 countries between 1955 and 1998. Two Possible Questions:
Using this data, can we predict state failure? Using this data, can we explain why states fail?
Prediction versus Explanation The Role of Models

The Prediction Problem

Prediction versus Explanation The Role of Models

The Training Set

Prediction versus Explanation The Role of Models

The Validation Set

Prediction versus Explanation The Role of Models

Explaining State Failure
Does high infant mortality explain state failure? Does high population density explain state failure? Is classiﬁcation enough?

Prediction versus Explanation The Role of Models

Explaining State Failure
Does high infant mortality explain state failure? Does high population density explain state failure? Is classiﬁcation enough?

Prediction versus Explanation The Role of Models

The Role of Models
Speaking of functional form... This is primarily a course on linear regression, and therefore, we will usually assume a linear relationship. Q: Is linearity a reasonable modeling assumption? A1: In some cases, the relationship may be close enough to linear, that we will get reliable answers A2: We may be able to tell when the linear model is inadequate. A3: We may be able to make small changes in order to ﬁx things up. We’ll be making lots of assumptions, and some of them can’t be tested with data!
Prediction versus Explanation The Role of Models

Mice and Tigers
“Since all models are wrong the scientist cannot obtain a ‘correct’ one by excessive elaboration. ... Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” – George E. P. Box, 1976 “All models are wrong. Some are useful.” – George E. P. Box

Summary

In this course, we will focus on estimation (point and interval) and testing. Predictive and explanatory models have different goals, and often utilize different statistical techniques. In this course, we focus on explanation. We will not get the model “right”, but we may be close enough to say something useful. In this course, we will learn when the linear model is (and is not) useful.

