You are on page 1of 24

ENGRD 2700 Engineering Probability and Statistics Lecture 1: Introduction

David S. Matteson
School of Operations Research and Information Engineering Rhodes Hall, Cornell University Ithaca NY 14853 USA dm484@cornell.edu January 20, 2009

The Greatest Scope More Details EDA Describing sample data

Title Page

Page 1 of 25 Go Back Full Screen Close Quit

1.

Why you cant live without prob & stat

Basis for the scientic method, management science, inteligent empiricism. Tidbits: Fact based decision making. Scientic decision making. In God we trust; everyone else bring data. Conclusion: Hunches can be wrong, intuition can be fooled, instincts may be immature or just wrong.

The Greatest Scope More Details EDA Describing sample data

Title Page

Positive societal contribution; ubiquity:


Drug testing: Drug companies are not allowed to market a drug until it it proven safe and eective via clinical trials. (Living link with history: SIR1 was part of the largest public health clinical trial everthe Salk polio vaccine clinical trial.) The drug company employs statisticians to supervise the clinical trial; the FDA employs statisticians to review the drug company results. Digital network trac: Fundamentally dierent from POTS=plain old telephone trac. Prize winning Bellcore study (1993) statistically analyzed data network traces and concluded they exhibit characteristics not compatible with telephone models. Probability models were then constructed to explain the anomolies.

Page 2 of 25 Go Back Full Screen Close Quit

Risk managment: Governmental bodies set standards which control risk.

Hydrology: 10,000 year ood. Build the dam so high that water will exceed the height once per 10,000 years. (Small quantile estimation to the cognoscenti.) Finance: Value at riskthe banks risk reserve needs to be high enough to cover a loss so big that it occurs with probability 1/10,000. Environment: City is ruled out of compliance with EPA regulations if pollutant concentration exceeds a specied level more than 5% of the days in a year; that is, the standard is set so the prob of exceeding the standard is 5%.

The Greatest Scope More Details EDA Describing sample data

Weather prediction: Chance of rain tomorrow is 53%. What does this mean?

Title Page

But image problems persist:


Pick your favorite quote: There are lies, damn lies and statistics. Sir Winston Churchill There are three kinds of lies: Lies, Damn Lies, and Statistics. Benjamin Disraeli You can use statistics to prove anything. Homer Simpson He uses statistics as a drunken man uses a lamppost - for support rather than illumination. Andrew Lang
Page 3 of 25 Go Back Full Screen Close Quit

Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. Aaron Levenstein It is easy to lie with statistics, but it is easier to lie without them. Frederick Mosteller Subjects perceived as dull. (Why?) Tell someone at a social gathering that you are studying statistics and watch their reaction. (Is the reaction the same if you say Engineering?) A statistician is someone who is good with numbers, but lacks the personality to be an accountant. (Or is it the other way round?)
The Greatest Scope More Details EDA Describing sample data

Title Page

Page 4 of 25 Go Back Full Screen Close Quit

Statisticians sometimes perceived as hired guns. (My statistician can beat up your statistician.) Statisticians are often used as expert witnesses in court. Government witnesses in regulatory hearings.
The Greatest

Lowest blow
Best selling book:. How to Lie with Statistics

Scope More Details EDA Describing sample data

Title Page

Page 5 of 25 Go Back Full Screen Close

Figure 1: A best seller.

Quit

The Greatest Scope More Details EDA Describing sample data

Title Page

Page 6 of 25 Go Back Full Screen Close

Figure 2: Blowup

Quit

2.

What this course is about.


Probability: Build models of random phenomena. Random phenomena: Phenomena whose outcomes cannot be predicted in advance. Outcome of ip of a die. Outcome: tomorrows stock price. Outcome: result of the upcoming presidential election.
The Greatest Scope More Details EDA Describing sample data

Caution: All models are wrong; some models are useful. George Box

Statistics: The science of organizing and summarizing data and using information in the data to draw conclusions; scientic extraction of information from data. Statisticians consumers of probability. Statisticians t parameters in probability models. Statisticians draw conclusions about population based on the partial information contained in a sample. The manner in which the conclusions are drawn use probability tools and reasoning. (The sampling error in the estimate of p is 5%.)

Title Page

Page 7 of 25 Go Back Full Screen Close Quit

3.

More Details

The Greatest Scope More Details EDA Describing sample data

Title Page

Page 8 of 25 Go Back Full Screen Close Quit

More Details (cont)


Population:

a well-dened set of items under attention and discussion. (It is clear how to decide if an item is or is not in the population.)

Sample: a well-dened subset of the population that has been selected for
study or measurement.

The Greatest Scope More Details EDA Describing sample data

Observation:

an individual measurement from a sample.

Examples: Population: all US universities Sample: ivy league universities. Population: all voters in the US. Sample: all voters with incomes over $200,000. Population: Yearly best times in seconds (minus 3 minutes) in the mile run over 121 years (ending in mid-80s). Sample: Yearly best times which were record times during this time period.

Title Page

Page 9 of 25 Go Back Full Screen Close Quit

100

The Greatest Scope

90

More Details EDA Describing sample data

mile

80

Title Page

60

70

Page 10 of 25

50

Go Back Full Screen Close Quit

20

40

60 Time

80

100

120

Figure 3: Time series plot of yearly best times in mile.

More on populations and samples:


Four Stages in statistical analysis: Precisely dene the population and then formulate clear answerable questions about the population. Collect data to help answer the questions; requires sampling scheme and experimental design. Exploratory step: Describe and present data using the tools of graphics descriptive statistics such as numerical summaries of datamean, median, variance, quantiles, . . . . Conrmatory step; formal inference: Analyze the data to draw conclusions about the population.
Page 11 of 25 Go Back Full Screen Close Quit The Greatest Scope More Details EDA Describing sample data

Title Page

More on populations and samples:


Types of populations and sets: 1. Large but nite population: Dicult to take a census (ie, sample whole population). Thus, typically only a relatively small number of items are be sampled. Population of the US. Actual census only every 10 years. Even in the presidential elections, the sample consisting of eligible voters who vote is a relatively small percentage of the population. Only 53.9 percent of the voting-age population actually voted in 1980 when Ronald Reagan was elected. Pollsters predict outcomes based on much smaller samples of the order of hundreds. 2. Small and nite population: can be sampled in its entirety. Average age for ENGRD 2700 students this semester. Average number of publications of ORIE professors. ()
Page 12 of 25 Go Back Full Screen Close Quit

The Greatest Scope More Details EDA Describing sample data

Title Page

3. Abstracted population represented by an innite setsay an interval of real numbers (or worse). Population representing time to failure of a machine component; population could be represented by the set of all positive real numbers [0, ).

Population representing the study of the point of maximum pollution concentration in city; population could be represented by a two dimensional set {(x, y) : 0 x 1, 0 y 1}.

The Greatest Scope More Details EDA Describing sample data

Title Page

Page 13 of 25 Go Back Full Screen Close Quit

Types of samples:
A simple random sample of size n: a sample of size n in which each subset of size n in the population has the same likelihood of being selected. Problem with denition Likelihood is undened. For populations from continuous innite sets, the likelihood may be zero; eg, the probability of drawing a sample of size 2 from [0, 1] and getting 1 , 3 is 0. 4 4 Stratied random sample: The population consists of sub-populations. EG: Take a sample of size 700 from the Cornell student population by sampling 100 from each college. (Useful to predict outcome of union vote?) Biased sample. The samples are clearly dependent and unrepresentative. Population: residents of Florida. Determine the average income of residents of Florida. Biased sample: NBA players living in Florida.
The Greatest Scope More Details EDA Describing sample data

Title Page

More later.

Meanwhile, read the book. Carefully!

Page 14 of 25 Go Back Full Screen Close Quit

4.

Need for descriptive statistics; exploratory data analysis.


Population of US Voters Vote to re-elect Mayor = 1 Vote to elect challenger = 0. Epidemiology Infected =0 Not infected =1 Failure times; population is all real numbers. Data networks: Observe packet counts per unit time. Finance: Observe stock index like DJ.
Page 15 of 25 Go Back Full Screen Close Quit Title Page

Most (not all) populations coded by numbers.


The Greatest Scope More Details EDA Describing sample data

Exceptions: Makes of cars sold in the US {Buick Le Sabre, Volkswagon Passat, VW Golf, Honda Accord, Honda Civic, . . . }. Marketing: Brands of toothpaste {Colgate, Crest, Toms, . . . }.

Conclusion:

Result of sampling often yields a set of numbers. Sometimes the set is large. We need to make sense of the set of numbers.

Pay tribute to computer


The computer makes handling big data sets tolerable; 30 years ago statistics was frequently done By hand (pencil & paper, abacus, slide rule) or with a mechanical calculator. (Hence elementary stat texts had examples consisting of data sets of 5 points.) Then Electronic hand calculator. (Tedious to enter data.) Then Computer.
Page 16 of 25 Go Back Full Screen Close Quit The Greatest Scope More Details EDA Describing sample data

Title Page

Large data sets becoming increasingly common.


Walmart records and stores a record of each transaction. What to do with so much data? emerging eld of data mining. Internet joke (but true): Question: What network does a research physicist use to ship his data from one facilty to another? Answer: UPS; they ship the hard drive from one location to another by UPS. In Internet studies, sniers and other automatic data collection devices give arbitrarily large data sets. In nance, there is high resolution data (recordings at very small time intervals) and even tic by tic data.
The Greatest Scope More Details EDA Describing sample data

Title Page

Page 17 of 25

Example:

Trace is packet counts per 100 milliseconds=1/10 second for Financial Company Xs wide area network link including USA-UK trac. Length of dataset=288,009; 8 hours of collection from 9am5pm. Top plot too muddy; bottom represents subsets sized 20,000.

Go Back Full Screen Close Quit

The Greatest Scope More Details EDA Describing sample data

Title Page

Page 18 of 25 Go Back Full Screen Close Quit

Conclusion:
First step:

How do we make sense of large (and small) amounts of data.

descriptive statistics and exploratory data analysis.

Descriptive statistics:
Graphics

organization and summarization of large amounts of data for the purposes of drawing conclusions. Use

The Greatest Scope More Details EDA Describing sample data

Summary statistics (mean, median, variance, ...) Use of descriptive statistics is often a rst step in an exploratory data analysis. Somewhat informal, pictorial.

Formal inference:

Draw scientic inferences about the population from the data. More formal methods. For example we can formally test hypotheses. (Sometimes the hypotheses are suggested by the EDA.) Most famous clinical trial: Salk vaccine given to sample of kids in early 50s with a control group receiving a placebo. The formal hypothesis was that the Salk vacine was more eective than randomness.

Title Page

Page 19 of 25 Go Back Full Screen Close Quit

5.

Describing sample data

Use of graphs and other descriptive statistics.

5.1.

stem-and-leaf plot
The Greatest Scope More Details EDA Describing sample data

Older method originated for small univariate data sets when analysis was often by hand. This is just a clever arrangement of the data values to reect the shape of the distribution. Advantage: simple, quick, easy to construct. Disadvantage: a little primitive; doesnt capitalize on graphics capabilities of packages and computers.

Title Page

Procedure:
digits.

given data x1 , . . . , xn where each xi consists of (at least) 2

1. split each xi into a stem of leading digits and a leaf of the remaining digits. 2. List the stem values in the left hand margin column and to the right list the leaves corresponding to each stem, listed in the order they are encountered in the data set.

Page 20 of 25 Go Back Full Screen Close Quit

A simple illustration: Suppose student scores on an exam are 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95. A stem-and-leaf plot is below: 9 8 7 6 5 4 | | | | | | 5 0038 033699 379 8

The Greatest Scope More Details EDA Describing sample data

Positive features: The entire data set can be read with ease from the display. Gives a clear indication of the shape of the distribution of data values. R does this automatically with the following commands: > grades<-c( 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95) > stem(grades)

Title Page

Page 21 of 25 Go Back Full Screen Close Quit

4 5 6 7 8 9

| | | | | |

8 379 03699 0038 5

Minitab output (check the drop down menu under GRAPH) Stem-and-Leaf Display: C1 Stem-and-leaf of C1 Leaf Unit = 1.0 1 1 4 (5) 5 1 4 5 6 7 8 9 8 379 03699 0038 5 N = 14
The Greatest Scope More Details EDA Describing sample data

Title Page

Minitab output: extra column on the left. Features: The number in parentheses gives the number of observations on the line that contains the median (or the middle value). The 4 in the row above that, gives the total number of observations in the rst three rows, i.e. there were 4 scores below 70. The 1 above that indicates that there was one score in the 50s or below. The 5 below the median line indicates that there are 5 scores in the 80s and above. The 1 on the last line indicates one score 90.
Page 22 of 25 Go Back Full Screen Close Quit

Note The help le gives a detailed explanation. More extensive data sets - particularly those in which three or more digits vary - are dealt with in a variety of ways, but all are similar to the simple case shown above.

The Greatest Scope More Details EDA Describing sample data

Title Page

Page 23 of 25 Go Back Full Screen Close Quit

Contents
The Greatest Scope More Details EDA Describing sample data

Title Page

Page 24 of 25 Go Back Full Screen Close Quit