Lecture slides from ENGRD 2700 Basic Engineering Probability and Statistics

David S. Matteson

School of Operations Research and Information Engineering Rhodes Hall, Cornell University Ithaca NY 14853 USA dm484@cornell.edu January 20, 2009

1.

Basis for the scientic method, management science, inteligent empiricism. Tidbits: Fact based decision making. Scientic decision making. In God we trust; everyone else bring data. Conclusion: Hunches can be wrong, intuition can be fooled, instincts may be immature or just wrong.

Drug testing: Drug companies are not allowed to market a drug until it it proven safe and eective via clinical trials. (Living link with history: SIR1 was part of the largest public health clinical trial everthe Salk polio vaccine clinical trial.) The drug company employs statisticians to supervise the clinical trial; the FDA employs statisticians to review the drug company results. Digital network trac: Fundamentally dierent from POTS=plain old telephone trac. Prize winning Bellcore study (1993) statistically analyzed data network traces and concluded they exhibit characteristics not compatible with telephone models. Probability models were then constructed to explain the anomolies.

Hydrology: 10,000 year ood. Build the dam so high that water will exceed the height once per 10,000 years. (Small quantile estimation to the cognoscenti.) Finance: Value at riskthe banks risk reserve needs to be high enough to cover a loss so big that it occurs with probability 1/10,000. Environment: City is ruled out of compliance with EPA regulations if pollutant concentration exceeds a specied level more than 5% of the days in a year; that is, the standard is set so the prob of exceeding the standard is 5%.

Weather prediction: Chance of rain tomorrow is 53%. What does this mean?

Pick your favorite quote: There are lies, damn lies and statistics. Sir Winston Churchill There are three kinds of lies: Lies, Damn Lies, and Statistics. Benjamin Disraeli You can use statistics to prove anything. Homer Simpson He uses statistics as a drunken man uses a lamppost - for support rather than illumination. Andrew Lang

Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. Aaron Levenstein It is easy to lie with statistics, but it is easier to lie without them. Frederick Mosteller Subjects perceived as dull. (Why?) Tell someone at a social gathering that you are studying statistics and watch their reaction. (Is the reaction the same if you say Engineering?) A statistician is someone who is good with numbers, but lacks the personality to be an accountant. (Or is it the other way round?)

Statisticians sometimes perceived as hired guns. (My statistician can beat up your statistician.) Statisticians are often used as expert witnesses in court. Government witnesses in regulatory hearings.

Probability: Build models of random phenomena. Random phenomena: Phenomena whose outcomes cannot be predicted in advance. Outcome of ip of a die. Outcome: tomorrows stock price. Outcome: result of the upcoming presidential election.

Caution: All models are wrong; some models are useful. George Box

Statistics: The science of organizing and summarizing data and using information in the data to draw conclusions; scientic extraction of information from data. Statisticians consumers of probability. Statisticians t parameters in probability models. Statisticians draw conclusions about population based on the partial information contained in a sample. The manner in which the conclusions are drawn use probability tools and reasoning. (The sampling error in the estimate of p is 5%.)

Population:

a well-dened set of items under attention and discussion. (It is clear how to decide if an item is or is not in the population.)

Sample: a well-dened subset of the population that has been selected for

study or measurement.

Observation:

Examples: Population: all US universities Sample: ivy league universities. Population: all voters in the US. Sample: all voters with incomes over $200,000. Population: Yearly best times in seconds (minus 3 minutes) in the mile run over 121 years (ending in mid-80s). Sample: Yearly best times which were record times during this time period.

Four Stages in statistical analysis: Precisely dene the population and then formulate clear answerable questions about the population. Collect data to help answer the questions; requires sampling scheme and experimental design. Exploratory step: Describe and present data using the tools of graphics descriptive statistics such as numerical summaries of datamean, median, variance, quantiles, . . . . Conrmatory step; formal inference: Analyze the data to draw conclusions about the population.

Types of populations and sets: 1. Large but nite population: Dicult to take a census (ie, sample whole population). Thus, typically only a relatively small number of items are be sampled. Population of the US. Actual census only every 10 years. Even in the presidential elections, the sample consisting of eligible voters who vote is a relatively small percentage of the population. Only 53.9 percent of the voting-age population actually voted in 1980 when Ronald Reagan was elected. Pollsters predict outcomes based on much smaller samples of the order of hundreds. 2. Small and nite population: can be sampled in its entirety. Average age for ENGRD 2700 students this semester. Average number of publications of ORIE professors. ()

3. Abstracted population represented by an innite setsay an interval of real numbers (or worse). Population representing time to failure of a machine component; population could be represented by the set of all positive real numbers [0, ).

Population representing the study of the point of maximum pollution concentration in city; population could be represented by a two dimensional set {(x, y) : 0 x 1, 0 y 1}.

Types of samples:

A simple random sample of size n: a sample of size n in which each subset of size n in the population has the same likelihood of being selected. Problem with denition Likelihood is undened. For populations from continuous innite sets, the likelihood may be zero; eg, the probability of drawing a sample of size 2 from [0, 1] and getting 1 , 3 is 0. 4 4 Stratied random sample: The population consists of sub-populations. EG: Take a sample of size 700 from the Cornell student population by sampling 100 from each college. (Useful to predict outcome of union vote?) Biased sample. The samples are clearly dependent and unrepresentative. Population: residents of Florida. Determine the average income of residents of Florida. Biased sample: NBA players living in Florida.

More later.

4.

Population of US Voters Vote to re-elect Mayor = 1 Vote to elect challenger = 0. Epidemiology Infected =0 Not infected =1 Failure times; population is all real numbers. Data networks: Observe packet counts per unit time. Finance: Observe stock index like DJ.

Exceptions: Makes of cars sold in the US {Buick Le Sabre, Volkswagon Passat, VW Golf, Honda Accord, Honda Civic, . . . }. Marketing: Brands of toothpaste {Colgate, Crest, Toms, . . . }.

Conclusion:

Result of sampling often yields a set of numbers. Sometimes the set is large. We need to make sense of the set of numbers.

The computer makes handling big data sets tolerable; 30 years ago statistics was frequently done By hand (pencil & paper, abacus, slide rule) or with a mechanical calculator. (Hence elementary stat texts had examples consisting of data sets of 5 points.) Then Electronic hand calculator. (Tedious to enter data.) Then Computer.

Walmart records and stores a record of each transaction. What to do with so much data? emerging eld of data mining. Internet joke (but true): Question: What network does a research physicist use to ship his data from one facilty to another? Answer: UPS; they ship the hard drive from one location to another by UPS. In Internet studies, sniers and other automatic data collection devices give arbitrarily large data sets. In nance, there is high resolution data (recordings at very small time intervals) and even tic by tic data.

Example:

Trace is packet counts per 100 milliseconds=1/10 second for Financial Company Xs wide area network link including USA-UK trac. Length of dataset=288,009; 8 hours of collection from 9am5pm. Top plot too muddy; bottom represents subsets sized 20,000.

Conclusion:

First step:

Descriptive statistics:

Graphics

organization and summarization of large amounts of data for the purposes of drawing conclusions. Use

Summary statistics (mean, median, variance, ...) Use of descriptive statistics is often a rst step in an exploratory data analysis. Somewhat informal, pictorial.

Formal inference:

Draw scientic inferences about the population from the data. More formal methods. For example we can formally test hypotheses. (Sometimes the hypotheses are suggested by the EDA.) Most famous clinical trial: Salk vaccine given to sample of kids in early 50s with a control group receiving a placebo. The formal hypothesis was that the Salk vacine was more eective than randomness.

5.

stem-and-leaf plot

Older method originated for small univariate data sets when analysis was often by hand. This is just a clever arrangement of the data values to reect the shape of the distribution. Advantage: simple, quick, easy to construct. Disadvantage: a little primitive; doesnt capitalize on graphics capabilities of packages and computers.

Procedure:

digits.

1. split each xi into a stem of leading digits and a leaf of the remaining digits. 2. List the stem values in the left hand margin column and to the right list the leaves corresponding to each stem, listed in the order they are encountered in the data set.

A simple illustration: Suppose student scores on an exam are 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95. A stem-and-leaf plot is below: 9 8 7 6 5 4 | | | | | | 5 0038 033699 379 8

Positive features: The entire data set can be read with ease from the display. Gives a clear indication of the shape of the distribution of data values. R does this automatically with the following commands: > grades<-c( 48, 63, 67, 69, 70, 73, 76, 79, 79, 80, 80, 83, 88, 95) > stem(grades)

4 5 6 7 8 9

| | | | | |

Minitab output (check the drop down menu under GRAPH) Stem-and-Leaf Display: C1 Stem-and-leaf of C1 Leaf Unit = 1.0 1 1 4 (5) 5 1 4 5 6 7 8 9 8 379 03699 0038 5 N = 14

Minitab output: extra column on the left. Features: The number in parentheses gives the number of observations on the line that contains the median (or the middle value). The 4 in the row above that, gives the total number of observations in the rst three rows, i.e. there were 4 scores below 70. The 1 above that indicates that there was one score in the 50s or below. The 5 below the median line indicates that there are 5 scores in the 80s and above. The 1 on the last line indicates one score 90.

Note The help le gives a detailed explanation. More extensive data sets - particularly those in which three or more digits vary - are dealt with in a variety of ways, but all are similar to the simple case shown above.

