S01 Handout - Intro to Stats

Ch 1: Introduction to Statistics Statistics is a discipline concerned with the - collection, classification and interpretation of quantitative data - application of probability theory to the analysis and estimation of population parameters A statistic is a sample characteristic. - a sample is a subset of the population - a population is the entire set of objects being studied A parameter is a population characteristic.

A. Random Samples When selecting a sample from the population, a random sample should be used. - minimize any bias

simple random sample = sample in which every member of the population has an equal, non-zero and known probability of being selected

random sample = sample in which every member of the population has a non-zero, known but not necessarily equal probability of being selected

Problems with Random Samples: 1. population list may be unavailable 2. may be time consuming and/or costly 3. non-response may be high 4. may be biased if certain subgroups are numerically small but important to the study

Stratified Random Sampling ensures key sub-groups in the population are included in the random sample. First, divide population into sub-groups (strata). Calculate how many observations are needed from each strata to reflect their proportion in the population. Then, choose a random sample of that many observations from each subgroup.

Classification of Employees Category Male Female Management 10 20 Professional 50 40 Administration 40 60 Services 60 20 Total 160 140 Total 30 90 100 80 300

This firm wishes to administer an office climate survey to a random sample of 30 employees. How many people from each strata should get the survey? Step 1: Calculate the proportion of the population which is required for the sample. proportion of the population to be = total sample size sampled total population size Step 2: Multiply each groups size by the proportion you calculated. - round to the nearest whole number

Number Needed from Each Group Category Male Female Management Professional Administration Services Total

Step 3: Select the required number from each group randomly. Suppose we start with the men from management. Assign a number between 1 and 10 to each.

Name Ben Damian Greg Jeremy Matt Mohammad Nick Simon Teddy Will Assign ID 1 2 3 4 5 6 7 8 9 10

Use the randbetween function in Excel to randomly generate a number between 1 and 10. The number the Excel generates will be the ID number of the person chosen for the sample. Suppose the random number generated was _______. _________________would be chosen for the sample. We only need one member from this group.

Cluster Sampling Clusters are geographical areas or units like schools, households, etc.. Once the clusters have been defined, the required number of clusters is selected randomly. Then, depending on the nature of the research, all or some of the individuals in each cluster are surveyed. One-Stage Clustering If each cluster is divided into smaller clusters and a random sample of them is chosen, it is called Two-Stage Clustering. Multi-Stage Clustering is another option.

Problems with Stratified and Cluster Sampling: A stratified sample suffers from the same problems as a simple random sample. Each cluster should be representative of the population but in reality this may be difficult to achieve.

B. Non Random Sampling A population list may not be available. The researcher may have to use their judgment to determine the selection of the sample. (judgment samples) Stratified Quota Sample - calculate the number needed from each strata - the selection of individuals from each strata is not random Other - self-selected sample - focus group - opportunity sample

II. Sorting and Classifying Data Qualitative or Categorical Data = defined by some characteristic or quality nominal data = a group characteristic like gender or profession ordinal data = the result of ranking something in order of preference (e.g. products, TV shows) Quantitative or Numeric Data = described numerically by counts or measurements discrete = can only take certain distinct values e.g. a die can only turn up 1,2,3,4,5 or 6 when thrown once continuous = can be any value from a continuous set of values e.g. temperature - usually round to a specified number of decimal places

Example: Suppose we have the following raw data on MP3 player sales for the month of January.

30 49 15 55 38 Number of MP3 Players Sold Daily 11 29 34 54 31 42 45 25 18 13 25 13 55 38 31 43 22 37 20 36 36 25

Lets construct a Frequency Distribution Table in order to make some sense of the numbers. - Typically between 5 and 20 intervals are chosen. Lets choose 10-19, 20-29, 30-39, 40-49 and 50-59 as our intervals.

First, we need to sort our data. Once the data are sorted, tally the frequency of sales in each interval.

Number of MP3 Players Sold Daily 20 29 36 42 22 30 36 43 25 31 37 45 25 31 38 49 25 34 38 54

January Sales of MP3 Players Daily Sales Frequency 10 to 19 20 to 29 30 to 39 40 to 49 50 to 59 Total

11 13 13 15 18

55 55

Now that we have sorted and tallied our data, we can more easily make observations about it. For example:

Daily MP3 Player Sales in January

10

9 8 7

Frequency

Now that we have constructed our graph, we can visually make sense of the data.

6 5 4 3 2 1 0 10 to 19 20 to 29 30 to 39 40 to 49 50 to 59

A histogram is a graphical representation of frequency distributions for numeric data. - the area of each rectangle is proportional to the frequency of the interval - intervals may be equal or unequal - typically no gaps in bars

10 9 8 7 6 5 4 3 2 1 0 10 to 19 20 to 29 30 to 39 40 to 49 50 to 59

Frequency

Total International Emigration by Age: 1995 and 2002 Outflow (thousands) Age 1995 2002 Under 15 32.6 25 15 - 24 69.1 91.9 25 - 44 106.5 186.4 45 - 64 21 46.2 65 and over 7.3 9.9

The age group Under 15 contains all ages that round from 0 to not 15. So, it has a lower bound of zero and an upper bound of 14.499999. - interval of 14.4999 The age group 15 24 contains all ages that round from 15 to 24. - interval of 10 (14.5 to 24.4999999)

The age group 25 44 contains all ages that round from 25 to 44. - interval of The age group 45 64 contains all ages that round from 45 to 64. - interval of The age group 65 and over has no upper bound. We may wish to choose a reasonable one. If we choose 84 (reasonable), the interval will be 20. There is a way to calculate the height of the histogram bars by hand, but most people simply use the command in their data processing software.

A Cumulative Frequency Graph (Ogive) depicts the total number of data that have values less than the upper class boundary of each interval as given in the frequency distribution table. Recall our frequency distribution table for MP3 players.

January Sales of MP3 Players Daily Sales Frequency 10 to 19 5 20 to 29 6 30 to 39 9 40 to 49 4 50 to 59 3 Total 27

Daily Sales 0 to 9 10 to 19 20 to 29 30 to 39 40 to 49 50 to 59

Frequency 0 5 6 9 4 3

Cumulative frequency

Cumulative Frequency of Daily MP3 Player Sales

30

25

Cumulative Frequency

20

15

10

____________________________________________________________________ Skills: basic terminology of statistics given raw data: perform stratified random sampling construct frequency distribution table construct frequency distribution chart construct ogive graph

