You are on page 1of 45

Lecture 1: Introduction

Statistics is concerned with

Collecting and presenting data to assist decision making Processing and analyzing data g y g Obtaining reliable forecasts

Examples involving statistics

To i T inspect the incoming goods f t th i i d from a supplier (O li (Onesample hypothesis testing) Developers of a new hypertension drug want to determine if the drug lowers blood pressure (Twosample hypothesis testing) In marketing, statistics is used to evaluate whether higher spending on advertising is justified (Simple linear regression) g ) To forecast economic indices, such as GNP, GDP, etc related to many factors (Multiple linear regression)

Key Definitions
A population (universe) is the collection of all members of a group
N represents the population size

A sample is a portion of the population selected for analysis

n represents the sample size

A parameter is a numerical measure that describes a characteristic of a population d ib h t i ti f l ti A statistic is a numerical measure that describes a characteristic of a sample d ib h t i ti f l
3

Population vs. Sample

Population
a b cd

Sample
b gi o r y
Measures computed from sample data are called statistics
4

c n u

ef gh i jk l m n o p q rs t u v w x y z

Measures used to describe a population pop lation are called parameters

Examples
Population P l ti All eligible voters All light bulbs manufactured in a day All patients with high blood pressure for a clinical study Sample S l 1000 voters polled 100 light bulbs selected 200 hypertension patients enrolled for a clinical study

Two branches of statistics

Descriptive Statistics
Collecting, presenting, and characterizing data

Inferential Statistics
Drawing conclusions and/or making decisions concerning a population based only on sample data

Descriptive Statistics
Collect data
e.g., e g Survey

Present data
e.g., Tables and graphs

Characterize data

X e.g., Sample mean =

n

Inferential statistics

Population

8

Two types of Inferential Statistics

Estimation e.g., Estimate the population mean weight using the sample mean weight Hypothesis testing e.g., Test the claim that earnings for males to be higher than females

Reasons for Drawing a Sample

Less Time Consuming Than a Census Less Costl to Administer Than a Cens s Costly Census Less Cumbersome and More Practical to Administer Than Census of th P Ad i i t Th a C f the Population l ti

10

Types of Data
Data

Categorical
Examples: Marital Status Political Party Eye Color (Defined categories)

Numerical

Discrete
Examples: Number of Children Defects per hour (Counted items)

Continuous
Examples: Weight distance (Measured characteristics)

11

Descriptive Statistics: Graphical description of Numerical Data

Numerical Data N i lD t
41, 24, 32, 26, 27, 27, 30, 24, 38, 21

Frequency Distributions & Cumulative Distributions

Histograms
7 6

Tables

5 4 3 2 1 0 10 20 30 40 50 60

Stem-and-Leaf Display St d L f Di l
A simple way to see distribution details in a p y data set

METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)

Data in Raw Form (as Collected): 24, 26, 24, 21, 27, 27, 30, 41, 32, 24 26 24 21 27 27 30 41 32 38 Data in Ordered Array from Smallest to Largest: Largest 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Stem-and-Leaf Stem and Leaf Display:
2 144677 3 028 4 1

Tabulating Numerical Data: Frequency Distributions

What is a Frequency Distribution? A frequency distribution is a list or a table containing class groupings (ranges within which the data fall) ... and the corresponding frequencies with which data fall ithi d t f ll within each grouping or category h i t It allows for a quick visual interpretation of the data

Tabulating Numerical Data: Frequency Distributions

Condenses the raw data and allows for a quick visual interpretation of the data Example: A manufacturer of insulation randomly E l f t fi l ti d l selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

Sort R S Raw D Data on d days i A in Ascending O d di Order

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find Range: 58 - 12 = 46 Select Number of Classes: 5 ( (usually between 5 and ll b t d 15) Compute Class Interval (Width): 10 (46/5 then round up) C t Cl I t l (Width) Determine Class Boundaries (Limits):10, 20, 30, 40, 50,
60

q y Frequency Distributions and Percentage Distributions

Data in Ordered Array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class
[10, [10 20) [20, 30) [30, [30 40) [40, 50) [50, 60) Total

Frequency
3 6 5 4 2 20

Relative Frequency .15 15 .30 .25 25 .20 . 0 .10 1

Percentage
15 30 25 20 10 0 100

Histogram Example g p
Class [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) Class Cl Midpoint Frequency 15 25 35 45 55 3 6 5 4 2

(No gaps between bars)

Distribution Shape
The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the center. y ,
Symmetric Distribution
10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9

Fre equency

Distribution Shape
(continued)

The shape of the distribution is said to be skewed if the observations are not symmetrically distributed around the center.
Positively Skewed Distribution

A positively skewed distribution (skewed to the right) has a tail that extends to the right in the direction of g positive values.

12 10 Fre equency 8 6 4 2 0 1 2 3 4 5 6 7 8 9

A negatively skewed distribution (skewed to the left) has a tail that extends to the left in the direction of negative al es negati e values.

Negatively Skewed Distribution

12 10 Freq quency 8 6 4 2 0 1 2 3 4 5 6 7 8 9

What is the shape of distribution of daily high temperature?

His togram : Daily high te m pe rature 7 6 5 4 3 2 1 0 6 5 4 3 2 0 5 15 25 35 45 55 0 More

Fre equency

Numerical description
Summary M S Measures

Mean Median Mode

Quartiles
Range Variance

Variation

Interquartile range Standard Deviation

Mean
Mean (Arithmetic Mean) of Data Values
Sample mean

n Population mean

X=

X
i =1

Sample Size
i

X1 + X 2 + L + X n = n
Population Size

X
i =1

X1 + X 2 + L + X N = N

An example
TV watching hours/week: 5, 7, 3, 38, 7
Mean = (5 + 7 + 3 + 38 + 7)/5 = 60/5 = 12

If the correct time for 4th subject is 8 ( t 38) th t ti f bj t i (not

Mean = (5 + 7 + 3 + 8 + 7)/5 = 30/5 = 6

12

38

Mean = 12

Mean = 6

Mean (Contd) (Cont d)

The Most Common Measure of Central Tendency, especially when n is large Affected b E t Aff t d by Extreme Values (Outliers) V l (O tli )

Median
Robust measure of central tendency y Not affected by extreme values
3 5 7 38 3 5 7 8

Median = 7

Median = 7

In an ordered array, the median is the middle number

If n is odd, th median i th middle number i dd the di is the iddl b (i.e,(n+1)/2 th measurement) If n is even, the median is the average of the n/2 th g and (n/2 +1) th measurement

Mode
A Measure of Central Tendency Value that Occurs Most Often Not Affected b Extreme Values N t Aff t d by E t V l There May Not Be a Mode There M Be S Th May B Several M d l Modes Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

No Mode

Which measure of location is the best? best ?

Mean is generally used, unless extreme values (outliers) exist The median is often used, since the median is not sensitive to extreme values. l
Example: Median home prices may be reported for a region less sensitive to outliers

Quartiles Q til
Split ordered data into 4 quarters i ( n + 1) Position of i th quartile i-th

( Qi ) =

25%

25%

25%

25%

( Q1 )

( Q2 )

( Q3 )

Noncentral Location Q1 , Q2, and Q3 are called 25th, 50th, and 75th percentile respectively. A pth percentile is the value of X such that p% of the measurements are less than X and (100 p)% (100-p)% are greater than X X.

Quartiles ( Q til (example) l )

Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21
1(10 + 1) = 2.75 4

Q1 = 6 Q3 = 15 + 0.25 (18 15) = 15.75

3(10 + 1) = 8.25 4

5 number 5-number summary

Box-and-Whisker Box and Whisker Plot
Graphical display of data using 5-numbers Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21

X smallest Q 1

Median( Q2)

Q3

Xlargest

12

15.75 15 75 21

Example: Comparing variations

Suppose that you are a purchasing agent for a large manufacturing firm and that you regularly place orders with two different suppliers (A & B). The number of days required to fill orders are the following A: 9, 10, 10, 10, 10, 10, 11, 11, 11, 11 B: 7, 7, 8, 10, 10, 10, 11, 12, 13, 15

Which supplier do you prefer?

Supplier A: Mean = 10.3, Median=mode=10
Supplier A
6 5 4 Fre equency Fre equency 3 2 1 0 7 8 9 10 11 # of days 12 13 14 15 3.5 3 2.5 2 1.5 1 0.5 0 7 8 9 10 11 # of days 12 13 14 15

Supplier B: Mean = 10 3 M di M 10.3, Median=mode=10 d 10

Supplier B

Measures of Variation
Variation
Range Interquartile Range Variance

Standard Deviation

Measures of variation give information on the spread or variability of the data values.

Same center, different variation

Range
Easy to compute Difference between the Largest and the Smallest Observations: S ll t Ob ti

Range = X L t X Smallestt Largest S ll

Example:
Range = 12 - 7 = 5
7 8 9 10 11 12

Ignores the way in which data are distributed g y
7 8 9 10 11 12 7 8 9 10 11 12

Range = 12 - 7 = 5

Range = 12 - 7 = 5

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

Interquartile Range I t til R

Difference between the First and Third Quartiles
Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21

Interquartile range = Q3 Q1 = 15.75 6 = 9.75 Not Affected by Extreme Values y

Variance
Sample Variance:
S2 =

( X
i =1

X)

n 1

Population Variance:

=
2

( X
i =1

Standard Deviation
Most widely used Measure of Variation y Has the Same Units as the Original Data
Sample Standard Deviation:

S=
Population Standard Deviation:

( X
i =1

X)

n 1

( X
i =1

Examples E l
Data set 11, 12, 13, 16, 16, 17, 18, 21 n=8,
1 X = (11 + 12 + ... + 21) = 15.5 8

X i X = 4.5, 3.5, 2.5, 0.5, 0.5, 1.5, 2.5, 5.5

1 2 2 2 s = ( 4.5) + ( 3.5) + ... + (5.5) = 11.14 7
2

s = s 2 = 11.14 = 3.34

Computational f C t ti l formula f s: l for

2 n n 1 1 2 s= X i X i n 1 i =1 n i =1

Xi
i =1

and

X
i =1

Example ( i it) E l (revisit)

Data set 11, 12, 13, 16, 16, 17, 18, 21

X
i =1
8 i =1

= 11 + 12 + ... + 21 =124

2

Advantages of Variance and Standard Deviation St d d D i ti

Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared)

Visualizing variation