You are on page 1of 45

Lecture 1: Introduction

Statistics is concerned with


Collecting and presenting data to assist decision making Processing and analyzing data g y g Obtaining reliable forecasts

Examples involving statistics


To i T inspect the incoming goods f t th i i d from a supplier (O li (Onesample hypothesis testing) Developers of a new hypertension drug want to determine if the drug lowers blood pressure (Twosample hypothesis testing) In marketing, statistics is used to evaluate whether higher spending on advertising is justified (Simple linear regression) g ) To forecast economic indices, such as GNP, GDP, etc related to many factors (Multiple linear regression)

Key Definitions
A population (universe) is the collection of all members of a group
N represents the population size

A sample is a portion of the population selected for analysis


n represents the sample size

A parameter is a numerical measure that describes a characteristic of a population d ib h t i ti f l ti A statistic is a numerical measure that describes a characteristic of a sample d ib h t i ti f l
3

Population vs. Sample


Population
a b cd

Sample
b gi o r y
Measures computed from sample data are called statistics
4

c n u

ef gh i jk l m n o p q rs t u v w x y z

Measures used to describe a population pop lation are called parameters

Examples
Population P l ti All eligible voters All light bulbs manufactured in a day All patients with high blood pressure for a clinical study Sample S l 1000 voters polled 100 light bulbs selected 200 hypertension patients enrolled for a clinical study

Two branches of statistics


Descriptive Statistics
Collecting, presenting, and characterizing data

Inferential Statistics
Drawing conclusions and/or making decisions concerning a population based only on sample data

Descriptive Statistics
Collect data
e.g., e g Survey

Present data
e.g., Tables and graphs

Characterize data

X e.g., Sample mean =


n

Inferential statistics

Population

Sample Use statistics to summarize features

Use parameters to summarize features

Drawing conclusions about a population based on sample results.


8

Two types of Inferential Statistics


Estimation e.g., Estimate the population mean weight using the sample mean weight Hypothesis testing e.g., Test the claim that earnings for males to be higher than females

Reasons for Drawing a Sample


Less Time Consuming Than a Census Less Costl to Administer Than a Cens s Costly Census Less Cumbersome and More Practical to Administer Than Census of th P Ad i i t Th a C f the Population l ti

10

Types of Data
Data

Categorical
Examples: Marital Status Political Party Eye Color (Defined categories)

Numerical

Discrete
Examples: Number of Children Defects per hour (Counted items)

Continuous
Examples: Weight distance (Measured characteristics)

11

Descriptive Statistics: Graphical description of Numerical Data


Numerical Data N i lD t
41, 24, 32, 26, 27, 27, 30, 24, 38, 21

Stem and Leaf Display 2 144677 3 028 4 1

Frequency Distributions & Cumulative Distributions

Histograms
7 6

Tables

5 4 3 2 1 0 10 20 30 40 50 60

Stem-and-Leaf Display St d L f Di l
A simple way to see distribution details in a p y data set

METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)

Data in Raw Form (as Collected): 24, 26, 24, 21, 27, 27, 30, 41, 32, 24 26 24 21 27 27 30 41 32 38 Data in Ordered Array from Smallest to Largest: Largest 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Stem-and-Leaf Stem and Leaf Display:
2 144677 3 028 4 1

Tabulating Numerical Data: Frequency Distributions


What is a Frequency Distribution? A frequency distribution is a list or a table containing class groupings (ranges within which the data fall) ... and the corresponding frequencies with which data fall ithi d t f ll within each grouping or category h i t It allows for a quick visual interpretation of the data

Tabulating Numerical Data: Frequency Distributions


Condenses the raw data and allows for a quick visual interpretation of the data Example: A manufacturer of insulation randomly E l f t fi l ti d l selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

Sort R S Raw D Data on d days i A in Ascending O d di Order


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find Range: 58 - 12 = 46 Select Number of Classes: 5 ( (usually between 5 and ll b t d 15) Compute Class Interval (Width): 10 (46/5 then round up) C t Cl I t l (Width) Determine Class Boundaries (Limits):10, 20, 30, 40, 50,
60

Count Observations & Assign to Classes

q y Frequency Distributions and Percentage Distributions


Data in Ordered Array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class
[10, [10 20) [20, 30) [30, [30 40) [40, 50) [50, 60) Total

Frequency
3 6 5 4 2 20

Relative Frequency .15 15 .30 .25 25 .20 . 0 .10 1

Percentage
15 30 25 20 10 0 100

Histogram Example g p
Class [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) Class Cl Midpoint Frequency 15 25 35 45 55 3 6 5 4 2

His togram : Daily High Te m pe rature 7 6 Fre equency y 5 4 3 2 1 0 5 15 25 35 45 55 Class Midpoints 65

(No gaps between bars)

Distribution Shape
The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the center. y ,
Symmetric Distribution
10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9

Fre equency

Distribution Shape
(continued)

The shape of the distribution is said to be skewed if the observations are not symmetrically distributed around the center.
Positively Skewed Distribution

A positively skewed distribution (skewed to the right) has a tail that extends to the right in the direction of g positive values.

12 10 Fre equency 8 6 4 2 0 1 2 3 4 5 6 7 8 9

A negatively skewed distribution (skewed to the left) has a tail that extends to the left in the direction of negative al es negati e values.

Negatively Skewed Distribution


12 10 Freq quency 8 6 4 2 0 1 2 3 4 5 6 7 8 9

What is the shape of distribution of daily high temperature?


His togram : Daily high te m pe rature 7 6 5 4 3 2 1 0 6 5 4 3 2 0 5 15 25 35 45 55 0 More

Fre equency

Numerical description
Summary M S Measures

Central Tendency (location measures) ( )


Mean Median Mode

Quartiles
Range Variance

Variation

Interquartile range Standard Deviation

Mean
Mean (Arithmetic Mean) of Data Values
Sample mean

n Population mean

X=

X
i =1

Sample Size
i

X1 + X 2 + L + X n = n
Population Size

X
i =1

X1 + X 2 + L + X N = N

An example
TV watching hours/week: 5, 7, 3, 38, 7
Mean = (5 + 7 + 3 + 38 + 7)/5 = 60/5 = 12

If the correct time for 4th subject is 8 ( t 38) th t ti f bj t i (not


Mean = (5 + 7 + 3 + 8 + 7)/5 = 30/5 = 6

12

38

Mean = 12

Mean = 6

Mean (Contd) (Cont d)


The Most Common Measure of Central Tendency, especially when n is large Affected b E t Aff t d by Extreme Values (Outliers) V l (O tli )

Median
Robust measure of central tendency y Not affected by extreme values
3 5 7 38 3 5 7 8

Median = 7

Median = 7

In an ordered array, the median is the middle number


If n is odd, th median i th middle number i dd the di is the iddl b (i.e,(n+1)/2 th measurement) If n is even, the median is the average of the n/2 th g and (n/2 +1) th measurement

Mode
A Measure of Central Tendency Value that Occurs Most Often Not Affected b Extreme Values N t Aff t d by E t V l There May Not Be a Mode There M Be S Th May B Several M d l Modes Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

No Mode

Which measure of location is the best? best ?


Mean is generally used, unless extreme values (outliers) exist The median is often used, since the median is not sensitive to extreme values. l
Example: Median home prices may be reported for a region less sensitive to outliers

Quartiles Q til
Split ordered data into 4 quarters i ( n + 1) Position of i th quartile i-th

( Qi ) =

25%

25%

25%

25%

( Q1 )

( Q2 )

( Q3 )

Noncentral Location Q1 , Q2, and Q3 are called 25th, 50th, and 75th percentile respectively. A pth percentile is the value of X such that p% of the measurements are less than X and (100 p)% (100-p)% are greater than X X.

Q1 (1st quartile) and Q3 (3rd quartile) are measures of

Quartiles ( Q til (example) l )


Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21
1(10 + 1) = 2.75 4

Position of first quartile is Position of third quartile is q

Q1 = 6 Q3 = 15 + 0.25 (18 15) = 15.75

3(10 + 1) = 8.25 4

5 number 5-number summary


Box-and-Whisker Box and Whisker Plot
Graphical display of data using 5-numbers Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21

X smallest Q 1

Median( Q2)

Q3

Xlargest

12

15.75 15 75 21

Example: Comparing variations


Suppose that you are a purchasing agent for a large manufacturing firm and that you regularly place orders with two different suppliers (A & B). The number of days required to fill orders are the following A: 9, 10, 10, 10, 10, 10, 11, 11, 11, 11 B: 7, 7, 8, 10, 10, 10, 11, 12, 13, 15

Which supplier do you prefer?


Supplier A: Mean = 10.3, Median=mode=10
Supplier A
6 5 4 Fre equency Fre equency 3 2 1 0 7 8 9 10 11 # of days 12 13 14 15 3.5 3 2.5 2 1.5 1 0.5 0 7 8 9 10 11 # of days 12 13 14 15

Supplier B: Mean = 10 3 M di M 10.3, Median=mode=10 d 10


Supplier B

Measures of Variation
Variation
Range Interquartile Range Variance

Standard Deviation

Measures of variation give information on the spread or variability of the data values.

Same center, different variation

Range
Easy to compute Difference between the Largest and the Smallest Observations: S ll t Ob ti

Range = X L t X Smallestt Largest S ll


Example:
Range = 12 - 7 = 5
7 8 9 10 11 12

Disadvantages of the Range


Ignores the way in which data are distributed g y
7 8 9 10 11 12 7 8 9 10 11 12

Range = 12 - 7 = 5

Range = 12 - 7 = 5

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

Interquartile Range I t til R


Difference between the First and Third Quartiles
Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21

Interquartile range = Q3 Q1 = 15.75 6 = 9.75 Not Affected by Extreme Values y

Variance
Sample Variance:
S2 =

( X
i =1

X)

n 1

Population Variance:

=
2

( X
i =1

Standard Deviation
Most widely used Measure of Variation y Has the Same Units as the Original Data
Sample Standard Deviation:

S=
Population Standard Deviation:

( X
i =1

X)

n 1

( X
i =1

Examples E l
Data set 11, 12, 13, 16, 16, 17, 18, 21 n=8,
1 X = (11 + 12 + ... + 21) = 15.5 8

X i X = 4.5, 3.5, 2.5, 0.5, 0.5, 1.5, 2.5, 5.5


1 2 2 2 s = ( 4.5) + ( 3.5) + ... + (5.5) = 11.14 7
2

s = s 2 = 11.14 = 3.34

Computational f C t ti l formula f s: l for


2 n n 1 1 2 s= X i X i n 1 i =1 n i =1

All we need to know are

Xi
i =1

and

X
i =1

Example ( i it) E l (revisit)


Data set 11, 12, 13, 16, 16, 17, 18, 21

X
i =1
8 i =1

= 11 + 12 + ... + 21 =124

X i = 112 + 12 2 + ... + 212 = 2000


2

1 1 2 s= 2000 124 = 3.34 7 8

Advantages of Variance and Standard Deviation St d d D i ti


Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared)

Visualizing variation

Small standard deviation

Large standard deviation