You are on page 1of 153

MTH410

Probability and Statistics


Spring 2014
Nursel S. Ruzgar
Mathematics Department
nruzgar@ryerson.ca
416-979 5000/ext. 3173

MTH410 S14- Lecture 1

Discussion of Syllabus
Required Text:
Solved problems in Statistics, Part I- P. Ghargbouri, B. Todorow
Exercises in Statistics, Part I- P. Gharghbouri, B. Todorow
Meets
Mondays: 2:00-5:00pm-KHE221,
Wednesdays: 2:00-5:00pm- KHE221
Office Hours: Tuesdays: 5pm-5:45pm-VIC707

Labs:

Section 1: Fridays: 10:00-12:00pm-ENGLG12


Section 2: Fridays: 13:00-15:00pm-ENGLG12
Section 3: Fridays: 16:00-18:00pm-ENGLG12
Section 4: Wednesdays: 11:00-13:00pm-ENG102

MTH410 S14- Lecture 1

2/153

Discussion of Syllabus (contd)


Course Web Site: Blackboard

Labs and Quizzes

Labs will start in the first week, May 12.


There will be a quiz each week, except the
first week.

MTH410 S14- Lecture 1

3/153

Discussion of Syllabus (contd)


Academic Dishonesty (Strongly
discouraged)

Refer to the senate policy

Tentative Course Outline

MTH410 S14- Lecture 1

4/153

Course Objectives
Identify and formulate problems where
statistics can have an impact.
See the relevance of statistics. Apply what has
been learned to other engineering courses and
to career practice.
Understand the basics of Statistics and
Probability Theory
Interpret the statistical results and retrieve
necessary information to help decision making
Develop the bases for the other courses.

MTH410 S14- Lecture 1

5/153

Evaluation
30% MidtermTest (100 minutes) 10:00am,
Saturday, June 14, 2014
60% Final exam (180 minutes), room: TBA
10% Lab quizzes

MTH410 S14- Lecture 1

6/153

OUTLINE Lecture 1
Statistics-Descriptive and Inferential Statistics
Populations, Parameters, and Samples,
Statistic, Variable
Data & Types of Data

Cross-Sectional vs Time Series Data


Interval, Nominal Data, Ordinal

Graphical descriptive techniques for each type of


data

Histograms, Pie and Bar Charts


Scatter Diagrams, Contingency Table
Line Chart

MTH410 S14- Lecture 1

7/153

In todays world
we are constantly being surrounded by statistics and statistical information.
For example:
Political Polls, Customer Surveys
Interest rates, Economic Predictions
Course Marks, Job Market Information
How can we make sense out of all these data?
How can we differentiate valid from flawed claims?

What is Statistics?!

MTH410 S14- Lecture 1

8/153

What is Statistics? (Contd)


Statistics is a way to get information from data
Statistics

Data

Information

Data: Facts, especially


numerical facts, collected
together for reference or
information.

Information: Knowledge
communicated concerning
some particular fact.

Statistics is a tool for creating new understanding from a set of numbers.


Definitions: Oxford English Dictionary

MTH410 S14- Lecture 1

9/153

Example
A student is somewhat apprehensive about the statistics
course because the student believes the myth that the
course is difficult. The professor provides last terms marks to
the student. What information can the student obtain from this
list?
Statistics
Data

Information

List of last terms marks.


95
89
70
65
78
57
:

New information about


the statistics class.

E.g. Median of all marks,


Typical mark, i.e. average,
Mark distribution, etc.
MTH410 S14- Lecture 1

10/153

Population, Parameter, &


Sample, Statistics, Variable
Population: group of all items of interest to the statistics
practitioner.

All the members of the Ryerson University.

Parameter: A descriptive measure of a population.

Mean number of soft drinks sold at Ryerson every week.

Sample: A set of items drawn from the population.

500 students surveyed.

Statistic: A descriptive measure of a sample.

Average number of soft drinks these students buy per week.

Variable: A characteristic of population or sample that is


of interest for us.

Number of soft drinks a student buys every week.

MTH410 S14- Lecture 1

11/153

Key Statistical Concepts


Population
a population is the entire set of all items under study.
frequently very large, sometimes infinite.

E.g. All 5 million Florida voters


Sample
A sample is a set of data drawn from the population.
Potentially very large, but less than the population.
E.g. a sample of 765 voters exit polls on election day.
MTH410 S14- Lecture 1

12/153

Key Statistical Concepts (Contd)


Parameter
A descriptive measure of a population.
In most applications of inferential statistics, the parameter
represents the information we need.
E.g. The proportion of the 5 million Florida voters who voted
for Obama.
Statistic
A descriptive measure of a sample.
E.g. The proportion of the sample of 765 Floridians who voted
for Obama.

MTH410 S14- Lecture 1

13/153

Key Statistical Concepts (Contd)


Population
Sample
Subset
Inference

Statistic
Samples have
Statistics

Parameter
Populations have
Parameters
MTH410 S14- Lecture 1

14/153

Types of Statistics
Descriptive statistics: involves the
arrangement, summary, and presentation of
data, to enable meaningful interpretation, and
to support decision making.
Inferential Statistics: a set of methods used
to draw conclusions about characteristics of a
population based on sample data.

MTH410 S14- Lecture 1

15/153

Descriptive Statistics
Descriptive Statistics is a set of methods of organizing,
summarizing, and presenting data in a convenient and
informative way. These methods include:
Graphical Techniques
Numerical Techniques
The actual method used depends on what information we
would like to extract. Are we interested in:

measure(s) of central location? and/or


measure(s) of variability (dispersion)?
MTH410 S14- Lecture 1

16/153

Inferential Statistics
Descriptive Statistics describe the data set thats being
analyzed, but doesnt allow us to draw any conclusions
or make any inferences about the data. Hence we need
another branch of statistics: inferential statistics.
Inferential statistics is also a set of methods, but it is used
to draw conclusions or inferences about characteristics of
populations based on data from a sample.

MTH410 S14- Lecture 1

17
17/153

Statistical Inference
Statistical inference is the process of making an estimate,
prediction, or decision about a population based on a sample.
Population
Sample
Inference

Statistic
Parameter
What can we infer about a Populations Parameters
based on a Samples Statistics?
MTH410 S14- Lecture 1

18/153

Statistical Inference(Contd)
We use statistics to make inferences about parameters.
Therefore, we can make an estimate, prediction, or
decision about a population based on sample data.
Then, we can apply what we know about a sample to the
larger population from which the sample was drawn!
What is the purpose or/and which kind of benefits

MTH410 S14- Lecture 1

19/153

Statistical Inference (Contd)


Rationale:
Large populations make investigating each member
impractical, extremely expensive and time-consuming.
Easier and cheaper to take a sample and make estimates
about the population from the sample.
However:
Such conclusions and estimates are not always going to be
correct.
Hence, we have to build into the statistical inference
measures of reliability, namely confidence level and
significance level.
MTH410 S14- Lecture 1

20/153

Confidence & Significance


Levels

The confidence level is the proportion of times that an


estimating procedure will be correct, if the sampling
procedure were repeated a very large number of times.
E.g. a confidence level of 95% means that, estimates based on
this form of statistical inference will be correct 95% of the time.

When the purpose of the statistical inference is to draw a


conclusion about a population, the significance level
measures how frequently the conclusion will be wrong in
the long run.
E.g. a 5% significance level means that, in repeated samples,
this type of conclusion will be wrong 5% of the time.
MTH410 S14- Lecture 1

21/153

Confidence & Significance Levels


(Contd)
So if we use (Greek letter alpha) to
represent significance level (how frequently
the conclusion will be wrong) , then our
confidence level is 1 .
This relationship can also be stated as:

Confidence Level
+ Significance Level
=1
MTH410 S14- Lecture 1

22/153

Confidence & Significance Levels


(Contd)
Consider a statement from polling data you
may hear about in the news these days:
This poll is considered accurate within 3.4
percentage points, 19 times out of 20.

In this case, the confidence level is 95%


(19/20 = 0.95), and the significance level is
5%.
MTH410 S14- Lecture 1

23/153

Graphical Descriptive
Techniques

2014/5/8

MTH410 S14- Lecture 1

24/153

Agenda
Types of Data and Information
Graphical and Tabular Techniques for Nominal

Data
Graphical Techniques for Interval Data

Describing Time-Series Data


Describing the Relationship Between Two

Variables
MTH410 S14- Lecture 1

25/153

Definitions
A variable is some characteristic of a population or sample.
Typically denoted with a capital letter: X, Y, Z
E.g. student marks. No all students achieve the same mark. The
marks vary from student to student, so the name variable.
Values of a variable are all possible observations of the variable.
E.g. student marks: all integers between 0 and 100.
Data are the observed values of a variable.
E.g. marks of 6 students in an exam: {67, 74, 71, 83, 93, 48}

MTH410 S14- Lecture 1

26/153

Types of data analysis


Knowing the type of data is necessary to properly
select the technique to be used when analyzing data.
Type of analysis allowed for each type of data

Interval data arithmetic calculations


Nominal data counting the number of observation in each
category
Ordinal data - computations based on an ordering process

MTH410 S14- Lecture 1

27/153

Types of data and information


Variable - a characteristic of population or
sample that is of interest for us.

Number of soft drinks a student buys every week


The waiting time for medical services
The score of a student in the Stats Exam.

Data - the actual values of variables

Interval data are numerical observations


Nominal data are categorical observations
Ordinal data are ordered categorical observations

MTH410 S14- Lecture 1

28/153

Types of Data & Information


Data (at least for purposes of Statistics)
fall into three main groups:
Interval Data
Nominal Data
Ordinal Data

MTH410 S14- Lecture 1

29/153

Interval Data
Real numbers, i.e. weights, prices,
distance, etc.
Also called as quantitative or numerical.
Arithmetic operations can be performed on
Interval Data, so its meaningful to talk
about 2*Weight, or Price + $1.5, and so on.

MTH410 S14- Lecture 1

30/153

Nominal Data
The values of nominal data are categories.
E.g. responses to questions about marital status, coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4

Because the numbers are arbitrary, arithmetic operations


dont make any sense (e.g. does Widowed 2 = Married?!)
Any other numbering system is also valid provided that
each category has a different number assigned to it.
E.g. Another coding system as valid as the previous one:
Single = 7, Married = 4, Divorced = 13, Widowed = 1

Nominal data are also called qualitative or categorical.


MTH410 S14- Lecture 1

31/153

Ordinal Data
Ordinal Data appear to be nominal, but their values have an
order, a ranking to them.
E.g. The most active stocks traded on the NASDAQ in
descending order
MSFT = 1, CSCO = 2, Dell = 3, SunW = 4
Any other numbering system is valid provided the order is
maintained.
E.g. Another coding system as valid as the previous one:
MSFT = 6, CSCO = 11, Dell = 23, SunW = 45
We can say something like the number of stocks traded from:
Microsoft > Cisco or Sun Microsystems < Dell
It is still not meaningful to do arithmetic operations on this kind of data (e.g.
does 2*MSFT = CSCO?!).
MTH410 S14- Lecture 1

32/153

Nominal data vs. Ordinal data


The critical difference between nominal data and ordinal
data is that the values of the latter are in order.
E.g. It is valid for nominal data to have:
Single = 7, Married = 4, Divorced = 13, Widowed = 1
However, it wont be valid for ordinal data:
MSFT = 7, CSCO = 4, Dell = 13, SunW = 1
(The order changed to be: SunW, CSCO, MSFT,
Dell.
We must keep the order of MSFT, CSCO, Dell, SunW)
MTH410 S14- Lecture 1

33/153

Interval data vs. Ordinal data


The critical difference between interval data and ordinal data
is that the intervals or differences between values of interval
data are consistent and meaningful.
E.g. The difference between marks of 85 and 80 is the same
five-mark difference as that between 75 and 70.
However for coding system like:
MSFT = 1, CSCO = 2, Dell = 3, SunW = 4
We can see CSCO MSFT = 1, and SunW Dell = 1
But we cant conclude that the difference between the
number of stocks traded in Microsoft and Cisco Systems is
the same as the difference in the number of stocks traded
between Dell Computer and Sun Microsystems.
MTH410 S14- Lecture 1

34/153

Types of Data & Information(Contd)


Knowing the type of data is necessary to properly select
the technique to be used when analyzing data.
Data

Categorical?

Interval
Data

Ordinal
Data

Ranked?
Categorical
Data

Nominal
Data
MTH410 S14- Lecture 1

35/153

Calculations for Types of Data


All calculations are permitted on interval data.
Only calculations involving a ranking process are
allowed for ordinal data.

No calculations are allowed for nominal data, only


allowed to count the number of observations in each
category.
This leads to the following hierarchy of data
MTH410 S14- Lecture 1

36/153

Hierarchy of Data
Interval
Values are real numbers.
All calculations are valid.
Data may be treated as ordinal or nominal.

Higher
level
may be
treated
as lower
level(s)

Ordinal
Values must represent the ranked order of the data.
Calculations based on an ordering process are valid.
Data may be treated as nominal but not as interval.
Nominal
Values are the arbitrary numbers that represent categories.
Only calculations based on the frequencies of occurrence are valid.
Data may not be treated as ordinal or interval.
MTH410 S14- Lecture 1

37/153

E.g. Representing Student Grades


N

Data

Categorical?

Interval Data
e.g. integers in {0..100}

Y
Ranked?
Categorical
Data

Ordinal Data
e.g. {F, D, C, B, A}

Ranked order to data

Nominal Data
e.g. {Pass | Fail}
NO ranked order to data
MTH410 S14- Lecture 1

38/153

Agenda
Types of Data and Information
Graphical and Tabular Techniques for
Nominal Data

MTH410 S14- Lecture 1

39/153

Graphical & Tabular Techniques for


Nominal Data
The only allowable calculation on nominal data is to
count the frequency of each value of the variable.
We can summarize the data in a table that presents the
categories and their counts called a frequency
distribution.
A relative frequency distribution lists the categories
and the proportion with which each occurs.

MTH410 S14- Lecture 1

40/153

Relative frequency
It is often preferable to show the relative frequency
(proportion) of observations falling into each class, rather
than the frequency itself.

Class relative frequency =

Class frequency
Total number of observations

Relative frequencies should be used when

the population relative frequencies are studied


comparing two or more histograms
the number of observations of the samples studied are different

MTH410 S14- Lecture 1

41/153

Class width
It is generally best to use equal class width, but
sometimes unequal class width are called for.
Unequal class width is used when the frequency
associated with some classes is too low. Then,

several classes are combined together to form a


wider and more populated class.
It is possible to form an open ended class at the
higher end or lower end of the histogram.

MTH410 S14- Lecture 1

42/153

Example: Light Beer Preference


Survey
In 2006 total light beer sales in the United States was
approximately 3 million gallons
With this large a market breweries often need to know more
about who is buying their product.
The marketing manager of a major brewery wanted to
analyze the light beer sales among college and university
students who do drink light beer.
A random sample of 285 graduating students was asked to
report which of the following is their favorite light beer.

MTH410 S14- Lecture 1

43/153

Example
1. Budweiser Light
2. Busch Light
3. Coors Light
4. Michelob Light
5. Miller Lite
6. Natural Light
7. Other brand
The responses were recorded using the codes. Construct a
frequency and relative frequency distribution for these data
and graphically summarize the data by producing a bar
chart and a pie chart.
MTH410 S14- Lecture 1

44/153

Example
1
1
5
1
3
3
3
7
2
6
1
6
3
4
5
2
5
5
2
1

1
5
1
2
3
3
6
1
5
5
4
1
3
1
6
3
1
1
2
1

1
2
1
1
5
1
2
1
3
7
6
3
7
4
4
2
4
3
5
1

1
1
3
1
4
3
6
1
1
1
1
3
5
5
3
7
6
5
1
7

2
5
3
5
7
5
3
5
1
3
3
1
5
3
5
5
3
1
3
3

4
1
5
5
6
3
6
1
3
2
1
3
1
1
6
1
5
1
5
1

3
3
5
3
6
3
6
3
1
1
1
7
1
5
4
6
1
1
5
5

5
3
6
2
4
7
6
1
1
3
5
1
3
3
6
6
1
3
2
3

1
3
3
1
4
3
5
3
7
1
5
1
5
3
5
2
2
7
3
3

3
1
5
6
6
7
6
7
5
1
5
1
1
3
5
3
1
3
1
3

MTH410 S14- Lecture 1

1
1
3
1
5
2
1
7
3
7
5
2
5
1
5
3
5
1
1
5

3
5
5
1
2
1
1
2
2
5
1
4
4
1
5
3
6
6
3
3

7
3
5
4
1
5
6
1
1
5
5
1
5
5
3
1
1
3
6
1

5
1
5
5
1
7
3
1
1
6
5
1
3
3
1
1
1
1
1
7

1
5
1
1
5

45/153

Frequency and Relative Frequency


Distributions
Light Beer Brand

Frequency

Relative Frequency

Budweiser Light

90

31.6%

Busch Light

19

6.7

Coors Light

62

21.8

Michelob Light

13

4.6

Miller Lite

59

20.7

Natural Light

25

8.8

Other brands

17

6.0

Total

285

100

2014/5/8

MTH410 S14- Lecture 1

46/153

Nominal Data (Frequency)


100

90

90
80
70

62

59

60
50
40

25

30

19

20

17

13

10
0
1

Bar Charts are often used to display frequencies


MTH410 S14- Lecture 1

47/153

Nominal Data (Relative


Frequency)
6
9%

7
6%
1
31%

5
21%

4
4%

2
7%
3
22%

Pie Charts show relative frequencies


MTH410 S14- Lecture 1

48/153

Nominal Data
Light Beer Brand

Frequency

Relative Frequency

Budweiser Light

90

31.6%

Busch Light

19

6.7

Coors Light

62

21.8

Michelob Light

13

4.6

Miller Lite

59

20.7

Natural Light

25

8.8

Other brands

17

6.0

6
9%

7
6%

It all the same information,


(based on the same data).
Just different presentation.

100

90

90
1
31%

80
70

62

59

60
50

5
21%

40

25

30

4
4%

2
7%
3
22%

19

20

17

13

10
0
1

MTH410 S14- Lecture 1

49/153

Agenda
Types of Data and Information
Graphical and Tabular Techniques for
Nominal Data

Graphical Techniques for Interval Data

MTH410 S14- Lecture 1

50/153

Graphical Techniques for Interval Data


There are several graphical methods that are used when
the data are interval.
The most important of these graphical methods is the
histogram, which is created by drawing rectangles
whose bases are the intervals and whose heights are the
frequencies.

The histogram is not only a powerful graphical


technique used to summarize interval data, but also used
to help explain probabilities.
MTH410 S14- Lecture 1

51/153

Building a Histogram
Example The marketing manager of a long-distance telephone
company conducted a survey of 200 new costumers wherein the
first months bills are recorded. What information can be extracted
from those data?
This manager was only able to find that the smallest bill is $0, and
the largest bill is $119.63, and most of bills are less than $100
However, there is a lot of information may be more interesting.
Bill distribution,
Are there many small bills and few large bills?
What is the typical bill?
Are the bills somewhat similar or different?
MTH410 S14- Lecture 1

52/153

Building a Histogram(Contd)
1) Collect the Data

2) Create a frequency distribution for the data


a) Determine the number of classes to use
Refer to Table.
With 200 observations, we
should have between 7 & 10
classes 9 seems the best. For
our purpose, let us pick 8.

Alternatively, we could use Sturges formula:


Number of class intervals = 1 + 3.3 log (n), n
is the number of observations, then we get 9.

2014/5/8

MTH410 S14- Lecture 1

53/153

Building a Histogram(Contd)
1) Collect the Data
2) Create a frequency distribution for the data
a) Determine the number of classes to use. [8]

b) Determine how large to make each class


Look at the range of the data, that is,
Range = Largest Observation Smallest Observation
Range = $119.63 $0 = $119.63
Then each class width becomes:
Range (# of classes) = 119.63 8 15

FYI: if pick 9, the width should be 13.3


MTH410 S14- Lecture 1

54/153

Building a Histogram(Contd)
1) Collect the Data
2) Create a frequency distribution for the data
a) Determine the number of classes to use. [8]
b) Determine how large to make each class. [15]
c) Place the data into each class
each item can only belong to one class;

each class contains observations greater than its


lower limit and less than or equal to its upper limit.
That means, there is not overlapping between any
two classes.
MTH410 S14- Lecture 1

55/153

Building a Histogram(Contd)
1) Collect the Data
2) Create a frequency
distribution for the data.

3) Draw the Histogram


MTH410 S14- Lecture 1

56/153

Building a Histogram(Contd)
1) Collect the Data
2) Create a frequency distribution for the data.
3) Draw the Histogram

MTH410 S14- Lecture 1

57/153

Example : Interpret
About half of all
the bills are small
71+37=108

80
60
40

120

105

90

75

60

45

20
15
30

Frequency

Relatively, large
number of bills
are large
18+28+14=60

A few bills are in


the middle range
13+9+10=32

Bills
MTH410 S14- Lecture 1

58/153

FYI The difference with Bar chart


All data are nominal,
Each bin is one category,
There is a gap between
two neighbor bins.

All data are interval,


Each bin is an interval of values,
There is no gap between two
neighbor bins.
MTH410 S14- Lecture 1

59/153

Shapes of histograms
There are four typical shape characteristics

MTH410 S14- Lecture 1

60/153

Shapes of Histograms

Variable

2014/5/8

Frequency

Frequency

Frequency

Symmetry
A histogram is said to be symmetric if, when we
draw a vertical line down the center of the histogram,
the two sides are identical in shape and size:

Variable

MTH410 S14- Lecture 1

Variable

61/153

Shapes of Histograms(Contd)
Skewness
A skewed histogram is one with a long tail extending to
either the right or the left:

Negatively skewed

Positively skewed
MTH410 S14- Lecture 1

62/153

Shapes of Histograms(Contd)

Unimodal

Frequency

Frequency

Modality
A unimodal histogram is one with a single peak,
while a bimodal histogram is one with two peaks:
Bimodal

Variable

Variable

A modal class is the class with the largest number of observations


2014/5/8

MTH410 S14- Lecture 1

63/153

Modal classes
A modal class is the one with the largest number of
observations.

A unimodal histogram

The modal class


MTH410 S14- Lecture 1

64/153

Modal classes
A bimodal histogram

A modal class

A modal class

MTH410 S14- Lecture 1

65/153

Bell Shaped Histograms

Frequency

A special type of symmetric unimodal histogram


is Bell Shaped:

Many statistical techniques


require that the population
be bell shaped.
2014/5/8

Variable

Bell Shaped
MTH410 S14- Lecture 1

66/153

Bell shaped histograms


Many statistical techniques require that the
population be bell shaped.
Drawing the histogram helps verify the shape of
the population in question

MTH410 S14- Lecture 1

67/153

Stem & Leaf Display


Retains information about individual observations that would
normally be lost in the creation of a histogram.
Split each observation into two parts, a stem and a leaf:
e.g. Observation value: 42.19
There are several ways to split it up

We could split it at the decimal point:

Stem Leaf
42
19
4

Or split it at the tens position (while rounding to the nearest


integer in the ones position)
2014/5/8

MTH410 S14- Lecture 1

68/153

Stem & Leaf Display


Continue this process for all the observations.
Then, use the stems for the classes and
each leaf becomes part of the histogram as
follows
Stem Leaf
0
1
2
3
4
5
6
7
8
9
10
11
2014/5/8

0000000000111112222223333345555556666666778888999999
000001111233333334455555667889999
0000111112344666778999
001335589
124445589
33566
3458
022224556789
Thus, we still have access to our
334457889999
original data points value!
00112222233344555999
001344446699
124557889
MTH410 S14- Lecture 1

69/153

Histogram and Stem & Leaf

2014/5/8

MTH410 S14- Lecture 1

70/153

Ogive
(pronounced Oh-jive) is a graph of
a cumulative relative frequency distribution.
We create an ogive in three steps
First, from the frequency distribution created earlier,
calculate relative frequencies

2014/5/8

MTH410 S14- Lecture 1

71/153

Relative Frequencies
For example, we had 71 observations in the first class (telephone
bills from $0.00 to $15.00). Hence, the relative frequency for this
class is 71 200 (the total # of phone bills) = 0.355 (or 35.5%)

2014/5/8

MTH410 S14- Lecture 1

72/153

Ogive(Contd)
is a graph of a cumulative frequency distribution.
We create an ogive in three steps
1) Calculate relative frequencies.
2) Calculate cumulative relative frequencies by adding
the current class relative frequency to the previous
class cumulative relative frequency.
(For the first class, its cumulative relative frequency is just its relative
frequency)

2014/5/8

MTH410 S14- Lecture 1

73/153

Cumulative Relative
Frequencies
first class, just itself
next class: .355+.185=.540

last class: .930+.070=1.00

Always or by chance?
2014/5/8

MTH410 S14- Lecture 1

74/153

Ogive(Contd)
is a graph of a cumulative frequency distribution.
1) Calculate relative frequencies.
2) Calculate cumulative relative frequencies.
3) Graph the cumulative relative frequencies.

2014/5/8

MTH410 S14- Lecture 1

75/153

Ogive(Contd)
The ogive can be
used to answer
questions like:
What telephone bill
value is at the 50th
percentile?

around $35
2014/5/8

MTH410 S14- Lecture 1

76/153

Agenda
Types of Data and Information
Graphical and Tabular Techniques for
Nominal Data
Graphical Techniques for Interval
Data
Describing Time-Series Data

2014/5/8

MTH410 S14- Lecture 1

77/153

Describing Time Series Data


Observations measured at the same point in time
are called cross-sectional data.
Observations measured at successive points in
time are called time-series data.
Time-series data graphed on a line chart, which
plots the value of the variable on the vertical axis
against the time periods on the horizontal axis.
2014/5/8

MTH410 S14- Lecture 1

78/153

Example
We recorded the monthly average retail
price of gasoline since 1978.
Draw a line chart to describe these data
and briefly describe the results.

2014/5/8

MTH410 S14- Lecture 1

79/153

Example
3.5

3
2.5

2
1.5
1
0.5
0
1

2014/5/8

25

49

73

97

121 145 169 193 217 241 265 289 313 337

MTH410 S14- Lecture 1

80/153

Agenda
Types of Data and Information
Graphical and Tabular Techniques for
Nominal Data
Graphical Techniques for Interval Data
Describing Time-Series Data

Describing the Relationship Between Two


Variables
Two Nominal Variables
2014/5/8

MTH410 S14- Lecture 1

81/153

Relationship between Two


Nominal Variables
So far weve looked at tabular and graphical
techniques for one variable (either nominal or
interval data).
A cross-classification table (or cross-tabulation
table) is used to describe the relationship between
two nominal variables.
A cross-classification table lists the frequency of
each combination of the values of the two
variables
2014/5/8

MTH410 S14- Lecture 1

82/153

Example
In a major North American city there are four competing
newspapers: the Post, Globe and Mail, Sun, and Star.
To help design advertising campaigns, the advertising
managers of the newspapers need to know which segments of
the newspaper market are reading their papers.
A survey was conducted to analyze the relationship between
newspapers read and occupation.
A sample of newspaper readers was asked to report which
newspaper they read: Globe and Mail (1) Post (2), Star (3),
Sun (4), and to indicate whether they were blue-collar worker
(1), white-collar worker (2), or professional (3).
2014/5/8

MTH410 S14- Lecture 1

83/153

Example
By counting the number of times each of the 12 combinations occurs,
we produced the Table
Occupation
Newspaper
Blue Collar White Collar
Professional
Total
G&M
27
29
33
89
Post
18
43
51
112
Star
38
21
22
81
Sun
37
15
20
72
Total
120
108
126
354

2014/5/8

MTH410 S14- Lecture 1

84/153

Example
If occupation and newspaper are related, then there will be differences in
the newspapers read among the occupations. An easy way to see this is
to covert the frequencies in each column to relative frequencies in each
column. That is, compute the column totals and divide each frequency by
its column total.
Occupation
Newspaper
Blue Collar
White Collar
Professional
G&M

27/120 =.23

29/108 = .27

33/126 = .26

Post

18/120 = .15

43/108 = .40

51/126 = .40

Star

38/120 = .32

21/108 = .19

22/126 = .17

Sun

37/120 = .31

15/108 = .14

20/126 = .16

2014/5/8

MTH410 S14- Lecture 1

85/153

Example
Interpretation: The relative frequencies in the columns 2 & 3 are similar,
but there are large differences between columns 1 and 2 and between
columns 1 and 3.

similar

dissimilar

This tells us that blue collar workers tend to read different newspapers
from both white collar workers and professionals and that white collar and
professionals are quite similar in their newspaper choice.
2014/5/8

MTH410 S14- Lecture 1

86/153

Graphing the Relationship Between Two Nominal


Variables
Use the data from the cross-classification table to create bar charts
Professionals tend
to read the Globe &
Mail more than
twice as often as the
Star or Sun

2014/5/8

MTH410 S14- Lecture 1

87/153

Agenda
Types of Data and Information
Graphical and Tabular Techniques for Nominal
Data
Graphical Techniques for Interval Data
Describing Time-Series Data
Describing the Relationship Between Two
Variables
Two Nominal Variables

2014/5/8

Two Interval Variables


MTH410 S14- Lecture 1

88/153

Graphing the Relationship Between Two


Interval Variables
Moving from nominal data to interval data, we are
frequently interested in how two interval variables are
related.
To explore this relationship, we employ a scatter
diagram, which plots two variables against one another.
The independent variable is labeled X and is usually
placed on the horizontal axis, while the other, dependent
variable, Y, is mapped to the vertical axis.

2014/5/8

MTH410 S14- Lecture 1

89/153

Example
A real estate agent wanted to know to what extent the selling
price of a home is related to its size. To acquire this
information he took a sample of 12 homes that had recently
sold, recording the price in thousands of dollars and the size
in hundreds of square feet. These data are listed in the
accompanying table. Use a graphical technique to describe
the relationship between size and price.
Size
Price

2014/5/8

23
18 26 20 22 14
33 28 23 20 27 18
315 229 355 261 234 216 308 306 289 204 265 195

MTH410 S14- Lecture 1

90/153

Example
It appears that in fact there is a relationship,
that is, the greater the house size the greater
the selling price

2014/5/8

MTH410 S14- Lecture 1

91/153

Patterns of Scatter Diagrams


Linearity and Direction are two concepts we are interested in.

Positive Linear Relationship

Non-Linear Relationship
2014/5/8

Negative Linear Relationship

No Relationship

MTH410 S14- Lecture 1

92/153

Summary
Interval
Data
Single Set of
Data
Relationship
Between
Two Variables
2014/5/8

Nominal
Data

Histogram, Ogive

Frequency and
Relative Frequency
Tables, Bar and Pie
Charts

Scatter Diagram

Cross-classification
Table, Bar Charts

MTH410 S14- Lecture 1

93/153

Agenda
Introduction
Measures of Central Location
Measures of Variability
Measures of Relative Standing

MTH410 S14- Lecture 1

94/153

Numerical Descriptive Techniques


Measures of Central Location
Mean, Median, Mode

Measures of Variability
Range, Standard Deviation, Variance,
Coefficient of Variation

Measures of Relative Standing


Percentiles, Quartiles
MTH410 S14- Lecture 1

95/153

Agenda
Introduction
Measures of Central Location

MTH410 S14- Lecture 1

96/153

Measures of Central Location


Usually, we focus our attention on two
types of measures when describing
population characteristics:

Central location (e.g. average)


Variability or spread

The measure of central location


reflects the locations of all the actual
data points.
MTH410 S14- Lecture 1

97/153

Measures of Central Location


The measure of central location reflects the locations
of all the actual data points.
How?
With two data points,
the central location
But
if
the
third data
With one data point
should
fallpoint
in the middle
the leftthem
hand-side
clearly the central appears on
between
(in order
of the midrange,
pullof
location is at the point
to reflectit should
the location
the centralboth
location
to the left.
itself.
of them).

MTH410 S14- Lecture 1

98/153

Arithmetic Mean
The arithmetic mean, or average, simply
as mean, is the most popular & useful
measure of central location.
It is computed by simply adding up all the
observations and dividing by the total
number of observations:
Sum of the observations
Mean =
Number of observations
The arithmetic mean for a sample is denoted with an
x-bar:
MTH410 S14- Lecture 1

99/153

Notation
When referring to the number of
observations in a population, we use
uppercase letter N
When referring to the number of
observations in a sample, we use lower
case letter n
The arithmetic mean for a population is

denoted with Greek letter mu:


MTH410 S14- Lecture 1

100/153

Statistics is a pattern language

Size

Population

Sample

Mean

MTH410 S14- Lecture 1

101/153

Mean(Contd)

Population Mean

Sample Mean

MTH410 S14- Lecture 1

102/153

Statistics is a pattern language

Size

Population

Sample

Mean

MTH410 S14- Lecture 1

103/153

Mean(Contd)
is appropriate for describing interval data,
e.g. heights of people, marks of student
papers, etc.
is seriously affected by extreme values
called outliers.
E.g. If Bill Gates moved into any
neighborhood, the average household income
for that neighborhood would increase
dramatically beyond what it was previously!
MTH410 S14- Lecture 1

104/153

The Arithmetic Mean


Example
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.

10
x01 x72 ... x2210
i 1 xi
x

10
10

11.0

Example
Suppose the telephone bills of Example 2.1 represent
the population of measurements. The population mean is

x42.19
x38.45
... x45.77
i200
1
2
200
1 x i

43.59
200
200
MTH410 S14- Lecture 1

105/153

Properties of Mean
Calculated by using every data point.
Every interval data has a unique mean.
Sum of deviations from mean is 0.
Effected from extreme (very large or small)
values
Not meaningful for nominal or ordinal data.
Useful comparing 2 or more data sets.

MTH410 S14- Lecture 1

106/153

Median
The median is calculated by placing all the observations in
order; the observation that falls in the middle is the median.
Data: {0, 7, 12, 5, 14, 8, 0, 9, 22}

N=9 (odd)

Sort them bottom to top, find the middle:


0 0 5 7 8 9 12 14 22
Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10 (even)
Sort them bottom to top, there are two elements in
the middle:
0 0 5 7 8 9 12 14 22 33
median = (8 + 9) 2 = 8.5
Sample and population medians are computed the same way.
MTH410 S14- Lecture 1

107/153

Properties of Median
Calculated by using only 1 or at most 2
values.
Every interval data has a unique median.
Not affected from extreme values.
Can be calculated for ordinal data as well,
but cant be interpreted as the centre of
location.

MTH410 S14- Lecture 1

108/153

Mode
The mode of a set of observations is the value that
occurs most frequently. Sometimes we say
MODE = PEAK of a curve.
A set of data may have one mode (or modal class),
or two modes, or more modes.
Mode can be used for all data types, although mainly
used for nominal data.
For populations and large samples the modal class
is more preferable.
Sample and population modes are computed the same way.

MTH410 S14- Lecture 1

109/153

Mode(Contd)
E.g. Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33}
N=10
Which observation appears most often?
The mode for this data set is 0. How about
this as a measure of central location?
In a small sample, it may not be a good measure.

MTH410 S14- Lecture 1

110/153

Mode(Contd)
The mode may be not unique, i.e. 2 modes for
bimodal data.
Note: if you are using Excel for your data
analysis and your data is multi-modal (i.e.
there is more than one mode), Excel only
calculates the smallest one.

You will have to use other techniques (i.e.


histogram) to determine if your data is bimodal,
trimodal, etc.
MTH410 S14- Lecture 1

111/153

Properties of Mode
Not affected from extreme values.
Multiple modes possible, hence not a good
measure of central location.
No mode exists sometimes, all observations
have the same value.
Can be calculated for nominal data as well,
but cant be interpreted as the centre of
location
MTH410 S14- Lecture 1

112/153

Mean, Median, Mode


If a distribution is symmetrical, the mean,
median and mode may coincide
mode

median

mean

MTH410 S14- Lecture 1

113/153

Mean, Median, Mode(Contd)


If a distribution is asymmetrical, say skewed to the
left or to the right, the three measures may differ.
E.g.:
A positively skewed distribution
(skewed to the right)

mode

MedianMean
Mode

median
mean

A negatively skewed distribution


(skewed to the left)

Note: Median not as sensitive


as Mean for the skewness.
MTH410 S14- Lecture 1

Mean
Mode
Median

114/153

Mean, Median, Mode(Contd)


If data are symmetric, the mean, median,
and mode will be approximately the same.
If data are multimodal, report the mean,
median and/or mode for each subgroup.
If data are skewed, report the median.

MTH410 S14- Lecture 1

115/153

About Ordinal & Nominal Data


For ordinal and nominal data the calculation
of the mean is not valid.
Median is appropriate for ordinal data.
For nominal data, a mode calculation is
useful for determining highest frequency but
not central location.

MTH410 S14- Lecture 1

116/153

Course Marks: Mean & Median


Example: Assume you got 35 marks for one exam,
and the average was 45 marks, which kind of result
would you expect? Failed?
No sure. Dependent on the actual marks for all
students.
If all marks like:
15, 20, 25, 25, 25, 30, 35, 75, 100, 100
Congratulation! Good job, youre the fourth highest!
How about telling you the median was 27.5 (
would you worry again?
MTH410 S14- Lecture 1

25 30

),
117/153

Mean, Median, Mode: Which is Best


The mean is generally the first choice.
When the following scenarios, the median is
the best

there are extreme observations


determine the rank of a particular value
relative to the data set

The mode is rare the best measure.


MTH410 S14- Lecture 1

118/153

Geometric Mean
The geometric mean is used when the variable is a
growth rate or rate of change, such as the value of
an investment over periods of time.
For the given series of rate of
returns the nth period return is
calculated by:

If the rate of return was Rg in every


period, the nth period return would
be calculated by:
n
(1 R1 )(1 R2 )...(1 Rn ) (1 R g )
The geometric mean Rg is selected such that

R g n (1 R1 )(1 R 2 )...(1 Rn ) 1
MTH410 S14- Lecture 1

119/153

Finance Example
Suppose a 2-year investment of $1,000 grows by 100% to
$2,000 in the first year, but loses 50% from $2,000 back to
the original $1,000 in the second year. What is the average
return?
Using the arithmetic mean,
misleading
This would indicate having more than $1,000 at the end of the second
year, however in fact we only have $1,000.
Solving for the geometric mean yields a rate of 0%.

more precise

The upper case Greek Letter Pi represents a product of terms


MTH410 S14- Lecture 1

120/153

Measures of Central Location Summary


Compute mean to
Describe the central location of a single set of
interval data
Compute median to
Describe the central location of a single set of
ordinal or interval data (with extreme observations)
Compute mode to
Describe a single set of nominal, ordinal or interval
data
Compute Geometric mean to
Describe a single set of interval data based on
growth rates
MTH410 S14- Lecture 1

121/153

Agenda

Introduction
Measures of Central Location
Measures of Variability

MTH410 S14- Lecture 1

122/153

Measures of variability
Measures of central location fail to tell
the whole story about the distribution.
A question of interest still remains
unanswered:
How much are the observations spread out
around the mean value?

MTH410 S14- Lecture 1

123/153

Why not use mean


Observe two data sets:
The average value provides
a good representation of the
observations in the data set.

Small variability

This data set is now


changing to...

MTH410 S14- Lecture 1

124/153

Why not use mean(Contd)


Observe two data sets:
The average value provides
a good representation of the
observations in the data set.

Small variability

Larger variability

The same average value does not


provide as good representation of the
observations in the data set as before.

MTH410 S14- Lecture 1

125/153

Range
The range is the simplest measure of variability, and
calculated as:
Range = Largest observation Smallest observation
E.g. Data set: {4, 4, 4, 4, 4, 50} Range = 46
Data set: {4, 8, 15, 24, 39, 50}
Range = 46
The range is the same in both cases, but the data
sets have very different distributions
MTH410 S14- Lecture 1

126/153

Range(Contd)

But, how do all the observations spread out?


The range cannot assist in answering this question
Range

? ? ?

Smallest
observation

Largest
observation

MTH410 S14- Lecture 1

127/153

Variance
Variance and its related measure,
standard deviation, are arguably the most
important statistics. Used to measure
variability, they also play a vital role in
almost all statistical inference procedures.
Population variance is denoted by
(Lower case Greek letter sigma squared)
Sample variance is denoted by
(Lower case s squared)
MTH410 S14- Lecture 1

128/153

Statistics is a pattern language

Size

Population

Sample

Mean

Variance
MTH410 S14- Lecture 1

129/153

Variance(Contd)

population mean

The variance of a population is:


population size

sample mean

The variance of a sample is:


Sample size minus one !
The reason we will discuss later
MTH410 S14- Lecture 1

130/153

Variance(Contd)
As you can see, you have to calculate the sample mean
in order to calculate the sample variance.

Alternatively, there is a short-cut formulation


to calculate sample variance directly from the
data without the intermediate step of
calculating the mean. Its given by:

MTH410 S14- Lecture 1

131/153

Why not use the sum of deviations...


Consider two small populations:

goodofmeasure
of be a
CanAny
the sum
deviations
should
agree
gooddispersion
measure of
variability?
with this observation.

9 10 11 12

Sum = 0

but
Themeasurements
mean of both in B
arepopulations
more dispersed
is 10...
than those in A.

B
4

9-10 = -1
11-10 = +1
8-10 = -2
12-10 = +2
10-10 = 0

10

13

The sum of deviations is zero for all populations,


therefore, is not a good measure of variability.
MTH410 S14- Lecture 1

16

4-10 = - 6
16-10 = +6
7-10 = -3
13-10 = +3
10-10 = 0

Sum = 0

132/153

Let us calculate the variances


2
2
2
2
2
(
8

10
)

(
9

10
)

(
10

10
)

(
11

10
)

(
12

10
)
2A
2
5
2
2
2
2
2
(
4

10
)

(
7

10
)

(
10

10
)

(
13

10
)

(
16

10
)
B2
18
5

Why is the variance defined as


the average squared deviation
rather than the sum of squared
deviations?
MTH410 S14- Lecture 1

133/153

Why not use the sum of squared deviations...


Let us calculate the sum of squared deviations for both
data sets in this example
Which data set has a larger dispersion?
Data set B is
more dispersed
around the mean

B
1 2 3
Date set A:
{1, 1, 1, 1, 1
3, 3, 3, 3, 3}

Date set B:{1, 5}

MTH410 S14- Lecture 1

134/153

Why not use the sum of squared deviations...

Sum of squared deviation for A = 5(1-2)2 + 5(3-2)2= 10


Sum of squared deviation for B = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent
with the observation that set B is
more dispersed.

B
1 2 3
Date set A:
{1, 1, 1, 1, 1
3, 3, 3, 3, 3}

Date set B:{1, 5}

MTH410 S14- Lecture 1

135/153

How about averaged squared deviations...


When calculated on per observation basis (variance),
the data set dispersions are properly ranked.
A2 = SumA/N = 10/10 = 1
B2 = SumB/N = 8/2 = 4

B
1

2 3

MTH410 S14- Lecture 1

136/153

Application
Example
The following sample consists of the
number of jobs six students applied for:
17, 15, 23, 7, 9, 13.
Finds its mean and variance.
What are we looking to calculate?

MTH410 S14- Lecture 1

137/153

Sample Mean & Variance


Sample Mean

Sample Variance

Sample Variance (shortcut method)

MTH410 S14- Lecture 1

138/153

Standard Deviation
The standard deviation is the square root of
the variance.
Population standard deviation:
Sample standard deviation:

MTH410 S14- Lecture 1

139/153

Statistics is a pattern language

Size

Population

Sample

Mean
Variance
Standard
Deviation
MTH410 S14- Lecture 1

140/153

Mean Absolute Deviation


There is another deviation: Mean Absolute Deviation
(MAD), which is calculated by averaging the absolute
value of the deviation. However, this statistic is rarely
used.

MAD

i 1

( xi x )
n

E.g. Given data set {17, 15, 23, 7, 9, 13}


|17 14||1514||2314||7 14||914||1314|
1
MAD
4
6
3
MTH410 S14- Lecture 1

141/153

Measures of Variability Summary


If data are symmetric, with no serious outliers, use
range and standard deviation.
If comparing variation across two data sets, use
coefficient of variation.
The measures of variability introduced in this
section can be used only for interval data.
The next section will discuss a measure that can
be used to describe the variability of ordinal data.
There are no measures of variability for nominal
data.
MTH410 S14- Lecture 1

142/153

Agenda
Introduction
Measures of Central Location
Measures of Variability
Measures of Relative Standing

MTH410 S14- Lecture 1

143/153

Measures of Relative Standing


Measures of relative standing are designed to provide
information about the position of particular values relative
to the entire data set.
Percentile: the Pth percentile is the smallest point in a
distribution at or below which p percentage of cases is
found.
Example: Suppose your score is the 60th percentile of a
GMAT test. That is
60% of all the scores lie here

40%

Your score

Note: The 60th percentile doesnt mean you scored 60% on the
exam. It means that 60% of your peers scored lower than you on
the exam..
MTH410 S14- Lecture 1

144/153

Quartiles
We have special names for the 25th, 50th, and 75th
percentiles, namely quartiles.
The first or lower quartile is labeled Q1 = 25th percentile.
The second quartile, Q2 = 50th percentile (also the
median).
The third or upper quartile, Q3 = 75th percentile.
We can also convert percentiles into quintiles (fifths) and
deciles (tenths).
MTH410 S14- Lecture 1

145/153

Quartiles vs. Variability


Quartiles can provide an idea about the shape of a histogram

Q1 Q2

Q3

Q1

Q3

>

<

Positively skewed
histogram

Q2

Negatively skewed
histogram
MTH410 S14- Lecture 1

146/153

Commonly Used Percentiles


First (lower) decile
First (lower) quartile, Q1,
Second (middle) quartile, Q2,
Third quartile, Q3,
Ninth (upper) decile,

MTH410 S14- Lecture 1

= 10th percentile
= 25th percentile
= 50th percentile
= 75th percentile
= 90th percentile

147/153

Location of Percentiles
The following formula allows us to
approximate the location of any percentile:
P
LP (n 1)
100
whereLP is the location of the P th percentile

MTH410 S14- Lecture 1

148/153

Location of Percentiles(Contd)
Given the data :
0 0 5 7 8 9 12 14 22 33
Where is the location of the 25th percentile?
0 0 5 7 8 9 12 14 22 33

L25 = (10+1)(25/100) = 2.75


The 25th percentile is three-quarters of the distance between
the second (which is 0) and the third observations (which is 5).
Three-quarters of the distance is: (.75)(5 0) = 3.75; because
the second observation is 0, the 25th percentile is
0 + 3.75 = 3.75
MTH410 S14- Lecture 1

149/153

Location of Percentiles(Contd)
What about the upper quartile?
L75 = (10+1)(75/100) = 8.25
0 0 5 7 8 9 12 14 22 33
It is located one-quarter of the distance between the eighth
and the ninth observations, which are 14 and 22, respectively.
One-quarter of the distance is: (.25)(22 - 14) = 2, which
means the 75th percentile is at: 14 + 2 = 16

MTH410 S14- Lecture 1

150/153

Location of Percentiles(Contd)
position
2.75

16

0 0 5 7 8 9 12 14 22 33
position
8.25

3.75

Lp determines the position in the data set where the percentile value
lies, not the percentile itself.
We have already shown how to find the Median, which is the 50th
percentile. It is the 5.5th observation, (8+9)/2=8.5 The 50th percentile
is halfway between the fifth and sixth observations (in the middle
between 8 and 9), that is 8.5.
50
L 50 (10 1)

MTH410 S14- Lecture 1

100

5.5

151/153

Interquartile Range
The quartiles can be used to create another
measure of variability, the interquartile range,
which is defined as follows:

Interquartile range = Q3 Q1
The interquartile range measures the spread of the
middle 50% of the observations.
Large values of this statistic mean that the 1st and 3rd
quartiles are far apart indicating a high level of
variability.
MTH410 S14- Lecture 1

152/153

1. It is a summary.

2. It is also a guideline for


selecting techniques.
MTH410 S14- Lecture 1

153/153