29 views

Uploaded by zaid

Statistics Lecture Notes

- MB0040 – STATISTICS FOR MANAGEMENT
- Kkp Mat Statistics
- CQE
- MEASURES OF CENTRAL TENDENCY.ppt
- 100aHW4Soln.pdf
- 5 Most Important Methods for Statistical Data Analysis
- Araya & Pepe-Victoriano (1)
- Reliability
- Article on Stress
- Bullen - Means and Their Inequalities
- Study Set
- GRE Quants syllabus_schedule.docx
- Statistics Summary.pdf
- The Median2
- Fix Bhs Inggris Otw Lancar Bismillah
- agri mark 1
- Math
- Module 10 Introduction to Data and Statistics
- Praktikum 2 Geostat - Statistika Bivariat
- Mbokan Ana Sing Butuh

You are on page 1of 153

Spring 2014

Nursel S. Ruzgar

Mathematics Department

nruzgar@ryerson.ca

416-979 5000/ext. 3173

Discussion of Syllabus

Required Text:

Solved problems in Statistics, Part I- P. Ghargbouri, B. Todorow

Exercises in Statistics, Part I- P. Gharghbouri, B. Todorow

Meets

Mondays: 2:00-5:00pm-KHE221,

Wednesdays: 2:00-5:00pm- KHE221

Office Hours: Tuesdays: 5pm-5:45pm-VIC707

Labs:

Section 2: Fridays: 13:00-15:00pm-ENGLG12

Section 3: Fridays: 16:00-18:00pm-ENGLG12

Section 4: Wednesdays: 11:00-13:00pm-ENG102

2/153

Course Web Site: Blackboard

There will be a quiz each week, except the

first week.

3/153

Academic Dishonesty (Strongly

discouraged)

4/153

Course Objectives

Identify and formulate problems where

statistics can have an impact.

See the relevance of statistics. Apply what has

been learned to other engineering courses and

to career practice.

Understand the basics of Statistics and

Probability Theory

Interpret the statistical results and retrieve

necessary information to help decision making

Develop the bases for the other courses.

5/153

Evaluation

30% MidtermTest (100 minutes) 10:00am,

Saturday, June 14, 2014

60% Final exam (180 minutes), room: TBA

10% Lab quizzes

6/153

OUTLINE Lecture 1

Statistics-Descriptive and Inferential Statistics

Populations, Parameters, and Samples,

Statistic, Variable

Data & Types of Data

Interval, Nominal Data, Ordinal

data

Scatter Diagrams, Contingency Table

Line Chart

7/153

In todays world

we are constantly being surrounded by statistics and statistical information.

For example:

Political Polls, Customer Surveys

Interest rates, Economic Predictions

Course Marks, Job Market Information

How can we make sense out of all these data?

How can we differentiate valid from flawed claims?

What is Statistics?!

8/153

Statistics is a way to get information from data

Statistics

Data

Information

numerical facts, collected

together for reference or

information.

Information: Knowledge

communicated concerning

some particular fact.

Definitions: Oxford English Dictionary

9/153

Example

A student is somewhat apprehensive about the statistics

course because the student believes the myth that the

course is difficult. The professor provides last terms marks to

the student. What information can the student obtain from this

list?

Statistics

Data

Information

95

89

70

65

78

57

:

the statistics class.

Typical mark, i.e. average,

Mark distribution, etc.

MTH410 S14- Lecture 1

10/153

Sample, Statistics, Variable

Population: group of all items of interest to the statistics

practitioner.

of interest for us.

11/153

Population

a population is the entire set of all items under study.

frequently very large, sometimes infinite.

Sample

A sample is a set of data drawn from the population.

Potentially very large, but less than the population.

E.g. a sample of 765 voters exit polls on election day.

MTH410 S14- Lecture 1

12/153

Parameter

A descriptive measure of a population.

In most applications of inferential statistics, the parameter

represents the information we need.

E.g. The proportion of the 5 million Florida voters who voted

for Obama.

Statistic

A descriptive measure of a sample.

E.g. The proportion of the sample of 765 Floridians who voted

for Obama.

13/153

Population

Sample

Subset

Inference

Statistic

Samples have

Statistics

Parameter

Populations have

Parameters

MTH410 S14- Lecture 1

14/153

Types of Statistics

Descriptive statistics: involves the

arrangement, summary, and presentation of

data, to enable meaningful interpretation, and

to support decision making.

Inferential Statistics: a set of methods used

to draw conclusions about characteristics of a

population based on sample data.

15/153

Descriptive Statistics

Descriptive Statistics is a set of methods of organizing,

summarizing, and presenting data in a convenient and

informative way. These methods include:

Graphical Techniques

Numerical Techniques

The actual method used depends on what information we

would like to extract. Are we interested in:

measure(s) of variability (dispersion)?

MTH410 S14- Lecture 1

16/153

Inferential Statistics

Descriptive Statistics describe the data set thats being

analyzed, but doesnt allow us to draw any conclusions

or make any inferences about the data. Hence we need

another branch of statistics: inferential statistics.

Inferential statistics is also a set of methods, but it is used

to draw conclusions or inferences about characteristics of

populations based on data from a sample.

17

17/153

Statistical Inference

Statistical inference is the process of making an estimate,

prediction, or decision about a population based on a sample.

Population

Sample

Inference

Statistic

Parameter

What can we infer about a Populations Parameters

based on a Samples Statistics?

MTH410 S14- Lecture 1

18/153

Statistical Inference(Contd)

We use statistics to make inferences about parameters.

Therefore, we can make an estimate, prediction, or

decision about a population based on sample data.

Then, we can apply what we know about a sample to the

larger population from which the sample was drawn!

What is the purpose or/and which kind of benefits

19/153

Rationale:

Large populations make investigating each member

impractical, extremely expensive and time-consuming.

Easier and cheaper to take a sample and make estimates

about the population from the sample.

However:

Such conclusions and estimates are not always going to be

correct.

Hence, we have to build into the statistical inference

measures of reliability, namely confidence level and

significance level.

MTH410 S14- Lecture 1

20/153

Levels

estimating procedure will be correct, if the sampling

procedure were repeated a very large number of times.

E.g. a confidence level of 95% means that, estimates based on

this form of statistical inference will be correct 95% of the time.

conclusion about a population, the significance level

measures how frequently the conclusion will be wrong in

the long run.

E.g. a 5% significance level means that, in repeated samples,

this type of conclusion will be wrong 5% of the time.

MTH410 S14- Lecture 1

21/153

(Contd)

So if we use (Greek letter alpha) to

represent significance level (how frequently

the conclusion will be wrong) , then our

confidence level is 1 .

This relationship can also be stated as:

Confidence Level

+ Significance Level

=1

MTH410 S14- Lecture 1

22/153

(Contd)

Consider a statement from polling data you

may hear about in the news these days:

This poll is considered accurate within 3.4

percentage points, 19 times out of 20.

(19/20 = 0.95), and the significance level is

5%.

MTH410 S14- Lecture 1

23/153

Graphical Descriptive

Techniques

2014/5/8

24/153

Agenda

Types of Data and Information

Graphical and Tabular Techniques for Nominal

Data

Graphical Techniques for Interval Data

Describing the Relationship Between Two

Variables

MTH410 S14- Lecture 1

25/153

Definitions

A variable is some characteristic of a population or sample.

Typically denoted with a capital letter: X, Y, Z

E.g. student marks. No all students achieve the same mark. The

marks vary from student to student, so the name variable.

Values of a variable are all possible observations of the variable.

E.g. student marks: all integers between 0 and 100.

Data are the observed values of a variable.

E.g. marks of 6 students in an exam: {67, 74, 71, 83, 93, 48}

26/153

Knowing the type of data is necessary to properly

select the technique to be used when analyzing data.

Type of analysis allowed for each type of data

Nominal data counting the number of observation in each

category

Ordinal data - computations based on an ordering process

27/153

Variable - a characteristic of population or

sample that is of interest for us.

The waiting time for medical services

The score of a student in the Stats Exam.

Nominal data are categorical observations

Ordinal data are ordered categorical observations

28/153

Data (at least for purposes of Statistics)

fall into three main groups:

Interval Data

Nominal Data

Ordinal Data

29/153

Interval Data

Real numbers, i.e. weights, prices,

distance, etc.

Also called as quantitative or numerical.

Arithmetic operations can be performed on

Interval Data, so its meaningful to talk

about 2*Weight, or Price + $1.5, and so on.

30/153

Nominal Data

The values of nominal data are categories.

E.g. responses to questions about marital status, coded as:

Single = 1, Married = 2, Divorced = 3, Widowed = 4

dont make any sense (e.g. does Widowed 2 = Married?!)

Any other numbering system is also valid provided that

each category has a different number assigned to it.

E.g. Another coding system as valid as the previous one:

Single = 7, Married = 4, Divorced = 13, Widowed = 1

MTH410 S14- Lecture 1

31/153

Ordinal Data

Ordinal Data appear to be nominal, but their values have an

order, a ranking to them.

E.g. The most active stocks traded on the NASDAQ in

descending order

MSFT = 1, CSCO = 2, Dell = 3, SunW = 4

Any other numbering system is valid provided the order is

maintained.

E.g. Another coding system as valid as the previous one:

MSFT = 6, CSCO = 11, Dell = 23, SunW = 45

We can say something like the number of stocks traded from:

Microsoft > Cisco or Sun Microsystems < Dell

It is still not meaningful to do arithmetic operations on this kind of data (e.g.

does 2*MSFT = CSCO?!).

MTH410 S14- Lecture 1

32/153

The critical difference between nominal data and ordinal

data is that the values of the latter are in order.

E.g. It is valid for nominal data to have:

Single = 7, Married = 4, Divorced = 13, Widowed = 1

However, it wont be valid for ordinal data:

MSFT = 7, CSCO = 4, Dell = 13, SunW = 1

(The order changed to be: SunW, CSCO, MSFT,

Dell.

We must keep the order of MSFT, CSCO, Dell, SunW)

MTH410 S14- Lecture 1

33/153

The critical difference between interval data and ordinal data

is that the intervals or differences between values of interval

data are consistent and meaningful.

E.g. The difference between marks of 85 and 80 is the same

five-mark difference as that between 75 and 70.

However for coding system like:

MSFT = 1, CSCO = 2, Dell = 3, SunW = 4

We can see CSCO MSFT = 1, and SunW Dell = 1

But we cant conclude that the difference between the

number of stocks traded in Microsoft and Cisco Systems is

the same as the difference in the number of stocks traded

between Dell Computer and Sun Microsystems.

MTH410 S14- Lecture 1

34/153

Knowing the type of data is necessary to properly select

the technique to be used when analyzing data.

Data

Categorical?

Interval

Data

Ordinal

Data

Ranked?

Categorical

Data

Nominal

Data

MTH410 S14- Lecture 1

35/153

All calculations are permitted on interval data.

Only calculations involving a ranking process are

allowed for ordinal data.

allowed to count the number of observations in each

category.

This leads to the following hierarchy of data

MTH410 S14- Lecture 1

36/153

Hierarchy of Data

Interval

Values are real numbers.

All calculations are valid.

Data may be treated as ordinal or nominal.

Higher

level

may be

treated

as lower

level(s)

Ordinal

Values must represent the ranked order of the data.

Calculations based on an ordering process are valid.

Data may be treated as nominal but not as interval.

Nominal

Values are the arbitrary numbers that represent categories.

Only calculations based on the frequencies of occurrence are valid.

Data may not be treated as ordinal or interval.

MTH410 S14- Lecture 1

37/153

N

Data

Categorical?

Interval Data

e.g. integers in {0..100}

Y

Ranked?

Categorical

Data

Ordinal Data

e.g. {F, D, C, B, A}

Nominal Data

e.g. {Pass | Fail}

NO ranked order to data

MTH410 S14- Lecture 1

38/153

Agenda

Types of Data and Information

Graphical and Tabular Techniques for

Nominal Data

39/153

Nominal Data

The only allowable calculation on nominal data is to

count the frequency of each value of the variable.

We can summarize the data in a table that presents the

categories and their counts called a frequency

distribution.

A relative frequency distribution lists the categories

and the proportion with which each occurs.

40/153

Relative frequency

It is often preferable to show the relative frequency

(proportion) of observations falling into each class, rather

than the frequency itself.

Class frequency

Total number of observations

comparing two or more histograms

the number of observations of the samples studied are different

41/153

Class width

It is generally best to use equal class width, but

sometimes unequal class width are called for.

Unequal class width is used when the frequency

associated with some classes is too low. Then,

wider and more populated class.

It is possible to form an open ended class at the

higher end or lower end of the histogram.

42/153

Survey

In 2006 total light beer sales in the United States was

approximately 3 million gallons

With this large a market breweries often need to know more

about who is buying their product.

The marketing manager of a major brewery wanted to

analyze the light beer sales among college and university

students who do drink light beer.

A random sample of 285 graduating students was asked to

report which of the following is their favorite light beer.

43/153

Example

1. Budweiser Light

2. Busch Light

3. Coors Light

4. Michelob Light

5. Miller Lite

6. Natural Light

7. Other brand

The responses were recorded using the codes. Construct a

frequency and relative frequency distribution for these data

and graphically summarize the data by producing a bar

chart and a pie chart.

MTH410 S14- Lecture 1

44/153

Example

1

1

5

1

3

3

3

7

2

6

1

6

3

4

5

2

5

5

2

1

1

5

1

2

3

3

6

1

5

5

4

1

3

1

6

3

1

1

2

1

1

2

1

1

5

1

2

1

3

7

6

3

7

4

4

2

4

3

5

1

1

1

3

1

4

3

6

1

1

1

1

3

5

5

3

7

6

5

1

7

2

5

3

5

7

5

3

5

1

3

3

1

5

3

5

5

3

1

3

3

4

1

5

5

6

3

6

1

3

2

1

3

1

1

6

1

5

1

5

1

3

3

5

3

6

3

6

3

1

1

1

7

1

5

4

6

1

1

5

5

5

3

6

2

4

7

6

1

1

3

5

1

3

3

6

6

1

3

2

3

1

3

3

1

4

3

5

3

7

1

5

1

5

3

5

2

2

7

3

3

3

1

5

6

6

7

6

7

5

1

5

1

1

3

5

3

1

3

1

3

1

1

3

1

5

2

1

7

3

7

5

2

5

1

5

3

5

1

1

5

3

5

5

1

2

1

1

2

2

5

1

4

4

1

5

3

6

6

3

3

7

3

5

4

1

5

6

1

1

5

5

1

5

5

3

1

1

3

6

1

5

1

5

5

1

7

3

1

1

6

5

1

3

3

1

1

1

1

1

7

1

5

1

1

5

45/153

Distributions

Light Beer Brand

Frequency

Relative Frequency

Budweiser Light

90

31.6%

Busch Light

19

6.7

Coors Light

62

21.8

Michelob Light

13

4.6

Miller Lite

59

20.7

Natural Light

25

8.8

Other brands

17

6.0

Total

285

100

2014/5/8

46/153

100

90

90

80

70

62

59

60

50

40

25

30

19

20

17

13

10

0

1

MTH410 S14- Lecture 1

47/153

Frequency)

6

9%

7

6%

1

31%

5

21%

4

4%

2

7%

3

22%

MTH410 S14- Lecture 1

48/153

Nominal Data

Light Beer Brand

Frequency

Relative Frequency

Budweiser Light

90

31.6%

Busch Light

19

6.7

Coors Light

62

21.8

Michelob Light

13

4.6

Miller Lite

59

20.7

Natural Light

25

8.8

Other brands

17

6.0

6

9%

7

6%

(based on the same data).

Just different presentation.

100

90

90

1

31%

80

70

62

59

60

50

5

21%

40

25

30

4

4%

2

7%

3

22%

19

20

17

13

10

0

1

49/153

Agenda

Types of Data and Information

Graphical and Tabular Techniques for

Nominal Data

50/153

There are several graphical methods that are used when

the data are interval.

The most important of these graphical methods is the

histogram, which is created by drawing rectangles

whose bases are the intervals and whose heights are the

frequencies.

technique used to summarize interval data, but also used

to help explain probabilities.

MTH410 S14- Lecture 1

51/153

Building a Histogram

Example The marketing manager of a long-distance telephone

company conducted a survey of 200 new costumers wherein the

first months bills are recorded. What information can be extracted

from those data?

This manager was only able to find that the smallest bill is $0, and

the largest bill is $119.63, and most of bills are less than $100

However, there is a lot of information may be more interesting.

Bill distribution,

Are there many small bills and few large bills?

What is the typical bill?

Are the bills somewhat similar or different?

MTH410 S14- Lecture 1

52/153

Building a Histogram(Contd)

1) Collect the Data

a) Determine the number of classes to use

Refer to Table.

With 200 observations, we

should have between 7 & 10

classes 9 seems the best. For

our purpose, let us pick 8.

Number of class intervals = 1 + 3.3 log (n), n

is the number of observations, then we get 9.

2014/5/8

53/153

Building a Histogram(Contd)

1) Collect the Data

2) Create a frequency distribution for the data

a) Determine the number of classes to use. [8]

Look at the range of the data, that is,

Range = Largest Observation Smallest Observation

Range = $119.63 $0 = $119.63

Then each class width becomes:

Range (# of classes) = 119.63 8 15

MTH410 S14- Lecture 1

54/153

Building a Histogram(Contd)

1) Collect the Data

2) Create a frequency distribution for the data

a) Determine the number of classes to use. [8]

b) Determine how large to make each class. [15]

c) Place the data into each class

each item can only belong to one class;

lower limit and less than or equal to its upper limit.

That means, there is not overlapping between any

two classes.

MTH410 S14- Lecture 1

55/153

Building a Histogram(Contd)

1) Collect the Data

2) Create a frequency

distribution for the data.

MTH410 S14- Lecture 1

56/153

Building a Histogram(Contd)

1) Collect the Data

2) Create a frequency distribution for the data.

3) Draw the Histogram

57/153

Example : Interpret

About half of all

the bills are small

71+37=108

80

60

40

120

105

90

75

60

45

20

15

30

Frequency

Relatively, large

number of bills

are large

18+28+14=60

the middle range

13+9+10=32

Bills

MTH410 S14- Lecture 1

58/153

All data are nominal,

Each bin is one category,

There is a gap between

two neighbor bins.

Each bin is an interval of values,

There is no gap between two

neighbor bins.

MTH410 S14- Lecture 1

59/153

Shapes of histograms

There are four typical shape characteristics

60/153

Shapes of Histograms

Variable

2014/5/8

Frequency

Frequency

Frequency

Symmetry

A histogram is said to be symmetric if, when we

draw a vertical line down the center of the histogram,

the two sides are identical in shape and size:

Variable

Variable

61/153

Shapes of Histograms(Contd)

Skewness

A skewed histogram is one with a long tail extending to

either the right or the left:

Negatively skewed

Positively skewed

MTH410 S14- Lecture 1

62/153

Shapes of Histograms(Contd)

Unimodal

Frequency

Frequency

Modality

A unimodal histogram is one with a single peak,

while a bimodal histogram is one with two peaks:

Bimodal

Variable

Variable

2014/5/8

63/153

Modal classes

A modal class is the one with the largest number of

observations.

A unimodal histogram

MTH410 S14- Lecture 1

64/153

Modal classes

A bimodal histogram

A modal class

A modal class

65/153

Frequency

is Bell Shaped:

require that the population

be bell shaped.

2014/5/8

Variable

Bell Shaped

MTH410 S14- Lecture 1

66/153

Many statistical techniques require that the

population be bell shaped.

Drawing the histogram helps verify the shape of

the population in question

67/153

Retains information about individual observations that would

normally be lost in the creation of a histogram.

Split each observation into two parts, a stem and a leaf:

e.g. Observation value: 42.19

There are several ways to split it up

Stem Leaf

42

19

4

integer in the ones position)

2014/5/8

68/153

Continue this process for all the observations.

Then, use the stems for the classes and

each leaf becomes part of the histogram as

follows

Stem Leaf

0

1

2

3

4

5

6

7

8

9

10

11

2014/5/8

0000000000111112222223333345555556666666778888999999

000001111233333334455555667889999

0000111112344666778999

001335589

124445589

33566

3458

022224556789

Thus, we still have access to our

334457889999

original data points value!

00112222233344555999

001344446699

124557889

MTH410 S14- Lecture 1

69/153

2014/5/8

70/153

Ogive

(pronounced Oh-jive) is a graph of

a cumulative relative frequency distribution.

We create an ogive in three steps

First, from the frequency distribution created earlier,

calculate relative frequencies

2014/5/8

71/153

Relative Frequencies

For example, we had 71 observations in the first class (telephone

bills from $0.00 to $15.00). Hence, the relative frequency for this

class is 71 200 (the total # of phone bills) = 0.355 (or 35.5%)

2014/5/8

72/153

Ogive(Contd)

is a graph of a cumulative frequency distribution.

We create an ogive in three steps

1) Calculate relative frequencies.

2) Calculate cumulative relative frequencies by adding

the current class relative frequency to the previous

class cumulative relative frequency.

(For the first class, its cumulative relative frequency is just its relative

frequency)

2014/5/8

73/153

Cumulative Relative

Frequencies

first class, just itself

next class: .355+.185=.540

Always or by chance?

2014/5/8

74/153

Ogive(Contd)

is a graph of a cumulative frequency distribution.

1) Calculate relative frequencies.

2) Calculate cumulative relative frequencies.

3) Graph the cumulative relative frequencies.

2014/5/8

75/153

Ogive(Contd)

The ogive can be

used to answer

questions like:

What telephone bill

value is at the 50th

percentile?

around $35

2014/5/8

76/153

Agenda

Types of Data and Information

Graphical and Tabular Techniques for

Nominal Data

Graphical Techniques for Interval

Data

Describing Time-Series Data

2014/5/8

77/153

Observations measured at the same point in time

are called cross-sectional data.

Observations measured at successive points in

time are called time-series data.

Time-series data graphed on a line chart, which

plots the value of the variable on the vertical axis

against the time periods on the horizontal axis.

2014/5/8

78/153

Example

We recorded the monthly average retail

price of gasoline since 1978.

Draw a line chart to describe these data

and briefly describe the results.

2014/5/8

79/153

Example

3.5

3

2.5

2

1.5

1

0.5

0

1

2014/5/8

25

49

73

97

121 145 169 193 217 241 265 289 313 337

80/153

Agenda

Types of Data and Information

Graphical and Tabular Techniques for

Nominal Data

Graphical Techniques for Interval Data

Describing Time-Series Data

Variables

Two Nominal Variables

2014/5/8

81/153

Nominal Variables

So far weve looked at tabular and graphical

techniques for one variable (either nominal or

interval data).

A cross-classification table (or cross-tabulation

table) is used to describe the relationship between

two nominal variables.

A cross-classification table lists the frequency of

each combination of the values of the two

variables

2014/5/8

82/153

Example

In a major North American city there are four competing

newspapers: the Post, Globe and Mail, Sun, and Star.

To help design advertising campaigns, the advertising

managers of the newspapers need to know which segments of

the newspaper market are reading their papers.

A survey was conducted to analyze the relationship between

newspapers read and occupation.

A sample of newspaper readers was asked to report which

newspaper they read: Globe and Mail (1) Post (2), Star (3),

Sun (4), and to indicate whether they were blue-collar worker

(1), white-collar worker (2), or professional (3).

2014/5/8

83/153

Example

By counting the number of times each of the 12 combinations occurs,

we produced the Table

Occupation

Newspaper

Blue Collar White Collar

Professional

Total

G&M

27

29

33

89

Post

18

43

51

112

Star

38

21

22

81

Sun

37

15

20

72

Total

120

108

126

354

2014/5/8

84/153

Example

If occupation and newspaper are related, then there will be differences in

the newspapers read among the occupations. An easy way to see this is

to covert the frequencies in each column to relative frequencies in each

column. That is, compute the column totals and divide each frequency by

its column total.

Occupation

Newspaper

Blue Collar

White Collar

Professional

G&M

27/120 =.23

29/108 = .27

33/126 = .26

Post

18/120 = .15

43/108 = .40

51/126 = .40

Star

38/120 = .32

21/108 = .19

22/126 = .17

Sun

37/120 = .31

15/108 = .14

20/126 = .16

2014/5/8

85/153

Example

Interpretation: The relative frequencies in the columns 2 & 3 are similar,

but there are large differences between columns 1 and 2 and between

columns 1 and 3.

similar

dissimilar

This tells us that blue collar workers tend to read different newspapers

from both white collar workers and professionals and that white collar and

professionals are quite similar in their newspaper choice.

2014/5/8

86/153

Variables

Use the data from the cross-classification table to create bar charts

Professionals tend

to read the Globe &

Mail more than

twice as often as the

Star or Sun

2014/5/8

87/153

Agenda

Types of Data and Information

Graphical and Tabular Techniques for Nominal

Data

Graphical Techniques for Interval Data

Describing Time-Series Data

Describing the Relationship Between Two

Variables

Two Nominal Variables

2014/5/8

MTH410 S14- Lecture 1

88/153

Interval Variables

Moving from nominal data to interval data, we are

frequently interested in how two interval variables are

related.

To explore this relationship, we employ a scatter

diagram, which plots two variables against one another.

The independent variable is labeled X and is usually

placed on the horizontal axis, while the other, dependent

variable, Y, is mapped to the vertical axis.

2014/5/8

89/153

Example

A real estate agent wanted to know to what extent the selling

price of a home is related to its size. To acquire this

information he took a sample of 12 homes that had recently

sold, recording the price in thousands of dollars and the size

in hundreds of square feet. These data are listed in the

accompanying table. Use a graphical technique to describe

the relationship between size and price.

Size

Price

2014/5/8

23

18 26 20 22 14

33 28 23 20 27 18

315 229 355 261 234 216 308 306 289 204 265 195

90/153

Example

It appears that in fact there is a relationship,

that is, the greater the house size the greater

the selling price

2014/5/8

91/153

Linearity and Direction are two concepts we are interested in.

Non-Linear Relationship

2014/5/8

No Relationship

92/153

Summary

Interval

Data

Single Set of

Data

Relationship

Between

Two Variables

2014/5/8

Nominal

Data

Histogram, Ogive

Frequency and

Relative Frequency

Tables, Bar and Pie

Charts

Scatter Diagram

Cross-classification

Table, Bar Charts

93/153

Agenda

Introduction

Measures of Central Location

Measures of Variability

Measures of Relative Standing

94/153

Measures of Central Location

Mean, Median, Mode

Measures of Variability

Range, Standard Deviation, Variance,

Coefficient of Variation

Percentiles, Quartiles

MTH410 S14- Lecture 1

95/153

Agenda

Introduction

Measures of Central Location

96/153

Usually, we focus our attention on two

types of measures when describing

population characteristics:

Variability or spread

reflects the locations of all the actual

data points.

MTH410 S14- Lecture 1

97/153

The measure of central location reflects the locations

of all the actual data points.

How?

With two data points,

the central location

But

if

the

third data

With one data point

should

fallpoint

in the middle

the leftthem

hand-side

clearly the central appears on

between

(in order

of the midrange,

pullof

location is at the point

to reflectit should

the location

the centralboth

location

to the left.

itself.

of them).

98/153

Arithmetic Mean

The arithmetic mean, or average, simply

as mean, is the most popular & useful

measure of central location.

It is computed by simply adding up all the

observations and dividing by the total

number of observations:

Sum of the observations

Mean =

Number of observations

The arithmetic mean for a sample is denoted with an

x-bar:

MTH410 S14- Lecture 1

99/153

Notation

When referring to the number of

observations in a population, we use

uppercase letter N

When referring to the number of

observations in a sample, we use lower

case letter n

The arithmetic mean for a population is

MTH410 S14- Lecture 1

100/153

Size

Population

Sample

Mean

101/153

Mean(Contd)

Population Mean

Sample Mean

102/153

Size

Population

Sample

Mean

103/153

Mean(Contd)

is appropriate for describing interval data,

e.g. heights of people, marks of student

papers, etc.

is seriously affected by extreme values

called outliers.

E.g. If Bill Gates moved into any

neighborhood, the average household income

for that neighborhood would increase

dramatically beyond what it was previously!

MTH410 S14- Lecture 1

104/153

Example

The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,

14, 8, 0, 9, 22 hours. Find the mean time on the Internet.

10

x01 x72 ... x2210

i 1 xi

x

10

10

11.0

Example

Suppose the telephone bills of Example 2.1 represent

the population of measurements. The population mean is

x42.19

x38.45

... x45.77

i200

1

2

200

1 x i

43.59

200

200

MTH410 S14- Lecture 1

105/153

Properties of Mean

Calculated by using every data point.

Every interval data has a unique mean.

Sum of deviations from mean is 0.

Effected from extreme (very large or small)

values

Not meaningful for nominal or ordinal data.

Useful comparing 2 or more data sets.

106/153

Median

The median is calculated by placing all the observations in

order; the observation that falls in the middle is the median.

Data: {0, 7, 12, 5, 14, 8, 0, 9, 22}

N=9 (odd)

0 0 5 7 8 9 12 14 22

Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10 (even)

Sort them bottom to top, there are two elements in

the middle:

0 0 5 7 8 9 12 14 22 33

median = (8 + 9) 2 = 8.5

Sample and population medians are computed the same way.

MTH410 S14- Lecture 1

107/153

Properties of Median

Calculated by using only 1 or at most 2

values.

Every interval data has a unique median.

Not affected from extreme values.

Can be calculated for ordinal data as well,

but cant be interpreted as the centre of

location.

108/153

Mode

The mode of a set of observations is the value that

occurs most frequently. Sometimes we say

MODE = PEAK of a curve.

A set of data may have one mode (or modal class),

or two modes, or more modes.

Mode can be used for all data types, although mainly

used for nominal data.

For populations and large samples the modal class

is more preferable.

Sample and population modes are computed the same way.

109/153

Mode(Contd)

E.g. Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33}

N=10

Which observation appears most often?

The mode for this data set is 0. How about

this as a measure of central location?

In a small sample, it may not be a good measure.

110/153

Mode(Contd)

The mode may be not unique, i.e. 2 modes for

bimodal data.

Note: if you are using Excel for your data

analysis and your data is multi-modal (i.e.

there is more than one mode), Excel only

calculates the smallest one.

histogram) to determine if your data is bimodal,

trimodal, etc.

MTH410 S14- Lecture 1

111/153

Properties of Mode

Not affected from extreme values.

Multiple modes possible, hence not a good

measure of central location.

No mode exists sometimes, all observations

have the same value.

Can be calculated for nominal data as well,

but cant be interpreted as the centre of

location

MTH410 S14- Lecture 1

112/153

If a distribution is symmetrical, the mean,

median and mode may coincide

mode

median

mean

113/153

If a distribution is asymmetrical, say skewed to the

left or to the right, the three measures may differ.

E.g.:

A positively skewed distribution

(skewed to the right)

mode

MedianMean

Mode

median

mean

(skewed to the left)

as Mean for the skewness.

MTH410 S14- Lecture 1

Mean

Mode

Median

114/153

If data are symmetric, the mean, median,

and mode will be approximately the same.

If data are multimodal, report the mean,

median and/or mode for each subgroup.

If data are skewed, report the median.

115/153

For ordinal and nominal data the calculation

of the mean is not valid.

Median is appropriate for ordinal data.

For nominal data, a mode calculation is

useful for determining highest frequency but

not central location.

116/153

Example: Assume you got 35 marks for one exam,

and the average was 45 marks, which kind of result

would you expect? Failed?

No sure. Dependent on the actual marks for all

students.

If all marks like:

15, 20, 25, 25, 25, 30, 35, 75, 100, 100

Congratulation! Good job, youre the fourth highest!

How about telling you the median was 27.5 (

would you worry again?

MTH410 S14- Lecture 1

25 30

),

117/153

The mean is generally the first choice.

When the following scenarios, the median is

the best

determine the rank of a particular value

relative to the data set

MTH410 S14- Lecture 1

118/153

Geometric Mean

The geometric mean is used when the variable is a

growth rate or rate of change, such as the value of

an investment over periods of time.

For the given series of rate of

returns the nth period return is

calculated by:

period, the nth period return would

be calculated by:

n

(1 R1 )(1 R2 )...(1 Rn ) (1 R g )

The geometric mean Rg is selected such that

R g n (1 R1 )(1 R 2 )...(1 Rn ) 1

MTH410 S14- Lecture 1

119/153

Finance Example

Suppose a 2-year investment of $1,000 grows by 100% to

$2,000 in the first year, but loses 50% from $2,000 back to

the original $1,000 in the second year. What is the average

return?

Using the arithmetic mean,

misleading

This would indicate having more than $1,000 at the end of the second

year, however in fact we only have $1,000.

Solving for the geometric mean yields a rate of 0%.

more precise

MTH410 S14- Lecture 1

120/153

Compute mean to

Describe the central location of a single set of

interval data

Compute median to

Describe the central location of a single set of

ordinal or interval data (with extreme observations)

Compute mode to

Describe a single set of nominal, ordinal or interval

data

Compute Geometric mean to

Describe a single set of interval data based on

growth rates

MTH410 S14- Lecture 1

121/153

Agenda

Introduction

Measures of Central Location

Measures of Variability

122/153

Measures of variability

Measures of central location fail to tell

the whole story about the distribution.

A question of interest still remains

unanswered:

How much are the observations spread out

around the mean value?

123/153

Observe two data sets:

The average value provides

a good representation of the

observations in the data set.

Small variability

changing to...

124/153

Observe two data sets:

The average value provides

a good representation of the

observations in the data set.

Small variability

Larger variability

provide as good representation of the

observations in the data set as before.

125/153

Range

The range is the simplest measure of variability, and

calculated as:

Range = Largest observation Smallest observation

E.g. Data set: {4, 4, 4, 4, 4, 50} Range = 46

Data set: {4, 8, 15, 24, 39, 50}

Range = 46

The range is the same in both cases, but the data

sets have very different distributions

MTH410 S14- Lecture 1

126/153

Range(Contd)

The range cannot assist in answering this question

Range

? ? ?

Smallest

observation

Largest

observation

127/153

Variance

Variance and its related measure,

standard deviation, are arguably the most

important statistics. Used to measure

variability, they also play a vital role in

almost all statistical inference procedures.

Population variance is denoted by

(Lower case Greek letter sigma squared)

Sample variance is denoted by

(Lower case s squared)

MTH410 S14- Lecture 1

128/153

Size

Population

Sample

Mean

Variance

MTH410 S14- Lecture 1

129/153

Variance(Contd)

population mean

population size

sample mean

Sample size minus one !

The reason we will discuss later

MTH410 S14- Lecture 1

130/153

Variance(Contd)

As you can see, you have to calculate the sample mean

in order to calculate the sample variance.

to calculate sample variance directly from the

data without the intermediate step of

calculating the mean. Its given by:

131/153

Consider two small populations:

goodofmeasure

of be a

CanAny

the sum

deviations

should

agree

gooddispersion

measure of

variability?

with this observation.

9 10 11 12

Sum = 0

but

Themeasurements

mean of both in B

arepopulations

more dispersed

is 10...

than those in A.

B

4

9-10 = -1

11-10 = +1

8-10 = -2

12-10 = +2

10-10 = 0

10

13

therefore, is not a good measure of variability.

MTH410 S14- Lecture 1

16

4-10 = - 6

16-10 = +6

7-10 = -3

13-10 = +3

10-10 = 0

Sum = 0

132/153

2

2

2

2

2

(

8

10

)

(

9

10

)

(

10

10

)

(

11

10

)

(

12

10

)

2A

2

5

2

2

2

2

2

(

4

10

)

(

7

10

)

(

10

10

)

(

13

10

)

(

16

10

)

B2

18

5

the average squared deviation

rather than the sum of squared

deviations?

MTH410 S14- Lecture 1

133/153

Let us calculate the sum of squared deviations for both

data sets in this example

Which data set has a larger dispersion?

Data set B is

more dispersed

around the mean

B

1 2 3

Date set A:

{1, 1, 1, 1, 1

3, 3, 3, 3, 3}

134/153

Sum of squared deviation for B = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent

with the observation that set B is

more dispersed.

B

1 2 3

Date set A:

{1, 1, 1, 1, 1

3, 3, 3, 3, 3}

135/153

When calculated on per observation basis (variance),

the data set dispersions are properly ranked.

A2 = SumA/N = 10/10 = 1

B2 = SumB/N = 8/2 = 4

B

1

2 3

136/153

Application

Example

The following sample consists of the

number of jobs six students applied for:

17, 15, 23, 7, 9, 13.

Finds its mean and variance.

What are we looking to calculate?

137/153

Sample Mean

Sample Variance

138/153

Standard Deviation

The standard deviation is the square root of

the variance.

Population standard deviation:

Sample standard deviation:

139/153

Size

Population

Sample

Mean

Variance

Standard

Deviation

MTH410 S14- Lecture 1

140/153

There is another deviation: Mean Absolute Deviation

(MAD), which is calculated by averaging the absolute

value of the deviation. However, this statistic is rarely

used.

MAD

i 1

( xi x )

n

|17 14||1514||2314||7 14||914||1314|

1

MAD

4

6

3

MTH410 S14- Lecture 1

141/153

If data are symmetric, with no serious outliers, use

range and standard deviation.

If comparing variation across two data sets, use

coefficient of variation.

The measures of variability introduced in this

section can be used only for interval data.

The next section will discuss a measure that can

be used to describe the variability of ordinal data.

There are no measures of variability for nominal

data.

MTH410 S14- Lecture 1

142/153

Agenda

Introduction

Measures of Central Location

Measures of Variability

Measures of Relative Standing

143/153

Measures of relative standing are designed to provide

information about the position of particular values relative

to the entire data set.

Percentile: the Pth percentile is the smallest point in a

distribution at or below which p percentage of cases is

found.

Example: Suppose your score is the 60th percentile of a

GMAT test. That is

60% of all the scores lie here

40%

Your score

Note: The 60th percentile doesnt mean you scored 60% on the

exam. It means that 60% of your peers scored lower than you on

the exam..

MTH410 S14- Lecture 1

144/153

Quartiles

We have special names for the 25th, 50th, and 75th

percentiles, namely quartiles.

The first or lower quartile is labeled Q1 = 25th percentile.

The second quartile, Q2 = 50th percentile (also the

median).

The third or upper quartile, Q3 = 75th percentile.

We can also convert percentiles into quintiles (fifths) and

deciles (tenths).

MTH410 S14- Lecture 1

145/153

Quartiles can provide an idea about the shape of a histogram

Q1 Q2

Q3

Q1

Q3

>

<

Positively skewed

histogram

Q2

Negatively skewed

histogram

MTH410 S14- Lecture 1

146/153

First (lower) decile

First (lower) quartile, Q1,

Second (middle) quartile, Q2,

Third quartile, Q3,

Ninth (upper) decile,

= 10th percentile

= 25th percentile

= 50th percentile

= 75th percentile

= 90th percentile

147/153

Location of Percentiles

The following formula allows us to

approximate the location of any percentile:

P

LP (n 1)

100

whereLP is the location of the P th percentile

148/153

Location of Percentiles(Contd)

Given the data :

0 0 5 7 8 9 12 14 22 33

Where is the location of the 25th percentile?

0 0 5 7 8 9 12 14 22 33

The 25th percentile is three-quarters of the distance between

the second (which is 0) and the third observations (which is 5).

Three-quarters of the distance is: (.75)(5 0) = 3.75; because

the second observation is 0, the 25th percentile is

0 + 3.75 = 3.75

MTH410 S14- Lecture 1

149/153

Location of Percentiles(Contd)

What about the upper quartile?

L75 = (10+1)(75/100) = 8.25

0 0 5 7 8 9 12 14 22 33

It is located one-quarter of the distance between the eighth

and the ninth observations, which are 14 and 22, respectively.

One-quarter of the distance is: (.25)(22 - 14) = 2, which

means the 75th percentile is at: 14 + 2 = 16

150/153

Location of Percentiles(Contd)

position

2.75

16

0 0 5 7 8 9 12 14 22 33

position

8.25

3.75

Lp determines the position in the data set where the percentile value

lies, not the percentile itself.

We have already shown how to find the Median, which is the 50th

percentile. It is the 5.5th observation, (8+9)/2=8.5 The 50th percentile

is halfway between the fifth and sixth observations (in the middle

between 8 and 9), that is 8.5.

50

L 50 (10 1)

100

5.5

151/153

Interquartile Range

The quartiles can be used to create another

measure of variability, the interquartile range,

which is defined as follows:

Interquartile range = Q3 Q1

The interquartile range measures the spread of the

middle 50% of the observations.

Large values of this statistic mean that the 1st and 3rd

quartiles are far apart indicating a high level of

variability.

MTH410 S14- Lecture 1

152/153

1. It is a summary.

selecting techniques.

MTH410 S14- Lecture 1

153/153

- MB0040 – STATISTICS FOR MANAGEMENTUploaded byAli Asharaf Khan
- Kkp Mat StatisticsUploaded byMohd Khairul Nazrin Nordin
- CQEUploaded bycasmith43
- MEASURES OF CENTRAL TENDENCY.pptUploaded byDenisho Dee
- 100aHW4Soln.pdfUploaded byPei Jing
- 5 Most Important Methods for Statistical Data AnalysisUploaded byMayank Kakkar
- Araya & Pepe-Victoriano (1)Uploaded byCienciasMarinas
- ReliabilityUploaded byMuhammad Dinata
- Article on StressUploaded byAneesha Thomas
- Bullen - Means and Their InequalitiesUploaded byGabriel Crestani
- Study SetUploaded byOnur Can
- GRE Quants syllabus_schedule.docxUploaded bydevi_mamoni
- Statistics Summary.pdfUploaded byAyain Fida
- The Median2Uploaded byCharlotte P. Bactol
- Fix Bhs Inggris Otw Lancar BismillahUploaded byAyu Pratiwi
- agri mark 1Uploaded bykrubakaranM
- MathUploaded byJolina Ayngan
- Module 10 Introduction to Data and StatisticsUploaded byRainiel Victor M. Crisologo
- Praktikum 2 Geostat - Statistika BivariatUploaded byJulianAssadat
- Mbokan Ana Sing ButuhUploaded byHelmi Imanullah
- Lecture 1 AdrsUploaded byNofri Rahmadika
- BAYESUploaded byShikha Singh
- statistics activity 2016Uploaded byapi-299138743
- statistics activitiyUploaded byapi-309296181
- stats activityUploaded byapi-299410321
- Abstract%2520dan%2520judulUploaded byChika Dilla
- ErrorUploaded byThyago Oliveira
- enme392_1301_lecture13_estimation3Uploaded byZain Baqar
- analisa kertas aplikator.xlsxUploaded byWage Karsana

- 10 s Taylor Poly SeriesUploaded bysucatas
- Free GD T Wall ChartUploaded bykul
- 2017 Honda Civic Sedan Specifications & Features - Honda NewsUploaded byzaid
- Dividend CalculatorUploaded byzaid
- chapter 6 Beam StressesUploaded byzaid
- 3D stressUploaded byzaid
- Guidelines for Lab Report WritingUploaded byzaid
- Me 311 Machine Design Notes 09Uploaded byzaid
- AER606 MidtermUploaded byzaid
- MTH510 Root TradOffsUploaded byzaid
- XBee Quick Reference GuideUploaded byOscar Frausto
- ReadMeUploaded byscooter_scooter
- Control SystemsUploaded byzaid
- Wing AerodynamicsUploaded byzaid
- HST 701 Seminar 1Uploaded byzaid
- AER817 Lecture Sept07-2017Uploaded byzaid
- Laplace TranformUploaded byzaid
- MTH 510 HomeworkUploaded byzaid
- Winglet Airfoil Maughmer Et AlUploaded byzaid
- 11 Frequency Response TechniquesUploaded byzaid
- Analysis of Landing Gear Behaviour-Milwitzky and CookUploaded byzaid
- Control SystemsUploaded byzaid
- Exam 8 April 2015 Questions and Answers Gas DYnamicsUploaded byzaid
- 1 Roles of Aircraft StructuresUploaded byzaid

- Multiple RegressionUploaded bySunaina Kuncolienkar
- Shannon Index Non ParametricUploaded byCheikh Talla
- Tutorial_RapidMiner(Market Basket Analysis)-w ItemCountUploaded byROBERTO GAVIDIA DA CRUZ
- OutUploaded byinka.else
- 5. StatisticsAllTopicsUploaded byHoda Hosny
- pertemuan-1Uploaded byKerin Ardy
- 79Uploaded byRahayu N
- SPSS ExampleUploaded byAdde Mohamed
- Econometrics finalUploaded byDanikaLi
- STAT-30100. Parrott Syllabus STAT 30100 Online Fall 2016Uploaded byA K
- SigmaXL FeaturesUploaded byppdat
- The ELEMENTSofprobabilityUploaded byHaydn Bassarath
- Rf-PlanningUploaded byMustaf Mohamed
- Evaluation of Internet Ad Effectiveness Research MethodsUploaded bySunil Kumar
- jModelTest.0.1.1Uploaded byนาวิน สรชัย
- topic4 Bayes with RUploaded byDanar Handoyo
- 2005_JNeurosci_HuettelUploaded byCarlos
- Lecture_4.pdfUploaded byDarren Ignatius Lee
- ASTM D2444Uploaded byHernando Andrés Ramírez Gil
- 27168239-Take-Home-Quiz-1Uploaded bylpinedo12
- Relex Weibull DsUploaded byGloria
- SurvivalUploaded byPeter Knight
- Wilcoxon SrUploaded byjulita08
- 9709_m18_qp_72Uploaded byLaura Wu
- Statistics Hmwk.docxUploaded byAnterika Williams
- 2007 Version of Yesu ProjectUploaded bysapyadkozam
- statistical analysisUploaded bynaveedsidhu
- TERAPI BERMAIN KOKORUUploaded byKristiani Juita
- Tower ReactorUploaded byLidya Kurniawan
- SMS NOTES 8th UNIT 8th SemUploaded bygvarun_1989