K

LECTURER IN MATHEMATICS DEPARTMENT

KODAIKANAL CHRISTIAN COLLEGE

KODAIKANAL

CONTENT
1. 2. 3.

4. 5.

INTRODUCTION STATISTICS DIAGRAMMATIC PRESENTATION MEASURE OF CENTRAL TENDENCY MEASURE OF DISPERSION INDEX NUMBERS

1.INTRODUCTION STATISTICS

Introduction  Application  Collection of Data  Sampling Introduction  Types of Sampling  Types of Distribution

2.DIAGRAMMATIC PRESENTATION

Introduction  Types of Diagrams  Types of Graphs  Examples

3.MEASURE OF CENTRAL TENDENCY

Mean  Median  Mode  Geometric Mean  Harmonic Mean  Quartiles, Deciles  Merits & Demerits

4.MEASURE OF DISPERSION
Types of Dispersion  Lorenz Curve  Combined mean & Standard Deviation  Coefficient of Variation  Consistency of data

5. INDEX NUMBERS
Types of Methods  Simple Average of Price Relatives  Weighted Index Numbers  Laseyre’s , Bowley’s,Fisher’s & Marshall-Edgeworth Index Numbers  Test of Consistency of Index Numbers  Fisher’s Index Number an Ideal Index Number

UNIT-I

Introduction Statistics Application Collection of Data Sampling Introduction Types of Sampling Types of Distribution

Learning Objectives
1. 2. 3. 4.

Define Statistics Describe the Uses of Statistics Distinguish Descriptive & Inferential Statistics Define Population, Sample, Parameter, and Statistic Define Quantitative and Qualitative Data Define Random Sample

5. 6.

What Is Statistics?
1.

Collecting Data
e.g., Sample, Survey, Observe, Simulate

1.

Characterizing Data
e.g., Organize/Classify, Count, Summarize

Data Analysis

Why?

1.

Presenting Data
e.g., Tables, Charts, Statements

1.

Interpreting Results
e.g. Infer, Conclude, Specify Confidence

DecisionMaking

Populations & Samples
Population Sample

Subset

The graphical & tabular methods presented here apply to both entire populations and samples drawn from populations.

Definitions…
A variable is some characteristic of a population or sample.  E.g. student grades.  Typically denoted with a capital letter: X, Y, Z…

The values of the variable are the range of possible values for a variable.  E.g. student marks (0..100)

Data are the observed values of a variable.  E.g. student marks: {67, 74, 71, 83, 93, 55, 48}

Application Areas

Economics
 

Product Development
 

Forecasting Demographics

Design Quality

Sports

 

Individual & Team Performance

Consumer Preferences Financial Trends

Statistical Methods
Statistical Methods

Descriptive Statistics

Inferential Statistics

Descriptive Statistics
1.
• • • •

Involves
Collecting Data Organizing Data Presenting Data Characterizing Data
50 25 0 Q1 Q2 Q3 Q4

\$

1.

Purpose
Describe Data

X = 30.5 S2 = 113

Types of Statistical Applications in Business

Descriptive Statistics - describe collected data
“51.4% of all credit card purchases in 2003 were made with a Visa Card” “The average Pay-to-Return Rating of Retailing Industry CEOs in 2005 was 126.6”

Inferential Statistics
1.
• •

Involves
Estimation Hypothesis Testing
Population?

1.

Purpose

1.

Example
Retail CEOs were overpaid

Key Terms
1.

Population (Universe)

All items of interest Portion of population

1.

Sample

• P in Population & Parameter • S in Sample & Statistic

1.

Parameter

1.

Statistic

Fundamental Elements of Statistics
Item of interest - experimental Unit: graduating senior Population – the set of items we are interested in learning about: all 1450 graduating seniors at “State U” Variable – characteristic of a single population unit: age at

Value – symbol [number, letter, word(s), …] associating one option of a variable with one item: graduating senior Anne

Baker’s age at graduation will be 22

Triplet – fundamental data unit: (Anne Baker, age, 22)

Data Organized in Tables
Graduating Senior Anne Baker Charles Durango Ellen Fong Age 22 21 22 Major Accounting Comp Lit Ecology Home Santa Fe Ruidoso Taiwan

Rows for items in population, columns for variables, cells for values – variables are the focus

Definitions

2 types of variables:

Independent Variable (IV): A variable that is manipulated by the researcher (Example: I assign you to drink either 1)coffee with caffeine or 2) decaf) Dependent Variable (DV): The variable that is measured to see if the independent variable had an effect (Example: I measure how alert you are after you drink the coffee)

Types of Data
Types of Data

Quantitative Data

Qualitative Data

Types of Variables

Quantitative Variables
• •

measured on a naturally occurring scale equal intervals along scale (allows for meaningful mathematical calculations) Ratio scale
 

zero value properly describes the underlying phenomenon - e.g., bank balance, length of a material entity ratios of scale values properly describe relative values – e.g., 4 feet long is indeed twice as long as 2 feet zero value is arbitrarily assigned - e.g., zero temperature in F or C scale is not no heat at all, zero calendar time is not the beginning of time Ratios of scale variables do not describe relative values correctly – e.g., 40o F is not twice as many calories as 20o F

Interval scale

Types of Variables

Qualitative Variables
• • •

measured by classification only Non-numerical in nature Meaningfully ordered categories identify ordinal data (best to worst ranking, income categories, price ranges) Categories without a meaningful order identify nominal data (gender, political affiliation, industry classification, ethnic/cultural groups, cause of defectives)

Types of Data & Information…
Data
Categorical? Y Ordered? Categoric al Data N Y N

Interval Data

Ordinal Data

Nominal Data

Data
Categorical? Y Ordered? Categoric al Data N Y N

Interval Data e.g. {0..100}

Ordinal Data e.g. {F, D, C, B, A}
Rank order to data

Nominal Data e.g. {Pass | Fail}
NO rank order to data

Hierarchy of Data…
Interval  Values are real numbers.  All calculations are valid.  Data may be treated as ordinal or nominal.

Ordinal  Values must represent the ranked order of the data.  Calculations based on an ordering process are valid.  Data may be treated as nominal but not as interval.

Nominal  Values are the arbitrary numbers that represent categories.  Only calculations based on the frequencies of occurrence are valid.  Data may not be treated as ordinal or interval.

Relationships between Variables.
(Source. Rowntree 2000: 33)

Variables

Category

Quantity

Nominal

Ordinal

Discrete
(counting)

Continuous (measuring)

Ordered categories

Ranks.

Classroom Exercise

For undergraduate students, what type of variable is the following: Student Status (e.g., Freshman)
Enter: A for Ratio C for Ordinal B for Interval D for Nominal

Why bother with variable types?

Different statistical techniques used for quantitative and qualitative variables Quantitative variables can be transformed into Qualitative data through category creation Qualitative variables cannot be meaningfully transformed into Quantitative data – coding their values with numbers does not make them quantitative

Collecting Data

Sampling
 

When all elements of a population cannot be measured then sampling is necessary inferential statistics are then used to make estimates of population parameters (e.g., average age) from the sample values Samples need to be representative

Reflect population of interest Most common sampling method to ensure sample is largely representative Ensures that each subset of fixed size is equally likely to be selected Most representative sample technique Requires prior knowledge of population strata (sub-population) Uses random sampling within strata

Random Sampling
 

Stratified Sampling
  

Question (10thPERSON)

A local TV station conducts exit polling during an election, selecting every 10th person who exits the polling station. Is this a random sample? Enter Yes or No

Why or why not?

Common Sources of Error in Survey Data

Selection bias – exclusion of a subset of the population of interest prior to sampling Non-response bias – introduced when responses are not received from all sample members – what can be done? Measurement error – inaccuracy in recorded data. Can be due to survey design, transcription error, or surveyor sabotage Example – prediction of CEO performance based on their golf handicap (Chapter 1 Statistics In Action, p22)

The Role of Statistics in Managerial Decision Making

Statistical literacy is useful, if not necessary, to make informed decisions both at work (selecting a new employee based on education) and at home (selecting a new car based on repair data) Requires statistical thinking to critically assess data and the inferences drawn from it Statistical thinking assists you in identifying research resulting from unethical or uninformed statistical practices

Statistical Computer Packages
1.
• • • •

Typical Software
Excel SPSS SAS MINITAB

1.

Need Statistical Understanding
• •

Assumptions Limitations

Total Population

The total collection of units, elements or individuals that you want to analyse. These can be countries, lab-rats, light bulbs, university students, banks, residents of a particular area, regional health authorities etc. The population for a study of infant health might be all children born in the U.K. in the 1980's.

Sample

A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group. Using example for study of infant health the sample might be all babies born on 7th May in any of the years. samples selected because the population is too large to study in its entirety. Important that the researcher carefully and completely defines the population, including a description of the members to be included

Representative sample

A sample whose characteristics correspond to, or reflect, those of the original population or reference population To ensure representativeness, the sample may be either completely random or stratified depending upon the conceptualized population and the sampling objective (i.e., upon the decision to be made). A thorny issue in the social sciences- is it possible to achieve?

A probability provides a quantitative description of the likely occurrence of a particular event.

Probability Sampling

A probability sampling method is any method of sampling that uses some form of random selection. In order to have a random selection method, you must set up some process or procedure that assures that the different units in your population have equal probabilities of being chosen (Clark 2002: 37).

Most Common Types of Probability Sampling
   

Simple Random Sampling Stratified Random Sampling Systematic Random Sampling Cluster Or Multistage Sampling

Simple Random Sampling

where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen randomly and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection; that is, each member of the population is equally likely to be chosen at any stage in the sampling process. (Easton & Mc Coll 2004). A lottery draw is a good example of simple random sampling. A sample of 6 numbers is randomly generated from a population of 45, with each number having an equal chance of being selected.

Stratified Random Sampling
  

 

Often factors which divide up the population into sub-populations (groups / strata) measurement of interest may vary among the different sub-populations. This has to be accounted for when we select a sample from the population to ensure our sample is representative of the population. This is achieved by stratified sampling. A stratified sample is obtained by taking samples from each stratum or sub-group of a population. Suppose a farmer wishes to work out the average milk yield of each cow type in his herd which consists of Ayrshire, Friesian, Galloway and Jersey cows. He could divide up his herd into the four sub-groups and take samples from these (Easton and Mc Coll 2004).

Systematic Random Sampling
 

Systematic sampling, sometimes called interval sampling, means that there is a gap, or interval, between each selection. Often used in industry, where an item is selected for testing from a production line (say, every fifteen minutes) to ensure that machines and equipment are working to specification. Alternatively, the manufacturer might decide to select every 20th item on a production line to test for defects and quality. This technique requires the first item to be selected at random as a starting point for testing and, thereafter, every 20th item is chosen. used when questioning people in surveys eg market researcher selecting every 10th person who enters a particular store, after selecting a person at random as a starting point; interviewing occupants of every 5th house in a street, after selecting a house at random as a starting point. If researcher wants to select a fixed size sample. In this case, it is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then calculated by dividing population size, N, by required sample size, n, as follows: If a systematic sample of 500 students were to be carried out in a university with an enrolled population of 10,000, the sampling interval would be: I = N/n = 10,000/500 =20

Cluster Or Multistage Sampling

Cluster sampling is a sampling technique where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample. every element should have a specified (equal) chance of being selected into the final sample. typically used when the researcher cannot get a complete list of the members of a population they wish to study but can get a complete list of groups or 'clusters' of the population Cheap, easy economical method of data collection.

Non-Probability Sampling

  

Main Types Convenience/ opportunity/accidental sampling. Purposive/ judgemental sampling Quota sampling Snowball sampling

Convenience/ opportunity/accidental sampling.
 

volunteer samples Sometimes access through contacts or gatekeepers ‘easy to reach’ population.

Purposive/ judgemental sampling

Involves selecting a group of people because they have particular traits that the researcher wants to study e.g. consumers of a particular product or service in some types of market research My own questionnaire research on ‘NewAge’ Travellers.

Quota sampling

widely used in opinion polls and market research. Interviewers given a quota of subjects of specified type to attempt to recruit. eg. an interviewer might be told to go out and select 20 male smokers and 20 female smokers so that they could interview them about their health and smoking behaviours .

Snowball sampling
 1. 2.

Involves two main steps. Identify a few key individuals Ask these individuals to volunteer to distribute the questionnaire to people who know and fit the traits of the desired sample (e.g. my research on Travellers)

Sample Size

In general, the larger the sample size (selected with the use of probability techniques) the better. The more heterogeneous a population is on a variety of characteristics (e.g. race, age, sexual orientation, religion) then a larger sample is needed to reflect that diversity. (Papadopoulos 2003) Response rates vary on the type of surveys (e.g. mail surveys, telephone surveys). Response rates under 60 or 70 per cent may compromise the integrity of the random sample. (ibid)

Sample Size

In general, the larger the sample size (selected with the use of probability techniques) the better. The more heterogeneous a population is on a variety of characteristics (e.g. race, age, sexual orientation, religion) then a larger sample is needed to reflect that diversity. (Papadopoulos 2003) Response rates vary on the type of surveys (e.g. mail surveys, telephone surveys). Response rates under 60 or 70 per cent may compromise the integrity of the random sample. (ibid)

Frequencies and Distributions

Frequency-A frequency is the number of times a value is observed in a distribution or the number of times a particular event occurs. Distribution-When the observed values are arranged in order they are called a rank order distribution or an array. Distributions demonstrate how the frequencies of observations are distributed across a range of values.

Two elements to a distribution

Scale with a number of values -(Usually arrange the scores from the highest to lowest). Corresponding observations- Tally up the scores, convert them into frequencies.

Types of Distribution
  

Frequency distribution Class Intervals Relative (Proportional or percentage distributions) Cumulative distributions.

Frequency Distributions

1.

2.

Shows number of cases having each of the attributes of a particular variable. Divided into two types Ungrouped distribution-scores not collapsed into categories, each score represented as a separate values Grouped distribution. Scores collapsed into categories so that several scores are presented together as a group. Groups usually referred to as a class interval.

Relative (proportional or percentage) distributions

The proportion of cases in the whole distribution observed at each score or value.

Cumulative distribution.

The number of cases up to and including the scale value. Can appear in grouped or ungrouped format. Cumulative relative distribution for any particular value is the the total up to, and including, that value

Look at the distribution below: This distribution shows the recorded ages of patients receiving treatment for heart disease in the Stroud district. There are 50 observed values. We can easily see how often each value occurs. What is the frequency of the following values, 79; 81; 94? What is the range of this distribution?(r = h – l ). What is the mode? What is the median? From this distribution we can also tell that most of the values tend to cluster around the middle of the range.
62 73 78 81 86 64 74 78 81 87 65 74 79 81 87 66 74 79 82 88 68 75 79 82 89 70 75 80 82 90 71 76 80 83 90 71 77 80 83 92 72 77 81 85 94 72 78 81 85 96

Example

UNIT-II

Diagrammatic Presentation Types of Diagrams Types of Graphs Examples

Chapter Topics

Organizing numerical data

The ordered array and stem-leaf display

Tabulating and graphing Univariate numerical data

Frequency distributions: tables, histograms, polygons Cumulative distributions: tables, the Ogive

Graphing Bivariate numerical data

Chapter Topics

Organizing numerical data

The ordered array and stem-leaf display

Tabulating and graphing Univariate numerical data

Frequency distributions: tables, histograms, polygons Cumulative distributions: tables, the Ogive

Graphing Bivariate numerical data

Tabulating and graphing Univariate categorical data
 

The summary table Bar and pie charts, the Pareto diagram Contingency tables Side by side bar charts

Tabulating and graphing Bivariate categorical data
 

Graphical excellence and common errors in presenting data

Organizing Numerical Data
Numerical Data
41, 24, 32, 26, 27, 27, 30, 24, 38, 21

Ordered Array
21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Frequency Distributions Cumulative Distributions Histograms Tables Ogive Polygons

Stem and Leaf Display

2 144677 3 028 4 1

Organizing Numerical Data
(continued)

Data in raw form (as collected): 24, 26, 24, 21, 27, 27, 30, 41, 32, 38 Data in ordered array from smallest to largest: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Stem-and-leaf display:
2 144677 3 028 4 1

Graphing Numerical Data: The Histogram
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Histogram 7 6 5 4 3 2 1 0 6 5 4 3 2 0 5 15 25 36 45 55 0 More

Frequency

No Gaps Between Bars

Class Boundaries

Class Midpoints

Graphing Numerical Data: The Frequency Polygon
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Frequency 7 6 5 4 3 2 1 0 5 15 25 36 45 55 More

Class Midpoints

Tabulating Numerical Data: Cumulative Frequency
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class 10 but under 20 20 but under 30 30 but under 40 40 but under 50 50 but under 60

Cumulative Frequency 3 9 14 18 20

Cumulative % Frequency 15 45 70 90 100

Ogive…

Is a graph of a cumulative frequency distribution.

We create an ogive in three steps… 1) Calculate relative frequencies.  2) Calculate cumulative relative frequencies by adding the current class’ relative frequency to the previous class’ cumulative relative frequency.

(For the first class, its cumulative relative frequency is just its relative frequency)

Cumulative Relative Frequencies…
first class… next class: .355+.185=.540

: :

last class: .930+.070=1.00

Ogive…
Is a graph of a cumulative frequency distribution. 1) Calculate relative frequencies.  2) Calculate cumulative relative frequencies.  3) Graph the cumulative relative frequencies…

Ogive…
The ogive can be used to answer questions like: What telephone bill value is at the 50th percentile?

“around \$35”

(Refer also to Fig. 2.13 in your textbook)

Graphing Numerical Data:
The Ogive (Cumulative % Polygon)
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
O give
100 80 60 40 20 0 10 20 30 40 50 60

Class Boundaries (Not Midpoints)

Graphing Bivariate Numerical Data (Scatter Plot)
M utual Fu nd s S catte r P lo t
40 Total Year to Date Return (%) 30 20 10 0 0 10 20 30 N e t A sse t V a lu e s 40

Tabulating and Graphing Categorical Data:Univariate Data
Categorical Data

Tabulating Data The Summary Table

Graphing Data

Pie Charts Bar Charts Pareto Diagram

Summary Table
(for an Investor’s Portfolio)
Investment Category Stocks Bonds CD Savings Total
(in thousands \$)

Amount 46.5 32 15.5 16 110

Percentage 42.27 29.09 14.09 14.55 100

Variables are Categorical

Graphing Categorical Data: Univariate Data
Categorical Data

Tabulating Data The Summary Table

Graphing Data

Pie Charts
CD S a ving s B o nd s S to c k s 0 10 20 30 40 50

Bar Charts
45 40 35 30 25 20 15 10 5 0

Pareto Diagram
120 100 80 60 40 20 0 S toc k s B onds S avings CD

Bar Chart
(for an Investor’s Portfolio)
Investor's Portfolio
Savings CD Bonds Stocks 0 10 20 30 40 50

Amount in K\$

Pie Chart
(for an Investor’s Portfolio)
Amount Invested in K\$ Savings 15% CD 14% Stocks 42%

Bonds

29%

Percentages are rounded to the nearest percent.

Pareto Diagram
45% 100% 90% 80% 70% 30% 60% 25% 50% 20% 40% 15% 30% 10% 20% 10% 0% Stocks Bonds Savings CD 40%

Axis for bar chart shows % invested in each category

35%

5%

0%

Axis for line graph shows cumulative % invested

Tabulating and Graphing Bivariate Categorical Data

Contingency tables: investment in thousands of dollars
Investor A 46.5 32 15.5 16 110 Investor B 55 44 20 28 147 Investor C 27.5 19 13.5 7 67 Total 129 95 49 51 324

Investment Category Stocks Bonds CD Savings Total

Tabulating and Graphing Bivariate Categorical Data

Side by side charts
C o m p arin g In vesto rs
S a vin g s CD B onds S toc k s 0 10 In ve s t o r A 20 30 In ve s t o r B 40 50 In ve s t o r C 60

Principles of Graphical Excellence
 

 

Presents data in a way that provides substance, statistics and design Communicates complex ideas with clarity, precision and efficiency Gives the largest number of ideas in the most efficient manner Almost always involves several dimensions Tells the truth about the data

Errors in Presenting Data
 

Using “chart junk” Failing to provide a relative in comparing data groups Compressing the vertical axis Providing no zero point on the vertical axis basis between

 

“Chart Junk”
Minimum Wage 1960: \$1.00 1970: \$1.60 1980: \$3.10
0 4 2

 Good Presentation
Minimum Wage

\$

1990: \$3.80

1960

1970

1980

1990

No Relative Basis
A’s received by students. Freq. 300 200 100 0 FR SO JR SR

 Good Presentation
30 % 20 10 0 FR SO JR SR A’s received by students.

FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior

Compressing Vertical Axis
200 100 0 Q1 Q2 Q3 Q4

Good Presentation
50 25 0 Q1 Q2 Q3 Q4

\$

Quarterly Sales

\$

Quarterly Sales

No Zero Point on Vertical Axis
45 42 39 36 J F M A M J
0 J F M A M J

 Good Presentation
45 42 39 36

\$

Monthly Sales

\$

Monthly Sales

Graphing the first six months of sales.

Summary II…
Interval Data Single Set of Histogram, Ogive, or Stem-and-Leaf Data Display Nominal Data Frequency and Relative Frequency Tables, Bar and Pie Charts Contingency Table, Bar Charts

Relationship Scatter Diagram Between Two Variables

Chapter Summary

Organized numerical data

The ordered array and stem-leaf display

Tabulated and graphed univariate numerical data
 

Frequency distributions: tables, histograms, polygon Cumulative distributions: tables and the Ogive

Graphed bivariate numerical data

Chapter Summary

Tabulated and graphed univariate categorical data
 

(continued )

The summary table Bar and pie charts, the Pareto diagram Contingency tables Side by side charts

Tabulated and graphed bivariate categorical data
 

Discussed graphical excellence and common errors in presenting data

UNIT-III

Measure of Central Tendency  Mean  Median  Mode  Geometric Mean  Harmonic Mean  Quartiles, Deciles  Merits & Demerits

WHAT DO THEY ALL MEAN?

Numerical Descriptive Techniques…

Measures of Central Location

Mean, Median, Mode

Measures of Variability

Range, Standard Deviation, Variance, Coefficient of Variation

Measures of Relative Standing

Percentiles, Quartiles

Measures of Linear Relationship

Covariance, Correlation, Least Squares Line

Measures of Central Location…
The arithmetic mean, a.k.a. average, shortened to mean, is the most popular & useful measure of central location.

It is computed by simply adding up all the observations and dividing by the total number of observations:

Sum of the observations Mean = Number of observations

Notation…
When referring to the number of observations in a population, we use uppercase letter N

When referring to the number of observations in a sample, we use lower case letter n

The arithmetic mean for a population is denoted with Greek letter “mu”:
 

The arithmetic mean for a sample is denoted with an “x-bar”:

Statistics is a pattern language…
Population Size Sample

N

n

Mean

Arithmetic Mean…

Population Mean

Sample Mean

Statistics is a pattern language…
Population Sample Size

N

n

Mean

The Arithmetic Mean…
…is appropriate for describing measurement data, e.g. heights of people, marks of student papers, etc.

…is seriously affected by extreme values called “outliers”. E.g. as soon as a billionaire moves into a neighborhood, the average household income increases beyond what it was previously!

Measures of Central Location…
The median is calculated by placing all the observations in order; the observation that falls in the middle is the median.

Data: {0, 7, 12, 5, 14, 8, 0, 9, 22} N=9 (odd) Sort them bottom to top, find the middle: 0 0 5 7 8 9 12 14 22

Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10 (even) Sort them bottom to top, the middle is the simple average between 8 & 9: 0 0 5 7 8 9 12 14 22 33 median = (8+9)÷2 = 8.5
Sample and population medians are computed the same way.

Measures of Central Location…
The mode of a set of observations is the value that occurs most frequently.

A set of data may have one mode (or modal class), or two, or more modes.

Mode is a useful for all data types, though mainly used for nominal data.

For large data sets the modal class is much more relevant than a single-value mode.

Sample and population modes are computed the same way.

Mode…

E.g. Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10

Which observation appears most often? The mode for this data set is 0. How is this a measure of “central” location?

A modal class
Frequency

Variable

=MODE(range) in Excel…
Note: if you are using Excel for your data analysis and your data is multi-modal (i.e. there is more than one mode), Excel only calculates the smallest one.

You will have to use other techniques (i.e. histogram) to determine if your data is bimodal, trimodal, etc.

If a distribution is symmetrical, the mean, median and mode may coincide…

Mean, Median, Mode…
mode median

mean

Mean, Median, Mode…
If a distribution is asymmetrical, say skewed to the left or to the right, the three measures may differ. E.g.:

mode

median

mean

Mean, Median, Mode…
If data are symmetric, the mean, median, and mode will be approximately the same.

If data are multimodal, report the mean, median and/or mode for each subgroup.
 

If data are skewed, report the median.

Mean, Median, & Modes for Ordinal & Nominal Data… For ordinal and nominal data the calculation of the mean is not valid.
 

Median is appropriate for ordinal data.

For nominal data, a mode calculation is useful for determining highest frequency but not “central location”.

The geometric mean is used when the variable is a growth rate or rate of change, such as the value of an investment over periods of time.

Geometric Mean…

If Ri denotes the rate of return in period i (i = 1, 2, …, n), then The geometric mean R of the returns R , R , … R is defined g 1 2 n such that:
   

Solving for Rg we produce the following formula:

Harmonic Mean
Harmoni Mean of a set of n values is defined as the reciprocals of the mean of the reciprocals of these values. That is, x1,x2,x3,………xn are the n values,

Presentation of data & descriptive statistics The geometric mean:  An alternative measure which is particularly applicable when the average rate of growth is to be measured is the geometric mean

X g = n X 1 X 2 ... X n − 1
 

E.g.: annual growth rates: 10%, 20%, 15%, -30%, 20% The annual growth rate is

X g = 5 1.1× 1.2 ×1.15 × 0.7 ×1.2 − 1 = 1.0498 − 1 = 4.98%

Finance Example…
Suppose a 2-year investment of \$1,000 grows by 100% to \$2,000 in the first year, but loses 50% from \$2,000 back to the original \$1,000 in the second year. What is your average return?
 

Using the arithmetic mean, we have

This would indicate we should have \$1,250 at the end of our investment, not \$1,000.
 

Solving for the geometric mean yields a rate of 0%, which is correct.

The upper case Greek Letter “Pi” represents a product of terms…

Harmonic Mean
Harmoni Mean of a set of n values is defined as the reciprocals of the mean of the reciprocals of these values. That is, x1,x2,x3,………xn are the n values,

Measures of Central Location • Summary…
 

Compute the Mean to Describe the central location of a single set of interval data Compute the Median to Describe the central location of a single set of interval or ordinal data Compute the Mode to Describe a single set of nominal data Compute the Geometric Mean to Describe a single set of interval data based on growth rates

 

 

 

Presentation of data & descriptive statistics

Calculation of quartiles from grouped data:
 (n + 1)   3(n + 1)  −F  −F   4  4 Q1 = L + i   ; Q3 = L + i   f f        

   

L= lower bound of the quartile group i = width of quartile group F = cumulative frequency up to the quartile group f = frequency in the quartile group

Presentation of data & descriptive statistics

Example of calculation of quartiles from grouped data. See Table 2.9 (page 55) :

 

 (51 + 1)  − 12   4 Q1 = −3 + 1  = −2.667% 3      3(51 + 1)  − 37   4 Q3 = 3 + 1  = 3.666% 3   Quartile range = Q3 – Q1 = 3.666 – (-2667)= 6.333% 
Quartile deviation = 6.333/2 = 3.1666%

Measures of Central Location • Summary…
 

Compute the Mean to Describe the central location of a single set of interval data Compute the Median to Describe the central location of a single set of interval or ordinal data Compute the Mode to Describe a single set of nominal data Compute the Geometric Mean to Describe a single set of interval data based on growth rates

 

 

 

How is the range of a set of numbers identified?

Arrange the numbers in the set in order from least to greatest. Subtract the lowest number from the highest number in the set of numbers. The difference of the two numbers is the range of a set of numbers.

95, 87, 92, 100, and 94

Your highest score is a 100 Your lowest score is 87 Your range is 100 – 87 = 13 The range = 13

It’s Time To Practice!
Number of pets owned by 7 students: 2, 1, 1, 4, 3, 2, 1 What is the mean What is the median What is the Mode What is the Range 2 2 1 3

4 ways to describe data
The Mean – what we usually think of as the average….. The Median – The middle number in a data set…..50% of the numbers are above and
50% are below the median

The Mode – The number that occurs most often….. The Range – the difference between the smallest and bigest number…..

UNIT-IV

Measure of Dispersion  Types of Disperson  Lorenz Curve  Combined mean & Standard Deviation  Coefficient of Variation  Consistency of data

Definition

Measures of dispersion are descriptive statistics that describe how similar a set of scores are to each other

The more similar the scores are to each other, the lower the measure of dispersion will be The less similar the scores are to each other, the higher the measure of dispersion will be In general, the more spread out a distribution is, the larger the measure of dispersion will be

Measures of Dispersion

Which of the distributions of scores has the larger dispersion?

The upper distribution has more dispersion because the scores are more spread out
That is, they are less similar to each other

125 100 75 50 25 0 1 2 3 4 5 6 7 8 9 10

125 100 75 50 25 0 1 2 3 4 5 6 7 8 9 10

Measures of Dispersion

There are three main measures of dispersion:
  

The range The semi-interquartile range (SIR) Variance / standard deviation

The Range

The range is defined as the difference between the largest score in the set of data and the smallest score in the set of data, XL - XS What is the range of the following data: 4 8 1 6 6 2 9 3 6 9 The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL - XS = 9 - 1 = 8

When To Use the Range

The range is used when
 

you have ordinal data or you are presenting your results to people with little or no knowledge of statistics

The range is rarely used in scientific work as it is fairly insensitive
 

It depends on only two scores in the set of data, XL and XS Two very different sets of data can have the same range: 1 1 1 1 9 vs 1 3 5 7 9

The Semi-Interquartile Range

The semi-interquartile range (or SIR) is defined as the difference of the first and third quartiles divided by two
 

The first quartile is the 25th percentile The third quartile is the 75th percentile

SIR = (Q3 - Q1) / 2

Interquartile Range…
The quartiles can be used to create another measure of variability, the interquartile range, which is defined as follows:
 

Interquartile Range = Q3 – Q1

The interquartile range measures the spread of the middle 50% of the observations. Large values of this statistic mean that the 1st and 3rd quartiles are far apart indicating a high level of variability.

Deciles

SIR Example

What is the SIR for the data to the right? 25 % of the scores are below 5

5 is the first quartile

25 % of the scores are above 25

25 is the third quartile

SIR = (Q3 - Q1) / 2 = (25 5) / 2 = 10

2 4 6 8 10 12 14 20 30 60

← 5 = 25th %tile

← 25 = 75th %tile

When To Use the SIR

The SIR is often used with skewed data as it is insensitive to the extreme scores

Variance

Variance is defined as the average of the square deviations:
σ2 =

( X − µ) 2 ∑
N

What Does the Variance Formula Mean?

First, it says to subtract the mean from each of the scores

This difference is called a deviate or a deviation score The deviate tells us how far a given score is from the typical, or average, score Thus, the deviate is a measure of dispersion for a given score

What Does the Variance Formula Mean?

Why can’t we simply take the average of the deviates? That is, why isn’t variance defined as:
σ
2

∑ ( X − µ) ≠
N
This is not the formula for variance!

What Does the Variance Formula Mean?

One of the definitions of the mean was that it always made the sum of the scores minus the mean equal to 0 Thus, the average of the deviates must be 0 since the sum of the deviates must equal 0 To avoid this problem, statisticians square the deviate score prior to averaging them

Squaring the deviate score makes all the squared scores positive

What Does the Variance Formula Mean?
 

Variance is the mean of the squared deviation scores The larger the variance is, the more the scores deviate, on average, away from the mean The smaller the variance is, the less the scores deviate, on average, from the mean

Coefficient of Variation…
The coefficient of variation of a set of observations is the standard deviation of the observations divided by their mean, that is: Population coefficient of variation = CV =

Sample coefficient of variation = cv =

Statistics is a pattern language…
Population Size Mean Variance Standard Deviation Coefficient of Variation S Sample

N

n

CV

cv

Coefficient of Variation…
This coefficient provides a proportionate measure of variation, e.g.

A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500.

Standard Deviation

When the deviate scores are squared in variance, their unit of measure is squared as well

E.g. If people’s weights are measured in pounds, then the variance of the weights would be expressed in pounds2 (or squared pounds)

Since squared units of measure are often awkward to deal with, the square root of variance is often used instead

The standard deviation is the square root of variance

Standard Deviation
 

Standard deviation = √variance Variance = standard deviation2

Computational Formula

When calculating variance, it is often easier to use a computational formula which is algebraically equivalent to the definitional formula:

σ

2

=

∑X

2

( ∑ X) −
N N

2

∑( X −µ) =
N

2

σ 2 is the population variance, X is a score, µ is the population mean, and N is the number of scores

Computational Formula Example
X 9 8 6 5 8 6 = Σ 42 X2 81 64 36 25 64 36 = Σ 306 X-µ 2 1 -1 -2 1 -1 = Σ 0 (X- µ ) 4 1 1 4 1 1
2

= Σ 12

σ
=

2

=

Computational Formula Example ( ∑ X) ∑ −
2

X
6

2

N

N
2

σ

2

∑( X −µ) =
N

2

306 − 42

6 306 − 294 = 6 12 = 6 =2

12 = 6 =2

Variance of a Sample

Because the sample mean is not a perfect estimate of the population mean, the formula for the variance of a sample is slightly different from the formula for the variance of a population:

s

2

=

∑ X −X
N −1

(

)

2

s2 is the sample variance, X is a score, X is the sample mean, and N is the number of scores

Presentation of data & descriptive statistics

Which measure of dispersion?
 

If the median is used -> quartile deviation If the arithmetic mean is used -> variance/std. deviation/ negative semi-variance

Presentation of data & descriptive statistics Coefficient of variation  The SD is expressed in the underlying units of measurement.  Thus when comparing the degree of dispersion between variables, we must take into account of the difference in magnitude of variables, e.g. FTSE index x S&P 500 index or Dow Jones index x Ibovespa  We can use the SD for returns but not for levels!  The CV overcomes this problem σ CV = X

UNIT-V

Index Numbers

Index Numbers  Types of Methods  Simple Average of Price Relatives  Weighted Index Numbers  Laseyre’s , Bowley’s,Fisher’s & Marshall-Edgeworth Index Numbers  Test of Consistency of Index Numbers  Fisher’s Index Number an Ideal Index Number

Index Numbers

Index numbers are used to summarize many variables or numbers with one number The most common index numbers are price indexes
Consumer Price Index (CPI – TÜFE)  Producer Price Index (PPI – ÜFE)  ISE Index  Dow Jones Industrial Average

Index Numbers

Index numbers may be computed for other things than prices
quantity indexes  quality indexes

Price Indexes

Price indexes are used to measure the general movement of prices (inflation) Common types of indexes
price relatives  unweighted  Laspeyres  Paasche

Price Relatives

Price relatives are used to find the change in price of a single item

Pt PR = × 100 P0

Prices of different fruits in different years (NTL/kg)

Price Relatives for Banana and Kiwi
P P = t ×100 R P 0

From 2001 to 2002 Banana: PRB = (0.94/0.91)*100 = 103.3 Kiwi: OR PRK = (2.10/1.90)*100 = 110.5

  

Banana price index BASE YEAR=2000

Unweighted Price Indexes
 

Price Relatives only represent the change in price of one item over time Unweighted Price indexes are formed by adding the prices in the year of interest and dividing by the sum of the prices in the base year

Puw

∑ Pt = × 100 ∑ P0

UnW Fruits Price Index Base Year=2000
Puw = ∑ Pt × 100 ∑ P0

Problems with Unweighted Price Indexes

Unweighted Price Indexes have a couple of problems:
they may be influenced by items with high prices  items that are relatively unimportant in the goods bundle may have undue influence

The usual solution is to weight the prices by some quantities

Weighted Price Indexes

If the price index is to be weighted by quantities, which quantities?
base year quantities (Laspeyres)  current year quantities (Paasche)

Most of the CPI’s and PPI’s use these or some variations (i.e. Fisher’s ideal PI)

Laspeyres Index

The Laspeyres index uses base year quantities as weights

∑ Pt Q0 PL = × 100 ∑ P0 Q0

Quantities purchased (1000 kg)

Laspeyres fruit price index (BY=2000)
PL

∑ = ∑
3

3 i= 1

Pi ,t Qi , 2000 Pi , 2000 Qi , 2000

×100

i= 1

Paasche Price Index

The Paasche index uses current quantities as the weighting factor

∑ Pt Qt PP = × 100 ∑ P0 Qt

Paashe fruit price index (BY=2000)

PP

∑ = ∑
3

3 i =1

Pi ,t Qi ,t Pi , 2000 Qi ,t

×100

i =1

Which is Best?

The advantage of the Laspeyres index is that once the quantities are set they do not change. This index is easy to update. The advantage of the Paasche is that the quantities reflect the current production/consumption. However, it is difficult to update and may not be as easy to compare over time.

Weights – CPI (442 items)