You are on page 1of 91

Chapter 3

Numerical Descriptive
Techniques (6 hours)
Learning Objectives
In this chapter you learn:
 1. Measures of centre and location
 2. Measures of dispersion and variation
 3. Measures of correlation
Definitions
 The central tendency is the extent to which the
values of a numerical variable group around a typical
or central value.

 The variation is the amount of dispersion or


scattering away from a central value that the values
of a numerical variable show.

 The shape is the pattern of the distribution of values


from the lowest value to the highest value.
Central tendency
The central tendency of the set of
measurements–that is, the tendency of the data to
cluster, or center, about certain numerical values.

Central Tendency
(Location)
Variation
The variability of the set of measurements–that is,
the spread of the data.

Variation
(Dispersion)
Measures of Central Tendency:
The Mean

 The arithmetic mean (often just called the “mean”)


is the most common measure of central tendency

 For a sample of size n:


Pronounced x-bar
The ith value
n

X i
X1  X 2    Xn
X i1

n n
Sample size Observed values
Measures of Central Tendency:
The Mean (con’t)

 The most common measure of central tendency


 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Mean = 13 Mean = 14
11  12  13  14  15 65 11  12  13  14  20 70
  13   14
5 5 5 5
Numerical Descriptive
Measures for a Population

 Descriptive statistics discussed previously described a


sample, not the population.

 Summary measures describing a population, called


parameters, are denoted with Greek letters.

 Important population parameters are the population mean,


variance, and standard deviation.
Numerical Descriptive Measures
for a Population: The mean µ

 The population mean is the sum of the values in


the population divided by the population size, N
N

X i
X1  X 2    XN
 i1

N N
Where μ = population mean
N = population size
Xi = ith value of the variable X
Arithmetic Mean
 The arithmetic mean (mean) is the most
common measure of central tendency

 For a sample of size n:


n

X i
X1  X 2    Xn
X i1

n n

Sample size Observed values


Example
An investment of $100,000 declined to $50,000 at the
end of year one and rebounded to $100,000 at end
of year two:

X1  $100,000 X2  $50,000 X3  $100,000

50% decrease 100% increase

The overall two-year return is zero, since it started and


ended at the same level.
 Year 0: Invested 100.000
 Year 1: Declined to 50.000
 Year 2: Rebounded 100.000
Geometric Mean
 Geometric mean
 Used to measure the rate of change of a variable
over time
1/ n
XG  ( X1  X 2    Xn )
 Geometric mean rate of return
 Measures the status of an investment over time

RG  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1
 Where Ri is the rate of return in time period i
Stock price: at the each month, Jan-> Dec
Monthly return:
Example

An investment of $100,000 declined to $50,000 at the


end of year one and rebounded to $100,000 at end
of year two:

X1  $100,000 X2  $50,000 X3  $100,000

50% decrease 100% increase

The overall two-year return is zero, since it started and


ended at the same level.
Example
(continued)

Use the 1-year returns to compute the arithmetic


mean and the geometric mean:

Arithmetic ( 50%)  (100%)


mean rate X  25% Misleading result
2
of return:

Geometric R G  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1


mean rate
 [(1  ( 50%))  (1  (100%))]1/ 2  1 More
of return:
accurate
 [(.50)  (2)]1/ 2  1  11/ 2  1  0% result
Measures of Central Tendency:
The Median

 In an ordered array, the median is the “middle”


number (50% above, 50% below)

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Median = 13 Median = 13

 Less sensitive than the mean to extreme values


Measures of Central Tendency:
Locating the Median

 The location of the median when the values are in numerical order (smallest to largest):

n 1
 If theMedian position
number of values is odd,the median position
is the middlein the ordered data
number
2
 If the number of values is even, the median is the average of the two middle numbers

Note that is not the value of the median, only the position of

the median in the ranked data

n 1
2
Measures of Central Tendency:
The Mode

 Value that occurs most often


 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode
 There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 9 No Mode
Measures of Central Tendency:
Which Measure to Choose?

 The mean is generally used, unless extreme values


(outliers) exist.
 The median is often used, since the median is not
sensitive to extreme values. For example, median
home prices may be reported for a region; it is less
sensitive to outliers.
 In some situations it makes sense to report both the
mean and the median.
Shape of a Distribution
 Describes how data are distributed
 Two useful shape related statistics are:
 Skewness
 Measures the extent to which data values are not
symmetrical
 Kurtosis
 Kurtosis affects the peakedness of the curve of
the distribution—that is, how sharply the curve
rises approaching the center of the distribution
Shape of a Distribution
(Skewness)

 Measures the extent to which data is not


symmetrical
Left-Skewed Symmetric Right-Skewed
Mean < Median Mean = Median Median < Mean

Skewness
Statistic < 0 0 >0
Measures of Variation
Variation

Range Variance Standard Coefficient


Deviation of Variation

 Measures of variation give


information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
Measures of Variation:
The Range

 Simplest measure of variation


 Difference between the largest and the smallest values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12
Measures of Variation:
Why The Range Can Be Misleading

 Does not account for how the data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
 Average (approximately) of squared deviations
of values from the mean
n
 Sample variance:
2
 (X  X) i
2

S  i1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Measures of Variation:
The Sample Standard Deviation

 Most commonly used measure of variation


 Shows variation about the mean
 Is the square root of the variance
 Has the same units as the original data
n

 Sample standard deviation:  (X  X)


i
2

S i1
n -1
Measures of Variation:
The Standard Deviation

Steps for Computing Standard Deviation

1. Compute the difference between each value and the


mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample variance.
5. Take the square root of the sample variance to get
the sample standard deviation.
Measures of Variation:
Sample Standard Deviation:
Calculation Example

Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Mean = X = 16

(10  X)2  (12  X)2  (14  X)2    (24  X)2


S
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2



8 1

130 A measure of the “average”


  4.3095
7 scatter around the mean
Measures of Variation:
Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Numerical Descriptive Measures For A
Population: The Variance σ2

 Average of squared deviations of values from


the mean
N
 Population variance:  (X  μ)
i
2

σ2  i1
N

Where μ = population mean


N = population size
Xi = ith value of the variable X
Numerical Descriptive Measures For A
Population: The Standard Deviation σ

 Most commonly used measure of variation


 Shows variation about the mean
 Is the square root of the population variance
 Has the same units as the original data

N
 Population standard deviation:  i
(X  μ) 2

σ i1
N
Sample statistics versus
population parameters

 X

2 S2

 S
Interpreting Standard
Deviation: Empirical Rule

 Data sets are mound shaped and symmetric


 Approximately 68% of the measurements lie in
the interval x  s to x  s
 Approximately 95% of the measurements lie in
the interval x  2s to x  2s
 Approximately 99.7% of the measurements lie in
the interval x  3s to x  3s
Interpreting Standard
Deviation: Empirical Rule

x – 3s x – 2s x–s x x+s x +2s x + 3s

Approximately 68% of the measurements

Approximately 95% of the measurements


Approximately 99.7% of the measurements
Empirical Rule Example
Previously we found the mean closing stock
price of new stock issues is 15.5 and the
standard deviation is 3.34. If we can assume
the data is symmetric and mound shaped,
calculate the percentage of the data that lie
within the intervals
x  s, x  2s, x  3s.
Numerical Measures of
Relative Standing: z–Scores
 Describes the relative location of a measurement
compared to the rest of the data

Sample z–score Population z–score


xx x µ
z z
s 
Measures the number of standard deviations
away from the mean a data value is located
Problem

 A random sample of 2,000 students who sat for the


Graduate Management Admission Test (GMAT) is
selected. For this sample, the mean GMAT score is x
= 540 points and the standard deviation is s = 100
points. One student from the sample, Kara Smith, had
a GMAT score of x = 440 points. What is Kara’s
sample z-score?
Problem
 The mean time to assemble a product is 22.5
minutes with a standard deviation of 2.5 minutes.
 Find the z–score for an item that took 20 minutes
to assemble.
 Find the z–score for an item that took 27.5
minutes to assemble.
Interpretation of z–Scores for
Mound-Shaped Distributions of
Data
1. Approximately 68% of the measurements will
have a z-score between –1 and 1.
2. Approximately 95% of the measurements will
have a z-score between –2 and 2.
3. Approximately 99.7% of the measurements will
have a z-score between –3 and 3.
(see the figure on the next slide)
Interpretation of z–Scores
z–Score Example
 The mean time to assemble a product is
22.5 minutes with a standard deviation of
2.5 minutes.
 Find the z–score for an item that took 20
minutes to assemble.
 Find the z–score for an item that took 27.5
minutes to assemble.
Numerical Measures of Relative
Standing: Percentiles

 Describes the relative location of a measurement


compared to the rest of the data are called
measures of relative standing.
 The pth percentile is a number such that p% of
the data falls below it and (100 – p)% falls above
it
 Median = 50th percentile
Quartiles
Measure of noncentral tendency
Split ordered data into 4 quarters
25% 25% 25% 25%
Q1 Q2 Q3
Lower quartile QL is 25th percentile.
Middle quartile m is the median.

Upper quartile QU is 75th percentile.


Percentile Example

 You scored 560 on the GMAT exam. This score


puts you in the 58th percentile.
 What percentage of test takers scored lower
than you did?
 58% of test takers scored lower than 560.
 What percentage of test takers scored higher
than you did?
 (100 – 58)% = 42% of test takers scored
higher than 560.
Outlier
An observation (or measurement) that is unusually
large or small relative to the other values in a data
set is called an outlier. Outliers typically are
attributable to one of the following causes:
1. The measurement is observed, recorded, or
entered into the computer incorrectly.
2. The measurement comes from a different
population.
3. The measurement is correct but represents a
rare (chance) event.
Measure of noncentral tendency

Split ordered data into 4 quarters


25% 25% 25% 25%
Q1 Q2 Q3
Lower quartile QL is 25th percentile.
Middle quartile m is the median.
Upper quartile QU is 75th percentile.
Interquartile range: IQR = QU – QL
Quartile (Q2) Example

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7


 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
 Position: 1 2 3 4 5 6

Q2 is the median, the average of the two middle


scores (7.7 + 8.9)/2 = 8.3
Quartile (Q1) Example

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7


 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
 Position: 1 2 3 4 5 6

QL is median of bottom half = 6.3


Quartile (Q3) Example

 Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7


 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
 Position: 1 2 3 4 5 6

QU is median of bottom half = 10.3


Interquartile Range

1. Measure of dispersion
2. Also called midspread
3. Difference between upper and lower quartiles
 Interquartile Range = QU – QL
4. Spread in middle 50%
5. Not affected by extreme values
Thinking Challenge
 You’re a financial analyst for Prudential-Bache
Securities. You have collected the following
closing stock prices of new stock issues: 17,
16, 21, 18, 13, 16, 12, 11.
 What are the quartiles, Q1 and Q3, and the
interquartile range?
Box Plot

1. Graphical display of data using 5-number


summary

Xsmallest Q 1 Median Q 3 Xlargest

4 6 8 10 12
Box Plot

1. Draw a rectangle (box) with the ends


(hinges) drawn at the lower and upper
quartiles (QL and QU). The median data is
shown by a line or symbol (such as “+”).
2. The points at distances 1.5(IQR) from each
hinge define the inner fences of the data set.
Line (whiskers) are drawn from each hinge to
the most extreme measurements inside the
inner fence.
Box Plot
3. A second pair of fences, the outer fences, are
defined at a distance of 3(IQR) from the
hinges. One symbol (*) represents
measurements falling between the inner and
outer fences, and another (0) represents
measurements beyond the outer fences.
4. Symbols that represent the median and
extreme data points vary depending on
software used. You may use your own
symbols if you are constructing a box plot by
hand.
Shape & Box Plot

Left-Skewed Symmetric Right-Skewed


Q 1 Median Q3 Q1 Median Q 3 Q 1 Median Q 3
Detecting Outliers

Box Plots: Observations falling between the inner


and outer fences are deemed suspect outliers.
Observations falling beyond the outer fence
are deemed highly suspect outliers.
z-scores: Observations with z-scores greater than
3 in absolute value are considered outliers.
(For some highly skewed data sets,
observations with z-scores greater than 2 in
absolute value may be outliers.)
Example

 In the Journal of Experimental Social


Psychology (Vol. 45, 2009) study on whether
money can buy love, recall that the researchers
measured the quantitative variable birthday gift
price (dollars) for each of the 237 participants.
Are there any unusual reported prices in the
BUYLOV data set?
Example
The Sample Covariance
 The sample covariance measures the strength of the
linear relationship between two variables (called
bivariate data)

 The sample covariance:


n

 ( X  X)( Y  Y )
i i
cov ( X , Y )  i1
n 1
 Only concerned with the strength of the relationship
 No causal effect is implied
Interpreting Covariance

 Covariance between two random variables:

cov(X,Y) > 0 X and Y tend to move in the same direction

cov(X,Y) < 0 X and Y tend to move in opposite directions

cov(X,Y) = 0 X and Y are independent


Coefficient of Correlation
 Measures the relative strength of the linear
relationship between two variables

 Sample coefficient of correlation:


n

 ( X  X)( Y  Y )
i i
cov ( X , Y )
r i1

n n SX SY
 i
(
i1
X  X ) 2
 i
(
i1
Y  Y ) 2
Features of
Correlation Coefficient, r
 Unit free
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear
relationship
 The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.6 r=0
Y
Y Y

X X X
r = +1 r = +.3 r=0
Applications of standard deviation

 Quality Management: control chart


 Risk Management
Quality Management
Control Chart
Process Variation

Total Process Common Cause Special Cause


Variation = Variation + Variation

 Variation is natural; inherent in the world


around us
 No two products or service experiences are
exactly the same
 With a fine enough gauge, all things can be
seen to differ
Process Variation

Total Process Common Cause Special Cause


Variation = Variation + Variation

Variation is often due to differences in:


 People
 Machines
 Materials
 Methods
 Measurement
 Environment
Process Variation

Total Process Common Cause Special Cause


Variation = Variation + Variation

Common cause variation


 naturally occurring and expected
 the result of normal variation in materials,
tools, machines, operators, and the
environment
Process Variation

Total Process Common Cause Special Cause


Variation = Variation + Variation

Special cause variation


 abnormal or unexpected variation
 has an assignable cause
 variation beyond what is considered
inherent to the process
Control Limits
Forming the Upper control limit (UCL) and the Lower
control limit (LCL):

UCL = Process Mean + 3 Standard Deviations


LCL = Process Mean – 3 Standard Deviations

UCL
+3σ
Process Average
- 3σ
LCL

time
Control Chart Basics

Special Cause Variation:


Range of unexpected variability

UCL
Common Cause +3σ
Process Mean
Variation: range of
- 3σ
expected LCL
variability
time
UCL = Process Mean + 3 Standard Deviations
LCL = Process Mean – 3 Standard Deviations
Process Variability
Special Cause of Variation:
A measurement this far from the process average is very
unlikely if only expected variation is present

UCL
±3σ → 99.7% of
process values Process Mean
should be in this
range LCL

time
UCL = Process Mean + 3 Standard Deviations
LCL = Process Mean – 3 Standard Deviations
Using Control Charts

Control Charts are used to check for process control

If the process is found to be out of control, steps


should be taken to find and eliminate the special
causes of variation
In-control Process

 A process is said to be in control when


the control chart does not indicate any
out-of-control condition
 Contains only common causes of variation
 If the common causes of variation is small, then
control chart can be used to monitor the process
 If the common causes of variation is too large, you
need to alter the process
Process In Control

 Process in control: points are randomly


distributed around the center line and all
points are within the control limits

UCL

Process Mean

LCL

time
Process Not in Control

Out of control conditions:

 One or more points outside control limits


 8 or more points in a row on one side of the
center line
 8 or more points in a row moving in the same
direction
Process Not in Control
One or more points outside Eight or more points in a row on one
control limits side of the center line
UCL UCL
Process Process
Average Average

LCL LCL

Eight or more points in a row


moving in the same direction
UCL
Process
Average
LCL
Out-of-control Processes

 When the control chart indicates an out-of-


control condition (a point outside the control
limits or exhibiting trend, for example)
 Contains both common causes of variation and
assignable causes of variation
 The assignable causes of variation must be identified
 If detrimental to the quality, assignable causes of variation
must be removed
 If increases quality, assignable causes must be incorporated
into the process design

You might also like