You are on page 1of 7

Chapter 3, Part A

x́=
∑ x i = 4 7,280 = 3,940
Descriptive Statistics: Numerical Measures n 12
 Measures of Location
 Measures of Variability Median
 The median of a data set is the value in the
Numerical Measures middle when the data items are arranged in
 If the measures are computed for data from a ascending order.
sample, they are called sample statistics.  Whenever a data set has extreme values,
 If the measures are computed for data from a median is the preferred measure of central
population, they are called population parameters. location.
 A sample statistic is referred to as the point  The median is the measure of location most
estimator of the corresponding population often reported for annual income and property
parameter. value data.
Measures of Location  A few extremely large incomes or property
- Mean values can inflate the mean.
- Median For an odd number of observations:7 observations
- Mode
- Weighted Mean
- Geometric Mean
- Percentiles
In ascending order
- Quartiles
Median is the middle value; Median = 19
For an even number of observations:
Mean
8 observations
Perhaps the most important measure of location is the
Median is the average of the middle two values.
mean.
Median = (19 + 26)/2 = 22.5
 The mean provides a measure of
central location.
Example: Monthly Starting Salary
 The mean of a data set is the average of
Averaging the 6th and 7th data values:
all the data values.
Median = (3,890+ 3,920)/2 = 3,905
 The sample mean x́ is the point
estimator of the population mean µ.
Sample Mean x́

x́=
∑ xi
n
where:
Sxi = sum of the values of the n observations
n = number of observations in the sample

Population Mean m Trimmed Mean


∑ xi  Another measure sometimes used when
μ= extreme values are present, is the trimmed mean.
N
where:  It is obtained by deleting a percentage of the
Sxi = sum of the values of the N observations smallest and largest values from a data set and then
N = number of observations in the population computing the mean of the remaining values.
 For example, the 5% trimmed mean is obtained
Sample Mean x́ by removing the smallest 5% and the largest 5% of
Example: Monthly Starting Salary the data values and then computing the mean of
A placement office wants to know the average starting the remaining values.
salary of business graduates. Monthly starting salaries
for a sample of 12 business school graduates is
provided here.

Example: Monthly starting Salary


Mode
 The mode of a data set is the value that occurs
with greatest frequency. Geometric Mean
 The greatest frequency can occur at two or  The geometric mean is calculated by finding the
more different values. nth root of the product of n values.
 If the data have exactly two modes, the data  It is often used in analyzing growth rates in
are bimodal. financial data (where using the arithmetic mean will
 If the data have more than two modes, the data provide misleading results).
are multimodal.  It should be applied anytime you want to
Example: Monthly Starting Salary The only monthly determine the mean rate of change over several
starting salary that occurs more than once is $3,880. successive periods (be it years, quarters,
weeks, . . .).
Mode = 3,880  Other common applications include: changes in
populations of species, crop yields, pollution levels,
and birth and death rates.
 x́ g =√n ( x 1 ) ( x 2 ) …( x n)
= [(x1)(x2)…(xn)]1/n

Percentiles
 A percentile provides information about how
the data are spread over the interval from the
smallest value to the largest value.
Weighted Mean
 Admission test scores for colleges and
 In some instances the mean is computed by
universities are frequently reported in terms of
giving each observation a weight that reflects its
percentiles.
relative importance.
 The pth percentile of a data set is a value such
 The choice of weights depends on the
that at least p percent of the items take on this
application.
value or less and at least (100 - p) percent of the
 The weights might be the number of credit
items take on this value or more.
hours earned for each grade, as in GPA.
 In other weighted mean computations,
Arrange the data in ascending order.
quantities such as pounds, dollars, or volume are
Compute Lp, the location of the pth percentile.
frequently used.

x́=
∑ w i xi Lp = (p/100)(n + 1)
∑ wi th
80 Percentile
where: xi = value of observation i Example: Monthly Starting Salary
wi = weight for observation I Lp = (p/100)(n + 1) = (80/100)(12 + 1) = 10.4
Numerator: sum of the weighted data values (the 10th value plus .4 times the difference between the
Denominator: sum of the weights 11th and 10th values)
80th Percentile = 4050 + 0.4 (4130 – 4050) = 4082
If data is from a population, m replaces x́ .

Example: Purchase of Raw Material


Consider the following sample of five purchases of a
raw material over a period of three months:

“At least 20% of the items


At least 80% of the items take on a value of 4082 or
take on a value of 4082 or more.” 2/12 = .167 or 16.7%
less.” 10/12 = .833 or 83%

x́ =
∑ wi x i 18,500
= = 2.96 = $2.96
∑ wi 6,250
FYI, equally-weighted (simple) mean = $3.07
Quartiles
Quartiles are specific percentiles. Interquartile Range
 The interquartile range of a data set is the
 First Quartile = 25th Percentile
 Second Quartile = 50th Percentile = Median difference between the third quartile and the first
quartile.
 Third Quartile = 75th Percentile
 It is the range for the middle 50% of the data.
 It overcomes the sensitivity to extreme data
Third Quartile (75th Percentile)
Example: Monthly Starting Salary values.
Example: Monthly Starting Salary
Lp = (p/100)(n + 1) = (75/100)(12 + 1) = 9.75
(the 9th value plus .75 times the difference between 3rd Quartile (Q3) = 4,000
1st Quartile (Q1) = 3,865
the 10th and 9th values)
Third quartile = 3950 + .75(4050 – 3950) = 4025 IQR = Q3 - Q1 = 4,000 – 3,865 = 135

Measures of Variability Variance


 It is often desirable to consider measures of  The variance is a measure of variability that
variability (dispersion), as well as measures of utilizes all the data.
location.  It is based on the difference between the value
 For example, in choosing supplier A or supplier of each observation (xi) and the mean ( x́ for a
B we might consider not only the average delivery sample, m for a population).
time for each, but also the variability in delivery  The variance is useful in comparing the
time for each. variability of two or more variables.
 Range  The variance is the average of the squared
 Interquartile Range differences between each data value and the mean.
 Variance  The variance is computed as follows:
 Standard Deviation
 Coefficient of Variation for a sample: s2=
∑ ( x i−x́ ) 2
n−1
Range for a population:
σ 2=
∑ ( xi −μ ) 2
 The range of a data set is the difference N
between the largest and smallest data values.
Range = Largest value – Smallest value Standard Deviation
 It is the simplest measure of variability.  The standard deviation of a data set is the
 It is very sensitive to the smallest and largest positive square root of the variance.
data values.  It is measured in the same units as the data,
making it more easily interpreted than the variance.
 Example: Monthly Starting Salary for a sample: 2
s = √s
Range = largest value - smallest value
Range = 4,325 – 3,710 = 615 for a population: s = √ s  2
1. Moderately Skewed Left
- Skewness is negative
- Mean will usually be less than the median

Coefficient of Variation
 The coefficient of variation indicates how large
the standard deviation is in relation to the mean.
The coefficient of variation is computed as follows:
s 2. Moderately Skewed Right
for a sample [ x́ ]
x 100 %
- Skewness is positive
σ - Mean will usually be more than the median.
for a population [ μ ]
x 100 %

Sample Variance, Standard Deviation,


And Coefficient of Variation
Example: Monthly starting salary
Variance

s2 =
∑ ( x i−x́ ) 2 = 27,440.91
n−1
Standard Deviation
3. Highly Skewed Right
s = √ s 2=√ 27,440.91=¿ 165.65
- Skewness is positive (often above 1.0).
Coefficient of Variation - Mean will usually be more than the median.
s 165.65
[ x́ ] [
x 100 % =
3,940 ]
x 100 %=4.2 %

Chapter 3, Part B
Descriptive Statistics: Numerical Measures

Measures of Distribution Shape,


Relative Location, and Detecting Outliers
 Distribution Shape Z- Scores
 z-Scores  The z-score is often called the standardized
 Chebyshev’s Theorem value.
 Empirical Rule  It denotes the number of standard deviations a
 Detecting Outliers data value xi is from the mean.
xi −x́
Distribution Shape: Skewness z i=
s
 An important measure of the shape of a
 Excel’s STANDARDIZE function can be used to
distribution is called skewness.
compute the z-score.
 The formula for the skewness of sample data is
 An observation’s z-score is a measure of the
3
n x i−x́
Skewness =
(n−1)( n−2)
∑ [ ]s 
relative location of the observation in a data set.
A data value less than the sample mean will
 Skewness can be easily computed using have a z-score less than zero.
statistical software  A data value greater than the sample mean will
have a z-score greater than zero.
1.Symmetric (not skewed)  A data value equal to the sample mean will
- Skewness have a
i s zero z-score
- Mean and of zero.
median are
equal
Example: Class Size data  Approximately 68% of the data
values will be within +/- 1 standard deviation of its
xi −x́
zi = mean.
s  Approximately 95% of the data
values will be within +/- 2 standard deviations of its
mean.
 Almost all of the data values
will be within +/- 3 standard deviations of its mean.
Bell shaped distribution:

Note: 𝑥 ̅ = 44 and s = 8 for the given data.

Chebyshev’s Theorem
 At least (1 - 1/z2) of the items in any data set
will be within z standard deviations of the mean,
where z is any value greater than 1.
 Chebyshev’s theorem requires z > 1, but z need
not be an integer. Detecting Outliers:
 At least 75% of the data values must be within z  An outlier is an unusually small or unusually
= 2 standard deviations of the mean. large value in a data set.
 At least 89% of the data values must be within z  A data value with a z-score less than -3 or
= 3 standard deviations of the mean. greater than +3 might be considered an outlier.
 At least 94% of the data values must be within z  It might be:
= 4 standard deviations of the mean.  an incorrectly recorded data value
 Example: Marks of students - Suppose the  a data value that was incorrectly
marks of 100 students in a course had a mean of included in the data set
70 and a standard deviation of 5. We want to know  a correctly recorded unusual data value
the number of students having test scores between that belongs in the data set
60 and 80. Example: Class Size data
60 and 80 are 2 standard deviations below and above
the mean respectively.
- 60 = 70 – 2(5) s
- 80 = 70 + 2(5)
- Z = 75%

Number of students having test scores between 58 and 72:


(58-72)/5 = -2.4
 -1.5 shows fifth class size is farthest from the
(82-70)/5 = 2.4
mean .
z = 2.4
 No outliers are present as z value is within +/- 3
(1 – 1/ z2) = (1 – 1/(2.4)2 ) = 0.826 = 82.6%
guideline for outliers.
Empirical Rule
When the data are believed to approximate a bell-shaped Five-Number Summaries and Box Plots
distribution:  Summary statistics and easy-to-draw graphs
 The empirical rule can be used to can be used to quickly summarize large quantities of
determine the percentage of data values that must be data.
within a specified number of standard deviations of  Two tools that accomplish this are five-number
the mean. summaries and box plots.
 The empirical rule is based on the
normal distribution, which is covered in Chapter 6. Five-Number Summary
 Smallest Value
For data having a bell-shaped distribution:
 First Quartile
 Median
 Third Quartile
 Largest Value Lower Limit: Q1 - 1.5(IQR) = 3,857.5 - 1.5(167.5) =
3,606.25
Example: Monthly starting Salary  The upper limit is located 1.5(IQR)
 Lowest Value = 3,710 above Q3.
 Median = 575 Upper Limit: Q3 + 1.5(IQR) = 4,025 + 1.5(167.5) =
 Third Quartile = 4,025 4,276.25
 First Quartile = 3,857.5  There is one outlier i.e 4,325 in the
 Largest Value = 4,325 given instance.

Example: Monthly Starting Salary


 Whiskers (dashed lines) are drawn from the
ends of the box to the smallest and largest data
values inside the limits.

Smallest value inside Largest value inside


limits = 3,606.25 limits = 4,276.25

Measures of Association Between Two Variables


 Thus far we have examined numerical methods
used to summarize the data for one variable at a time.
 Often a manager or decision maker is interested
Box Plot in the relationship between two variables.
 A box plot is a graphical summary of data that is  Two descriptive measures of the relationship
based on a five-number summary. between two variables are covariance and correlation
 A key to the development of a box plot is the coefficient.
computation of the median and the quartiles Q1 and
Q3.
 Box plots provide another way to identify
outliers. Covariance
Example Monthly starting salary  The covariance is a measure of the linear
 A box is drawn with its ends located at the first association between two variables.
and third quartiles.  Positive values indicate a positive relationship.
 A vertical line is drawn in the box at the location  Negative values indicate a negative
of the median (second quartile). relationship.

For sample: s xy= ∑ ( x i ¿−x́)( y i − ý) ¿


n−1
( x i ¿−μ x )( y i−μ y )
σ xy = ∑ ¿
For population: N
Correlation Coefficient
 Correlation is a measure of linear association
and not necessarily causation.

 Just because two variables are highly


correlated, it does not mean that one variable is the
 Limits are located (not drawn) using the cause of the other. σ xy
interquartile range (IQR). s xy ρ xy=
r xy= σxσ y
 Data outside these limits are considered For sample: s x s y For population:
outliers
 The coefficient can take on
 The locations of each outlier is shown with the
values between -1 and +1.
symbol
 Values near -1 indicate a strong
Example: Monthly starting salary
negative linear relationship.
 The lower limit is located 1.5(IQR)
below Q1.  Values near +1 indicate a strong
positive linear relationship.
 The closer the correlation is to
zero, the weaker the relationship.
Example: Stereo and Sound Equipment Store

The store’s manager wants to determine the


relationship between the number of weekend television
commercials shown and the sales at the store during
the following week.

Example: Stereo and Sound Equipment Store

( x i ¿−x́)( y i − ý)
Sample Covariance s xy = ∑ ¿ = 99
n−1
/ 9 = 11

Sample Correlation Coefficient


s xy
r xy = = 11/(1.49 × 7.93) = 0.93
sx s y

Data Dashboards:
Adding Numerical Measures to Improve Effectiveness

 Data dashboards are not limited to graphical


displays.
 The addition of numerical measures, such as
the mean and standard deviation of KPIs, to a data
dashboard is often critical.
 Dashboards are often interactive.

 Drilling down refers to functionality in


interactive dashboards that allows the user to
access information and analyses at increasingly
detailed level.

You might also like