You are on page 1of 9

DATA DESCRIPTION

After completing this chapter, you should be able to


1. Summarize data, using measures of central tendency, such as the mean, median, mode, and midrange.
2. Describe data, using measures of variation, such as the range, variance, and standard deviation.
3. Identify the position of a data value in a data set, using various measures of position, such as percentiles,
deciles, and quartiles.

Summation Notation
The most common symbol or notation used in statistics is the summation notation or simply summation ( ∑ ).

n
∑ Xi = X1 + X2 + X3 + ... + Xn .
i=1

Read: “ The summation of X sub i , from i  1 to i  n ”


i is called the index of summation,
1 the lower limit, and
n the upper limit

Theorems on Summation
1. The summation of the sum of two or more variables is the sum of their summations. Thus,
n n n n
∑ (xi + yi + zi ) = ∑ xi + ∑ yi + ∑ zi
i =1 i =1 i =1 i=1
2. If c is a constant, then
n n
∑ cxi = c∑ xi
i=1 i=1
3. If c is a constant, then
n
∑ c = nc
i =1

Exercise:
A. Write the following in full.
5
1. ∑ 5 xi
i=1
6
2. ∑ ( xi + 2 yi - 3)
i=3
5
3. ∑ 3xi 2
i=2
4
4. ∑ (3xi ) 2
i =1
4
5. ∑ ( xi+2 - 3)
i=1
B. Write each of the following expressions in summation notation with appropriate limits.
1. 2x1 + 2x2 + 2x3 + 2x4 + 2x5 + 2x6 + 2x7 + 2x8
2. ( x2 - 3 y2 ) + ( x3 - 3 y3 ) + ( x4 - 3 y4 ) + ( x5 - 3 y5 )

3. (a3 + 5) 2 + (a 4 + 5) 2 + (a5 + 5) 2 + (a6 + 5) 2

4. ( x12 +1) + ( x2 2 + 2) + ( x32 + 3) + ( x4 2 + 4) + ( x5 2 + 5)


5. 1+ 4 + 9 +16 + 25 + ... + n
2

C. Given: x1 = 4 y1 = -2
x2 = -3 y2 = 5
x3 = 6 y3 = -1
x4 = 2 y4 = 3
Evaluate the following.
4 4 4 4 4 4
1. ∑ (2 xi + 3 yi - 4) 2. ∑ xi 2 yi 3 3. ( ∑ xi 2 )( ∑ yi 3 ) 4. ( ∑ xi ) 2 ( ∑ yi ) 3
i=1 i=1 i =1 i =1 i=1 i=1
MEASURES OF CENTRAL TENDENCY AND LOCATION

Measures of Central Tendency for Raw Data


Any measure indicating the center of a set of data, arranged in an increasing or decreasing order of magnitude,
is called a measure of central tendency. The most commonly used measures of central tendency are the mean,
median, and mode.

The Mean ( x )
The mean, also known as the arithmetic average, is found by adding the values of the data and dividing by
the total number of values. If x1, x2 , x3 ,..., xn represents a finite set of observations of size n , then the mean is
n
∑ xi
i =1
n
The population mean is denoted by  and the sample mean is denoted by x . The mean should be rounded to one
more decimal place than occurs in the raw data.

Examples
1. The data represent the number of days off per year for a sample of individuals selected from nine different
countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
2. The numbers of building permits issued last month to 12 construction firms in a small city were 4, 7, 0, 7,
11, 4, 1, 15, 3, 5, 8, and 7. Find the mean.

~
The Median ( X )
The median is the halfway point in a data set. Before you can find this point, the data must be arranged in
order. When the data set is ordered, it is called a data array. The median either will be a specific value in the data
set or will fall between two values.

Examples
1. The number of rooms in the seven hotels in downtown Pittsburgh is 713, 300, 618, 595, 311, 401, and 292.
Find the median.
2. The numbers of building permits issued last month to 12 construction firms in a small city were 4, 7, 0, 7,
11, 4, 1, 15, 3, 5, 8, and 7. Find the median.

The Mode ( X̂ )
The mode of a set of observations is that value which occurs most often or with the greatest frequency. A
data set that has only one value that occurs with the greatest frequency is said to be unimodal. If a data set has two
values that occur with the same greatest frequency, both values are considered to be the mode and the data set is said
to be bimodal. If a data set has more than two values that occur with the same greatest frequency, each value is used
as the mode, and the data set is said to be multimodal. When no data value occurs more than once, the data set is
said to have no mode. A data set can have more than one mode or no mode at all.

Examples
1. Find the mode of the signing bonuses of eight NFL players for a specific year. The bonuses in millions
of dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10.
2. Find the mode for the number of coal employees per county for 10 selected counties in southwestern
Pennsylvania. 110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
3. The data show the number of licensed nuclear reactors in the United States for a recent 15-year period. Find
the mode.
104 104 104 104 104
107 109 109 109 110
109 111 112 111 109

The mode is the only measure of central tendency that can be used in finding the most typical case when
the data are nominal or categorical.

Example. A survey showed this distribution for the number of students enrolled in each field. Find the mode.
Business 1425
Liberal arts 878
Computer science 632
Education 471
General studies 95

The Midrange (MR)


The midrange is a rough estimate of the middle. It is found by adding the lowest and highest values in the
data set and dividing by 2. It is a very rough estimate of the average and can be affected by one extremely high or
low value.
Examples
1. The number of accidents in a city intersection last month were reported as 2, 3, 6, 8, 4, 1. Find the
midrange.
2. Find the midrange of the signing bonuses of eight NFL players for a specific year. The bonuses in
millions of dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10.

Weighted Mean
The type of mean that considers an additional factor is called the weighted mean, and it is used when the
values are not all equally represented.
If k quantities x1, x2 , . . . , xk have weights w1 , w2 , . . . , wk , respectively, where the weights
represents measures of relative importance, then the weighted mean is
∑ xi wi
xw =
∑ wi

Examples:
1. What is the average for a student who received grades of 85, 76, and 82 on three tests and a 79 on the final
examination in a certain course if the final examination counts three times as much as each of the three
tests?

2. On a vacation trip a family bought 21.3 liters of gasoline at 39.9 cents per liter, 18.7 liters at42.9 cents per
liter, and 23.5 liters at 40.9 cents per liter. Find the mean price paid per liter.

3. A savings and loan association makes one car loan of $5000 at 10.5% interest, a second car loan of $6300
at 10.8% interest, and a third car loan of $4500 at 11% interest. What is the average percentage return to the
savings and loan association for these three loans?

Combined Mean
If k finite groups having n1 , n2 , . . . , nk measurements, respectively, have means x1 , x 2 , . . . , x k , the
combined mean is
∑ ni x i
xc =
∑ ni
Example:
1. Three sections of a statistics class containing 28, 32, and 35 students averaged 83, 80, and 76, respectively,
on the same final examination. What is the combined mean for all three sections?

2. A survey of a random sample of people leaving an amusement park showed an average expenditure of
$10.30 for the evening. The average expenditure for the 20 girls in the sample was $9.70 and for the boys it
was $11.10. How many boys are there in the random sample?

Measures of Central Tendency for Grouped Data


∑ f i xi
Mean x= where: fi = frequency
n
xi = class mark
n = total number of observations
n
( - Sb ) i
~
Median X = L+ 2 where: L = lower boundary of the median class
fm

n = total number of observations


Sb = sum of the frequencies before the median class
fm = frequency of the median class
i = size of the class interval

Δ1 i
Mode Xˆ = L + where: L = lower boundary of the modal class
Δ1 + Δ 2
1 = difference between the frequencies of the modal class and the next lower class
 2 = difference between the frequencies of the modal class and the next higher class
i = size of the class interval
Properties and Uses of Central Tendency
The Mean
1. The mean is found by using all the values of the data.
2. The mean varies less than the median or mode when samples are taken from the same population and all
three measures are computed for these samples.
3. The mean is used in computing other statistics, such as the variance.
4. The mean for the data set is unique and not necessarily one of the data values.
5. The mean cannot be computed for the data in a frequency distribution that has an open-ended class.
6. The mean is affected by extremely high or low values, called outliers, and may not be the appropriate average
to use in these situations.
The Median
1. The median is used to find the center or middle value of a data set.
2. The median is used when it is necessary to find out whether the data values fall into the upper half or lower
half of the distribution.
3. The median is used for an open-ended distribution.
4. The median is affected less than the mean by extremely high or extremely low values.
The Mode
1. The mode is used when the most typical case is desired.
2. The mode is the easiest average to compute.
3. The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation.
4. The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a
data set.
The Midrange
1. The midrange is easy to compute.
2. The midrange gives the midpoint.
3. The midrange is affected by extremely high or low values in a data set

MEASURES OF VARIATION
Measures of variability or dispersion are measures of the average distance of each observation from the
center of the distribution. They measure the homogeneity or heterogeneity of a particular group.

A small measure of variability would indicate that the data are: clustered closely around the mean, more
homogeneous, less variable, more consistent, and more uniformly distributed.

There are two general classifications of measures of variability or dispersion: Measures of absolute
dispersion and measures
of relative dispersion.

Measures of Absolute Dispersion


The measures of absolute dispersion are expressed in the units of the original observations. They cannot be
used to compare variations of two data sets when the averages of these data sets differ a lot in value or when the
observations differ in units of measurement.

1. The Range
The range is the difference between the highest and the lowest values. This is the simplest but the most
unreliable measure of variability since it uses only two values in the distribution.

Range = Highest Value – Lowest Value

2. The Variance
Variance is the average of the squared deviation from the mean.

For raw data:


∑ ( x - μ) 2
Population Variance: σ =
2
N
∑ ( x - x) 2
Sample Variance: s2 =
n -1

For grouped data:


∑ f ( X - μ )2
Population variance: σ2 =
N

n∑ fi X i2 - (∑ fi X i )
2
Sample variance: s =
2
n(n - 1)

where: Xi = class marks


fi = frequencies
n = total number of observations
3. The Standard Deviation
Standard deviation is the square root of the average deviation from the mean, or simply the square root of
the variance.

For raw data:


∑ ( x - μ)2
Population Standard Deviation: σ=
N
∑ ( x - x)2
Sample Standard Deviation: s=
n -1

For grouped data :

Population standard deviation:


∑ f ( X - μ)2
σ=
N

Sample standard deviation:


n ∑ fX 2 - (∑ fX )
2
s=
n(n - 1)

Example :
1. Consider the following sets of grades in Mathematics of two groups of 5 students each.
Male group : 70, 95, 60, 80, 100
Female group : 82, 80, 83, 81, 79
Find the range, variance, and standard deviation for each set.

2. Net Worth of Corporations These data represent the net worth (in millions of dollars) of 45 national corporations.
Class limits Frequency
10–20 2
21–31 8
32–42 15
43–53 7
54–64 10
65–75 3

Calculate the variance and standard deviation.

The range can be used to approximate the standard deviation. The approximation is called the range rule
range
of thumb. A rough estimate of the standard deviation is 𝑠 4 .

Uses of the Variance and Standard Deviation


1. Variances and standard deviations can be used to determine the spread of the data. If the variance or standard
deviation is large, the data are more dispersed. This information is useful in comparing two (or more) data
sets to determine which is more (most) variable.
2. The measures of variance and standard deviation are used to determine the consistency of a variable. For
example, in the manufacture of fittings, such as nuts and bolts, the variation in the diameters must be small,
or the parts will not fit together.
3. The variance and standard deviation are used to determine the number of data values that fall within a
specified interval in a distribution. For example, Chebyshev’s theorem shows that, for any distribution, at
least 75% of the data values will fall within 2 standard deviations of the mean.
4. Finally, the variance and standard deviation are used quite often in inferential statistics.

Measures of Relative Dispersion


Measures of relative dispersion are unitless and are used when one wishes to compare the scatter of one
distribution with another distribution.

Coefficient of Variation
- describes the standard deviation relative to the mean
Coefficient of variation (CV) is the ratio of the standard deviation to the mean and is usually expressed in
percentage. It is used to compare the variability of two or more sets of data even when they are expressed in
different units of measurement.
s
cv = where : s = standard deviation
x
x = mean
Example: Height and Weight of Men
Using the height and weight data for 40 males included in a sample, the statistics are given below.
Find the coefficient of variation for the height, the coefficient of variation for the weight, and then compare the
results.

Applications of the Standard Deviation


1. EMPIRICAL (NORMAL) RULE – applies to bell-shaped distributions

Example. IQ scores of normal adults on the Wechsler test have a bell shape distribution with a mean of 100
and a standard deviation of 15. What percentage of adults have IQ scores between 55 and 145?

2. CHEBYSHEV’S THEOREM
- Applies to distribution of any shape
1
At least the fraction 1 - of the measurements of any set of data must lie within k standard
k2
deviations of the mean.

Examples:
1. If the IQs of a random sample of 1080 students at a large university have a mean score of 120 and a
standard deviation of 8,
a. Determine the interval containing at least 810 of the IQs in the sample,
b. In what range can we be sure that no more than 120 of the scores fall?

2. A coffee-maker is regulated so that it takes an average of 5.8 minutes to brew a cup of coffee with a
standard deviation of 0.6 minute. According to Chebyshev’s theorem, what percentage of the times
that this coffee-maker is used will the brewing time take anywhere from
a. 4.6 minutes to 7.0 minutes
b. 3.4 minutes to 8.2 minutes
c. 4.3 minutes to 7.3 minutes

Note: For typical data sets, it is unusual for a score to differ from the mean by more than 2 or 3 standard deviations.

3. A study of the nicotine contents of a certain brand of cigarette shows that on the average one cigarette
contains 1.52 milligrams of nicotine with a standard deviation of 0.07 milligram. According to
Chebyshev’s theorem, between what values must the nicotine content be for
24
a. at least of all cigarettes of this brand?
25
48
b. At least of all cigarettes of this brand?
49
MEASURES OF POSITION
Standard Scores
A z score or standard score for a value is obtained by subtracting the mean from the value and dividing
the result by the standard deviation. The symbol for a standard score is z.
xx
For samples, the formula is z 
s
x-μ
For populations, the formula is z =
σ
The z score represents the number of standard deviations that a data value falls above or below the mean.

Example: A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she
scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative positions
on the two tests.

Other Measures of Location


There are several other measures of location that describe or locate the position of certain non-central
pieces of data relative to the entire set of data. These measures, often referred to as fractiles or quantiles, are values
below which a specific fraction or percentage of the observations in a given set must fall. Of special interest are
those fractiles commonly referred to as quartiles, deciles, and percentiles.

Quartiles
Quartiles are values that divide a set of observations into 4 equal parts. These values, denoted by Q1, Q2,
and Q3, are such that
25% of the data falls below Q1,
50% of the data falls below Q2 , and
75% of the data falls below Q3.

Determining the Quartiles for Ungrouped Data


1. Arrange the data from lowest to highest
2. Find the location of the kth quartile, denoted by LQ k, using the formula 𝐿𝑄𝑘 = 𝑘4(𝑛+1).
3. Determine the value of the kth quartile, denoted by Q k, using the formula
𝑄𝑘 = 𝐿𝑆 + 𝑑𝑒𝑐(𝐻𝑆 − 𝐿𝑆)
where LS = score corresponding to the integral part of LQk
dec = decimal part of LQk
HS = score succeeding LS

Example. The lengths of service (in years) of 16 employees in a certain town hall are
7 1 5 35 28 10 15 22
11 10 12 6 8 14 18 16
Find Q1, Q2, and Q3.

Deciles
Deciles are values that divide a set of observations into 10 equal parts. These values, denoted by D1. D2. . . .
. D9 , are such that
10% of the data falls below D1 ,
20% of the data falls below D2 ,
.
.
.
90% of the data falls below D9 .

Determining the Deciles for Ungrouped Data


1. Arrange the data from lowest to highest
𝑘
2. Find the location of the kth decile, denoted by LD k, using the formula 𝐿𝐷𝑘 = 10 (𝑛+1).
3. Determine the value of the kth decile, denoted by D k, using the formula
𝐷𝑘 = 𝐿𝑆 + 𝑑𝑒𝑐(𝐻𝑆 − 𝐿𝑆)
where LS = score corresponding to the integral part of LQ k
dec = decimal part of LQk
HS = score succeeding LS

Example. The lengths of service (in years) of 16 employees in a certain town hall are
7 1 5 35 28 10 15 22
11 10 12 6 8 14 18 16
Find D1, D4, and D9.
Percentiles
Percentiles are values that divide a set of observations into 100 equal parts. These values, denoted by P1,
P2, P3, . . . , P99 , are such that
1% of the data falls below P1,
2% of the data falls below P2,
3% of the data falls below P3,
.
.
.
99% of the data falls below P99.

Determining the Percentiles for Ungrouped Data


1. Arrange the data from lowest to highest
𝑘
2. Find the location of the kth decile, denoted by LD k, using the formula 𝐿𝐷𝑘 = 10 (𝑛+1).
3. Determine the value of the kth decile, denoted by D k, using the formula
𝐷𝑘 = 𝐿𝑆 + 𝑑𝑒𝑐(𝐻𝑆 − 𝐿𝑆)
where LS = score corresponding to the integral part of LQ k
dec = decimal part of LQk
HS = score succeeding LS

Example. The lengths of service (in years) of 16 employees in a certain town hall are
7 1 5 35 28 10 15 22
11 10 12 6 8 14 18 16
Find P17, P43, and P87.

Quantiles for Grouped Data


 kn 
( - Sb ) i 
Qk  L   4  where:
 fq 
 
 
L = lower boundary of the quartile class
n = total number of observations
Sb = sum of the frequencies before the quartile class
f q = frequency of the quartile class
i = size of the class interval

 kn 
( - Sb ) i 
Dk  L   10  where:
 fd 
 
 
L = lower boundary of the decile class
n = total number of observations
Sb = sum of the frequencies before the decile class
fd = frequency of the decile class
i = size of the class interval

 kn 
( - S b )i 
Pk  L   100  where:
 fp 
 
 
L = lower boundary of the percentile class
n = total number of observations
Sb = sum of the frequencies before the percentile class
fp = frequency of the percentile class
i = size of the class interval
SKEWNESS AND KURTOSIS
Skewness refers to the degree of symmetry or asymmetry of a distribution.

~
Normal distribution is a distribution with a bell-shaped appearance. In a normal distribution, X = X = X̂ .

A distribution is skewed to the left if the mean is less than the median. The bulk of the distribution is on the
right. This is otherwise known as negatively skewed. The graph has a long left tail.

A distribution is skewed to the right if the mean is greater than its median. The bulk of the distribution is on
the left. This is otherwise known as positively skewed. The graph has a long right tail.

The extent of skewness can be obtained by getting the coefficient of skewness.

3( X - X )
~
SK = where: SK = coefficient of skewness
s
X = the mean
~
X = the median
s = standard deviation

Note:
If SK = 0, the distribution is normal.
If SK < 0, the distribution is skewed to the left.
If SK > 0, the distribution is skewed to the right.

Kurtosis refers to the peakedness or flatness of a distribution.

Mesokurtic is a normal distribution.


Leptokurtic is more peaked than the normal distribution.
Platykurtic is flatter than the normal distribution.
Kurtosis (Ku) is obtained using the following formulas:

∑ (X - X) 4
Ku = , for ungrouped data
ns 4

∑ f(X - X) 4
Ku = , for grouped data
ns 4
where:
Ku is the kurtosis s standard deviation
X raw data or class mark n sample size
X mean
Note:
If Ku = 3, the distribution is normal.
If Ku > 3, the distribution is leptokurtic.
If Ku < 3, the distribution is platykurtic.

Example: In each of the following numbers, compute the coefficients of skewness and kurtosis. Indicate if the
distribution is normal, skewed to the right or skewed to the left , and also if it is mesokurtic, platykurtic, or
leptokurtic.
1. Given the date set:
72 81 67 83 61 75 78 82 71 67

2. Refer to the table and find the coefficient of skewness. Describe the distribution.
Class limits Frequency
10–20 2
21–31 8
32–42 15
43–53 7
54–64 10
65–75 3

You might also like