You are on page 1of 32

ENGINEERING DATA

ANALYSIS
MODULE 1.2: Measures of Central
Tendency and Dispersion
MEASURES OF
CENTRAL TENDENCY
UNGROUPED DATA
MEASURES OF CENTRAL TENDENCY

Measures of central Tendency or


Location- a single value about which
the set of observations tend to cluster.
MEASURES OF CENTRAL TENDENCY

Arithmetic Mean- the sum of all the observations divided by the total
number of observations; denoted as 𝜇 (Greek letter mu)
𝑁

𝜇 = ෍ 𝑋𝑖/𝑁
𝑖=1

Where 𝑋𝑖 is the value of the ith observation, i =1,…N


𝑁 is the total number of observations
MEASURES OF CENTRAL TENDENCY

Median- a single value which divides an array (arranged


data set in ascending or descending order) of
observations into two equal parts such that 50% of the
observations fall below it and 50% of the observations fall
above it; denoted as Md.

If N (no. of observations) is odd, the median is the middle


value of the array
If N is even, the median is the mean of the two middle
values of the array
MEASURES OF CENTRAL TENDENCY

-the value which occurs most frequently in the data set


-denoted as Mo

For ungrouped data set, the mode is the value which


occurs most frequently
MEASURES OF CENTRAL TENDENCY

Geometric mean
- the Nth root of the product of N positive number
-used mainly to average ratios, rates of change, economic
indices, etc.
-in Practice, geometric mean means are calculated by
making use of the fact that the logarithm of the geometric
mean of a set of positive numbers equals the arithmetic
means of their logarithms.
Comparison among measures of central tendency

MEAN MEDIAN MODE


 Reflects the magnitude of  it is positional value and  determined by the frequency
observation hence is not affected by and not by the values of the
 Easily affected by the the presence of extreme observations
presence of extreme values vlues(suggested mct  when a quick measure of
 Amenable to further when there are few location is needed
computation extreme values)  it cannot be manipulated
 Most commonly used  not amenable to further algebraically
measure of central computation  can be defined with quantitative
tendency (mct) because of  the median of grouped as well as qualitative random
its good statistical data can be calculated variables
properties even with open-ended  very much affected by the
 Most meaningful mct when intervals provided the method of grouping data
there are no extreme median class is not  can be computed with open-
values open-ended ended intervals provided the
modal class is not open-ended
MEASURES OF
CENTRAL TENDENCY
GROUPED DATA
MEDIAN OF GROUPED DATA
In a grouped data, it is not possible to find the median for the
given observation by looking at the cumulative frequencies.
The middle value of the given data will be in some class
interval. So, it is necessary to find the value inside the class
interval that divides the whole distribution into two halves. In
this scenario, we have to find the median class.

To find the median class, we have to find the cumulative


frequencies of all the classes and n/2. After that, locate the
class whose cumulative frequency is greater than (nearest
to) n/2. The class is called the median class.
MEDIAN OF GROUPED DATA

After finding the median class, use the below formula to find the
median value.

Where
l is the lower limit of the median class
n is the number of observations
f is the frequency of median class
h is the class size
cf is the cumulative frequency of class preceding the median
class.
MEDIAN OF GROUPED DATA
EXAMPLE
The following data represents the survey regarding the
heights (in cm) of 51 girls of Class x. Find the median height.

Answer: Median = 149.03


MODE OF GROUPED DATA
In the case of grouped data, it is not possible to identify the
mode of the data, by looking at the frequency of data. In this
scenario, we can determine the mode value by locating the
class with the maximum frequency called modal class.
Inside a modal class, we can locate the mode value of the
data by using the formula,
MEDIAN OF GROUPED DATA

Where,
f1 is the frequency of the modal class
f0 is the frequency of the class preceding the modal class
f2 is the frequency of the class succeeding the modal class
h is the size of the class intervals
l is the lower limit of the modal class
MODE OF GROUPED DATA
EXAMPLE
A survey has been conducted by a group of students on 20
households in a locality as shown in the following frequency
distribution table. Find the mode for the given data.

Answer: Mode = 3.286.

Answer: Median = 149.03


MEASURES OF
VARIATION
MEASURES OF DISPERSION

Measures of Dispersion- a quantity that measures the spread or


variability of the observation in a given population

Illustration:
Data Set 1: 3,3,3,3,3
Data Set 2: 1,2,3,4,5
Data Set 3: 2,2,3,4,4

All three data sets have mean equal to 3 yet they are not identical. There is
a need for another quantity to measure the spread of the values in a given
population.
Some common measures of dispersion:

1. Range
2. Variance
3. Standard Deviation
4. Coefficient of Variation
Range- difference between the highest value and the lowest
value of the population

Example.
The range of actual body weight value is 46.8-8.00=38.8.

Properties:
1. It is quick but rough measure of dispersion
2. The larger the value of the range the more dispersed are the
observations.
3. It considers the highest and lowest observations I the
population. Hence, it may be reflective of the dispersion
characteristic of the majority.
Variance -mean of the squared deviations of the observations
from the mean, denoted by 𝜎 2

2 σ(𝑋𝑖−𝜇)² σ 𝑋𝑖²−(𝜇)²
𝜎 = =
𝑁 𝑁
Properties:

1. The variance is always non-negative.


2. A large variance corresponds to a highly dispersed set of
values.
3. The variance is easy to manipulate for further mathematical
treatment.
4. The variance makes use of all observations.
5. The variance comes in a unit of measure that is the square of
the unit of measure of the given set of values
Standard deviation - the positive square root of variance. That is,

𝜎= 𝜎2

Properties: The standard deviation has the same set of properties as


the variance except that its unit of measurement is similar to the unit
of measurement of the observations

Example. In actual body weight of sheep, the standard deviation is

𝜎 = 66.3533917= 8.145759091

Remark: The standard deviation, coupled with arithmetic mean,


gives a lot of information about the distribution of a given population
Interquartile range- the difference between the third and the first
quartiles of a set of data. It is denoted by IR. It provides a measure
of the range of the middle 50% of the observations.

Quartiles are values from a given array of data which divide the
array into four equal parts.

The First Quartile, denoted by Q1, is the value for which 25% of the
observations are less than Q1 and 75% are greater than it.

The Third Quartile, denoted by Q3, is the value for which 75% of
the observations are less than Q3 and 25% are greater than it.
The Empirical Rule states that if the distribution of
our data values appears to be mound-shaped or
bell shaped with mean 𝜇 and standard deviation
𝜎 , then approximately

a) 68% of the population values lies between 𝜇- 𝜎


and 𝜇 + 𝜎
b) 95% of the population lie between 𝜇 − 2𝜎 and 𝜇 + 2𝜎
c) 99.7% of the population values lie between 𝜇 −
3𝜎 𝑎𝑛𝑑 𝜇 + 3𝜎
Coefficient of Variation – ratio of the standard deviation and the
mean
- denoted as CV

CV= 𝜎/𝜇 , provided 𝜇 is not equal to zero

Properties:

1. CV could be expressed in decimal or percentage.


2. CV is an absolute measure of dispersion.
3. The CV, being unit less, can be used to compare the dispersion
of two or more populations measured in different units.
4. CV can be expressed in percentage.

Example. In actual body weight of sheep

CV= 𝜎/𝜇 = 8.145759091÷16.4882= 0.494035679


GRAPHICAL SUMMARY
Stem-and-Leaf Plots
A stem-and-leaf plot is a simple way to summarize a data set.
Stem-and-Leaf Plots
Figure 1.5 presents a stem-and-leaf plot of the geyser data.
Each item in the sample is divided into two parts: a stem,
consisting of the leftmost one or two digits, and the leaf, which
consists of the next digit
Dotplots
A dotplot is a graph that can be used to give a rough
impression of the shape of a sample. It is useful when the
sample size is not too large and when the sample contains
some repeated values. Figure 1.7 presents a dotplot for the
geyser data in Table 1.3.
Histograms
A histogram is a graphic that gives an idea of the “shape” of a
sample, indicating regions where sample points are
concentrated and regions where they are sparse.
We will construct a histogram for the PM emissions of 62
vehicles driven at high altitude, as presented in Table 1.2. The
sample values range from a low of 1.11 to a high of 23.38, in
units of grams of emissions per gallon of fuel.
The first step is to construct a frequency table, shown in Table 1.4.
Histogram for the data in Table 1.4. In this histogram the heights of the
rectangles are the relative frequencies. Since the class widths are all the
same, the frequencies, relative frequencies, and densities are
proportional to one another, so it would have been equally appropriate to
set the heights equal to the frequencies or to the densities.

You might also like