You are on page 1of 40

APPLIED STATISTICS

BEED 2A

Course Outcome:

At the end the semester, it is expected that the students are able to:

1. Discuss the importance of Statistics in the context of education and other related
sciences.
2. Identify different statistical methods and tools appropriate for educational researches.
3. Process data using the statistical tools with confidence.
4. Explain the difference of experimental designs used in scientific researches in terms of
application and limitations.

CHAPTER 1

INTRODUCTION OF STATISTICS

Desired Learning Outcomes/ Competencies

At the end of the lesson, the students should be able to:

1. Explain the importance of statistics in human lives.


2. Discuss how to collect, present, analyze and interpret the data in such a way that it
can be used for decision-making.
3. Identify the different level of measurement, describe the different kind of variables and
solve problems in summation notation.
4. Describe the sources of data and determine the methods of collecting data.

Key Words

Statistics

Parametric Statistics

Non- Parametric Statistics

Measurement

Variable

Summation

Data

A. STATISTICS

In its plural sense, statistics is a set of numerical data (e.g., vital statistics in a beauty contest,
monthly sales of a company, daily peso- dollar exchange rate). In its singular sense, Statistics
is that branch of science which deals with the collection, presentation, analysis and
interpretation of data.

Branches of Statistics (What are the types of Statistics?)

Descriptive Statistics and Inferential Statistics

The field of statistics may be divided into descriptive and inferential statistics. Descriptive
statistics is only concerned with summarizing values to describe group characteristics of the
data after gathering, classifying, and presenting data. To do this, it employs graphs, tables
and frequency distributions, percentages, measures of central tendency and position, and
measures of variability. It does not need to generalize or make conclusions. Whereas,
Inferential statistics is concerned with a higher order of critical thinking and judgement. And
it needs more complex mathematical procedures. Its aim is to give generalization,
conclusion, or information regarding large groups of data called the population without
necessarily dealing with each and every element of these groups. It only uses a small portion
of the total set of data or only a representative portion called a sample to give conclusions
or generalizations regarding the entire population.

Classification of Statistics

Parametric Statistics and Non- Parametric Statistics

Parametric statistics are inferential techniques which make the following assumptions
regarding the nature of the population from which the observations or data are drawn:

1. The observations must be independent. This means that in choosing any element from the
population to be included in the sample, it must not affect the chances of other elements
for inclusion.

2. The population must be drawn from normally distributed populations. The crude way of
knowing that the distribution is normal is when the mean, the median and the mode are all
equal (mean=median=mode). If we are going to draw the curve, we can produce a
bellshaped curve which has an area of one and is symmetrical with respect to the x-axis.
3. If we analyze the two groups/populations, these populations must have the same variance
and we call this as homoscedastic populations.

4. The variables must be measured in the interval or ratio scale, so that we can interpret the
results.

While the non-parametric statistics makes fewer and weaker assumptions like:

1. The observations must be independent, and the variable has the underlying continuity.
2. The observations are measured in either the nominal or ordinal scales.
To have a better understanding on when to use the parametric and non-parametric statistics,
please refer to the table below:

Inferential Techniques Distribution Measurement


Parametric Statistics Normal Interval or Ratio
Non-Parametric Statistics Unknown or Any Distribution Nominal or Ordinal

B. Levels of Measurement

It is necessary to give attention to different levels of measurement especially when


contemplating the use of statistics. The measurement scale is an important factor in
determining the appropriate statistical methods to be used in analyzing the data of a
research study. It is classified into nominal scale, ordinal scale, interval scale, and ratio scale.

NOMINAL SCALE- is the first and the lowest level of measurement. It is merely grouping or
classifying different objects into categories based upon some defined characteristics without
paying attention to order or arrangement. Following the identification of the various
categories, frequencies or the number of objects in each category are counted.

Properties of the nominal data are as follows:

1. The data are mutually exclusive, (an object can belong to only one category).
2. The data categories have no logical order or arrangement. These are two ways of
classifying: the one- way classification and the two-way classification.
Example of one-way classification:

Example No. 1: Students may be classified according to College.

COLLEGE FREQUENCY
College of Arts and Sciences 50

College of Agriculture 100


College of Veterinary Medicine 45
College of Education 75

Example No. 2: Responses to a questionnaire.

RESPONSE FREQUENCY

Strongly Agree 50

Agree 30
Moderately Agree 20

Disagree 10

Strongly Disagree 10

In the two-way classification, an individual may be classified twice. For example, Peter is
classified as male under sex and at the same time, he is classified Yes or under neutral or
whatever is his response.

Example 1.

SEX YES NEUTRAL NO TOTAL


Male 20 10 30 60
Female 45 10 20 75
Total 65 20 50 135

The ORDINAL SCALE is the second level or measurement. In here, there is logical ordering or
arrangement of categories aside from categories being mutually exclusive. The process of
measurement is the same as the nominal scale where number of objects are counted in each
category. However, we can discern which is the highest or lowest. For example, rank in
military, we know that the private < corporal< sergeant< lieutenant etc.

Example:

RANK FREQUENCY
Private 20
Corporal 15
Sergeant 10
Lieutenant 25

The following are the properties of ordinal data:


1. Data categories are mutually exclusive
2. Data categories have some logical orders
3. Data categories are scaled according to the amount of the characteristics they possess.
INTERVAL SCALE- is the third higher level of measurement. It possesses all the properties of the
preceding scales with some additional properties. Another additional property is the
difference between the various level of categories on any part of the scale are equal.

A common variable measured on an interval scale is temperature. The difference between


temperature of 65 and 88 is regarded as the difference between temperature of 13 and 16.
Here zero is just another point on the scale. It does not mean that the there is no temperature.
In fact, this is the freezing point of water.

The properties of interval data are as follows:


1. Data categories are mutually exclusive.
2. Data categories have logical order.
3. Data categories are scaled according to the amount of the characteristics they possess.
4. Equal differences in the characteristics are represented by equal differences in the
numbers assigned to the categories.
5. The point zero is just another point in the scale.

RATIO SCALE is the highest level of measurement. All properties of the interval scale are
applicable in the ratio scale plus one additional property which is known as the “true zero
point” which reflects the absence of the characteristics measured.
Example: Number of correct answers in an exam

Speed

In Summary:

• The nominal scale categories without order.


• The ordinal scale categories with order.
• The interval scale categories with order and establishes an equal unit in the scale
• The ratio scale categories with order, establishes an equal unit in the scale and
contains a true zero point.

C. Variable

• A variable is a characteristic or attribute of persons or objects which can assume


different values or labels for different persons or objects under consideration.
• Measurement is the process of determining the value or label of a particular variable
for a particular experimental unit.
• An experimental unit is the individual or object on which a variable is measured.
Response Variable- a variable which is affected by the value of some other variable. This
may be continuous, ordinal or nominal. In regression setting, they are called dependent
variables or Y variables.

Explanatory Variable- is a variable that is thought to affect the values of the response
variable. It is sometimes called independent variable or X variable in regression setting. In
this case, explanatory variable, like the response variable may be continuous, ordinal or
nominal.

1. Discrete Vs. Continuous

• Discrete Variable- a variable which can assume finite, or, at most countably infinite
number of values, usually measured by counting or enumeration.
• Continuous Variable- a variable which can assume the infinitely many values
corresponding to a line interval.

2. Qualitative Vs. Quantitative

• Qualitative Variable- a variable that yields categorical response (e.g., political


affiliation, occupation, marital status)
• Quantitative Variable- a variable that takes on numerical values representing
an amount or quantity (e.g., weight, height, no. of cars).

D. Summation Notation

Important Symbol

POPULATION SAMPLE
Number or observations N n
Characteristics: Parameter Statistics
Mean µ x, y, z
Variance σ2 s2
Standard Deviation Σ s
In statistics it is frequently necessary to work with sums of numerical value. For example,
we may wish to compute the average cost of a certain brand of toothpaste sold at 10
different stores. Perhaps we would like to know the total number of heads that occur when
3 coins are tossed several times.

Consider a controlled experiment in which the decreases in weight over a 6- month


period was, 15, 10, 18 and 6 kilograms, respectively. If we designate the first recorded
value x, the second x2 and so on, then we can write x1= 15, x2= 10, x3= 18 and x4= 6.

Using the Greek letter Σ (capital sigma) to indicate “summation of” we can write the sum
of the 4 weights as
4

∑xi
i=1

Where we read “summation of xi, i going from 1 to 4.” The numbers 1 and 4 are called the
lower and upper limits of summation. Hence

= 15 +10 +18 + 6 =49


Also,

=10 +18 =28


In general, the symbol ∑ni=1 means that we replace i wherever it appears after the
summation symbol by, then by 2, and so on up to n, then add up the terms. Therefore, we
can write
,

xjyj = x2y2 + x3y3 + x4y4 + x5y5


The subscript may be any letter, although I, j and k seem to be preferred by statisticians.
Obviously,

i=1 i =1 xj

The lower limit of summation is not necessarily a subscript. For instance, the sum of the natural
numbers from 1 to 9 may be written.

∑9x=1 x = 1 + 2 + … + 9 =45
When we are summing over all the values of x1 that are available, the limits of
summation are often omitted, and we simply write ∑ xi. If in the diet experiment only 4 people
were involved, then ∑ xi = x1 + x2 +x3 +x4. In fact, some authors even drop the subscript and let
∑ x represent the sum of all available data. Example: If x1 = 3, x2 = 5 and x3 =7, find
a).∑ xi b) c)∑3i=2 xi - i
Solution

a) ∑ xi = xi + x2 + x3 = 3 + 5 + 7= 15
b)
c) xi – i = x2 -2 + x3 -3 = 3 + 4 =7

Example: Given x1 = 2, x2 = −3,x3 = 1, y1 = 4, y2 = 2 and y3 = 5, evaluate

a) xiyi

b) ∑3i=2 xi ∑2j=1 yj2

Solution

a) xiyi = x1y1 +x2y2 +x3y3 =(2)(4)+(-


3)(2)+(1)(5) =7

b)
=(-2)(20) = -40

E. Classification of Data

1. External Vs. Internal

• Internal Data- information that relates to the operations and functions of the
organization collecting the data.
• External Data- Information that relates to some activity outside the organization
collecting the data.

Example: The sales data of SM is internal data for SM but external data for any other
organization collecting such as Robinsons.

2. Primary Vs. Secondary

• Primary Source- data measured by the researcher/agency that published it


• Secondary Source- any republication of data by another agency.
Example: The publication of the National Statistics Office are primary sources and all
subsequent publications of other agencies are secondary sources.

F. Methods of Data Collection

1. Survey Method- questions are asked to obtain information, either through self- administered
questionnaire or personal interview.
Self- administered Questionnaire Personal Interview
• Obtained information is limited to • Missing information and
subjects’ written answers to pre- vague responses are minimized
arranged questions. with the proper probing of the
• • Lower response rate. • interviewer.
It can be administered to a large Higher response rate through callbacks.
• number of people simultaneously. • It is administered to a person or group
Respondents may feel freer to express one at a time.
views and are less pressured to answer • Respondent may feel more cautious
• immediately. particularly in answering sensitive
It is more appropriate for obtaining questions for fear of disapproval. It is
objective information. • more appropriate for obtaining about
complex emotionally- laden topics or
probing sentiments underlying an
expressed opinion.
2. Observation Method- makes possible the recording of behavior but only at the time of
occurrence (e.g., observing reactions to a particular stimulus, traffic count).

Advantages over Survey Method

• Does not rely on the respondent’s willingness to provide the desired data
• Certain types of data can be collected only by observation (e.g. behavior patterns of
which the subject is not aware of or is ashamed to admit)
• The potential bias caused by the interviewing process is reduced or eliminated.
Disadvantages over Survey Method:
• Things such as awareness, beliefs, feelings and preferences cannot be observed.
• The observed behavior patterns can be rare or too unpredictable thus increasing the
data collection costs and time requirements.

3. Experimental Method- a method designed for collecting data under controlled conditions.
An experiment is an operation where there is actual human interference with the conditions
that can affect the variable under study. This is an excellent method of collecting data for
causation studies. If properly designed and executed, experiments will reveal with a good
deal of accuracy, the effect of a change in one variable on another variable.

4. Use of existing studies- e.g., census, health statistics and weather Bureau reports.

Two types:

• Documentary sources- published or written reports, periodicals, unpublished


documents, etc.
• Field sources- researchers who have done studies on the area of interest are asked
personally or directly for information needed.

5. Registration Method- e.g., car registration, student registration and hospital admission.

CHAPTER 2
SAMPLING METHODS

Introduction

In Chapter 1, the concepts of population and sample were discussed. A population is any
defined aggregate of objects, persons, or events, the variables used as the basis for
classification or measurement being specified. A sample is any sub aggregate drawn from
the population. Any statistic calculated on a sample of observations is an estimate of a
corresponding population value or parameter. The symbol 𝑋̅ is used to refer to the arithmetic
mean of X calculated on a sample of size N. The symbol 𝜇 is used to refer to the mean of the
population. Similarly, s2 is used to refer to the variance in the sample, and 𝜎2 is the
corresponding population parameter. 𝑋̅ is an estimate of 𝜇 and s2 is an estimate of 𝜎2. Likewise,
any other statistic calculated on a sample is an estimate of a corresponding population
parameter. In most situations the parameters are unknown and must be estimated in some
manner from the sample data.

Much statistical work in practice is concerned with the use of sample statistics as estimates
of population parameters and more particularly with describing the magnitude of error which
attaches to such statistics. The body of statistical method concerned with the making of
statements about population parameters from sample statistics is called sampling statistics
and the logical process involved is called statistical inference, this being a rigorous form of
inductive inference. If inferences about population parameters are to be drawn from sample
statistics, certain conditions must attach to the methods of sampling used.

Desired Learning Outcomes/ Competencies

At the end of the lesson, the students should be able to:

1. Differentiate between probability and non-probability sampling.


2. Determine the sampling methods under probability and non-probability sampling.

Key Terms/Words

Sampling

Population

Sample

Probability Sampling

Non- Probability Sampling

A. Definition of Terms

• Population- refers to the entire group of individuals of interest under study. It is


classified into target population and sampled population.
• Target Population- is the population from which representative information is desired
and to which the interferences will be made.
• Sampled Population- refers to the population from which a sample will actually be
drawn.
• Sampling- The act of studying only a portion of the population.
• Sample- a representative portion of the population.
• Elementary units- the individual in the population on which measurement is actually
taken and made.
• Sampling unit- the units which are chose in selecting a sample.
• Sampling Frame- collection of all the sample unit.
B. Criteria of Sampling Methods

There are four criteria of sampling designs, i.e., representative of the population, reliability,
practicable and efficient and economical.
• Representative of the Population- The sample must be selected so that it properly
represents the population that is to be covered, i.e., each individual must have a
chance of being selected and this chance must not be zero.
• Reliability- It should be possible to measure the reliability of the estimates made from
the sample. In addition to the desired estimates of the characteristics of the
population, the sample should give measures of the precision of these estimates.
• Practicable- The third criterion is that the sampling design must be practical. It must
be sufficiently simple and straight-forward so that it can be carried out substantially as
designed.
• Efficient and Economical- the design should be efficient and economical. Among the
various sampling methods (discussed latter), one must naturally choose the method
which to the best of our knowledge produces the most information at the smaller cost.

C. Methods of Probability

PROBABILITY AND NON- PROBABILITY SAMPLING

A sampling procedure that gives every element of the population a (known) nonzero
chance of being selected in the sample is called probability sampling. Otherwise, the
sampling procedure is called non- probability sampling.

• Whenever possible, probability sampling is used because there is no objective way of


assessing the reliability of inferences under non- probability sampling.

• Target Population- is the population from which information is desired.

• Sampled Population- is the collection of elements from which the sample is actually
taken.

• The Population Frame- is a listing of all the individual units in the population.

Methods of Non- Probability Sampling


1. Purposive Sampling- Sets out to make a sample agree with the profile of the
population based on some preselected characteristics.
2. Quota Sampling- Selects a specified number (quota) of sampling units
possessing certain characteristics.
3. Convenience Sampling- selects sampling units that come to hand or are
convenient to get information from.
4. Judgment Sampling- selects sample in accordance with an expert’s
judgement.

Methods of Probability Sampling

1. Simple random sampling- is a method of selecting n units out of the N units in


the population in such a way that every distinct sample of size n has an equal
chance of being drawn. The process of selecting the sample must give an
equal chance of selection to any one of the remaining elements in the
population at any one of the n draws.

Random sampling may be with replacement (SRSWR) or without


replacement (SRSWOR). In SRSWR, a chosen element is always replaced
before the next selection is made, so that an element may be chosen more
than once.

Sample Selection Procedure

Step 1: Make a list of the sampling units and number them from 1 to N
Step 2: Select n (distinct for SRSWOR), not necessarily distinct for SRSWR)
numbers from 1 to N using some random process, for example, the table of
random numbers.
Step 3: The sample consists of the units corresponding to the selected random
numbers.
Advantages
• The theory involved is much easier to understand than the theory
behind other sampling designs.
• Inferential methods are simple and easy.
Disadvantages

• The sample chosen may be widely spread, thus entailing high


transportation costs.
• A population frame, or list, is needed.
• Less precise estimates result if the population is heterogeneous with
respect to the characteristics under study.

2. Stratified Random Sampling- the population of N units is first divided into


subpopulation called strata. Then a simple random sample is drawn from each
stratum, the selection being made independently in different strata.

Sample Selection Procedure

Step 1: Divide the population into strata. Ideally, each stratum must consist of more or
less homogeneous units.
Step 2: After the population has been stratified, a simple random sample is selected
from each stratum.

Advantages
• Stratification may produce a gain in precision in the estimates of characteristics
of the population.
• It allows for more comprehensive data analysis since information is provided for
each stratum
• It is administratively convenient.
Disadvantages

• A listing of the population for each stratum is needed.


• The stratification of the population may require additional prior information
about the population and its strata.

3. Systematic Sampling- with a “random start” is a method of selecting a sample by taking


every kth unit from an ordered population, the first unit being selected at random. Here k is
called the sampling interval, the reciprocal 1/k is the sampling fraction.

Sample Selection Procedure

Method A
Step 1: Number the units of the population consecutive from 1 to N
Step 2: Determine k, the sampling interval using the formula k= N÷ n
Step 3: Select the random start r, where 1≤ r ≤ k. The unit corresponding to r is the first unit of
the sample.
Step 4: The other units of the sample correspond to r + k, r + 2k, r + 3k and so on.

Method B
Step 1: Number the units of the population consecutively from 1 to N. Step
2: Let k be the nearest integer to N/n,
Step 3: Select the random start r, where 1 r N. The unit corresponding to r is the first unit
of the sample.
Step 4: Consider the list of units of the population as a circular list, i.e., the last unit in the list is
followed by the first. The other units in the sample are the units corresponding to r + k, r + 2k, r
+ 3k,…r + (n-1)k.
Advantages
• It is easier draw the sample and often easier to executive without mistakes than simple
random sampling.
• It is possible to select a sample in the field without a sampling form.
• The systematic sample is spread more evenly over the population.

Disadvantages
• If periodic regularities are found in the list, a systematic sample may consist only of
similar types. (Example: Store sales over seven days of the week- estimating total sales
based on a systematic sample every Tuesday would be unwise.)
• Knowledge of the structure of the population is necessary for its most effective use.

4. Cluster Sampling- method of sampling where a sample of distinct groups or clusters of


elements is selected and then a census of every element in the selected clusters is taken.
Similar to strata is stratified sampling, clusters are non- overlapping sub- populations which
together comprise the entire population. For example, a household is a cluster of individuals
living together or a city block might also be considered as a cluster. Unlike strata, however,
clusters are preferably formed with heterogeneous, rather than homogeneous elements so
that each cluster will be typical of the population.

Clusters may be of equal or unequal size. When all of the clusters are of the same size, the
number of elements in a cluster will be denoted by M while the number of clusters in the
population will be denoted by N.

Sample Selection Procedure

Step 1: Number the clusters from 1 to N.


Step 2: Select n numbers from 1 to N at random. The clusters corresponding to the selected
numbers from the sample of clusters.
Step 3: Observe all the elements in the sample of clusters.

Advantages
• A population list of elements is not needed, only a population lists of clusters is required.
Listing cost is reduced.
• Transportation cost is reduced.

Disadvantages
• The costs and problems of statistical analysis are greater.
• Estimation procedures are more difficult.

5. Multistage Sampling- the population is divided into a hierarchy of sampling units


corresponding to the different sampling stages. In the first stage of sampling, the population
is divided into primary stage units (PSU) then a sample of PSU’s is drawn. In the second stage
of sampling, each selected PSU is subdivided into second-stage units (SSU) then a simple of
SSU’s is drawn. The process of subsampling can be carried to a third stage, fourth stage and
so on by sampling the subunits instead of enumerating them completely at each stage.

Advantages
• Listing cost is reduced
• Transportation cost is reduced.

Disadvantages
• Estimation procedure is difficult, especially when the primary stage units are not of the
same size.
• Estimation procedure gets more complicated as the number of sampling stages
increases.
• The sampling procedure entails much planning before selection is done.
6. Sequential Sampling- units are drawn one by one in a sequence without prior fixing of
the total number of observations and the results of the drawing at any stage are used to
decide whether to terminate sampling or not.

CHAPTER 3

MEASURE OF CENTRAL TENDENCY OR CENTRAL LOCATION


Introduction

A variety of statistical measures are employed to summarize and describe sets of data.
Some of these statistical measures define, in some sense, the center of a set data and
consequently are called measures of central location or measures of central tendency.

The term central location refers to a central reference value which is usually close to the point
of greatest concentration of the measurements and may in some sense be thought to typify
the whole set. Measures of central location in common use are the mode, median, and
arithmetic mean. Other less frequently used measures are the geometric mean and the
harmonic mean. By far the most widely used measure of central location is the arithmetic
mean. This statistic is an appropriate measure of central location for interval and ratio
variables. The median and mode are sometimes viewed as appropriate measures for ordinal
and nominal variables, respectively, although they can also be used with interval and ratio
variables.

Desired Learning Outcome/Competencies

At the end of the lesson, the students should be able to:

1. Solve problems about mean, median, mode, deciles and percentiles.


2. Discuss the advantages and disadvantages of using central values.

Key Words

Central Location

Statistical Measure

Mean

Median

Mode

A. Measures of Central Location

To investigate a set of quantitative data, it is useful to define numerical measures that


describe important features of data. One of the important ways of describing a group of
measurements whether it be a sample or a population, is by the use of an average.

An average is a measure of the center of a set of data when the data are arranged in
an increasing or decreasing order of magnitude. For example, if an automobile averages
14.5 kilometers to 1 liter of gasoline, this can be considered a value indicating the center of
several more values. In the country 1 liter of gasoline may give considerably more kilometers
per liter than in the congested traffic of a large city. The number 14.5 in some sense defines a
center value.

Any measure indicating the center of a set of data, arranged in an increasing or


decreasing order of magnitude is called a measure of central location or a measure of central
tendency. The most commonly used measures of central location are the mean, median, and
mode. The most important of these and the one we shall consider first is the mean.

B. Population Mean- if the set of data x1, x2,…xN, not necessarily all distinct, represents a finite
population of size N, then the population mean is

N
Example 1: The number of employees at 5 different drugstores are 3, 5, 6, 4 and 6. Treating
the data as a population, find the mean number of employees for the 5 stores. Solution:
Since the data are considered to be a finite population,
μ= = 4.8
C. Sample Mean- if the set of data x1, x2,…xn, not necessarily all distinct represents a finite
sample of size n, then the sample mean is

x = ∑ni=1 x1
n

Example 2: A food inspector examined a random sample of 7 cans of a certain brand


of tune to determine the percent of foreign impurities. The following data were record: 1.8,
2.1, 1.7, 1.6, 0.9, 2.7 and 1.8 Compute the sample mean.

Solution: This being a sample, we have 1.

x = =1.8%
Often, it is possible to simplify the work in computing a mean by using coding
techniques. For example, it is sometimes convenient to add (or subtract) a constant to all our
observations and then compute the mean. How is this new mean related to the mean of the
original set of observation? If we let y1 = x1 + a, then

y +
an n
Therefore, the addition (or subtraction) of a constant to all observation changes the
mean by the same amount. To find the mean of the numbers -5, -3, 1,4 and 6, we might add
5 first to give the set of all positive values 0, 2, 6, 9 and 11 that have a mean of 5.6. Therefore,
the original numbers have a mean of 5.6 - 5 =0.6. Now suppose that we let y1 = ax1. It follows
that

ni=1 1 ni=1 ax1 xy =


an n

Therefore, if all observations are multiplied or divided by a constant, the new


observations will have a mean that is the same constant multiple of the original mean. The
mean of the numbers 4, 6, 14 is equal to 8, therefore, after dividing by 2, the mean of the set
2, 3, and 7 must be = 4.
The second most useful measure of central location is the median. For a population we
designate the median by μ̃ and for a sample we write x̃ .

D. Median- The median of a set of observations arranged in an increasing or decreasing


order of magnitude is the middle value when the number of observations is odd or the
arithmetic mean of the two middle values when the number of observations is even.

Example 3: On 5 term tests in sociology a student has made grades of 82, 93, 86, 92 and
79. Find the median for this population of grades.

Solution: Arranging the grades in an increasing order of magnitude, we get

79 82 86 92 93

and hence μ̃ = 86
Example 4: The nicotine contents for a random sample of 6 cigarettes of a certain brand
are found to be 2.3, 2.7, 2.5, 2.9, 3.1 and 1.9 milligrams. Find the median.

Solution: If we arrange these nicotine contents in an increasing order of magnitude, we


get
1.9 2.3 2.5 2.7 2.9 3.1

and the median is then the mean of 2.5 and 2.7. Therefore,
2. x̃ =
= 2.6
milligrams.

The third and final measure of central location that we shall discuss is the mode.

E. Mode: The mode of a set of observations is that value which occurs most often or with
the greatest frequency.

The mode does not always exist. This us certainly true when all observations occur with
the same frequency. For some sets of data there may be several values occurring with the
greatest frequency in which case we have more than one mode.

Example 5: If the donations from the residents of Fairway Forest toward the Virginia Lung
Association are recorded as 9, 10, 5, 9, 9,7, 8, 6, 10 and 11 dollars, then 9 dollars, the value
that occurs with the greatest frequency, is the mode.

Example 6: The number of movies attended last month by a random sample of 12 high
school students were recorded as follows: 2, 0, 3, 1, 2, 4, 2, 5, 4, 0, 1 and 4. In this case, there
are two modes, 2 and 4, since both 2 and 4 occur the greatest frequency. The distribution is
said to be bimodal.

Example 7: No mode exists for the sociology grades of Example 3, since each grade
occurs only once.
Answer the following problems.

Assignment

1, 3, 7, 10

Bring-home Quiz

2, 5, 13
CHAPTER 4

MEASURE OF VARIABILITY

Introduction

Of great concern to the statistician is the variation in the events of nature. The variation of
one measurement from another is a persisting characteristic of any sample of measurements.
Measurements of intelligence, eye color, reaction time, and skin resistance for example
exhibit variation in any sample of individuals. Anthropometric measurements such as height,
weight, diameter of the skull, length of the forearm and angular separation of the metatarsals
show variation between individuals. Anatomical and physiological measurements vary; also,
the measurements made by the physicist, chemist, botanist and agronomist. Statistics can be
viewed as the study of variation. The experimental scientist is concerned with the different
circumstances, conditions or sources which contribute to the variation in the measurements
of he or she obtains. Among the possible measures used to describe this variation are the
range, the mean deviation and the standard deviation. The most important of these is the
standard deviation.

Desired Learning Outcomes/ Competencies

At the end of the lesson, the students should be able to:

1. Identify the most typical measures of dispersion – the range, variance, standard
deviation and coefficient of variation.
2. Determine the extent of the scatter so that steps may be taken to control the existing
variation

Key Words

Range

Variance

Mean Deviation

Standard Deviation z

Scores

A. Measures of Variation

The three measures of central location discussed, do not by themselves give an


adequate description of our data. We need to know how the observations spread out from
the average. It is quite possible to have two sets of observations with the same mean or
median that differ considerably in the variability of their measurements about the average.

Consider the following measurements, in liters, for two samples of orange juice bottled
by companies A and B:
Sample A 0.97 1.00 0.94 1.03 1.11
Sample B 1.06 1.01 0.88 0.91 1.14

Both samples have the same mean, 1.00 liters. It is quite obvious that company A bottles
orange juice with a more uniform content than company B. We say that the variability or the
dispersion of the observations from the average is less for sample A than foe sample B.
Therefore, in buying orange juice, we would feel more confident that the bottle we select will
be closer to the advertised average if we buy from company A.

The most important statistics for measuring the variability of a set of data are the range
and the variance. The simplest of theses to compute is the range.
B. Range- the range of a set of data is the difference between the largest and smallest
number in the set.

Example 8: The IQs of 5 members of a family are 108, 112, 127, 118 and 113. Find the
range.

Solution: The range of the 5 IQs is 127- 108 =19

In the case of the companies bottling orange juice, the range for company A is 0.17 liters
compared to a range of 0.26 liters for company B, indicating a greater spread in the values
for company B.

The range is a poor measure of variation, particularly if the size of the sample or population is
large. It considers only the extreme values and tells us nothing about the distribution of
numbers in between. Consider, for example, the following two sets of data, both with a range
of 12.
Set A 3 4 5 6 8 9 10 12 15
Set B 3 7 7 7 8 8 8 9 15

In set A the mean and median are both 8, but the numbers vary over the entire interval from
3 to 15. In set B the mean and median are also 8, but most of the values are closer to the
center of the data. Although the range fails to measure this variation between the upper and
lower observations, it does not have some useful applications. In industry the range for
measurements on items coming off an assembly line might be specified in advance. As long
as all measurements fall within the specified range, the process is said to be in control.

To overcome the disadvantage of the range, we shall consider a measure of variation,


namely, the variance, that considers the position of each observation relative to the mean
of the set. This is accomplished by examining the deviations from the mean. The deviation of
an observation from the mean is found by subtracting the mean of our set of data from the
given observation. For the finite population x1, x2,…xN, the deviations are

x1 - µ, x2 - µ,…xN - µ
Similarly, if our set of data is the random sample x1, x2,…xn, the deviations are

x1 - x, x2 - x,…xn – x
An observation greater than the mean will yield a positive deviation, whereas an observation
smaller than the mean will produce a negative deviation. Comparing the deviations for the
two sets of data below, we have the following:

Set A -5 -4 -3 -2 0 1 2 4 7
Set B -5 -1 -1 -1 0 0 0 1 7

Clearly, most of the deviations of set B are smaller in magnitude than those of set A,
indicating less variation among the observations of set B. Our aim now is to obtain a single
numerical measure of variation that incorporates all the deviations from the mean. The most
obvious procedure would be to average the deviations. The sum of the deviations from the
mean is zero for any set of data and consequently their mean is also zero. To circumvent this
problem, we could find a measure of variation called the mean deviation whereby we
compute the mean of the absolute values of the deviations. An absolute value of a number
is the number without the associated algebraic sign. Thus, the absolute values of -4 is simply
4.

In practice, the mean of the absolute values of deviation from the mean is seldom used.
The use of absolute values makes its mathematical treatment awkward. Instead, we shall
work with the squares of all the deviations in computing the variance. In the case of a finite
population of size N, the variance denoted by the symbol σ2 (sigma squared), may be
computed directly from the following summation formula.

C. Population Variance: Given finite population x1, x2,…xN, the population variance is

σ2 = ∑Ni=1(x1−μ)2
N

Assuming that the two sets A and B are populations, we now use the deviations in the
preceding table to calculate their variance. For set A.

And for set B

A comparison of the two variances shows that the data of set A are more variable than the
data of set B.

By using the square of the deviations to compute the variance, we obtain a number in
squared units. That is, if the original measurements were in feet the variance would be
expressed in squared feet. To get a measure of variation expressed in the same units as the
raw data, as was the case for the range, we take the square root of the variance. Such a
measure is called standard deviation.

Example 9: The following score were given by 6 judges for a gymnast’s performance in
the vault of an international meet: 7, 5, 9, 7, 8 and 6. First the standard deviation of this

population.

Solution: First, we compute

μ= =7
And then

D. The standard deviation is then given by


The variance of a sample, denoted by s2, is a statistic. Therefore, different random
samples of size n, selected from the same population, would generally yield different values
for s2. In most statistical applications the parameter σ2 is unknown and estimated by the value
s2. For our estimate to be good, it must be computed from a formula that on the average
produces the true answer σ2. That is, if we were take all possible random samples of size n from
a population and compute s2 for each sample, the average of all the s 2 values should be
equal to σ2. A statistic that estimates the true parameter on the average is said to be
unbiased.
Intuitively, we would expect the formula for s2 to be the same summation formula as
that used for σ2, with the summation now extending over the sample observations and with μ
replaced by x. This is indeed done in many texts, but the values so computed for the sample
variance tend to underestimate σ2 on the average. To compensate for this bias, we replace
n by n-1 in the divisor.
E. Sample Variance-Given a random sample x1, x2,…xn, the sample variance is

s2 = ∑ni=1(x1−x)2
n−1

Example 10: A comparison of coffee prices at 4 randomly selected grocery stores in San
Diego showed increases from the previous month of 12, 15, 17 and 20 cents for a 200- gram
jar. Find the variance of this random sample of price increases. Solution: Calculating the

sample mean, we get x= =16 cents


Therefore,

=
If 𝑥̅ is a decimal number that has been rounded off, we accumulate a large error using
the sample- variance formula in the form given above. To avoid this, let us derive the more
useful computational formula.

The sample standard deviation, denoted by s, is defined to be the positive square root of the
sample variance.

Example 11: Find the variance of the data 3, 4, 5, 6, 6 and 7, representing the number of trout
caught by a random sample of 6 fishermen on June 19, 1981 at Lake Muskoka.

Solution: In tabular form we write.

𝑥̅1 𝑥̅12
3 Error! Bookmark not defined.
4 2
5 21
6 22
649 23

171 n=6

Hence,

(6)(171) − (31)2 13
𝑠2 = =
(6)(5) 6

Often, it is possible to simplify the computational procedure for calculating the variance of a
set of data by using coding techniques. Recall that coding was used in Section 2.2 to
compute the mean. The effects of coding on the variance by subtracting a constant from
each observation or by dividing each observation by a constant will be of particular interest
to us. We shall investigate these effects here only for random samples, but the results are
equally valid for populations.
If we let 𝑦𝑖 = 𝑥̅𝑖 + 𝑐, it follows that 𝑦 = 𝑥̅ + 𝑐, and hence the variance of the 𝑦1′𝑠 is 𝑠2 = 𝑖=1(𝑦1
− 𝑦)2 = ∑𝑛𝑖=1[(𝑥̅1 + 𝑐) − (𝑥̅ + 𝑐)]2 ∑𝑛
𝑛−1 𝑛−1

𝑛−1

Therefore, if each observation of a set of data is transformed to anew set by the addition (or
subtraction) of a constant 𝑐, the variance of the original set of data is the same as the
variance of the new set.

Now suppose we let 𝑦𝑖 = 𝑐𝑥̅𝑖, so that 𝑦 = 𝑐𝑥̅. It follows that the variance

2= ∑𝑛𝑖=1(𝑦𝑖 − 𝑦)2 = ∑𝑛𝑖=1(𝑐𝑥̅𝑖 − 𝑐𝑥̅)2


𝑠
𝑛−1 𝑛−1
𝑐 2 ∑𝑛𝑖 =1 (𝑥̅ 1 −𝑥̅ )2
=
𝑛−1

Therefore, if a set of data is transformed to a new set by multiplying (or dividing) each
observation by a constant 𝑐,the variance of the original set is equal to the variance of the
new set divided (or multiplied) by 𝑐2.

Example 12: A random sample of 5 bank presidents indicated annual salaries of $63,000,
$52,000, $35,000 and $41,000. Find the variance of this set of data by using appropriate
coding techniques.

Solution: If we divide all the salaries by 1000 and then subtract 50, we obtain the numbers 13,
-2, 12, -15 and -9, for which
and
Now, for the coded data

2= (5)(623) − (−1)2 = 155.7


𝑠
(5)(4)

and after multiplying by 10002, the variance of the original set of salaries is 𝑠2 = 1.557 × 108.

The standard deviation seems to be the best measure of variation that have, At this point,
however, it has meaning only when comparing two more sets of data having the same units
of measurement and approximately the same mean. Therefore, we could compare the
variances of the observations of two companies bottling orange juice and the larger value
would indicate the company whose product is more variable or less uniform provided that
bottles of the same size were used. It would not be meaningful to compare the variance of
a set of heights to the variance of a set of aptitude scores.

F. Z Scores

In assessing the accomplishments of a student in chemistry and economics during the


summer session at a certain college, we might compare her numerical grades for the two
courses. If we assume that the student made a grade of 82 in chemistry and a grade of 89 in
economics, can we conclude that she is a better student in economics than in chemistry?
Perhaps we should consider how this student performed relative to the other students in each
of her classes. It is not possible that one examination was much more difficult than the other
and she actually did better in chemistry relative to the other students enrolled in chemistry
than she did in economics? After all, the mean grade in chemistry was 68 and the standard
deviation was 8, whereas the distribution of economics grades had a mean of 80 and a
standard deviation of 6.
The problem before us, then, is one of comparing two observations from two different
populations in order to determine their relative rank. In our illustration the entire set of
chemistry grades constitutes one of the populations, the entire set of economics grades
represents the other populations and the student’s grades are the two observations. One
method for ranking these two observations is to convert the individual observations into
standard units known as z scores or z values.

Z Score- An observation, 𝑥̅, from a population with mean 𝜇 and standard deviation 𝜎 has a z
score or z value defined by
𝑥̅ − 𝜇
𝑧=
𝜎
A z score measures how many standard deviations an observation is above or below the
mean. Since 𝜎 is never a negative, a positive z score measures the number of standard
deviations an observation is above the mean, and a negative z score give the number of
standard deviations an observation is below the mean. Note that the units of the denominator
and the numerator of a z score cancel. Hence a z score is unitless, thereby permitting a
comparison of two observations relative to their groups, measured in completely different
units.

Let us now compute the z scores corresponding to our student’s grades in chemistry
and economics. For chemistry we obtain

𝑧= = 1.75
and for economics

𝑧= = 1.50

Example 13: Different typing skills are required for secretaries depending on whether one is
working in a law office, an accounting firm, or a research mathematical group at a major
university. In order to evaluate candidates for these positions, an employment agency
administers three distinct standardized typing samples. A time penalty has been incorporated
into the scoring of each sample based on the number of typing errors. The mean and
standard deviation for each test, together with the score achieved by a recent applicant are
given in Table 2.1.

Table 2.1

Data for Standardized Typing Samples


Sample Applicant’s Score Mean Standard Deviation

Law 141 sec 180 sec 30 sec


Accounting 7 min 10 min 2 min
Scientific 33 min 26 min 5 min

For what type of position does this applicant seem to be best suited?

Solution: First we compute the z score for each sample.

Law: 𝑧 = = −1.3

Accounting: 𝑧 = = −1.5

Scientific: 𝑧 = = 1.4
Since speed is of primary importance, we are looking for the z score that represents the
greatest number of standard deviations to the left of the mean and in our case that would
be -1.5. Therefore, this particular applicant ranks higher among typists in accounting firms than
when compared to typists in the other two areas, and consequently should be placed with
an accounting firm.
Answer the following problems.

Assignment:

6, 9, 16

Bring-home quiz:

7, 11, 18

CHAPTER 5 STATISTICAL DESCRIPTION OF DATA


Introduction

Often, we are confronted with the problem of disseminating large masses of statistical data
in compact form. Although numerical measures of location and variation are certainly useful
compact descriptions of a set of observations, they do not by themselves identify all the
important features of the data. Considerable information can be retrieved from large masses
of data when they are summarized and displayed by means of appropriate tables, charts
and graphs.

Desired Learning Outcome/ Competencies

At the end of the lesson, the students should be able to:

1. Construct the different presentation of data in textual, tabular and graphical


methods.
2. Build up frequency distribution, histogram, frequency polygon and cumulative
frequency distribution.

Key Terms
Distribution
Graphical Representation
Symmetry
Skewness

A. Frequency Distribution- arrangement in tabular form of the important characteristics of a


large mass of data can be readily assessed by grouping the data into different classes and
the determining the number of observations that fall in each of the classes.
Grouped Data- Data that are presented in the form of a frequency distribution.
Table 3.1: Frequency Distribution for the Weights of 50 Pieces of Luggage.
Weight(kg) Number of Pieces
7-9 2
10-12 8
13-15 14
16-18 19
19-21 7
The Table 3.1 shows the frequency distribution of the weights of 50 pieces of luggage
recorded to the nearest kilogram, belonging to the passengers on a commercial flight from
Denver to Chicago. For these data we have used the 5 class intervals 7-9, 10-12, 13-15, 16-18
and 19-21.

Class Limits- smallest and largest values that can fall in a given class interval. For the interval
10-12, the smaller number is 10 (lower class limit) and the larger number is 12 (upper class
limit).The original data were recorded to the nearest kilogram, so the 8 observations in the
interval 10-12 are the weights of all the pieces of luggage weighing more than 9.5 kilograms
but less than 12.5 kilograms. The numbers 9.5 and 12.5 are called the class boundaries for the
given interval. For the interval 10-12, the number 9.5 is called the lower class boundary and
12.5 is called the upper class boundary. However, 12.5 would also be the lower class
boundary for the interval 13-15.

Class Frequency (f)- the number of observations falling in a particular class.

Class Width- the numerical difference between the upper and the lower class boundaries of
a class interval.

Class Mark or Class Midpoint- The midpoint between the upper and lower class boundaries
or class limits of a class interval.

Table 3.2: Frequency Distribution for the Weights of 50 Pieces of Luggage.


Class Interval Class Class Mark, x Frequency, f
Boundaries

7-9 6.5-9.5 8 2
10-12 9.5-12.5 11 8
13-15 12.5-15.5 14 14
16-18 19-21 15.5-18.5 17 19
18.5-21.5 20 7
To illustrate the construction of a frequency distribution, consider the data of the
Table3.3 below which represents the lives of 40 similar car batteries recorded to the nearest
tenth of a year. The batteries were guaranteed to last 3 years.

Table 3.3: Car Battery Lives


2.2 4.1 3.5 4.5 3.2 3.7 3.0 2.6

3.4 1.6 3.1 3.3 3.8 3.1 4.7 3.7


2.5 4.3 3.4 3.6 2.9 3.3 3.9 3.1
3.3 3.1 3.7 4.4 3.2 4.1 1.9 3.4
4.7 3.8 3.2 2.6 3.9 3.0 4.2 3.5
The steps in grouping a large set of data into a frequency distribution may be summarized as
follows.

1. Decide on the number of class intervals required.


2. Determine the range.
3. Divide the range by the number of classes to estimate the approximate width of the
interval.
4. List the lower class limit of the bottom interval and then the lower class boundary. Add
the class width to the lower class boundary to obtain the upper class boundary. Write
down the upper class limit.
5. List all the class limits and class boundaries by adding the class width to the limits and
boundaries of the previous interval.
6. Determine the class marks of each interval by averaging the class limits or the class
boundaries.
7. Tally the frequencies for each class.
8. Sum the frequency column and check again the total number or observations.

Table 3.4: The Frequency Distribution of Battery Lives


Class Interval Class Boundaries Class Midpoint Frequency
1.5-1.9 2.0- 1.45-1.95 1.95-2.45 1.7 2.2 2
2.4 2.5-2.9 2.45-2.95 2.95-3.45 2.7 3.2 1
3.0-3.4 3.5- 3.45-3.95 3.95-4.45 3.7 4.2 4
3.9 4.0-4.4 4.45-4.95 4.7 15
4.5-4.9 10
5
3
Variations of Table 3.4 are obtained by listing the relative frequencies or percentages
for each interval. The relative frequency of each class can be obtained by dividing the class
frequency by the total frequency. A table listing relative frequencies is called a relative
frequency distribution. If each relative frequency is multiplied by 100%, we have a percentage
distribution. The relative frequency distribution for the data in Table 3.2 is given in Table 3.5.

Table 3.5: Relative Frequency Distribution for 50 Pieces of Luggage


Class Interval Class Boundaries Class Mark, x Frequency, f
7-9 6.5-9.5 8 0.04 0.16
10-12 9.5-12.5 11 0.28
13-15 12.5-15.5 14
16-18 15.5-18.5 17 0.38
19-21 18.5-21.5 20 0.14

In many situations we are concerned not with the number of observations in a given
class but in the number that fall above or below a specified value. For example, in able 3.4
the number of batteries lasting less than 3 years is 7. The total frequency of all values less than
the upper class boundary of given class interval is called the cumulative frequency up to and
including that class. A Table 3.6 shows the cumulative frequencies that is called cumulative
frequency distribution.

Table 3.6: Cumulative Frequency of Distribution of Battery Lives


Class Boundaries Cumulative Frequency
Less than 1.45 0
Less than 1.95 2
Less than 2.45 3
Less than 2.95 7
Less than 3.45 22
Less than 3.95 32
Less than 4.45 37
Less than 4.95 40
Two additional forms of Table 3.6 are possible using relative frequencies and percentages.
Such distributions are called relative cumulative frequency distributions and percentage
cumulative distributions. The percentage cumulative distribution enables one to read off the
percentage of observations falling below certain specified values. For example, in Table 3.7
we can see that 80% of the batteries last less than 3.95 years.

Table 3.7: Percentage Cumulative Distribution of Battery Lives


Class Boundaries Cumulative Percent
Less than 1.45 0
Less than 1.95 5.0
Less than 2.45 7.5
Less than 2.95 17.5 55.0
Less than 3.45 80.0
Less than 3.95 92.5
Less than 4.45 100.0
Less than 4.95

B. Graphical Representations

The information provided by a frequency distribution in tabular form is easier to grasp if


presented graphically. Most people find a visual picture beneficial in comprehending the
essential features of a frequency distribution. A widely used form of graphic presentation of
numerical data is the bar chart.

One can quickly observe form the bar chart that most of the batteries lasted from 3.0 to 3.4
years, only a very few batteries lasted less than 2.5 years and no battery lasted longer 4.9
years. In a bar chart the base of each bar corresponds to a class interval of a frequency
distribution and the heights of the bars represent the frequencies associated with each class.
Although the bar chart provides immediate information about a set of data in a condensed
form, we are usually more interested in a related pictorial representation called histogram. A
histogram differs from a bar chart in that the bases of each bar are the class boundaries
rather than the class limits. The use of class boundaries for the base eliminates the spaces
between the bars to give the solid appearance of Figure 3.2
For some problems it will be more convenient to let the vertical axis represent relative
frequencies or percentages. The graphs called relative frequency histograms or percentage
histograms have exactly the same shape as the frequency histogram but a different vertical
scale.
In viewing a histogram, the eye tends to compare the areas of the different rectangles rather
than their heights. Although this is appropriate for class intervals of equal width, it can be very
misleading if some of the class width differ. Unscrupulous individuals have been known to
deliberately misrepresent

data by erroneously constructing histograms with unequal class widths. Suppose for example,
that we combine the two class intervals 2.5-2.9 and 3.0-3.4 of Table 3.4 into the single interval
2.5-3.4 containing the 19 observations of the combined frequencies 4 and 15.
In Figure 3.3, we get the mistaken impression that well over half of the observations fall in the
longer class interval 2.5-3.4 when the actual number is just one less than half. To correct for
this misconception, we must reduce the height of this new rectangle by the inverse of the
factor that extends the class interval. Since we doubled the class width by combining the two
intervals, we must therefore divide the height of this new rectangle by 2 to give the correct
visual picture as shown in the Figure 3.4. Of course, now that areas and not heights represent
the frequencies, we have no further need for the vertical axis, and it is therefore omitted.

A second useful way of presenting numerical data in graphic form is by

means of a
frequency polygon. Frequency polygons are constructed by plotting class frequencies
against class marks and connecting the consecutive points by straight lines.

A polygon is many-sided closed figure. To close the frequency polygon an additional class
interval is added to both ends of the distribution, each with zero frequency. For our example,
the midpoints of these two additional classes will be 1.2 and 5.2. These two points enable us
to connect both ends to the horizontal axis, resulting in a polygon. The frequency polygon for
the data of Table 3.4 is shown in Figure 3.5. We can obtain the frequency polygon very quickly
from the histogram by joining the midpoints of the tops of adjacent rectangles and then
adding the two intervals at each end.

If we wish to compare two sets of data with unequal sample sizes by constructing two
frequency polygons on the same graph, we must use relative frequencies or percentages. A
graph is similar to Figure 3.3, but using relative frequencies or percentages is called a relative
frequency polygon or a percentage polygon.
A second line graph is called a cumulative frequency polygon or ogive is obtained by
plotting the cumulative frequency less than any upper class boundary against the upper class
boundary and joining all the consecutive points by straight lines. The cumulative frequency
polygon for the data of Table 3.6 is shown in the Figure 3.6. If relative cumulative frequencies
or percentages had been used, we would call the graph a relative frequency ogive or a
percentage ogive.

C. Symmetry and Skewness

The shape or distribution of a set measurements is best displayed by means of a histogram.


Some of the many possible shapes might arise are illustrated in Figure 3.7 and 3.8. A distribution
is said to be symmetric if it can be folded along vertical axis so that the two sides coincide.
We see that the distributions in Figure 3.7 are indeed symmetric, although quite different in
appearance. A distribution that lacks symmetry with respect to a vertical axis is said to be
skewed.
The distribution illustrated in Figure 8a is said to be skewed to the right. or positively skewed,
since it has a long right tail compared to a much shorter left tail. In figure 8b the distribution is
skewed to the left or negatively skewed.
For a symmetric distribution of measurements, the mean and median are both located at
the same position along the horizontal axis. However, if the data are skewed to the right as in
Figure 3.8a, the large values in the right tail are not offset by correspondingly low values in the
left tail and consequently the mean will be greater than the median. In Figure 3.8b the reverse
is true and the small values in the left tail will make the mean less than the median. We shall
use this behavior between the mean and median relative to the standard deviation to define
a numerical measure of skewness.

D. Percentiles, Deciles and Quartiles

Percentile- are values that divide a set of observation into 100 equal parts. These values
denoted by P1, P2,…. P99, are such that 1% of the data falls below P1, 2% falls below P2,…and
99% falls below P99.

To illustrate the procedure in calculating a percentile, let us find P85 for the distribution of
battery lives in Table 3.3. First, we must rank the given data in increasing order of magnitude
as displayed in Table 3.9. Since the table contains 40 observations, we seek the value below
which (85/100) x 40 =34 observations fall. As seen in table 3.9, P85 could be any value between
4.1 years and 4.2 years. In order to give a unique value, we shall define P85= 4.15 years. This
procedure works very well whenever the number of observations below the given percentile
is a whole number. However, when the required number of observations is fractional, it is
customary to use the next highest whole number to find the required percentile. For example,
in finding P48 we seek the value below which (48/100) x 40= 19.2 observations fall. Rounding
up to the next integer, we use the 20th observation as our location point. Hence P48 = 3.4 years.

Table 3.9: Car Battery Lives by Rank

1.6 2.6 3.1 3.2 3.4 3.7 3.9 4.3 1.9 2.9 3.1 3.3
3.4 3.7 3.9 4.4
2.2 3.0 3.1 3.3 3.5 3.7 4.1 4.5
2.5 3.0 3.2 3.3 3.5 3.8 4.1 4.7

2.6 3.1 3.2 3.4 3.6 3.8 4.2 4.7

Although one can always determine a percentile from the original data, it may be
advantageous and less time- consuming to calculate a percentile directly from the
frequency distribution. In grouping the data, we have chosen to ignore the identity of the
individual observations. The only information that remains, assuming the original raw data
have been discarded is the number of observations falling in each class interval. To evaluate
a percentile from a frequency distribution, we assume the measurements within a given class
interval to be uniformly distributed between the lower and upper class boundaries. This is
equivalent to interpreting a percentile as a value below which a specific fraction or
percentage of the area of a histogram falls. To illustrate the calculation of a percentile from
a frequency distribution, we consider the following example.

Example 3: Find P48 for the distribution of battery lives in Table 4.

Solution: We are seeking the value below which (48/100) x 40= 19.2 of the observations fall.
The fact that the observations are assumed uniformly distributed over the class interval permits
us to use fractional observations as is the case here. There are 7 observations falling below
the class boundary 2.95. We still need 12.2 of the next 15 observations falling between 2.95
and 3.45. Therefore, we must go a distance (12/15) x 0.5= 0.41 beyond 2.95. Hence

P48 = 2.95 + 0.41

= 3.36 years

compared with 3.4 years obtained above from the ungrouped data. Therefore, we conclude
that 48% of all batteries of this type will last less than 3.36 years.

Decile- Deciles are values that divide a set of observations into 10 equal parts. These values
denoted by D1, D2,…, D9, are such that 10% of the data falls below D1, 20% falls below D2…
and 90% falls below D9.

Deciles are found in exactly the same way that we found percentiles. To find D7 for the
distribution of battery lives, we need the value below which (70/100) x 40 =28 of the
observations in Table 3.9 fall. Since this can be any value between 3.7 years and 3.8 years,
we take their average and hence D7 = 3.75 years. Therefore, we conclude that 70% of all
batteries of this type will last less than 3.75 years.

Example 4: Use the frequency distribution of Table 3.4 to find the D7 for the distribution of
battery lives.

Solution: We need the value below which (70/100) x 40= 28 observations fall. There are
22 observations falling down 3.45. We still need 6 of the next 10 observations and therefore
must go a distance (6/10) x 0.5 =0.3 beyond 3.45. Hence,

D7= 3.45 + 0.3

= 3.75 years

which is identical to the value obtained by using the ungrouped data.

Quartile- are values that divide a set of observations into 4 equal parts. These values, denoted
by Q1, Q2 and Q3 are such that 25% of the data falls below Q1, 50% below Q2 and 755 falls
below Q3.

To find Q1 for the distribution of battery lives, we need the value below which (25/100) x
40 =10 of the observations in Table 3.9 fall. Since the 10th and 11th measurements are both
equal to 3.1 years, their average will also be 3.1 years and hence Q1 = 3.1 years.

Example 5: Use the frequency distribution of Table 3.2 to find Q3 for the distribution of
weights of 50 pieces of luggage.

Solution: We need the value below which (75/100) x 50 = 37.5 observations fall. There are
24 observations falling below 15.5 kilograms. We still need 13.5 of the next 19 observations and
therefore must go a distance (13.5/19) x 3 = 2.1 beyond 15.5. Hence

Q3= 15.5 +2.1

= 17.6 kilograms.

Therefore, we conclude that 75% of all 50 pieces of luggage weigh less than 17.6 kilograms.

The 50th percentile, fifth decile and second quartile of a distribution are all equal to the
same value, commonly referred to as the medium. All the quartiles and deciles are
percentiles. For example, the seventh decile is the 70th percentile and the first quartile is the
25th percentile. Any percentile, decile or quartile can also be estimated from a percentage
ogive.
Answer the following problems

Assignment:

4, 12

Bring-home Quiz

5, 13

You might also like