Professional Documents
Culture Documents
BEED 2A
Course Outcome:
At the end the semester, it is expected that the students are able to:
1. Discuss the importance of Statistics in the context of education and other related
sciences.
2. Identify different statistical methods and tools appropriate for educational researches.
3. Process data using the statistical tools with confidence.
4. Explain the difference of experimental designs used in scientific researches in terms of
application and limitations.
CHAPTER 1
INTRODUCTION OF STATISTICS
Key Words
Statistics
Parametric Statistics
Measurement
Variable
Summation
Data
A. STATISTICS
In its plural sense, statistics is a set of numerical data (e.g., vital statistics in a beauty contest,
monthly sales of a company, daily peso- dollar exchange rate). In its singular sense, Statistics
is that branch of science which deals with the collection, presentation, analysis and
interpretation of data.
The field of statistics may be divided into descriptive and inferential statistics. Descriptive
statistics is only concerned with summarizing values to describe group characteristics of the
data after gathering, classifying, and presenting data. To do this, it employs graphs, tables
and frequency distributions, percentages, measures of central tendency and position, and
measures of variability. It does not need to generalize or make conclusions. Whereas,
Inferential statistics is concerned with a higher order of critical thinking and judgement. And
it needs more complex mathematical procedures. Its aim is to give generalization,
conclusion, or information regarding large groups of data called the population without
necessarily dealing with each and every element of these groups. It only uses a small portion
of the total set of data or only a representative portion called a sample to give conclusions
or generalizations regarding the entire population.
Classification of Statistics
Parametric statistics are inferential techniques which make the following assumptions
regarding the nature of the population from which the observations or data are drawn:
1. The observations must be independent. This means that in choosing any element from the
population to be included in the sample, it must not affect the chances of other elements
for inclusion.
2. The population must be drawn from normally distributed populations. The crude way of
knowing that the distribution is normal is when the mean, the median and the mode are all
equal (mean=median=mode). If we are going to draw the curve, we can produce a
bellshaped curve which has an area of one and is symmetrical with respect to the x-axis.
3. If we analyze the two groups/populations, these populations must have the same variance
and we call this as homoscedastic populations.
4. The variables must be measured in the interval or ratio scale, so that we can interpret the
results.
While the non-parametric statistics makes fewer and weaker assumptions like:
1. The observations must be independent, and the variable has the underlying continuity.
2. The observations are measured in either the nominal or ordinal scales.
To have a better understanding on when to use the parametric and non-parametric statistics,
please refer to the table below:
B. Levels of Measurement
NOMINAL SCALE- is the first and the lowest level of measurement. It is merely grouping or
classifying different objects into categories based upon some defined characteristics without
paying attention to order or arrangement. Following the identification of the various
categories, frequencies or the number of objects in each category are counted.
1. The data are mutually exclusive, (an object can belong to only one category).
2. The data categories have no logical order or arrangement. These are two ways of
classifying: the one- way classification and the two-way classification.
Example of one-way classification:
COLLEGE FREQUENCY
College of Arts and Sciences 50
RESPONSE FREQUENCY
Strongly Agree 50
Agree 30
Moderately Agree 20
Disagree 10
Strongly Disagree 10
In the two-way classification, an individual may be classified twice. For example, Peter is
classified as male under sex and at the same time, he is classified Yes or under neutral or
whatever is his response.
Example 1.
The ORDINAL SCALE is the second level or measurement. In here, there is logical ordering or
arrangement of categories aside from categories being mutually exclusive. The process of
measurement is the same as the nominal scale where number of objects are counted in each
category. However, we can discern which is the highest or lowest. For example, rank in
military, we know that the private < corporal< sergeant< lieutenant etc.
Example:
RANK FREQUENCY
Private 20
Corporal 15
Sergeant 10
Lieutenant 25
RATIO SCALE is the highest level of measurement. All properties of the interval scale are
applicable in the ratio scale plus one additional property which is known as the “true zero
point” which reflects the absence of the characteristics measured.
Example: Number of correct answers in an exam
Speed
In Summary:
C. Variable
Explanatory Variable- is a variable that is thought to affect the values of the response
variable. It is sometimes called independent variable or X variable in regression setting. In
this case, explanatory variable, like the response variable may be continuous, ordinal or
nominal.
• Discrete Variable- a variable which can assume finite, or, at most countably infinite
number of values, usually measured by counting or enumeration.
• Continuous Variable- a variable which can assume the infinitely many values
corresponding to a line interval.
D. Summation Notation
Important Symbol
POPULATION SAMPLE
Number or observations N n
Characteristics: Parameter Statistics
Mean µ x, y, z
Variance σ2 s2
Standard Deviation Σ s
In statistics it is frequently necessary to work with sums of numerical value. For example,
we may wish to compute the average cost of a certain brand of toothpaste sold at 10
different stores. Perhaps we would like to know the total number of heads that occur when
3 coins are tossed several times.
Using the Greek letter Σ (capital sigma) to indicate “summation of” we can write the sum
of the 4 weights as
4
∑xi
i=1
Where we read “summation of xi, i going from 1 to 4.” The numbers 1 and 4 are called the
lower and upper limits of summation. Hence
i=1 i =1 xj
The lower limit of summation is not necessarily a subscript. For instance, the sum of the natural
numbers from 1 to 9 may be written.
∑9x=1 x = 1 + 2 + … + 9 =45
When we are summing over all the values of x1 that are available, the limits of
summation are often omitted, and we simply write ∑ xi. If in the diet experiment only 4 people
were involved, then ∑ xi = x1 + x2 +x3 +x4. In fact, some authors even drop the subscript and let
∑ x represent the sum of all available data. Example: If x1 = 3, x2 = 5 and x3 =7, find
a).∑ xi b) c)∑3i=2 xi - i
Solution
a) ∑ xi = xi + x2 + x3 = 3 + 5 + 7= 15
b)
c) xi – i = x2 -2 + x3 -3 = 3 + 4 =7
a) xiyi
Solution
b)
=(-2)(20) = -40
E. Classification of Data
• Internal Data- information that relates to the operations and functions of the
organization collecting the data.
• External Data- Information that relates to some activity outside the organization
collecting the data.
Example: The sales data of SM is internal data for SM but external data for any other
organization collecting such as Robinsons.
1. Survey Method- questions are asked to obtain information, either through self- administered
questionnaire or personal interview.
Self- administered Questionnaire Personal Interview
• Obtained information is limited to • Missing information and
subjects’ written answers to pre- vague responses are minimized
arranged questions. with the proper probing of the
• • Lower response rate. • interviewer.
It can be administered to a large Higher response rate through callbacks.
• number of people simultaneously. • It is administered to a person or group
Respondents may feel freer to express one at a time.
views and are less pressured to answer • Respondent may feel more cautious
• immediately. particularly in answering sensitive
It is more appropriate for obtaining questions for fear of disapproval. It is
objective information. • more appropriate for obtaining about
complex emotionally- laden topics or
probing sentiments underlying an
expressed opinion.
2. Observation Method- makes possible the recording of behavior but only at the time of
occurrence (e.g., observing reactions to a particular stimulus, traffic count).
• Does not rely on the respondent’s willingness to provide the desired data
• Certain types of data can be collected only by observation (e.g. behavior patterns of
which the subject is not aware of or is ashamed to admit)
• The potential bias caused by the interviewing process is reduced or eliminated.
Disadvantages over Survey Method:
• Things such as awareness, beliefs, feelings and preferences cannot be observed.
• The observed behavior patterns can be rare or too unpredictable thus increasing the
data collection costs and time requirements.
3. Experimental Method- a method designed for collecting data under controlled conditions.
An experiment is an operation where there is actual human interference with the conditions
that can affect the variable under study. This is an excellent method of collecting data for
causation studies. If properly designed and executed, experiments will reveal with a good
deal of accuracy, the effect of a change in one variable on another variable.
4. Use of existing studies- e.g., census, health statistics and weather Bureau reports.
Two types:
5. Registration Method- e.g., car registration, student registration and hospital admission.
CHAPTER 2
SAMPLING METHODS
Introduction
In Chapter 1, the concepts of population and sample were discussed. A population is any
defined aggregate of objects, persons, or events, the variables used as the basis for
classification or measurement being specified. A sample is any sub aggregate drawn from
the population. Any statistic calculated on a sample of observations is an estimate of a
corresponding population value or parameter. The symbol 𝑋̅ is used to refer to the arithmetic
mean of X calculated on a sample of size N. The symbol 𝜇 is used to refer to the mean of the
population. Similarly, s2 is used to refer to the variance in the sample, and 𝜎2 is the
corresponding population parameter. 𝑋̅ is an estimate of 𝜇 and s2 is an estimate of 𝜎2. Likewise,
any other statistic calculated on a sample is an estimate of a corresponding population
parameter. In most situations the parameters are unknown and must be estimated in some
manner from the sample data.
Much statistical work in practice is concerned with the use of sample statistics as estimates
of population parameters and more particularly with describing the magnitude of error which
attaches to such statistics. The body of statistical method concerned with the making of
statements about population parameters from sample statistics is called sampling statistics
and the logical process involved is called statistical inference, this being a rigorous form of
inductive inference. If inferences about population parameters are to be drawn from sample
statistics, certain conditions must attach to the methods of sampling used.
Key Terms/Words
Sampling
Population
Sample
Probability Sampling
A. Definition of Terms
There are four criteria of sampling designs, i.e., representative of the population, reliability,
practicable and efficient and economical.
• Representative of the Population- The sample must be selected so that it properly
represents the population that is to be covered, i.e., each individual must have a
chance of being selected and this chance must not be zero.
• Reliability- It should be possible to measure the reliability of the estimates made from
the sample. In addition to the desired estimates of the characteristics of the
population, the sample should give measures of the precision of these estimates.
• Practicable- The third criterion is that the sampling design must be practical. It must
be sufficiently simple and straight-forward so that it can be carried out substantially as
designed.
• Efficient and Economical- the design should be efficient and economical. Among the
various sampling methods (discussed latter), one must naturally choose the method
which to the best of our knowledge produces the most information at the smaller cost.
C. Methods of Probability
A sampling procedure that gives every element of the population a (known) nonzero
chance of being selected in the sample is called probability sampling. Otherwise, the
sampling procedure is called non- probability sampling.
• Sampled Population- is the collection of elements from which the sample is actually
taken.
• The Population Frame- is a listing of all the individual units in the population.
Step 1: Make a list of the sampling units and number them from 1 to N
Step 2: Select n (distinct for SRSWOR), not necessarily distinct for SRSWR)
numbers from 1 to N using some random process, for example, the table of
random numbers.
Step 3: The sample consists of the units corresponding to the selected random
numbers.
Advantages
• The theory involved is much easier to understand than the theory
behind other sampling designs.
• Inferential methods are simple and easy.
Disadvantages
Step 1: Divide the population into strata. Ideally, each stratum must consist of more or
less homogeneous units.
Step 2: After the population has been stratified, a simple random sample is selected
from each stratum.
Advantages
• Stratification may produce a gain in precision in the estimates of characteristics
of the population.
• It allows for more comprehensive data analysis since information is provided for
each stratum
• It is administratively convenient.
Disadvantages
Method A
Step 1: Number the units of the population consecutive from 1 to N
Step 2: Determine k, the sampling interval using the formula k= N÷ n
Step 3: Select the random start r, where 1≤ r ≤ k. The unit corresponding to r is the first unit of
the sample.
Step 4: The other units of the sample correspond to r + k, r + 2k, r + 3k and so on.
Method B
Step 1: Number the units of the population consecutively from 1 to N. Step
2: Let k be the nearest integer to N/n,
Step 3: Select the random start r, where 1 r N. The unit corresponding to r is the first unit
of the sample.
Step 4: Consider the list of units of the population as a circular list, i.e., the last unit in the list is
followed by the first. The other units in the sample are the units corresponding to r + k, r + 2k, r
+ 3k,…r + (n-1)k.
Advantages
• It is easier draw the sample and often easier to executive without mistakes than simple
random sampling.
• It is possible to select a sample in the field without a sampling form.
• The systematic sample is spread more evenly over the population.
Disadvantages
• If periodic regularities are found in the list, a systematic sample may consist only of
similar types. (Example: Store sales over seven days of the week- estimating total sales
based on a systematic sample every Tuesday would be unwise.)
• Knowledge of the structure of the population is necessary for its most effective use.
Clusters may be of equal or unequal size. When all of the clusters are of the same size, the
number of elements in a cluster will be denoted by M while the number of clusters in the
population will be denoted by N.
Advantages
• A population list of elements is not needed, only a population lists of clusters is required.
Listing cost is reduced.
• Transportation cost is reduced.
Disadvantages
• The costs and problems of statistical analysis are greater.
• Estimation procedures are more difficult.
Advantages
• Listing cost is reduced
• Transportation cost is reduced.
Disadvantages
• Estimation procedure is difficult, especially when the primary stage units are not of the
same size.
• Estimation procedure gets more complicated as the number of sampling stages
increases.
• The sampling procedure entails much planning before selection is done.
6. Sequential Sampling- units are drawn one by one in a sequence without prior fixing of
the total number of observations and the results of the drawing at any stage are used to
decide whether to terminate sampling or not.
CHAPTER 3
A variety of statistical measures are employed to summarize and describe sets of data.
Some of these statistical measures define, in some sense, the center of a set data and
consequently are called measures of central location or measures of central tendency.
The term central location refers to a central reference value which is usually close to the point
of greatest concentration of the measurements and may in some sense be thought to typify
the whole set. Measures of central location in common use are the mode, median, and
arithmetic mean. Other less frequently used measures are the geometric mean and the
harmonic mean. By far the most widely used measure of central location is the arithmetic
mean. This statistic is an appropriate measure of central location for interval and ratio
variables. The median and mode are sometimes viewed as appropriate measures for ordinal
and nominal variables, respectively, although they can also be used with interval and ratio
variables.
Key Words
Central Location
Statistical Measure
Mean
Median
Mode
An average is a measure of the center of a set of data when the data are arranged in
an increasing or decreasing order of magnitude. For example, if an automobile averages
14.5 kilometers to 1 liter of gasoline, this can be considered a value indicating the center of
several more values. In the country 1 liter of gasoline may give considerably more kilometers
per liter than in the congested traffic of a large city. The number 14.5 in some sense defines a
center value.
B. Population Mean- if the set of data x1, x2,…xN, not necessarily all distinct, represents a finite
population of size N, then the population mean is
N
Example 1: The number of employees at 5 different drugstores are 3, 5, 6, 4 and 6. Treating
the data as a population, find the mean number of employees for the 5 stores. Solution:
Since the data are considered to be a finite population,
μ= = 4.8
C. Sample Mean- if the set of data x1, x2,…xn, not necessarily all distinct represents a finite
sample of size n, then the sample mean is
x = ∑ni=1 x1
n
x = =1.8%
Often, it is possible to simplify the work in computing a mean by using coding
techniques. For example, it is sometimes convenient to add (or subtract) a constant to all our
observations and then compute the mean. How is this new mean related to the mean of the
original set of observation? If we let y1 = x1 + a, then
y +
an n
Therefore, the addition (or subtraction) of a constant to all observation changes the
mean by the same amount. To find the mean of the numbers -5, -3, 1,4 and 6, we might add
5 first to give the set of all positive values 0, 2, 6, 9 and 11 that have a mean of 5.6. Therefore,
the original numbers have a mean of 5.6 - 5 =0.6. Now suppose that we let y1 = ax1. It follows
that
Example 3: On 5 term tests in sociology a student has made grades of 82, 93, 86, 92 and
79. Find the median for this population of grades.
79 82 86 92 93
and hence μ̃ = 86
Example 4: The nicotine contents for a random sample of 6 cigarettes of a certain brand
are found to be 2.3, 2.7, 2.5, 2.9, 3.1 and 1.9 milligrams. Find the median.
and the median is then the mean of 2.5 and 2.7. Therefore,
2. x̃ =
= 2.6
milligrams.
The third and final measure of central location that we shall discuss is the mode.
E. Mode: The mode of a set of observations is that value which occurs most often or with
the greatest frequency.
The mode does not always exist. This us certainly true when all observations occur with
the same frequency. For some sets of data there may be several values occurring with the
greatest frequency in which case we have more than one mode.
Example 5: If the donations from the residents of Fairway Forest toward the Virginia Lung
Association are recorded as 9, 10, 5, 9, 9,7, 8, 6, 10 and 11 dollars, then 9 dollars, the value
that occurs with the greatest frequency, is the mode.
Example 6: The number of movies attended last month by a random sample of 12 high
school students were recorded as follows: 2, 0, 3, 1, 2, 4, 2, 5, 4, 0, 1 and 4. In this case, there
are two modes, 2 and 4, since both 2 and 4 occur the greatest frequency. The distribution is
said to be bimodal.
Example 7: No mode exists for the sociology grades of Example 3, since each grade
occurs only once.
Answer the following problems.
Assignment
1, 3, 7, 10
Bring-home Quiz
2, 5, 13
CHAPTER 4
MEASURE OF VARIABILITY
Introduction
Of great concern to the statistician is the variation in the events of nature. The variation of
one measurement from another is a persisting characteristic of any sample of measurements.
Measurements of intelligence, eye color, reaction time, and skin resistance for example
exhibit variation in any sample of individuals. Anthropometric measurements such as height,
weight, diameter of the skull, length of the forearm and angular separation of the metatarsals
show variation between individuals. Anatomical and physiological measurements vary; also,
the measurements made by the physicist, chemist, botanist and agronomist. Statistics can be
viewed as the study of variation. The experimental scientist is concerned with the different
circumstances, conditions or sources which contribute to the variation in the measurements
of he or she obtains. Among the possible measures used to describe this variation are the
range, the mean deviation and the standard deviation. The most important of these is the
standard deviation.
1. Identify the most typical measures of dispersion – the range, variance, standard
deviation and coefficient of variation.
2. Determine the extent of the scatter so that steps may be taken to control the existing
variation
Key Words
Range
Variance
Mean Deviation
Standard Deviation z
Scores
A. Measures of Variation
Consider the following measurements, in liters, for two samples of orange juice bottled
by companies A and B:
Sample A 0.97 1.00 0.94 1.03 1.11
Sample B 1.06 1.01 0.88 0.91 1.14
Both samples have the same mean, 1.00 liters. It is quite obvious that company A bottles
orange juice with a more uniform content than company B. We say that the variability or the
dispersion of the observations from the average is less for sample A than foe sample B.
Therefore, in buying orange juice, we would feel more confident that the bottle we select will
be closer to the advertised average if we buy from company A.
The most important statistics for measuring the variability of a set of data are the range
and the variance. The simplest of theses to compute is the range.
B. Range- the range of a set of data is the difference between the largest and smallest
number in the set.
Example 8: The IQs of 5 members of a family are 108, 112, 127, 118 and 113. Find the
range.
In the case of the companies bottling orange juice, the range for company A is 0.17 liters
compared to a range of 0.26 liters for company B, indicating a greater spread in the values
for company B.
The range is a poor measure of variation, particularly if the size of the sample or population is
large. It considers only the extreme values and tells us nothing about the distribution of
numbers in between. Consider, for example, the following two sets of data, both with a range
of 12.
Set A 3 4 5 6 8 9 10 12 15
Set B 3 7 7 7 8 8 8 9 15
In set A the mean and median are both 8, but the numbers vary over the entire interval from
3 to 15. In set B the mean and median are also 8, but most of the values are closer to the
center of the data. Although the range fails to measure this variation between the upper and
lower observations, it does not have some useful applications. In industry the range for
measurements on items coming off an assembly line might be specified in advance. As long
as all measurements fall within the specified range, the process is said to be in control.
x1 - µ, x2 - µ,…xN - µ
Similarly, if our set of data is the random sample x1, x2,…xn, the deviations are
x1 - x, x2 - x,…xn – x
An observation greater than the mean will yield a positive deviation, whereas an observation
smaller than the mean will produce a negative deviation. Comparing the deviations for the
two sets of data below, we have the following:
Set A -5 -4 -3 -2 0 1 2 4 7
Set B -5 -1 -1 -1 0 0 0 1 7
Clearly, most of the deviations of set B are smaller in magnitude than those of set A,
indicating less variation among the observations of set B. Our aim now is to obtain a single
numerical measure of variation that incorporates all the deviations from the mean. The most
obvious procedure would be to average the deviations. The sum of the deviations from the
mean is zero for any set of data and consequently their mean is also zero. To circumvent this
problem, we could find a measure of variation called the mean deviation whereby we
compute the mean of the absolute values of the deviations. An absolute value of a number
is the number without the associated algebraic sign. Thus, the absolute values of -4 is simply
4.
In practice, the mean of the absolute values of deviation from the mean is seldom used.
The use of absolute values makes its mathematical treatment awkward. Instead, we shall
work with the squares of all the deviations in computing the variance. In the case of a finite
population of size N, the variance denoted by the symbol σ2 (sigma squared), may be
computed directly from the following summation formula.
C. Population Variance: Given finite population x1, x2,…xN, the population variance is
σ2 = ∑Ni=1(x1−μ)2
N
Assuming that the two sets A and B are populations, we now use the deviations in the
preceding table to calculate their variance. For set A.
A comparison of the two variances shows that the data of set A are more variable than the
data of set B.
By using the square of the deviations to compute the variance, we obtain a number in
squared units. That is, if the original measurements were in feet the variance would be
expressed in squared feet. To get a measure of variation expressed in the same units as the
raw data, as was the case for the range, we take the square root of the variance. Such a
measure is called standard deviation.
Example 9: The following score were given by 6 judges for a gymnast’s performance in
the vault of an international meet: 7, 5, 9, 7, 8 and 6. First the standard deviation of this
population.
μ= =7
And then
s2 = ∑ni=1(x1−x)2
n−1
Example 10: A comparison of coffee prices at 4 randomly selected grocery stores in San
Diego showed increases from the previous month of 12, 15, 17 and 20 cents for a 200- gram
jar. Find the variance of this random sample of price increases. Solution: Calculating the
=
If 𝑥̅ is a decimal number that has been rounded off, we accumulate a large error using
the sample- variance formula in the form given above. To avoid this, let us derive the more
useful computational formula.
The sample standard deviation, denoted by s, is defined to be the positive square root of the
sample variance.
Example 11: Find the variance of the data 3, 4, 5, 6, 6 and 7, representing the number of trout
caught by a random sample of 6 fishermen on June 19, 1981 at Lake Muskoka.
𝑥̅1 𝑥̅12
3 Error! Bookmark not defined.
4 2
5 21
6 22
649 23
171 n=6
Hence,
(6)(171) − (31)2 13
𝑠2 = =
(6)(5) 6
Often, it is possible to simplify the computational procedure for calculating the variance of a
set of data by using coding techniques. Recall that coding was used in Section 2.2 to
compute the mean. The effects of coding on the variance by subtracting a constant from
each observation or by dividing each observation by a constant will be of particular interest
to us. We shall investigate these effects here only for random samples, but the results are
equally valid for populations.
If we let 𝑦𝑖 = 𝑥̅𝑖 + 𝑐, it follows that 𝑦 = 𝑥̅ + 𝑐, and hence the variance of the 𝑦1′𝑠 is 𝑠2 = 𝑖=1(𝑦1
− 𝑦)2 = ∑𝑛𝑖=1[(𝑥̅1 + 𝑐) − (𝑥̅ + 𝑐)]2 ∑𝑛
𝑛−1 𝑛−1
𝑛−1
Therefore, if each observation of a set of data is transformed to anew set by the addition (or
subtraction) of a constant 𝑐, the variance of the original set of data is the same as the
variance of the new set.
Now suppose we let 𝑦𝑖 = 𝑐𝑥̅𝑖, so that 𝑦 = 𝑐𝑥̅. It follows that the variance
Therefore, if a set of data is transformed to a new set by multiplying (or dividing) each
observation by a constant 𝑐,the variance of the original set is equal to the variance of the
new set divided (or multiplied) by 𝑐2.
Example 12: A random sample of 5 bank presidents indicated annual salaries of $63,000,
$52,000, $35,000 and $41,000. Find the variance of this set of data by using appropriate
coding techniques.
Solution: If we divide all the salaries by 1000 and then subtract 50, we obtain the numbers 13,
-2, 12, -15 and -9, for which
and
Now, for the coded data
and after multiplying by 10002, the variance of the original set of salaries is 𝑠2 = 1.557 × 108.
The standard deviation seems to be the best measure of variation that have, At this point,
however, it has meaning only when comparing two more sets of data having the same units
of measurement and approximately the same mean. Therefore, we could compare the
variances of the observations of two companies bottling orange juice and the larger value
would indicate the company whose product is more variable or less uniform provided that
bottles of the same size were used. It would not be meaningful to compare the variance of
a set of heights to the variance of a set of aptitude scores.
F. Z Scores
Z Score- An observation, 𝑥̅, from a population with mean 𝜇 and standard deviation 𝜎 has a z
score or z value defined by
𝑥̅ − 𝜇
𝑧=
𝜎
A z score measures how many standard deviations an observation is above or below the
mean. Since 𝜎 is never a negative, a positive z score measures the number of standard
deviations an observation is above the mean, and a negative z score give the number of
standard deviations an observation is below the mean. Note that the units of the denominator
and the numerator of a z score cancel. Hence a z score is unitless, thereby permitting a
comparison of two observations relative to their groups, measured in completely different
units.
Let us now compute the z scores corresponding to our student’s grades in chemistry
and economics. For chemistry we obtain
𝑧= = 1.75
and for economics
𝑧= = 1.50
Example 13: Different typing skills are required for secretaries depending on whether one is
working in a law office, an accounting firm, or a research mathematical group at a major
university. In order to evaluate candidates for these positions, an employment agency
administers three distinct standardized typing samples. A time penalty has been incorporated
into the scoring of each sample based on the number of typing errors. The mean and
standard deviation for each test, together with the score achieved by a recent applicant are
given in Table 2.1.
Table 2.1
For what type of position does this applicant seem to be best suited?
Law: 𝑧 = = −1.3
Accounting: 𝑧 = = −1.5
Scientific: 𝑧 = = 1.4
Since speed is of primary importance, we are looking for the z score that represents the
greatest number of standard deviations to the left of the mean and in our case that would
be -1.5. Therefore, this particular applicant ranks higher among typists in accounting firms than
when compared to typists in the other two areas, and consequently should be placed with
an accounting firm.
Answer the following problems.
Assignment:
6, 9, 16
Bring-home quiz:
7, 11, 18
Often, we are confronted with the problem of disseminating large masses of statistical data
in compact form. Although numerical measures of location and variation are certainly useful
compact descriptions of a set of observations, they do not by themselves identify all the
important features of the data. Considerable information can be retrieved from large masses
of data when they are summarized and displayed by means of appropriate tables, charts
and graphs.
Key Terms
Distribution
Graphical Representation
Symmetry
Skewness
Class Limits- smallest and largest values that can fall in a given class interval. For the interval
10-12, the smaller number is 10 (lower class limit) and the larger number is 12 (upper class
limit).The original data were recorded to the nearest kilogram, so the 8 observations in the
interval 10-12 are the weights of all the pieces of luggage weighing more than 9.5 kilograms
but less than 12.5 kilograms. The numbers 9.5 and 12.5 are called the class boundaries for the
given interval. For the interval 10-12, the number 9.5 is called the lower class boundary and
12.5 is called the upper class boundary. However, 12.5 would also be the lower class
boundary for the interval 13-15.
Class Width- the numerical difference between the upper and the lower class boundaries of
a class interval.
Class Mark or Class Midpoint- The midpoint between the upper and lower class boundaries
or class limits of a class interval.
7-9 6.5-9.5 8 2
10-12 9.5-12.5 11 8
13-15 12.5-15.5 14 14
16-18 19-21 15.5-18.5 17 19
18.5-21.5 20 7
To illustrate the construction of a frequency distribution, consider the data of the
Table3.3 below which represents the lives of 40 similar car batteries recorded to the nearest
tenth of a year. The batteries were guaranteed to last 3 years.
In many situations we are concerned not with the number of observations in a given
class but in the number that fall above or below a specified value. For example, in able 3.4
the number of batteries lasting less than 3 years is 7. The total frequency of all values less than
the upper class boundary of given class interval is called the cumulative frequency up to and
including that class. A Table 3.6 shows the cumulative frequencies that is called cumulative
frequency distribution.
B. Graphical Representations
One can quickly observe form the bar chart that most of the batteries lasted from 3.0 to 3.4
years, only a very few batteries lasted less than 2.5 years and no battery lasted longer 4.9
years. In a bar chart the base of each bar corresponds to a class interval of a frequency
distribution and the heights of the bars represent the frequencies associated with each class.
Although the bar chart provides immediate information about a set of data in a condensed
form, we are usually more interested in a related pictorial representation called histogram. A
histogram differs from a bar chart in that the bases of each bar are the class boundaries
rather than the class limits. The use of class boundaries for the base eliminates the spaces
between the bars to give the solid appearance of Figure 3.2
For some problems it will be more convenient to let the vertical axis represent relative
frequencies or percentages. The graphs called relative frequency histograms or percentage
histograms have exactly the same shape as the frequency histogram but a different vertical
scale.
In viewing a histogram, the eye tends to compare the areas of the different rectangles rather
than their heights. Although this is appropriate for class intervals of equal width, it can be very
misleading if some of the class width differ. Unscrupulous individuals have been known to
deliberately misrepresent
data by erroneously constructing histograms with unequal class widths. Suppose for example,
that we combine the two class intervals 2.5-2.9 and 3.0-3.4 of Table 3.4 into the single interval
2.5-3.4 containing the 19 observations of the combined frequencies 4 and 15.
In Figure 3.3, we get the mistaken impression that well over half of the observations fall in the
longer class interval 2.5-3.4 when the actual number is just one less than half. To correct for
this misconception, we must reduce the height of this new rectangle by the inverse of the
factor that extends the class interval. Since we doubled the class width by combining the two
intervals, we must therefore divide the height of this new rectangle by 2 to give the correct
visual picture as shown in the Figure 3.4. Of course, now that areas and not heights represent
the frequencies, we have no further need for the vertical axis, and it is therefore omitted.
means of a
frequency polygon. Frequency polygons are constructed by plotting class frequencies
against class marks and connecting the consecutive points by straight lines.
A polygon is many-sided closed figure. To close the frequency polygon an additional class
interval is added to both ends of the distribution, each with zero frequency. For our example,
the midpoints of these two additional classes will be 1.2 and 5.2. These two points enable us
to connect both ends to the horizontal axis, resulting in a polygon. The frequency polygon for
the data of Table 3.4 is shown in Figure 3.5. We can obtain the frequency polygon very quickly
from the histogram by joining the midpoints of the tops of adjacent rectangles and then
adding the two intervals at each end.
If we wish to compare two sets of data with unequal sample sizes by constructing two
frequency polygons on the same graph, we must use relative frequencies or percentages. A
graph is similar to Figure 3.3, but using relative frequencies or percentages is called a relative
frequency polygon or a percentage polygon.
A second line graph is called a cumulative frequency polygon or ogive is obtained by
plotting the cumulative frequency less than any upper class boundary against the upper class
boundary and joining all the consecutive points by straight lines. The cumulative frequency
polygon for the data of Table 3.6 is shown in the Figure 3.6. If relative cumulative frequencies
or percentages had been used, we would call the graph a relative frequency ogive or a
percentage ogive.
Percentile- are values that divide a set of observation into 100 equal parts. These values
denoted by P1, P2,…. P99, are such that 1% of the data falls below P1, 2% falls below P2,…and
99% falls below P99.
To illustrate the procedure in calculating a percentile, let us find P85 for the distribution of
battery lives in Table 3.3. First, we must rank the given data in increasing order of magnitude
as displayed in Table 3.9. Since the table contains 40 observations, we seek the value below
which (85/100) x 40 =34 observations fall. As seen in table 3.9, P85 could be any value between
4.1 years and 4.2 years. In order to give a unique value, we shall define P85= 4.15 years. This
procedure works very well whenever the number of observations below the given percentile
is a whole number. However, when the required number of observations is fractional, it is
customary to use the next highest whole number to find the required percentile. For example,
in finding P48 we seek the value below which (48/100) x 40= 19.2 observations fall. Rounding
up to the next integer, we use the 20th observation as our location point. Hence P48 = 3.4 years.
1.6 2.6 3.1 3.2 3.4 3.7 3.9 4.3 1.9 2.9 3.1 3.3
3.4 3.7 3.9 4.4
2.2 3.0 3.1 3.3 3.5 3.7 4.1 4.5
2.5 3.0 3.2 3.3 3.5 3.8 4.1 4.7
Although one can always determine a percentile from the original data, it may be
advantageous and less time- consuming to calculate a percentile directly from the
frequency distribution. In grouping the data, we have chosen to ignore the identity of the
individual observations. The only information that remains, assuming the original raw data
have been discarded is the number of observations falling in each class interval. To evaluate
a percentile from a frequency distribution, we assume the measurements within a given class
interval to be uniformly distributed between the lower and upper class boundaries. This is
equivalent to interpreting a percentile as a value below which a specific fraction or
percentage of the area of a histogram falls. To illustrate the calculation of a percentile from
a frequency distribution, we consider the following example.
Solution: We are seeking the value below which (48/100) x 40= 19.2 of the observations fall.
The fact that the observations are assumed uniformly distributed over the class interval permits
us to use fractional observations as is the case here. There are 7 observations falling below
the class boundary 2.95. We still need 12.2 of the next 15 observations falling between 2.95
and 3.45. Therefore, we must go a distance (12/15) x 0.5= 0.41 beyond 2.95. Hence
= 3.36 years
compared with 3.4 years obtained above from the ungrouped data. Therefore, we conclude
that 48% of all batteries of this type will last less than 3.36 years.
Decile- Deciles are values that divide a set of observations into 10 equal parts. These values
denoted by D1, D2,…, D9, are such that 10% of the data falls below D1, 20% falls below D2…
and 90% falls below D9.
Deciles are found in exactly the same way that we found percentiles. To find D7 for the
distribution of battery lives, we need the value below which (70/100) x 40 =28 of the
observations in Table 3.9 fall. Since this can be any value between 3.7 years and 3.8 years,
we take their average and hence D7 = 3.75 years. Therefore, we conclude that 70% of all
batteries of this type will last less than 3.75 years.
Example 4: Use the frequency distribution of Table 3.4 to find the D7 for the distribution of
battery lives.
Solution: We need the value below which (70/100) x 40= 28 observations fall. There are
22 observations falling down 3.45. We still need 6 of the next 10 observations and therefore
must go a distance (6/10) x 0.5 =0.3 beyond 3.45. Hence,
= 3.75 years
Quartile- are values that divide a set of observations into 4 equal parts. These values, denoted
by Q1, Q2 and Q3 are such that 25% of the data falls below Q1, 50% below Q2 and 755 falls
below Q3.
To find Q1 for the distribution of battery lives, we need the value below which (25/100) x
40 =10 of the observations in Table 3.9 fall. Since the 10th and 11th measurements are both
equal to 3.1 years, their average will also be 3.1 years and hence Q1 = 3.1 years.
Example 5: Use the frequency distribution of Table 3.2 to find Q3 for the distribution of
weights of 50 pieces of luggage.
Solution: We need the value below which (75/100) x 50 = 37.5 observations fall. There are
24 observations falling below 15.5 kilograms. We still need 13.5 of the next 19 observations and
therefore must go a distance (13.5/19) x 3 = 2.1 beyond 15.5. Hence
= 17.6 kilograms.
Therefore, we conclude that 75% of all 50 pieces of luggage weigh less than 17.6 kilograms.
The 50th percentile, fifth decile and second quartile of a distribution are all equal to the
same value, commonly referred to as the medium. All the quartiles and deciles are
percentiles. For example, the seventh decile is the 70th percentile and the first quartile is the
25th percentile. Any percentile, decile or quartile can also be estimated from a percentage
ogive.
Answer the following problems
Assignment:
4, 12
Bring-home Quiz
5, 13