Professional Documents
Culture Documents
Prepared by:
RACHEL M. PATANGAN
Instructor III
LESSON 5.1
DATA GATHERING AND ORGANIZING DATA; REPRESENTING DATA USING GRAPHS AND
CHARTS; INTERPRETING ORGANIZED DATA
Fact sheets
You may be familiar with probability and statistics through radio, television,
newspapers, and magazines. For example, you may have read statements like the following:
1. In Philippines, 60% of adults ranging 55 – 65 years old were being infected by the
Coronavirus diseases.
2. The back-to-school student plans to spend, on average, 15,050 on electronics and
computer-related items.
3. A person need an average of 8 hours of sleep per day.
4. The average in-state college tuition and fees for 4-year public college is 10,500
1. Like professional people, you must be able to read and understand the various
statistical studies performed in your fields. To have this understanding, you must be
knowledgeable about the vocabulary, symbols, concepts, and statistical procedures
used in these studies.
2. You may be called on to conduct research in your field, since statistical procedures
are basic to research. To accomplish this, you must be able to design experiments;
collect, organize, analyze, and summarize data; and possibly make reliable
predictions or forecasts for future use. You must also be able to communicate the
results of the study in your own words.
3. You can also use the knowledge gained from studying statistics to become better
consumers and citizens. For example, you can make intelligent decisions about what
products to purchase based on consumer studies, about government spending based
on utilization studies, and so on.
These reasons can be considered the goals for studying statistics. It is the purpose of this
chapter to introduce the goals for studying statistics by answering questions such as the
following:
Data are the values (measurements or observations) that the variables can assume.
A collection of data values forms a data set. Each value in the data set is called a data value
or a datum.
Data can be used in different ways. The body of knowledge called statistics is
sometimes divided into two main areas, depending on how data are used. The two areas are
1. Descriptive statistics
2. Inferential statistics
Here, the statistician tries to make inferences from samples to populations. Inferential
statistics uses probability, i.e., the chance of an event occurring. You may be familiar with the
concepts of probability through various forms of gambling. If you play cards, dice, bingo, and
lotteries, you win or lose according to the laws of probability. Probability theory is also used in
the insurance industry and other areas.
Most of the time, due to the expense, time, size of population, medical concerns, etc., it is not
possible to use the entire population for a statistical study; therefore, researchers use samples.
If the subjects of a sample are properly selected, most of the time they should
possess the same or similar characteristics as the subjects in the population. The techniques
used to properly select a sample will be explained in the next lesson.
Qualitative variables are variables that can be placed into distinct categories, according to
some characteristic or attribute. For example, if subjects are classified according to gender
(male or female), then the variable gender is qualitative. Other examples of qualitative
variables are religious preference and geographic locations.
Quantitative variables are numerical and can be ordered or ranked. For example, the variable
age is numerical, and people can be ranked in order according to the value of their ages.
Other examples of quantitative variables are heights, weights, and body temperatures.
Quantitative variables can be further classified into two groups: discrete and continuous.
Discrete variables can be assigned values such as 0, 1, 2, 3 and are said to be countable.
Examples of discrete variables are the number of children in a family, the number of students
in a classroom, and the number of calls received by a switchboard operator each day for a
month.
The first level of measurement is called the nominal level of measurement. A sample
of college instructors classified according to subject taught (e.g., English, history, psychology,
or mathematics) is an example of nominal-level measurement. Classifying survey subjects as
male or female is another example of nominal-level measurement. No ranking or order can be
placed on the data. Classifying residents according to zip codes is also an example of the
nominal level of measurement. Even though numbers are assigned as zip codes, there is no
meaningful order or ranking. Other examples of nominal-level data are political party
(Democratic,
Republican, Independent, etc.), religion (Christianity, Judaism, Islam, etc.), and marital status
(single, married, divorced, widowed, separated).
The next level of measurement is called the ordinal level. Data measured at this level
can be placed into categories, and these categories can be ordered, or ranked. For example,
from student evaluations, guest speakers might be ranked as superior, average, or poor.
Floats in a homecoming parade might be ranked as first place, second place, etc. Note that
precise measurement of differences in the ordinal level of measurement does not exist.
For instance, when people are classified according to their build (small, medium, or large), a
large variation exists among the individuals in each class.
The third level of measurement is called the interval level. This level differs from the
ordinal level in that precise differences do exist between units. For example, many standardized
psychological tests yield values measured on an interval scale. IQ is an example of such a
variable. There is a meaningful difference of 1 point between an IQ of 109 and an IQ of 110.
Temperature is another example of interval measurement, since there is a meaningful difference
of 1F between each unit, such as 72 and 73F. One property is lacking in the interval scale:
There is no true zero. For example, IQ tests do not measure people who have no intelligence.
For temperature, 0F does not mean no heat at all.
The final level of measurement is called the ratio level. Examples of ratio scales are
those used to measure height, weight, area, and number of phone calls received. Ratio scales
have differences between units (1 inch, 1 pound, etc.) and a true zero. In addition, the ratio
scale contains a true ratio between values. For example, if one person can lift 200 pounds
and another can lift 100 pounds, then the ratio between them is 2 to 1. Put another way, the
first person can lift twice as much as the second person.
When conducting a statistical study, the researcher must gather data for the particular
variable under study. For example, if a researcher wishes to study the number of people who
were bitten by poisonous snakes in a specific geographic area over the past several years, he
or she has to gather the data from various doctors, hospitals, or health departments.
After organizing the data, the researcher must present them so they can be
understood by those who will benefit from reading the study. The most useful method of
presenting the data is by constructing statistical charts and graphs. There are many different
types of charts and graphs, and each one has a specific purpose.
This lesson explains how to organize data by constructing frequency distributions and
how to present the data by constructing charts and graphs. The charts and graphs illustrated
here are histograms, frequency polygons, ogives, and pie graphs.
Since little information can be obtained from looking at raw data, the researcher
organizes the data into what is called a frequency distribution. A frequency distribution
consists of classes and their corresponding frequencies. Each raw data value is placed into a
quantitative or qualitative category called a class. The frequency of a class then is the number
of data values contained in a specific class. A frequency distribution is shown for the
preceding data set.
Now some general observations can be made from looking at the frequency
distribution. For example, it can be stated that the majority of the wealthy people in the study
are over 55 years old.
The classes in this distribution are 35–41, 42–48, etc. These values are called class
limits. The data values 35, 36, 37, 38, 39, 40, 41 can be tallied in the first class; 42, 43, 44,
45, 46, 47, 48 in the second class; and so on.
Two types of frequency distributions that are most often used are the categorical
frequency distribution and the grouped frequency distribution. The procedures for constructing
these distributions are shown.
Example 5.1.2.1
Twenty-five army inductees were given a blood test to determine their blood type. The
data set is
Students sometimes have difficulty finding class boundaries when given the class limits.
The basic rule of thumb is that the class limits should have the same decimal place value as the
data, but the class boundaries should have one additional place value and end in a 5. For
example, if the values in the data set are whole numbers, such as 24, 32, and 18, the limits for a
class might be 31–37, and the boundaries are 30.5–37.5. Find the boundaries by subtracting 0.5
from 31 (the lower class limit) and adding 0.5 to 37 (the upper class limit).
If the data are in tenths, such as 6.2, 7.8, and 12.6, the limits for a class hypothetically
might be 7.8–8.8, and the boundaries for that class would be 7.75–8.85. Find these values by
subtracting 0.05 from 7.8 and adding 0.05 to 8.8.
Finally, the class width for a class in a frequency distribution is found by subtracting
the lower (or upper) class limit of one class from the lower (or upper) class limit of the next
class. For example, the class width in the preceding distribution on the duration of boat
batteries is 7, found from 31 - 24 = 7.
The class width can also be found by subtracting the lower boundary from the upper
boundary for any given class. In this case, 30.5 - 23.5 = 7.
Note: Do not subtract the limits of a single class. It will result in an incorrect answer.
The researcher must decide how many classes to use and the width of each class. To
construct a frequency distribution, follow these rules:
Example 5.1.3.1
These data represent the record high temperatures in degrees Fahrenheit (0F) for each of the
50 states. Construct a grouped frequency distribution for the data using 7 classes.
Cumulative frequencies are used to show how many data values are accumulated up
to and including a specific class. In Example 5.1.3.1, 28 of the total record high temperatures
are less than or equal to 114F. Forty-eight of the total record high temperatures are less than
or equal to 124 0F.
After the raw data have been organized into a frequency distribution, it will be
analyzed by looking for peaks and extreme values. The peaks show which class or classes
have the most data values compared to the other classes. Extreme values, called outliers,
show large or small data values that are relative to other data values.
When the range of the data values is relatively small, a frequency distribution can be
constructed using single data values for each class. This type of distribution is called an
ungrouped frequency distribution and is shown next.
Example 5.1.3.2
The data shown here represent the number of miles per gallon (mpg) that 30 selected four-
wheel-drive sports utility vehicles obtained in city driving. Construct a frequency distribution,
and analyze the distribution.
The steps for constructing a grouped frequency distribution are summarized in the
following Procedure Table.
After you have organized the data into a frequency distribution, you can present them
in graphical form. The purpose of graphs in statistics is to convey the data to the viewers in
pictorial form. It is easier for most people to comprehend the meaning of data presented
graphically than data presented numerically in tables or frequency distributions. This is
especially true if the users have little or no statistical knowledge.
Statistical graphs can be used to describe the data set or to analyze it. Graphs are
also useful in getting the audience’s attention in a publication or a speaking presentation.
They can be used to discuss an issue, reinforce a critical point, or summarize a data set. They
can also be used to discover a trend or pattern in a situation over a period of time.
1. The histogram.
2. The frequency polygon.
3. The cumulative frequency graph, or ogive (pronounced o-jive).
Example 5.1.4.1
Construct a histogram to represent the data shown for the record high temperatures for
each of the 50 states (see Example 5.1.3.1).
Example 5.1.4.2 Using the frequency distribution given in Example 5.1.4.1, construct a
frequency polygon.
The third type of graph that can be used represents the cumulative frequencies for
the classes. This type of graph is called the cumulative frequency graph, or ogive. The
cumulative frequency is the sum of the frequencies accumulated up to the upper boundary of
a class in the distribution.
Example 5.1.4.3
The steps for drawing these three types of graphs are shown in the following Procedure Table.
LESSON 5.2
Measures of central tendency: mean, median, mode, weighted mean
Chapter 5.1 stated that statisticians use samples taken from populations; however, when
populations are small, it is not necessary to use samples since the entire population can be
used to gain information. For example, suppose an insurance manager wanted to know the
average weekly sales of all the company’s representatives. If the company employed a large
number of salespeople, say, nationwide, he would have to use a sample and make n
inference to the entire sales force. But if the company had only a few salespeople, say, only
87 agents, he would be able to use all representatives’ sales for a randomly chosen week and
thus use the entire population.
Measures found by using all the data values in the population are called parameters.
Measures obtained by using the data values from samples are called statistics; hence, the
average of the sales from a sample of representatives is a statistic, and the average of sales
obtained from the entire population is a parameter.
The Mean
Example 5.2.1
The data represent the number of days off per year for a sample of individuals selected from
nine different countries. Find the mean.
Example 5.2.2
The number of rooms in the seven hotels in downtown Pittsburgh is 713, 300, 618, 595, 311,
401, and 292. Find the median.
Example 5.2.3
The number of tornadoes that have occurred in the United States over an 8-year period
follows. Find the median. 684, 764, 656, 702, 856, 1133, 1132, 1303
In Example 5.2.2 each had an odd number of values in the data set; hence, the median was
an actual data value. When there are an even number of values in the data set, the median
will fall between two given values, as illustrated in Examples 5.2.3.
The Mode
A data set that has only one value that occurs with the greatest frequency is said to be
unimodal. If a data set has two values that occur with the same greatest frequency, both
values are considered to be the mode and the data set is said to be bimodal. If a data set has
more than two values that occur with the same greatest frequency, each value is used as the
mode, and the data set is said to be multimodal. When no data value occurs more than once,
the data set is said to have no mode. A data set can have more than one mode or no mode at
all.
Example 5.2.4
Find the mode of the signing bonuses of eight NFL players for a specific year. The bonuses in
millions of dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
Example 5.2.5
Find the mode for the number of coal employees per county for 10 selected counties in
Southwestern Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
The Midrange
Example 5.2.6 In the last two winter seasons, the city of Brownsville, Minnesota, reported these
numbers of water-line breaks per month. Find the midrange.
2, 3, 6, 8, 4, 1
LESSON 5.3
Measures of dispersion: range, standard deviation and
variance
Measures of Dispersion
Example 5.3.1
Find the variance and standard deviation for 35, 45, 30, 35, 40, 25.
Example 5.3.2 Find the sample variance and standard deviation for the amount of European
auto sales for a sample of 6 years shown. The data are in millions of dollars.
LESSON 5.4
Measures of relative position: z-score, percentiles, quartiles and box and
whiskers plots
Measures of Position
in the distribution and 20% of the values fall above it. The median is the value that
corresponds to the 50th percentile, since one-half of the values fall below it and one-half of
the values fall above it. This section discusses these measures of position.
Standard Scores
There is an old saying, “You can’t compare apples and oranges.” But with the use of
statistics, it can be done to some extent. Suppose that a student scored 90 on a music test
and 45 on an English exam. Direct comparison of raw scores is impossible, since the exams
might not be equivalent in terms of number of questions, value of each question, and so on.
However, a comparison of a relative standard similar to both can be made. This comparison
uses the mean and standard deviation and is called a standard score or z score. (We also use
z scores in later chapters.)
A standard score or z score tells how many standard deviations a data value is above or below the
mean for a specific distribution of values. If a standard score is zero, then the data value is the
same as the mean.
For the purpose of this section, it will be assumed that when we find z scores, the data were
obtained from samples.
Example 5.4.1
A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10;
she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her
relative positions on the two tests.
Percentiles
Percentiles are position measures used in educational and health-related fields to indicate
the position of an individual in a group.
Example 5.4.2
A teacher gives a 20-point test to 10 students. The scores are shown here. Find the
percentile rank of a score of 12.
Example 5.4.3
Using the data in Example 5.4.2, find the percentile rank for a score of 6.
Example 5.4.4
Using the scores in Example 5.4.2, find the value corresponding to the 25 th percentile.
Example 5.4.5
Using the scores in Example 5.4.2, find the value corresponding to the 60 th percentile.
Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3.
Note that Q1 is the same as the 25th percentile; Q2 is the same as the 50th percentile, or the median; Q3
corresponds to the 75th percentile, as shown:
Example 5.4.6
Find Q1, Q2, and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18.
In addition to dividing the data set into four groups, quartiles can be used as a rough measurement of
variability. The interquartile range (IQR) is defined as the difference between Q1 and Q3 and is the range of
the middle 50% of the data.
Deciles divide the distribution into 10 groups, as shown. They are denoted by D1, D2, etc.
LESSON 5.5
Probabilities and normal distributions
What is Normal?
Medical researchers have determined so-called normal intervals for a person’s blood
pressure, cholesterol, triglycerides, and the like. For example, the normal range of systolic
blood pressure is 110 to 140. The normal interval for a person’s triglycerides is from 30 to 200
milligrams per deciliter (mg/dl). By measuring these variables, a physician can determine if a
patient’s vital statistics are within the normal interval or if some type of treatment is needed to
correct a condition and avoid future illnesses. The question then is, how does one determine
the so-called normal intervals?
In this chapter, you will learn how researchers determine normal intervals for specific
medical tests by using a normal distribution. You will see how the same methods are used to
determine the lifetimes of batteries, the strength of ropes, and many other traits.
The normal distribution curve, can be used to study many variables that are not
perfectly normally distributed but are nevertheless approximately normal.
This equation may look formidable, but in applied statistics, tables or technology is used for
specific problems instead of the equation. Another important consideration in applied statistics
is that the area under a normal distribution curve is used more often than the values on the y
axis. Therefore, when a normal distribution is pictured, the y axis is sometimes omitted.
The shape and position of a normal distribution curve depend on two parameters, the
mean and the standard deviation. Each normally distributed variable has its own normal
distribution curve, which depends on the values of the variable’s mean and standard
deviation. Figure (a) shows two normal distributions with the same mean values but different
standard deviations. The larger the standard deviation, the more dispersed, or spread out, the
distribution is. Figure (b) shows two normal distributions with the same standard deviation but
with different means. These curves have the same shapes but are located at different
positions on the x axis. Figure (c) shows two normal distributions with different means and
different standard deviations.
Since each normally distributed variable has its own mean and standard deviation, as
stated earlier, the shape and location of these curves will vary. In practical applications, then,
you would have to have a table of areas under the curve for each variable. To simplify this
situation, statisticians use what is called the standard normal distribution.
Table E for Standard Normal Distribution gives the area under the normal distribution curve to
the left of any z value given in two decimal places. For example, the area to the left of a z
value of 1.39 is found by looking up 1.3 in the left column and 0.09 in the top row. Where the
two lines meet gives an area of 0.9177. See Figure below
Example 5.5.1
Example 5.5.2
Example 5.5.3
The standard normal distribution curve can be used to solve a wide variety of practical
problems. The only requirement is that the variable be normally or approximately normally
distributed. There are several mathematical tests to determine whether a variable is normally
distributed. For all the problems presented in this chapter, you can assume that the variable is
normally or approximately normally distributed.
To solve problems by using the standard normal distribution, transform the original
variable to a standard normal distribution variable by using the formula
Example 5.5.4
A survey by the National Retail Federation found that women spend on average $146.21
for the Christmas holidays. Assume the standard deviation is $29.44. Find the percentage
of women who spend less than $160.00. Assume the variable is normally distributed.
Example 5.5.5
LESSON 5.6
Linear regression and correlation, least squares line, linear correlation
coefficient
To answer the third question, you must ascertain what type of relationship exists.
There are two types of relationships: simple and multiple. In a simple relationship, there
are two variables—an independent variable, also called an explanatory variable or a
predictor variable, and a dependent variable, also called a response variable. A simple
relationship analysis is called simple regression, and there is one independent variable that
is used to predict the dependent variable. For example, a manager may wish to see whether
the number of years the salespeople have been working for the company has anything to
do with the amount of sales they make. This type of study involves a simple relationship,
since there are only two variables—years of experience and amount of sales
Finally, the fourth question asks what type of predictions can be made. Predictions are
made in all areas and daily. Examples include weather forecasting, stock market analyses,
sales predictions, crop predictions, gasoline price predictions, and sports predictions. Some
predictions are more accurate than others, due to the strength of the relationship. That is,
the stronger the relationship is between variables, the more accurate the prediction is.
The scatter plot is a visual way to describe the nature of the relationship between the
independent and dependent variables.
Example 5.6.1
Example 5.6.2:
REGRESSION
In studying relationships between two variables, collect the data and then construct a
scatter plot. The purpose of the scatter plot, as indicated previously, is to determine the nature
of the relationship. The possibilities include a positive linear relationship, a negative linear
relationship, a curvilinear relationship, or no discernible relationship. After the scatter plot is
drawn, the next steps are to compute the value of the correlation coefficient and to test the
significance of the relationship. If the value of the correlation coefficient is significant, the next
step is to determine the equation of the regression line, which is the data’s line of best fit.
Note: Determining the regression line when r is not significant and then making predictions
using the regression line are meaningless. The purpose of the regression line is to enable the
researcher to see the trend and make predictions on the basis of the data.
Figure 5.6.1 shows a scatter plot for the data of two variables. It shows that several
lines can be drawn on the graph near the points.
Given a scatter plot, you must be able to draw the line of best fit. Best fit means that
the sum of the squares of the vertical distances from each point to the line is at a minimum.
The reason you need a line of best fit is that the values of y will be predicted from the values
of x; hence, the closer the points are to the line, the better the fit and the prediction will be.
See Figure 5.6.2. When r is positive, the line slopes upward and to the right. When r is
negative, the line slopes downward from left to right.
Figure 5.6.3
Example 5.6.3:
Find the equation of the regression line for the data in Car Rental Companies, as shown
below, and graph the line on the scatter plot of the data.
Then plot the two points (15, 1.986) and (40, 4.639) and draw a line connecting the two points.
See Figure 5.6.4.
Figure 5.6.4
Example 5.6.4:
START
ACTIVITY 5.1.1: Let’s Diagnose Your Knowledge
The following items talk about the gathering and organizing data that diagnose your
understanding related on representing data using graphs and charts and to interpret the data.
Read carefully each item and determine whether each statement is true or false.
DISCOVER
1. It deals largely with calculations and graphical displays to describe important features
of the data.
EXAMINE
ACTIVITY 5.2.1: ONE MORE TRY
The following items talk about the data management that diagnosed your understanding
related on measures of central tendency. Read carefully each item and encircle the letter of
your choice.
20, 6, 23, 19, 9, 14, 15, 3, 1, 12, 10, 20, 13, 3, 17, 10, 11, 6, 21, 9, 6, 10, 9, 4, 5, 1, 5, 11, 7,
24
Find the (a.) mean, (b.) median, (c.) mode, and (d.) midrange.
EVALUATE
ACTIVITY 5.3.1: We solve!
1. The number of incidents in which police were needed for a sample of 10 schools in
Allegheny County is 7, 37, 3, 8, 48, 11, 6, 0, 10, 3.
The following items talk about the data management that diagnosed your understanding
related on probabilities and normal distributions. Read carefully each item and determine
whether each statement is true or false.
The following items talk about the data management that diagnosed your understanding
related on linear regression and correlation. Read carefully each item and determine whether
each statement is true or false.
1. The dependent variable is the variable that is being described, predicted, or controlled.
2. A simple linear regression model is an equation that describes the straight-line
relationship between a dependent variable and an independent variable.
3. The residual is the difference between the observed value of the dependent variable and
the predicted value of the dependent variable.
4. When using simple regression analysis, if there is a strong correlation between the
independent and dependent variable, then we can conclude that an increase in the value
of the independent variable causes an increase in the value of the dependent variable.
5. In a simple linear regression model, the correlation coefficient not only indicates the
strength of the relationship between independent and dependent variable, but also shows
whether the relationship is positive or negative.
6. If r = -1, then we can conclude that there is a perfect relationship between X and Y.
7. The slope of the simple linear regression equation represents the average change in the
value of the dependent variable per unit change in the independent variable (X).
8. The least squares simple linear regression line minimizes the sum of the vertical
deviations between the line and the data points.
9. The notation refers to the average value of the dependent variable Y.
Yˆ
10. A significant positive correlation between X and Y implies that changes in X cause Y to
change.
11. The estimated simple linear regression equation minimizes the sum of the squared
deviations between each value of Y and the line.