You are on page 1of 11

CHAPTER

4
Chapter Outline
4.1 Introduction
4.2 Descriptive Statistics
Data
Management
by Rebecca C. Tolentino

4.3 Linear Regression and Correlation

Learning Objectives

1. Use a variety of statistical tools to process and manage numerical data.


2. Use the methods of linear regression and correlations to predict the value
of a variable given certain conditions.
3. Advocate the use of statistical data in making important decisions.
Chapter 4. Data Management

4.1 Introduction
Data management pertains to the “practice of managing data as a valuable resource to unlock
its potential for an organization” (SAS, 2020). This is very essential in this digital age when big data is
produced every day. Statistics is one of the tools that will aid in the effective management of data.

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting


quantitative or numerical data. It is a branch of mathematics that transforms numbers into useful
information for decision makers. It is used in almost all fields of endeavor and is also useful in
understanding data that we see everywhere.

There are now several computer softwares that are equipped with statistical functions. These
softwares generate several statistical measures that are used in decision making. One of these
softwares is Excel. This software will be used extensively in this chapter.

4.2 Descriptive Statistics


Definition 4.2.1. Descriptive Statistics is the division of statistics that is involved in the collection,
organization, and presentation of data in an understandable way.

The following are the most commonly used descriptive statistics and their equivalent syntax
in excel.

Suppose the data you want to process is encoded in cells B2 to B26.

Description Excel Function


Measures of Central Tendency and Position
Mean sum of all values = AVERAGE(B2:B26)
divided by the number of values
Median middle value in an ordered array of data =MEDIAN(B2:B26)
Mode the value in a set of data that appears = MODE(B2:B26)
most frequently
Percentile the value below which a given =PERCENTILE(B2:B26, k)
percentage of data falls. k is between 0 and 1
Quartile three summary measures that divide an =QUARTILE(B2:B26, k)
ordered array of data into four equal k is an integer from 1 to 4
parts.
Measures of Variability
Population sum of the squared differences around =VARP(B2:B26)
Variance the mean divided by the
sample size
Sample Variance sum of the squared differences around =VAR(B2:B26)
the mean divided by the
sample size minus 1.
Population Square root of the population variance =STDEVP(B2:B26)
Standard Deviation

54
Chapter 4. Data Management

Sample Standard Square root of the sample variance = STDEV(B2:B26)


Deviation
Mean Absolute Sum of the absolute deviations from the =AVEDEV(B2:B26)
Deviation mean divided by the number of values

4.3 Correlation and Regression Analysis


Definition 4.3.1. Correlation Analysis is a group of statistical techniques to measure the association
between two variables.

Definition 4.3.2. Correlation coefficient is a measure of the relative strength of a linear relationship
between two numerical variables. Its value ranges from -1, perfect negative correlation to +1, for a
perfect positive correlation.

Definition 4.3.3. Pearson’s product moment correlation coefficient or Pearson’s r is a measure of


the strength of relationship between two variables that are at least in the interval scale. The excel
function used to generate Pearson’s r is =CORREL(range1, range2)

Definition 4.3.4. The coefficient of determination (r2) is the proportion of the total variation in the
dependent variable (Y) that is explained or accounted for by the variation in the independent variable
(X).

Definition 4.3.5. Regression analysis is carried out to develop a model to predict the values of a
dependent variable (Y), based on the value of the independent variable (X).

The Dependent Variable, denoted by Y, is the variable being predicted or estimated.

The Independent Variable, denoted by X, provides the basis for estimation. It is the predictor
variable.

The regression equation is Y’= a + bX

where

Y’ is the average predicted value of Y for any X.

a is the Y-intercept. It is the estimated Y value when X=0.

b is the slope of the line, or the average change in Y’ for each change of one unit in X.

Suppose the X’s are in B2 to B32 and the Y’s are in C2 to C32

Description Excel Function


Regression and correlation
Slope Average change in Y for every unit = SLOPE(C2:C32, B2:B32)
(Regression change in X
Coefficient)
Intercept Predicted value of Y when X is zero =INTERCEPT(C2:C32, B2:B32)

55
Chapter 4. Data Management

(Regression
Constant)
Pearson’s r Correlation coefficient for two interval or = CORREL(B2:B32,C2:C32)
ratio-scaled variables

Watch the following videos for further explanation and examples:

• Calculating Mean, Median, Mode and Standard Deviation In Excel.


https://www.youtube.com/watch?v=2rEhWFhSqnI

• Calculating The Standard Deviation, Mean, Median, Mode, Range, & Variance
Using Excel. https://www.youtube.com/watch?v=k17_euuiTKw

• Using Excel to calculate a correlation coefficient || interpret relationship


between variables. https://www.youtube.com/watch?v=sGlsdHD-lcA

• How to Calculate a Correlation in Microsoft Excel - Pearson's r.


https://www.youtube.com/watch?v=8a_etQN-qso

56
Chapter 4. Data Management

Exercise 4.1
Descriptive Statistics
Name: ________________________________________________________
Score:
Course-Block: _________________ Schedule: ________________________

Professor: _____________________________________________________

Use Excel or any spreadsheet software to answer the following questions.

1. A student obtained the following scores in twelve 100-item assignments:


93 91 90 86 86 94 97 90 98 85 95 97

a. What is his mean score in the assignments? _____________________________


b. What is the median score in the assignments? _____________________________
c. What is the modal score in the assignments? _____________________________

2. A travelling salesman checks the prices of gasoline in gas stations within his area of
assignment. The following are the prices per liter of unleaded gasoline in a sample of 15
gasoline stations in his area:
50.15 51.89 48.84 51.87 46.59 51.61 49.54 47.98
50.96 51.22 51.08 50.88 51.94 46.50 47.90

a. What is the highest price? _________________________________________________


b. What is the lowest price? _________________________________________________
c. What is the mean? _________________________________________________
d. What is the median? _________________________________________________
e. What is the mode? _________________________________________________

3. A student obtained the following scores in twelve 100-item assignments:


93 91 90 86 86 94 97 90 98 85 95 97

a. What is the standard deviation? _____________________________


b. What is the Mean Absolute Deviation? _____________________________
c. Based on the descriptive measures you obtained in item 1 and in this item what can you
say about his scores in the assignment?
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________

57
Chapter 4. Data Management

4. A travelling salesman checks the prices of gasoline in gas stations within his area of
assignment. The following are the prices per liter of unleaded gasoline in a sample of 15
gasoline stations in his area:
50.15 51.89 48.84 51.87 46.59 51.61 49.54 47.98
50.96 51.22 51.08 50.88 51.94 46.50 47.90

a. What is the standard deviation? ____________________________________


b. What is the coefficient of variation? ____________________________________
c. Based on the descriptive measures you obtained in item 2 and in this item, what can
you say about the price of gasoline in the area?
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________
_______________________________________________________________________

5. A commuter from Cavite travels daily to work in Manila each morning. He records his travel
time ( in minutes) during the last two weeks as follows:

Week 1 Week 2
Mon Tue Wed Thurs Fri Mon Tue Wed Thurs Fri
104 84 62 97 70 115 54 74 101 108

a. Compute the mean, median, first quartile, and third quartile.

b. Compute the range, interquartile range, variance, standard deviation, and coefficient of
variation.

c. What would you tell a person who asks how long it would take to commute from Cavite to
Manila in the morning?

58
Chapter 4. Data Management

6. One of the major issues in customer service is the speed with which a company responds to
customer complaints. The manager of a telecommunication company aims to have a baseline
data about the period the company is able to respond to customer complaints. The data will
be used as a reference for a new system they want to adopt. The following data from a
random sample of 25 complaints represent the number of days between the receipt of a
complaint and the resolution of the complaint:

a. Compute the mean, median, first quartile, and third quartile.

b. Compute the range, interquartile range, variance, standard deviation, and coefficient of
variation.

c. On the basis of the results of (a) and (b), if you had to tell the president of the company how
long a customer should expect to wait to have a complaint resolved, what would you say?
Explain.

59
Chapter 4. Data Management

Exercise 4.2
Linear Regression and Correlation
Name: ________________________________________________________
Score:
Course-Block: _________________ Schedule: ________________________

Professor: _____________________________________________________

1. A researcher is developing a regression model to predict college general weighted average


based on high school average grade.
a. The independent variable is ________________________________________________.
b. The dependent variable is __________________________________________________.

2. A fitness instructor is interested in studying decrease in weight as a function of the number


of hours spent in the gym. In his study,
a. The independent variable is ________________________________________________.
b. The dependent variable is __________________________________________________.

3. The following data is to be used to construct a regression model:

X 12 11 13 5 19 14 17 6 17 14 18 7 8 18 14
Y 20 19 24 14 27 22 25 14 22 26 26 15 15 23 20

a. The value of the regression constant is _______________________________________.


b. The value of the regression coefficient is ______________________________________.
c. Pearson’s correlation coefficient is ___________________________________________.
d. The coefficient of determination is __________________________________________.

4. A college faculty collected data on his students’ general weighted average in the first
semester and their high school average grade.

GWA 2.06 2.08 2.11 1.52 1.62 1.47 2.18 1.7 1.85 1.69
HS grade 92 93 85 87 89 89 89 85 88 95
GWA 2.03 2 1.46 1.27 2.06 1.26 2.14 2.2 1.29 1.96
HS grade 85 92 87 92 91 85 91 92 91 85

a. The correlation coefficient is ____________________________________________.


b. The coefficient of determination is _______________________________________.
c. Based on the results of (a) and (b), what conclusions can you reach concerning
general weighted average in the first semester and high school average grade.
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________

60
Chapter 4. Data Management

5. A Mathematics faculty of a public university conducted a study among students enrolled in


an online Mathematics course. He collected data on the number of hours a student spent in
the online classroom and his score in the assessment test. The following table were the
data collected from 30 randomly selected students:

student hours score student hours score


1 28 90 16 10 42
2 29 92 17 30 98
3 24 82 18 21 84
4 25 85 19 11 51
5 23 79 20 16 66
6 24 82 21 19 77
7 28 98 22 17 70
8 30 80 23 25 85
9 28 78 24 14 52
10 11 45 25 18 82
11 28 95 26 15 63
12 27 86 27 13 51
13 26 97 28 15 60
14 15 55 29 30 92
15 17 67 30 29 93

a. The correlation coefficient is ____________________________________________.


b. The coefficient of determination is _______________________________________.
c. Based on the results of (a) and (b), what conclusions can you reach from this
correlation.
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________

6. In the study conducted by a college faculty collected data on his students’ general weighted
average in the first semester and their high school average grade, if a regression equation is
developed on GWA as a function of high school average,

a. The value of the regression constant is ____________________________________.


b. The value of the regression coefficient is ___________________________________.
c. Based on the regression model, what is the predicted general weighted average of a
student whose high school average is 93.
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________

61
Chapter 4. Data Management

7. In the study conducted by a Mathematics faculty on the number of hours a student spent in
the online classroom and his score in the assessment test, if a regression model is developed
on score in the assessment test based on the number of hours a student spent in the online
classroom,

a. The value of the regression constant is ____________________________________.


b. The value of the regression coefficient is ___________________________________.
c. Based on the regression model, what is the predicted score of a student who spent
21 hours in the online classroom.
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________

62
Chapter 4. Data Management

References

Berenson, M.L., Levine, D.M. & T.C. Krehbiel (2012). Basic business statistics: Concepts and
applications (12th Edition). Prentice Hall.

Lind, D.A., Marchal, W.G. & S.A. Wathen (2012). Basic Statistics for Business Economics (8 th Edition).
McGraw Hill.

Mann, P.S. (2010). Introductory Statistics. John Wiley & Sons, Inc.

63

You might also like