You are on page 1of 58

MODULE 4

Data Management

LEARNING OUTCOMES

At the end of the lesson, students should be able to:

1. Understand the basic concepts/terminologies in statistics;


2. Categorize data and the scales of measurement the data are measured;
3. Paraphrase and Differentiate the methods of collecting data and use appropriate
sampling techniques;
4. Chart data in various forms such as graphs, tables and/or texts using MS Excel;

Definition of Statistics
plural sense: numerical facts, e.g. CPI, peso-dollar exchange rate
singular sense: scientific discipline consisting of theory and methods for processing numerical
information that one can use when making decisions in the face of uncertainty.

History of Statistics
 The term statistics came from the Latin phrase “ratio status” which means study of
practical politics or the statesman’s art.
 In the middle of 18th century, the term statistik (a term due to Achenwall) was used, a
German term defined as “the political science of several countries”
 From statistik it became statistics defined as a statement in figures and facts of the
present condition of a state.

Application of Statistics
 Diverse applications
“During the 20th Century statistical thinking and methodology have become the scientific
framework for literally dozens of fields including education, agriculture, economics, biology, and
medicine, and with increasing influence recently on the hard sciences such as astronomy,
geology, and physics. In other words, we have grown from a small obscure field into a big obscure
field.” – Brad Efron
 Comparing the effects of five kinds of fertilizers on the yield of a particular variety of corn
 Determining the income distribution of Filipino families

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


 Comparing the effectiveness of two diet programs
 Prediction of daily temperatures
 Evaluation of student performance
Two Aims of Statistics
Statistics aims to uncover structure in data, to explain variation…
 Descriptive
 Inferential
Descriptive Statistics includes all the techniques used in organizing, summarizing and
presenting the data on hand while Inferential Statistics includes all the techniques used in
analyzing the sample data that will lead to generalizations about a population from which the
sample was taken

Areas of Statistics
Descriptive statistics
 methods concerned w/ collecting, describing, and analyzing a set of data without
drawing conclusions (or inferences) about a large group.

Example of Descriptive Statistics
Present the Philippine population by constructing a graph indicating the total number of Filipinos
counted during the last census by age group and sex

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Inferential statistics
 methods concerned with the analysis of a subset of data leading to predictions or
inferences about the entire set of data.

Example of Inferential Statistics

A new milk formulation designed to improve the psychomotor development of infants was
tested on randomly selected infants.

Based on the results, it was concluded that the new milk formulation is effective in improving the
psychomotor development of infants.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Larger Set
(N units/observations) Smaller Set

(n units/observations)

Inferences and Generalizations

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Key Definitions

 A universe is the collection of things or observational units under consideration.


 A variable is a characteristic observed or measured on every unit of the universe.
 A population is the set of all possible values of the variable.
 Parameters are numerical measures that describe the population or universe of interest.
Usually donated by Greek letters;  (mu),  (sigma),  (rho),  (lambda),  (tau),  (theta),
 (alpha) and  (beta).
 Statistics are numerical measures of a sample.
Parameter is a summary measure describing a specific characteristic of the population
while Statistic is a summary measure describing a specific characteristic of the sample.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ACTIVITY No. 1

Answer the following questions as briefly as possible.


1. Differentiate descriptive from inferential statistics. [4]

2. Give specific application of statistics in the following fields:


[14]
2.1 Business & Accountancy
2.2 Computer Studies
2.3 Education
2.4 Social Sciences & Humanities
2.5 Agriculture
2.6 Literature & Fine Arts
2.7 Technology & Livelihood

3. Look for any printed material and identify the statistics mentioned in the material
and classify them as to whether it is descriptive or inferential statistics. [12]

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Types of Variables

Qualitative variable

 non-numerical values
Quantitative variable

 numerical values
VARIABLES
a. Discrete
 countable
b. Continuous
 measurable Qualitative Quantitative
c. Constant
d.

Discrete Continuous

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Scales of Measurement

1. Nominal
 Numbers or symbols used to classify
 Examples are sex, marital status, occupation,
nationality, etc

2. Ordinal scale
 Accounts for order; no indication of distance
between positions.
 Examples are curriculum level, socio-economic
status, military ranks, Latin honors, etc

3. Interval scale
 Equal intervals; no absolute zero.
 Examples are temperature, test scores, etc

4. Ratio scale
 Has absolute zero.
 Examples are bank account, cellphone load, etc

The ratio level of measurement has all the following properties:


a. the numbers in the system are used to classify a person/object into
distinct, non-overlapping and exhaustive categories;
b. the system arranges the categories according to magnitude;
c. the system has a fixed unit of measurement representing a set size
throughout the scale and
d. the system has an absolute zero.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ACTIVITY No. 2

Enumerate five (5) variables that you may think and classify each as to
qualitative or quantitative data. If quantitative, state whether it is discrete or
continuous data. State the level each variable is measured. [15]
1. __________________________
2. __________________________
3. __________________________
4. __________________________
5. _________________________

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Definition

Measurement is the process of determining the value or label of the variable based
on what has been observed.

For example, we can measure the educational level of a person by using the
International Standard Classification of Education designed by UNESCO:

0 pre-primary; 1 primary; 2 lower secondary; 3 upper secondary; 4 post secondary


st nd
nontertiary; 5 1 stage tertiary; 6 2 stage tertiary

Methods of Data Collection


 Objective Method

 Subjective Method

 Use of Existing
Records

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Methods of Data Presentation

 Textual
 Tabular
 Graphical

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ACTIVITY No. 2

Sketch a pie chart on your own monthly family income and expenditures. [20]

SELF ASSESSMENT QUESTION NO. 1

Identify whether the given situation belongs to the area of descriptive statistics or
inferential statistics. [20]
1. Synchronous vs Asynchronous Learning: Their Effects in the Teaching-
Learning Process
2. Average of a student in his 10 subjects
3. Statistics on COVID-19 cases in the world
4. Effect of music in reviewing for the exams
5. One wishes to find out which gives a better salary between companies in the
rural areas or urban areas
6. Enrolment rate in tertiary private institutions
7. Percentage of PUIs by municipality in the Province of Rizal
8. Impact of COVID 19 Pandemic in the life of tertiary students
9. Average sales for the first quarter of 2020
10. Amount of time spent in studying vs success of passing

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


SELF ASSESSMENT QUESTION No. 2

Classify the following variables as to qualitative or quantitative. If quantitative, further


tell if it is discrete or continuous data. Be able to state the scale each is measured.
[30]
1. breeds of dogs
2. birth order (first, second, etc)
3. monthly income
4. cellphone number
5. night differential of cashiers in a convenient store
6. spot on a die
7. jersey number of a basketball player
8. IQ test scores
9. Students classification (continuing, irregular, returning)
10. COVID 19 cases in a barangay

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


SELF ASSESSMENT QUESTION No. 3

Sketch an appropriate graph in each of the following problems.


1. Enrolment Profile by College of a certain university for SY 2019-2020. [10]

College First Semester Second


Semester
Accountancy 450 650
Business 1250 1500
Computer Studies 600 750

2. Verbal Ability Test Scores and Math Ability Test Scores of ten (10) students in a
certain class. [15]

Student Verbal Ability Math Ability


Test Score Test Score
1 80 95
2 95 88
3 82 89
4 85 94
5 84 92
6 80 87
7 86 89
8 89 92
9 85 90
10 90 85

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ANSWERS TO SELF ASSESSMENT QUESTION No. 1

Identify whether the given situation belongs to the area of descriptive statistics or
inferential statistics. [20]

1. Synchronous vs Asynchronous Learning: Their Effects in the Teaching-Learning


Process Inferential Statistics
2. Average of a student in his 10 subjects Descriptive Statistics
3. Statistics on COVID-19 cases in the world Descriptive Statistics
4. Effect of music in reviewing for the exams Inferential Statistics
5. One wishes to find out which gives a better salary between companies in the
rural areas or urban areas Inferential Statistics
6. Enrolment rate in tertiary private institutions Descriptive Statistics
7. Percentage of PUIs by municipality in the Province of Rizal Descriptive Statistics
8. Impact of COVID 19 Pandemic in the life of tertiary students Inferential Statistics
9. Average sales for the first quarter of 2020 Descriptive Statistics
10. Amount of time spent in studying vs success of passing Inferential Statistics

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ANSWERS TO SELF ASSESSMENT QUESTION No. 2

Classify the following variables as to qualitative or quantitative. If quantitative, further


tell if it is discrete or continuous data. Be able to state the scale each is measured.
[30]
1. breeds of dogs qualitative, nominal
2. birth order (first, second, etc) qualitative, nominal
3. monthly income quantitative, continuous, ratio
4. cellphone number quantitative, discrete, nominal
5. night differential of cashiers in a convenient store quantitative, continuous, ratio
6. spot on a die quantitative, discrete, nominal
7. jersey number of a basketball player quantitative, discrete, nominal
8. IQ test scores quantitative, continuous, interval
9. Students classification (continuing, irregular, returning) qualitative, nominal
10. COVID 19 cases in a barangay quantitative, discrete, ratio

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ANSWERS TO SELF ASSESSMENT QUESTION No. 3

Sketch an appropriate graph in each of the following problems.


1. Enrolment Profile by College of a certain university for SY 2019-2020. [10]

College First Semester Second


Semester
Accountancy 450 650
Business 1250 1500
Computer Studies 600 750

Chart Title
1400 1600

1200 1400
1200
1000
1000
800
800
600
600
400
400
200 200
0 0
Accountancy Business Computer Studies

First Semester Second Semester

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


2. Verbal Ability Test Scores and Math Ability Test Scores of ten (10) students in a
certain class. [15]

Student Verbal Ability Math Ability


Test Score Test Score
1 80 95
2 95 88
3 82 89
4 85 94
5 84 92
6 80 87
7 86 89
8 89 92
9 85 90
10 90 85

Chart Title
96
94
92
90
88
86
84
82
80
78
0 2 4 6 8 10 12

Verbal Ability Test Score Math Ability Test Score

Reference: Slides Presentation Used During The Training on Teaching Basic Statistics for Tertiary
Level Teachers Summer 2008

Most of the slides were taken from Elementary Statistics: A Handbook of Slide Presentation
prepared by ZVJ Albacea, CE Reano, RV Collado, LN Comia, NA Tandang in 2005 for the Institute

St:
of Statistics, CAS UP Los Baños
Write
hen
be theorleft
three most
more columnmust
numbers d
mistake
added,
becomes
first
numbers
third the
two chances
much of
smallermaking
if the a
one are added and then the
URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020
LEARNING OUTCOMES

General Objectives

The purpose of this module is to familiarize students in Descriptive Statistics


using Data Analysis ToolPak

Specific Objectives
As a result of this lesson, students should be able to:

1. Analyze data using Data Analysis ToolPak and other functions in MS Excel;
2. Explain and interpret the results of the data analysis.

Descriptive Statistics
A descriptive statistic (in the count noun sense) is a summary statistic that
quantitatively describes or summarizes features from a collection of information
while descriptive statistics (in the mass noun sense) is the process of using and
analyzing those statistics. Descriptive statistics is distinguished from inferential (or
inductive statistics) by its aim to summarize a sample rather than use the data to learn
about the population that the sample of data is thought to represent. This generally means
that descriptive statistics, unlike inferential statistics, is not developed on the basis of
probability theory and are frequently non-parametric statistics. Even when a data analysis
draws its main conclusions using inferential statistics, descriptive statistics are generally
also presented. For example, in papers reporting on human subjects, typically a table is
included giving the overall sample size, sample sizes in important subgroups (e.g., for
each treatment or exposure group), and demographic or clinical characteristics such as
the average age, the proportion of subjects of each sex, the proportion of subjects with
related co-morbidities, etc. https://en.wikipedia.org/wiki/Descriptive_statistics

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Summary Measures

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Measures of Location

Maximum and Minimum

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Measures of Central Tendency

Mean

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Median

Mode

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


MODULE 6
Range (R)

Interquartile Range (IR)

Variance

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Standard Deviation

Remarks on Standard Deviation

Comparing Standard Deviation

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Measures of Skewness

What is Symmetry?

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Measures of Kurtosis

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Percentiles

Deciles

Quartiles

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Measures of Variation

Measures of Variation

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Let’s try to work on some data samples

Encoded Data

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Data Analysis Using ToolPak

Recall Module 1 on MS Excel Fundamentals, enable first your Data Analysis


ToolPak by following the steps as shown below:

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Click Data, Data Analysis, then Descriptive Statistics

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


This will be displayed on your screen.

Data Interpretation
Based on the summary measures, it can be noted that the distribution (weight in
pounds), whose mean is 145.13 with a standard deviation of approximately 18.67, is a
positively skewed (0.15 is greater than 0) and a platykurtic (-1.32 is less than 0)
distribution.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ACTIVITY NO. 1

Consider the data on daily wages on 15 employees below:

P300 P450 P550 P650 P650


Daily P350 P435 P500 P650 P600
Wages P400 P400 P550 P600 P450

a. Compute for Descriptive Statistics using Data Analysis ToolPak.


b. Interpret results.

ACTIVITY NO. 2

A survey in a certain barangay showed the number of members in each household


as follows 3, 5, 6, 4, 7, 8, 6, 9, 10, 4, 6, 7, 5, 8, 9, 8, 3, 4, 5 and 5.

a. Compute for Descriptive Statistics using Data Analysis ToolPak.


b. Interpret results.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


SELF-ASSESSMENT QUESTION NO. 1

Fifty families were surveyed and the number of children x was recorded for each
family as follows:
0,1,2,3,4,2,2,2,3,3,4,5,6,1,0,1,6,2,5,4,3,0,1,2,3,3,3,6,4,2,6,2,1,5,3,0,0,2,5,6,1,0,1,2,5,3
,4,2,2,3

a. Compute for Descriptive Statistics using Data Analysis ToolPak.


b. Interpret results.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ANSWERS TO SELF-ASSESSMENT QUESTION NO. 1

a. Compute for Descriptive Statistics using Data Analysis ToolPak.

Number of Children of 50 Families

Mean 2.72
Standard Error 0.255614506
Median 2.5
Mode 2
Standard
Deviation 1.807467503
Sample Variance 3.266938776
-
Kurtosis 0.771635469
Skewness 0.308046539
Range 6
Minimum 0
Maximum 6
Sum 136
Count 50

b. Interpret results.

Based on the summary measures, it can be observed that the distribution


(number of children of a sample of 50 families), whose mean is 2.72 or approximately
3 children with a standard deviation of approximately 2 children, is a positively skewed
(0.308 is greater than 0) and a platykurtic (-0.77 is less than 0) distribution.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


SELF-ASSESSMENT QUESTION NO. 2

Consider the systolic blood pressures of 12 smokers and 12 non-smokers are follows:

Smokers: 122,146,120,114,124,126,118,128,130,134,116,130
Non-Smokers: 114,134,114,116,138,110,112,116,132,126,108,116

a. Compute for Descriptive Statistics using Data Analysis ToolPak separately.


b. Compare and interpret results.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ANSWERS TO SELF-ASSESSMENT QUESTION NO. 2

Consider the systolic blood pressures of 12 smokers and 12 non-smokers are follows:

Smokers: 122,146,120,114,124,126,118,128,130,134,116,130
Non-Smokers: 114,134,114,116,138,110,112,116,132,126,108,116

a. Compute for Descriptive Statistics using Data Analysis ToolPak separately.

SMOKERS

Mean 124.8333333
Standard Error 2.790224639
Median 125
Mode 130
Standard
Deviation 9.665621678
Sample Variance 93.42424242
Kurtosis 0.643159284
Skewness 0.731605702
Range 34
Minimum 112
Maximum 146
Sum 1498
Count 12

NON-SMOKERS

Mean 119.6666667
Standard Error 2.921532737
Median 116
Mode 116
Standard
Deviation 10.12048627
Sample Variance 102.4242424
-
Kurtosis 0.830332271
Skewness 0.819101367
Range 30
Minimum 108
Maximum 138
Sum 1436
URS-IM-AA-CI-0167
Count 12 Rev 00 Effective Date: August 24, 2020
b. Compare and interpret results.

Based on the summary measures for smokers, it can be observed that their
systolic blood pressure, whose mean is 124.83 with a standard deviation of 9.67, is a
positively skewed (0.73 is greater than 0) and a leptokurtic (0.64 is greater than 0)
distribution. Meanwhile, the non-smokers whose mean systolic blood pressure is
119.67 with a standard deviation of 10.12 is also a positively skewed (0.82 is greater
than 0) and a leptokurtic (0.83 is greater than 0) distribution. In this survey, it can be
concluded that systolic blood pressure of smokers is closer to the mean than that of the
distribution of the non-smokers.

Reference: Slides Presentation Used During The Training on Teaching Basic Statistics for Tertiary
Level Teachers Summer 2008

Most of the slides were taken from Elementary Statistics: A Handbook of Slide Presentation
prepared by ZVJ Albacea, CE Reano, RV Collado, LN Comia, NA Tandang in 2005 for the Institute
of Statistics, CAS UP Los Baños

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Linear Regression and
Correlation

LEARNING OUTCOMES

At the end of the lesson, students should be able to:

1. Distinguish the measure of association to be used given the raw data;


2. Analyze correlational problems using Data Analysis Toolpak in MS Excel;

Definition of a Measure of Association


A measure of association or relationship is used to determine the degree of
relationship between two variables (x and Y). These variables are observed in their
natural setting. They cannot be manipulated nor controlled.
The correlational coefficient takes on the values ranging from [-1.0, 1]. The
quantity r, called the linear correlation coefficient, measures the strength and the
direction of a linear relationship between two variables.

Direction of Relationship
1. Perfect Positive Correlation
If x and y have a strong positive linear correlation, r is close to +1.0. An r value
which is exactly equal to +1.0 indicates a perfect positive fit. Positive values
indicate a relationship between x and y variables such that as values for x increase,
values for y also increase.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


2. Perfect Negative Correlation
If x and y have a strong negative linear correlation, r is close to -1.0. An r value
which is exactly equal to -1.0 indicates a perfect negative fit. Negative values
indicate a relationship between x and y variables such that as values for x increase,
values for y also decrease and vice versa.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Some Correlational Techniques
1. Pearson Product-Moment Correlation or Pearson r is used when both variables
are measured on an interval or ratio scale. The computational formula is given by:

2. Spearman Rank-Order Correlation Coefficient is used when both variables are


measured on an ordinal data. We may have two scenarios here (a) original data
are ranked; (b) original data are measured on an interval/ratio scale converted into
ranks. The computational formula is given by:

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


3. Point Biserial Correlation Coefficient is used when one of the variables is
measured on an interval or ratio scale and the other variable is dichotomous
variable (variable that have two categories). The computational formula is given
by:

4. Phi Coefficient or the Four-fold Coefficient is used when both x and y are
dichotomous. The computational formula is given by:

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Testing the Significance of an r

5. Chi Square Test for Independence compares two variables in a contingency table
to see if they are related. In a more general sense, it tests to see whether
distributions of categorical variables differ from each other. A very small chi
square test statistic means that your observed data fits your expected data
extremely well meaning that the two variables have correlation. Equivalently, a
very large chi square test statistic means that the data do not fit very well. In other
words, there is no relationship between the two variables.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Illustrative Example

Sample Sex Socio- QPA QPA in Rank in Rank in Oral Attendance


Economic in English Abstract Communication to
Status Math Reasoning Kindergarten
1 M Poor 1.3 1.8 2 5 Did Not
Attend
2 F Poor 1.2 1.7 3 4 Attended
3 M Non-Poor 1.5 1.5 5 2 Attended
4 M Poor 1.4 1.6 4 3 Did Not
Attend
5 F Non-Poor 1.0 1.2 1 1 Attended

Utilizing Data Analysis ToolPak in MS Excel

QPA in QPA in
Math English

QPA in
Math 1

QPA in
English 0.485512 1

The computed r value of 0.485512 indicates that there is a moderate correlation between
QPA in Math and QPA in English of the sampled population.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ACTIVITY No. 1

Answer the following questions as briefly as possible.


In a survey conducted with university students on a controversial issue, the
following results were obtained:
Students vs Opinion Agree Disagree
Seniors 176 139
Freshmen 157 204

Analyze using Data Analysis ToolPak. Use 0.05 alpha to test whether their
opinions differ or not.

ACTIVITY No. 2

A random sample of fifty men and fifty women were surveyed as to drinking
habits and classified as alcoholics, heavy drinkers and light drinkers. The results
were:

Sex vs Alcohol Alcoholic Heavy Light


Consumption Drinkers Drinkers
Male 11 18 21
Female 7 15 28

Analyze using Data Analysis ToolPak. Use 0.05 alpha to test their
independence.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


SELF ASSESSMENT QUESTION NO. 1

One hundred individuals, aged 20-58, were a test of psychomotor skill. Both age and
score were classified as shown in the accompanying table:

Score
Age High Average Low
40-59 23 20 17
20-39 18 12 10

SELF ASSESSMENT QUESTION No. 2

Test the relationship of Sex and their Attendance to Kindergarten in the table below:

Sample S Socio- QPA QPA Rank in Rank in Attendance


e Economi in in Abstract Oral to
x c Status Math Englis Reasonin Commu Kindergarten
h g nication
1 M Poor 1.3 1.8 2 5 Did Not
Attend
2 F Poor 1.2 1.7 3 4 Attended
3 M Non- 1.5 1.5 5 2 Attended
Poor
4 M Poor 1.4 1.6 4 3 Did Not
Attend
5 F Non- 1.0 1.2 1 1 Attended
Poor

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


ANSWERS TO SELF ASSESSMENT QUESTION No. 1
Utilizing the Chi Square Test for Independence, the computed Chi Square is 0.44. The
tabular value is 4.61 with 0.05m alpha and 2 as degrees of freedom, this leads to the
conclusion to reject the null hypothesis that there is no relationship in the psychomotor
skills test scores among one hundred individual. This implies that their psychomotor
skills of the two age groups differ from each other at 0.05 level of significance.

ANSWERS TO SELF ASSESSMENT QUESTION No. 2

Attendance to
Sample Sex
Kindergarten

1 0 0

2 1 1

3 0 1

4 0 0

5 1 1

did not
M=0
attend=0

F=1 attended=1

Attendance
to
Sex Kindergarten

Sex 1

Attendance
Examples
to for Self-Assessment Questions were taken from the book: Probability & Statistics.
Ymas Jr., Sergio E. Sta Monica Printing Corporation.Manila Philippines.2009
Kindergarten 0.666667 1

The correlational coefficient value of 0.666667 suggests that there is a moderate correlation
between Sex and Attendance to Kindergarten
URS-IM-AA-CI-0167 Revof00the sampled population.
Effective Date: August 24, 2020
Linear Regression
Regression is primarily used to build models/equations to predict a key response, Y,
from a set of predictor (X) variable. Correlation is primarily used to quickly and concisely
summarize the direction and strength of the relationships between a set of 2 or more
numeric variables.
Regression describes how an independent variable is numerically related to the
dependent variable. Correlation is used to represent the linear relationship between two
variables. On the contrary, regression is used to fit the best line and estimate one variable
on the basis of another variable.
Use correlation for a quick and simple summary of the direction and strength of the
relationship between two or more numeric variables. Use regression when you're looking
to predict, optimize, or explain a number response between the variables (how x
influences y)
When investigating the relationship between two or more variables, it is important
to know the difference correlation and regression. Correlation quantifies the direction
and strength of the relationship between two numeric variables, X and Y whose values
always lie between -1.0 – 1.0. Meanwhile, simple linear regression relates to X and Y
through an equation of the form y = a + bx.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Illustrative Examples
A researcher would like to know which among the high school grade, pre-board rating,
age and college grade are predictors of the board rating.
Let us try to simulate the analysis by encoding the data below.

Name Board High Pre- Age College


Rating School Board Grade
Grade Rating
Zsakira 90 94 88 30 86
Wajid 92 90 91 32 90
Ysabelle 95 92 92 24 93
Xhandra 93 88 90 22 91
Zhnarah 88 90 86 21 89
Gio 91 92 90 24 93
Airah 93 90 92 25 94
Wilxon 96 88 94 23 94
Wlei 99 89 97 22 97
Vinh 94 90 91 21 92
Fairuz 89 92 91 32 91
Adrian 95 91 94 40 93
Shairah 98 90 96 34 96

EXCEL VIEW

Encode the data using five columns, first column for the dependent variable (board
rating) and the remaining columns for the independent variables (high school grade,
pre-board rating, age, and college grade).

Figure 6.1
Data View

Encoded Data

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


To analyze the data we need to follow these steps.

One-Way Analysis of Variance (ANOVA) Data Analysis Tool Steps

1. From the Tool bar, click Data\ Data Analysis\Regression.


2. Click OK
3. Click in the Input Y Range and select the range of the dependent variable in
the first column including the label.
4. Click in the Input X Range and select the range of the independent variable in
the remaining columns.
5. Click in Labels.
6. Click OK

EXCEL GUIDE

From the Tool bar, click Data\ Data Analysis\Regression\OK.

Figure 6.2
Data, Data
Analysis,
Regression

Regression

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Click OK

Figure 6.3
Regression Dialog
Box

Click in the Input Y Range and select the range of the dependent variable in the first
column including the label.

Figure 6.4
Input Range
Including the labels Input Range A1:A14

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Click in the Input X Range and select the range of the independent variable in the
remaining columns.
Figure 6.5
Input Range

Input Range B1:E14


Including the labels

Click in Labels and then Click OK

Figure 6.6
Labels in First
Click Row

Click

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Result

Table 6.1
Regression
Output

Table 6.2
Regression Statistics

R Square equals 0.893, which is a good fit, 89.3% of the variation in the dependent
variable (board rating) is explained by the independent variables (high school grade, pre-
board rating, age, college grade).

F-Value, Probability Value


Table 6.3
ANOVA

Since the value of the Significance F (0.00059) is less than the 0.05, the results of
the analysis are reliable.
Otherwise, better to stop using this set of independent variables if Significance F
(0.00059) is greater than the 0.05. You may delete some variables and/or add other
variables.

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Regression Line Coefficients

Table 6.4
Coefficients

Based on the probability values, only the Pre-Board Rating with 0.0052 p-value is
below 0.05 which makes it a predictor of the board rating.

The regression line:


𝑌 = 13.482 − 0.0995 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙 𝐺𝑟𝑎𝑑𝑒 + 1.2856 𝑃𝑟𝑒 − 𝐵𝑜𝑎𝑟𝑑 𝑅𝑎𝑡𝑖𝑛𝑔 −
0.1424 𝐴𝑔𝑒 − 0.2738 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 𝐺𝑟𝑎𝑑𝑒.

In other words, for each unit increase in high school grade, board rating decreases
with 0.0995. For each unit increase in Pre-Board Rating, board rating increases with
1.2856. For each unit increase in age, board rating decreases with 0.1424. For each unit
increase in college grade, board rating decreases with 0.2738.

The regression line can also be used to forecast or predict the dependent variable
based on the given independent variables by simply substituting the values.
For example, you would like to predict the board rating of a student whose high
school grade is 90, pre-board rating is 80, 30 years old and with a college grade of 85.
𝑌 = 13.482 − 0.0995 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙 𝐺𝑟𝑎𝑑𝑒 + 1.2856 𝑃𝑟𝑒 − 𝐵𝑜𝑎𝑟𝑑 𝑅𝑎𝑡𝑖𝑛𝑔 − 0.1424 𝐴𝑔𝑒
− 0.2738 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 𝐺𝑟𝑎𝑑𝑒.
𝑌 = 13.482 − 0.0995 (90) + 1.2856 (80) − 0.1424 (30) − 0.2738 (85) =79.83

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


EXERCISE No. 1

1. The following data relate the selling price Y to the living space x1, lot size x2,
and the number of bathrooms x3, for 10 recently sold homes in a common
area.
Selling Price House Size Lot Size Number of
( Million Pesos) (Square Meter) (Square Meter) bathrooms
1.8 48 52 2
2.2 54 60 2
3.4 52 65 3
4.3 50 100 3
6.5 100 250 4
10.2 120 500 6

a. Fit a multiple linear regression model to the above data.


b. Predict the selling price of a home of 60 square meter house size, 80
square meter lot size, and with 2 bathrooms.

EXERCISE No. 2

A researcher would like to know whether the profile of the respondents in


terms of age, number of children, and distance from work predicts their
performance in a certain company.

Performance Age Number of Distance form


children work (In
kilometers)
88 45 4 15
90 28 2 4
94 25 3 4
86 32 6 8
92 40 3 6
95 21 1 6
80 58 10 20

a. Fit a multiple linear regression model to the above data.


b. Predict the performance of an employee who is 30 years old with 5
URS-IM-AA-CI-0167 children and 10 kilometersRevaway
00 from work. Effective Date: August 24, 2020
Data Management

SELF ASSESSMENT No. 1

Fit a multiple linear regression model to the following data set.


Y X1 X2 X3 X4
12.2 3 3 9 5
16.5 2 4 10 4
13.3 1.5 8 14 2
17.4 3 9 8 3
14.2 2.5 7 12 4
11.4 3 2 7 3

ANSWER TO SELF ASSESSMENT QUESTION

𝑌 = 32.0386 − 5.8121𝑥1 + 1.1255𝑥2 − 1.6028𝑥3 + 1.8405𝑥4

Reference: http://www/graphpad.com

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020


Data Management

URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020

You might also like