You are on page 1of 58

M

O
D
U
Data Management L
E
LEARNING OUTCOMES 4
At the end of the lesson, students should be able to:

Determine the solution set for linear equations in one or two variables;
Apply the different system of linear equations in one or two variables;
Graph the system of inequalities in one or two variables;
Formulate linear programming models;
Use graphical method for solving both maximization and minimization linear
programming problems.

Definition of Statistics
Statistics is an art and a science of collecting, presenting, analyzing and interpreting data.

Examples
■ The term statistics came from the Latin phrase “ratio status” which means study of
practical politics or the statesman’s art.
■ In the middle of 18th century, the term statistik (a term due to Achenwall) was used, a
German term defined as “the political science of several countries”
■ From statistik it became statistics defined as a statement in figures and facts of the
present condition of a state.

Application of Statistics
■ Diverse applications
“During the 20th Century statistical thinking and methodology have become the scientific
framework for literally dozens of fields including education, agriculture, economics, biology, and
medicine, and with increasing influence recently on the hard sciences such as astronomy,
geology, and physics. In other words, we have grown from a small obscure field into a big
obscure field.” – Brad Efron
■ Comparing the effects of five kinds of fertilizers on the yield of a particular variety of corn
■ Determining the income distribution of Filipino families
■ Comparing the effectiveness of two diet programs
■ Prediction of daily temperatures
■ Evaluation of student performance
Two Aims of Statistics
Statistics aims to uncover structure in data, to explain variation…
■ Descriptive
■ Inferential
Descriptive Statistics includes all the techniques used in organizing, summarizing and
presenting the data on hand while Inferential Statistics includes all the techniques used in
analyzing the sample data that will lead to generalizations about a population from which the
sample was taken.

Areas of Statistics
Descriptive statistics
■ methods concerned w/ collecting, describing, and analyzing a set of data without
drawing conclusions (or inferences) about a large group.

Example of Descriptive Statistics
Present the Philippine population by constructing a graph indicating the total number of Filipinos
counted during the last census by age group and sex
Inferential statistics
■ methods concerned with the analysis of a subset of data leading to predictions or
inferences about the entire set of data.

Example of Inferential Statistics

A new milk formulation designed to improve the psychomotor development of infants was
tested on randomly selected infants.

Based on the results, it was concluded that the new milk formulation is effective in improving the
psychomotor development of infants.
Larger Set
(N units/observations)
Smaller Set
(n units/observations)

Inferences and Generalizations


Key Definitions

■ A universe is the collection of things or observational units under consideration.


■ A variable is a characteristic observed or measured on every unit of the universe.
■ A population is the set of all possible values of the variable.
■ Parameters are numerical measures that describe the population or universe of interest.
Usually donated by Greek letters; μ (mu), σ (sigma), ρ (rho), λ (lambda), τ (tau), θ
(theta), α (alpha) and β (beta).
■ Statistics are numerical measures of a sample.
Parameter is a summary measure describing a specific characteristic of the population
while Statistic is a summary measure describing a specific characteristic of the sample.
ACTIVITY No. 1

Answer the following questions as briefly as possible.


Differentiate descriptive from inferential statistics.
[4]
Give specific application of statistics in the following fields:
[14]
Business & Accountancy
Computer Studies
Education
Social Sciences & Humanities
Agriculture
Literature & Fine Arts
Technology & Livelihood

ofLook for any printed material and identify the statistics


Typesmentioned
Variables
in the material and classify them as to whether it is
descriptive or inferential statistics. [12]
Qualitative variable
non-numerical
values
Quantitative variable
numerical values
Discrete
c
ountable
Continuous
m
easurable
Constant
Scales of Measurement

1. Nominal
Numbers or symbols used to classify
Examples are sex, marital status,
occupation, nationality, etc
2. Ordinal scale
Accounts for order; no indication of
distance between positions.
Examples are curriculum level, socio-
Theeconomic
ratio levelstatus, military ranks,
of measurement has Latin
all thehonors,
following properties:
etc
a. the numbers in the system are used to classify a person/object into
3. distinct,
Intervalnon-overlapping
scale and exhaustive categories;
Equal intervals; no absolute zero.
ACTIVITY No. 2

Enumerate five (5) variables that you may think and classify each as to
qualitative or quantitative data. If quantitative, state whether it is discrete or
continuous data. State the level each variable is measured. [15]
1. __________________________
2. __________________________
3. __________________________
4. __________________________
5. _________________________

Definition
Measurement is the process of determining the value or label of the variable based on what

has been observed.

For example, we can measure the educational level of a person by using the International

Standard Classification of Education designed by UNESCO:

0 pre-primary; 1 primary; 2 lower secondary; 3 upper secondary; 4 post secondary


st nd
nontertiary; 5 1 stage tertiary; 6 2 stage tertiary

Objective Method
Methods of Data Collection

Subjective Method

Use of Existing
Records

Methods of Data Presentation


Textual
Tabular
Graphical

ACTIVITY No. 2

Sketch a pie chart on your own monthly family income and expenditures. [20]
SELF ASSESSMENT QUESTION NO. 1
Identify whether the given situation belongs to the area of descriptive statistics or
inferential statistics. [20]
Synchronous vs Asynchronous Learning: Their Effects in the Teaching-
Learning Process
Average of a student in his 10 subjects
Statistics on COVID-19 cases in the world
Effect of music in reviewing for the exams
One wishes to find out which gives a better salary between companies in the
rural areas or urban areas
Enrolment rate in tertiary private institutions
Percentage of PUIs by municipality in the Province of Rizal
Impact of COVID 19 Pandemic
SELF in the
ASSESSMENT life of tertiary
QUESTION No. 2 students
Average sales for the first quarter of 2020
Amount of time spent in studying vs success of passing
Classify the following variables as to qualitative or quantitative. If quantitative, further
tell if it is discrete or continuous data. Be able to state the scale each is measured.
[30]
breeds of dogs
birth order (first, second, etc)
monthly income
cellphone number
night differential of cashiers in a convenient store
spot on a die
jersey number of a basketball player
IQ test scores
Students classification (continuing, irregular, returning)
COVID 19 cases in a barangay
SELF ASSESSMENT QUESTION No. 3

Sketch an appropriate graph in each of the following problems.


Enrolment Profile by College of a certain university for SY 2019-2020. [10]

Verbal Ability Test Scores and Math Ability Test Scores of ten (10) students
in a certain class. [15]
ANSWERS TO SELF ASSESSMENT QUESTION No. 1

Identify whether the given situation belongs to the area of descriptive statistics or
inferential statistics. [20]

Synchronous vs Asynchronous Learning: Their Effects in the Teaching-


Learning Process Inferential Statistics
Average of a student in his 10 subjects Descriptive Statistics
Statistics on COVID-19 cases in the world Descriptive Statistics
Effect of music in reviewing for the exams Inferential Statistics
One wishes to find out which gives a better salary between companies in the
rural areas or urban areas Inferential Statistics
Enrolment rate in tertiary private institutions Descriptive Statistics
Percentage of PUIs by municipality in the Province of Rizal Descriptive
Statistics
Impact of COVID 19 Pandemic in the life of tertiary students Inferential
Statistics
Average sales for the first quarter of 2020 Descriptive Statistics
Amount of time spent in studying vs success of passing Inferential Statistics
ANSWERS TO SELF ASSESSMENT QUESTION No. 2

Classify the following variables as to qualitative or quantitative. If quantitative, further


tell if it is discrete or continuous data. Be able to state the scale each is measured.
[30]
breeds of dogs qualitative, nominal
birth order (first, second, etc) qualitative, nominal
monthly income quantitative, continuous, ratio
cellphone number quantitative, discrete, nominal
night differential of cashiers in a convenient store quantitative, continuous,
ratio
spot on a die quantitative, discrete, nominal
jersey number of a basketball player quantitative, discrete, nominal
IQ test scores quantitative, continuous, interval
Students classification (continuing, irregular, returning) qualitative, nominal
COVID 19 cases in a barangay quantitative, discrete, ratio
ANSWERS TO SELF ASSESSMENT QUESTION No. 3

Sketch an appropriate graph in each of the following problems.


Enrolment Profile by College of a certain university for SY 2019-2020. [10]
Verbal Ability Test Scores and Math Ability Test Scores of ten (10) students
in a certain class. [15]

Reference: Slides Presentation Used During The Training on Teaching Basic Statistics for Tertiary
Level Teachers Summer 2008
Most of the slides were taken from Elementary Statistics: A Handbook of Slide Presentation
prepared by ZVJ Albacea, CE Reano, RV Collado, LN Comia, NA Tandang in 2005 for the Institute
of Statistics, CAS UP Los Baños

St:
Write the left most column d
hen three or more numbers must
be
added, the chances of making a
mistake
becomes much smaller if the
first two
numbers are added and then the
third one
LEARNING OUTCOMES

General Objectives
The purpose of this module is to familiarize students in Descriptive Statistics using Data
Analysis ToolPak

Specific Objectives
As a result of this lesson, students should be able to:

Analyze data using Data Analysis ToolPak and other functions in MS Excel;
Explain and interpret the results of the data analysis.
Descriptive Statistics
A descriptive statistic (in the count noun sense) is a summary statistic that
quantitatively describes or summarizes features from a collection of information
while descriptive statistics (in the mass noun sense) is the process of using and
analyzing those statistics. Descriptive statistics is distinguished from inferential (or
inductive statistics) by its aim to summarize a sample rather than use the data to learn
about the population that the sample of data is thought to represent. This generally
means that descriptive statistics, unlike inferential statistics, is not developed on the
basis of probability theory and are frequently non-parametric statistics. Even when a
data analysis draws its main conclusions using inferential statistics, descriptive statistics
are generally also presented. For example, in papers reporting on human subjects,
typically a table is included giving the overall sample size, sample sizes in important
subgroups (e.g., for each treatment or exposure group), and demographic  or clinical
characteristics such as the average age, the proportion of subjects of each sex, the
proportion of subjects with related co-morbidities, etc.
https://en.wikipedia.org/wiki/Descriptive_statistics

Summary Measures
Measures of Location
Maximum and Minimum

Measures of Central Tendency


Mean

Median
Mode
Range (R)

Interquartile Range (IR)

Variance

Standard Deviation
Remarks on Standard Deviation

Comparing Standard Deviation

Measures of Skewness
What is Symmetry?

Measures of Kurtosis
Percentiles

Deciles

Quartiles
Measures of Variation

Measures of Variation
Let’s try to work on some data samples

Encoded Data
Data Analysis Using ToolPak

Recall Module 1 on MS Excel Fundamentals, enable first your Data Analysis


ToolPak by following the steps as shown below:
Click Data, Data Analysis, then Descriptive Statistics
This will be displayed on your screen.

Data Interpretation
Based on the summary measures, it can be noted that the distribution (weight in
pounds), whose mean is 145.13 with a standard deviation of approximately 18.67, is a
positively skewed (0.15 is greater than 0) and a platykurtic (-1.32 is less than 0)
distribution.
ACTIVITY NO. 1

Consider the data on daily wages on 15 employees below:

ACTIVITY NO. 2
Compute for Descriptive Statistics using Data Analysis ToolPak.
A surveyresults.
Interpret in a certain barangay showed the number of members in each household
as follows 3, 5, 6, 4, 7, 8, 6, 9, 10, 4, 6, 7, 5, 8, 9, 8, 3, 4, 5 and 5.

Compute for Descriptive Statistics using Data Analysis


ToolPak.
Interpret results.

SELF-ASSESSMENT QUESTION NO. 1

Fifty families were surveyed and the number of children x was recorded for each
family as follows:
0,1,2,3,4,2,2,2,3,3,4,5,6,1,0,1,6,2,5,4,3,0,1,2,3,3,3,6,4,2,6,2,1,5,3,0,0,2,5,6,1,0,1,2,5,3
,4,2,2,3

Compute for Descriptive Statistics using Data Analysis ToolPak.


Interpret results.

ANSWERS TO SELF-ASSESSMENT QUESTION NO. 1

Compute for Descriptive Statistics using Data Analysis ToolPak.


Interpret results.

Based on the summary measures, it can be observed that the distribution (number of
children of a sample of 50 families), whose mean is 2.72 or approximately 3 children
with a standard deviation of approximately 2 children, is a positively skewed (0.308 is
greater than 0) and a platykurtic (-0.77 is less than 0) distribution.
SELF-ASSESSMENT QUESTION NO. 2

Consider the systolic blood pressures of 12 smokers and 12 non-smokers are follows:

Smokers: ANSWERS TO SELF-ASSESSMENT QUESTION NO. 2


122,146,120,114,124,126,118,128,130,134,116,130
Non-Smokers: 114,134,114,116,138,110,112,116,132,126,108,116

Compute for Descriptive Statistics using Data Analysis ToolPak separately.


Compare and interpret results.

Consider the systolic blood pressures of 12 smokers and 12 non-smokers are follows:

Smokers: 122,146,120,114,124,126,118,128,130,134,116,130
Non-Smokers: 114,134,114,116,138,110,112,116,132,126,108,116

Compute for Descriptive Statistics using Data Analysis ToolPak separately.

Compare and interpret results.

Based on the summary measures for smokers, it can be observed that their
systolic blood pressure, whose mean is 124.83 with a standard deviation of 9.67, is a
positively skewed (0.73 is greater than 0) and a leptokurtic (0.64 is greater than 0)
Reference: Slides Presentation Used During The Training on Teaching Basic Statistics for Tertiary
distribution.
Level Meanwhile,
Teachers Summer 2008 the non-smokers whose mean systolic blood pressure is
119.67 with a standard deviation of 10.12 is also a positively skewed (0.82 is greater
than 0) and a leptokurtic (0.83 is greater than 0) distribution. In this survey, it can be
concluded that systolic blood pressure of smokers is closer to the mean than that of
the distribution of the non-smokers.
Most of the slides were taken from Elementary Statistics: A Handbook of Slide Presentation
prepared by ZVJ Albacea, CE Reano, RV Collado, LN Comia, NA Tandang in 2005 for the Institute
of Statistics, CAS UP Los Baños

Linear Regression and


Correlation

LEARNING OUTCOMES

At the end of the lesson, students should be able to:

Distinguish the measure of association to be used given the raw data;


Analyze correlational problems using Data Analysis Toolpak in MS Excel;

Definition of a Measure of Association


A measure of association or relationship is used to determine the degree of
relationship between two variables (x and Y). These variables are observed in their
natural setting. They cannot be manipulated nor controlled.
The correlational coefficient takes on the values ranging from [-1.0, 1]. The
quantity r, called the linear correlation coefficient, measures the strength and the
direction of a linear relationship between two variables.

Direction of Relationship
1. Perfect Positive Correlation
If x and y have a strong positive linear correlation, r is close to +1.0. An r value
which is exactly equal to +1.0 indicates a perfect positive fit. Positive values
indicate a relationship between x and y variables such that as values for x
increase, values for y also increase.

2. Perfect Negative Correlation


If x and y have a strong negative linear correlation, r is close to -1.0. An r value
which is exactly equal to -1.0 indicates a perfect negative fit. Negative values
indicate a relationship between x and y variables such that as values for x
increase, values for y also decrease and vice versa.
Some Correlational Techniques
1. Pearson Product-Moment Correlation or Pearson r is used when both
variables are measured on an interval or ratio scale. The computational formula
is given by:
2. Spearman Rank-Order Correlation Coefficient is used when both variables
are measured on an ordinal data. We may have two scenarios here (a) original
data are ranked; (b) original data are measured on an interval/ratio scale
converted into ranks. The computational formula is given by:

3. Point Biserial Correlation Coefficient is used when one of the variables is


measured on an interval or ratio scale and the other variable is dichotomous
variable (variable that have two categories). The computational formula is given
by:

4. Phi Coefficient or the Four-fold Coefficient is used when both x and y are
dichotomous. The computational formula is given by:
Testing the Significance of an r

5. Chi Square Test for Independence compares two variables in a contingency


table to see if they are related. In a more general sense, it tests to see whether
distributions of categorical variables differ from each other. A very small chi
square test statistic means that your observed data fits your expected data
extremely well meaning that the two variables have correlation. Equivalently, a
very large chi square test statistic means that the data do not fit very well. In
other words, there is no relationship between the two variables.

Illustrative Example

Sampl Se Socio- QPA QPA in Rank in Rank in Oral Attendance


e x Economic in English Abstract Communicatio to
Status Mat Reasoning n Kindergarten
h
1 M Poor 1.3 1.8 2 5 Did Not
Attend
2 F Poor 1.2 1.7 3 4 Attended
3 M Non-Poor 1.5 1.5 5 2 Attended
4 M Poor 1.4 1.6 4 3 Did Not
Attend
5 F Non-Poor 1.0 1.2 1 1 Attended
Utilizing Data Analysis ToolPak in MS Excel

QPA in QPA in
  Math English

QPA in
Math 1
QPA in
English 0.485512 1

The computed r value of 0.485512 indicates that there is a moderate correlation


between QPA in Math and QPA in English of the sampled population.

ACTIVITY No. 1

Answer the following questions as briefly as possible.


In a survey conducted with university students on a controversial issue, the
following results were obtained:

Analyze using Data Analysis ToolPak. Use 0.05 alpha to test whether their
ACTIVITY No. 2

A random sample of fifty men and fifty women were surveyed as to drinking
habits and classified as alcoholics, heavy drinkers and light drinkers. The results
were:

Analyze using DataSELF


Analysis ToolPak. Use 0.05 alpha to test their
ASSESSMENT QUESTION NO. 1
independence.
One hundred individuals, aged 20-58, were a test of psychomotor skill. Both age and
score were classified as shown in the accompanying table:

SELF ASSESSMENT QUESTION No. 2

Test the relationship of Sex and their Attendance to Kindergarten in the table below:

ANSWERS TO SELF ASSESSMENT QUESTION No. 1


Utilizing the Chi Square Test for Independence, the computed Chi Square is 0.44. The
tabular value is 4.61 with 0.05m alpha and 2 as degrees of freedom, this leads to the
conclusion to reject the null hypothesis that there is no relationship in the psychomotor
skills test scores among one hundred individual. This implies that their psychomotor
skills of the two age groups differ from each other at 0.05 level of significance.
ANSWERS TO SELF ASSESSMENT QUESTION No. 2

The correlational coefficient value of 0.666667 suggests that there is a moderate correlation
between Sex and Attendance to Kindergarten of the sampled population.

Examples for Self-Assessment Questions were taken from the book: Probability & Statistics.
Ymas Jr., Sergio E. Sta Monica Printing Corporation.Manila Philippines.2009

Linear Regression
Regression is primarily used to build models/equations to predict a key response, Y,
from a set of predictor (X) variable. Correlation is primarily used to quickly and concisely
summarize the direction and strength of the relationships between a set of 2 or more
numeric variables.
Regression describes how an independent variable is numerically related to the
dependent variable. Correlation is used to represent the linear relationship between two
variables. On the contrary, regression is used to fit the best line and estimate one
variable on the basis of another variable.
Use correlation for a quick and simple summary of the direction and strength of the
relationship between two or more numeric variables. Use regression when you're
looking to predict, optimize, or explain a number response between the variables (how x
influences y)
When investigating the relationship between two or more variables, it is important
to know the difference correlation and regression. Correlation quantifies the direction
and strength of the relationship between two numeric variables, X and Y whose values
always lie between -1.0 – 1.0. Meanwhile, simple linear regression relates to X and Y
through an equation of the form y = a + bx.
Regression describes how an independent variable is numerically related to the dependent
variable. Correlation is used to represent the linear relationship between two variables. On
the contrary, regression is used to fit the best line and estimate one variable on the basis of
another variable.

Illustrative Examples
A researcher would like to know which among the high school grade, pre-board
rating, age and college grade are predictors of the board rating.
Let us try to simulate the analysis by encoding the data below.
Name Board High Pre- Age College
Rating School Board Grade
Grade Rating
Zsakira 90 94 88 30 86
Wajid 92 90 91 32 90
Ysabelle 95 92 92 24 93
Xhandra 93 88 90 22 91
Zhnarah 88 90 86 21 89
Gio 91 92 90 24 93
Airah 93 90 92 25 94
Wilxon 96 88 94 23 94
Wlei 99 89 97 22 97
Vinh 94 90 91 21 92
Fairuz 89 92 91 32 91
Adrian 95 91 94 40 93
Shairah 98 90 96 34 96

EXCEL VIEW

Encode the data using five columns, first column for the dependent variable (board
rating) and the remaining columns for the independent variables (high school grade,
pre-board rating, age, and college grade).
Figure 6.1
Data View

Encoded Data

To analyze the data we need to follow these steps.

One-Way Analysis of Variance (ANOVA) Data Analysis Tool Steps

1. From the Tool bar, click Data\ Data Analysis\Regression.


2. Click OK
3. Click in the Input Y Range and select the range of the dependent variable in
the first column including the label.
4. Click in the Input X Range and select the range of the independent variable in
the remaining columns.
5. Click in Labels.
6. Click OK

EXCEL GUIDE

From the Tool bar, click Data\ Data Analysis\Regression\OK.


Figure 6.2
Data, Data
Analysis,
Regression

Regression

Click OK
Figure 6.3
Regression Dialog
Box

Click in the Input Y Range and select the range of the dependent variable in the first
column including the label.

Figure 6.4
Input Range
Including the labels Input Range A1:A14
Click in the Input X Range and select the range of the independent variable in the
remaining columns.
Figure 6.5
Input Range

Input Range B1:E14


Including the labels

Click in Labels and then Click OK

Figure 6.6
Labels in First
Click Row

Click
Result

Table 6.1
Regression
Output

Table 6.2
Regression Statistics

R Square equals 0.893, which is a good fit, 89.3% of the variation in the dependent
variable (board rating) is explained by the independent variables (high school grade,
pre-board rating, age, college grade).

F-Value, Probability Value


Table 6.3
ANOVA

Since the value of the Significance F (0.00059) is less than the 0.05, the results
of the analysis are reliable.
Otherwise, better to stop using this set of independent variables if Significance F
(0.00059) is greater than the 0.05. You may delete some variables and/or add other
variables.
Regression Line Coefficients
Table 6.4
Coefficients

Based on the probability values, only the Pre-Board Rating with 0.0052 p-value is
below 0.05 which makes it a predictor of the board rating.

The regression line:

Y =13.482−0.0995 High SchoolGrade+ 1.2856 Pre−Board Rating−0.1424 Age−0.2738College Grade .

In other words, for each unit increase in high school grade, board rating
decreases with 0.0995. For each unit increase in Pre-Board Rating, board rating
increases with 1.2856. For each unit increase in age, board rating decreases with
0.1424. For each unit increase in college grade, board rating decreases with 0.2738.

The regression line can also be used to forecast or predict the dependent
variable based on the given independent variables by simply substituting the values.
For example, you would like to predict the board rating of a student whose high
school grade is 90, pre-board rating is 80, 30 years old and with a college grade of 85.
Y =13.482−0.0995 High SchoolGrade+ 1.2856 Pre−Board Rating−0.1424 Age−0.2738College Grade .

Y =13.482−0.0995 ( 90 ) +1.2856 ( 80 )−0.1424 ( 30 )−0.2738 ( 85 ) =¿79.83


EXERCISE No. 1

The following data relate the selling price Y to the living space x 1, lot size x2, and
the number of bathrooms x3, for 10 recently sold homes in a common area.

Fit a multiple linear regression model to the above data.


Predict the selling price of a home of 60 square meter house size, 80 square meter
lot size, and with 2 bathrooms.

EXERCISE No. 2

A researcher would like to know whether the profile of the respondents in terms of
age, number of children, and distance from work predicts their performance in a
certain company.

Fit a multiple linear regression model to the above data.


Predict the performance of an employee who is 30 years old with 5 children and 10
kilometers away from work.
AData
recently completed study attempted to relate job satisfaction to income and years
Management
in service for a random sample of 10 workers.

ANSWER TO ASSESSMENT
SELF SELF ASSESSMENT
No. 1 QUESTION

Fit a multiple linear regression model to the following data set.

Reference: http://www/graphpad.com

You might also like