Professional Documents
Culture Documents
Math 1 Module 4
Math 1 Module 4
Data Management
LEARNING OUTCOMES
Definition of Statistics
plural sense: numerical facts, e.g. CPI, peso-dollar exchange rate
singular sense: scientific discipline consisting of theory and methods for processing numerical
information that one can use when making decisions in the face of uncertainty.
History of Statistics
The term statistics came from the Latin phrase “ratio status” which means study of
practical politics or the statesman’s art.
In the middle of 18th century, the term statistik (a term due to Achenwall) was used, a
German term defined as “the political science of several countries”
From statistik it became statistics defined as a statement in figures and facts of the
present condition of a state.
Application of Statistics
Diverse applications
“During the 20th Century statistical thinking and methodology have become the scientific
framework for literally dozens of fields including education, agriculture, economics, biology, and
medicine, and with increasing influence recently on the hard sciences such as astronomy,
geology, and physics. In other words, we have grown from a small obscure field into a big obscure
field.” – Brad Efron
Comparing the effects of five kinds of fertilizers on the yield of a particular variety of corn
Determining the income distribution of Filipino families
Areas of Statistics
Descriptive statistics
methods concerned w/ collecting, describing, and analyzing a set of data without
drawing conclusions (or inferences) about a large group.
Example of Descriptive Statistics
Present the Philippine population by constructing a graph indicating the total number of Filipinos
counted during the last census by age group and sex
A new milk formulation designed to improve the psychomotor development of infants was
tested on randomly selected infants.
Based on the results, it was concluded that the new milk formulation is effective in improving the
psychomotor development of infants.
(n units/observations)
3. Look for any printed material and identify the statistics mentioned in the material
and classify them as to whether it is descriptive or inferential statistics. [12]
Qualitative variable
non-numerical values
Quantitative variable
numerical values
VARIABLES
a. Discrete
countable
b. Continuous
measurable Qualitative Quantitative
c. Constant
d.
Discrete Continuous
1. Nominal
Numbers or symbols used to classify
Examples are sex, marital status, occupation,
nationality, etc
2. Ordinal scale
Accounts for order; no indication of distance
between positions.
Examples are curriculum level, socio-economic
status, military ranks, Latin honors, etc
3. Interval scale
Equal intervals; no absolute zero.
Examples are temperature, test scores, etc
4. Ratio scale
Has absolute zero.
Examples are bank account, cellphone load, etc
Enumerate five (5) variables that you may think and classify each as to
qualitative or quantitative data. If quantitative, state whether it is discrete or
continuous data. State the level each variable is measured. [15]
1. __________________________
2. __________________________
3. __________________________
4. __________________________
5. _________________________
Measurement is the process of determining the value or label of the variable based
on what has been observed.
For example, we can measure the educational level of a person by using the
International Standard Classification of Education designed by UNESCO:
Subjective Method
Use of Existing
Records
Textual
Tabular
Graphical
Sketch a pie chart on your own monthly family income and expenditures. [20]
Identify whether the given situation belongs to the area of descriptive statistics or
inferential statistics. [20]
1. Synchronous vs Asynchronous Learning: Their Effects in the Teaching-
Learning Process
2. Average of a student in his 10 subjects
3. Statistics on COVID-19 cases in the world
4. Effect of music in reviewing for the exams
5. One wishes to find out which gives a better salary between companies in the
rural areas or urban areas
6. Enrolment rate in tertiary private institutions
7. Percentage of PUIs by municipality in the Province of Rizal
8. Impact of COVID 19 Pandemic in the life of tertiary students
9. Average sales for the first quarter of 2020
10. Amount of time spent in studying vs success of passing
2. Verbal Ability Test Scores and Math Ability Test Scores of ten (10) students in a
certain class. [15]
Identify whether the given situation belongs to the area of descriptive statistics or
inferential statistics. [20]
Chart Title
1400 1600
1200 1400
1200
1000
1000
800
800
600
600
400
400
200 200
0 0
Accountancy Business Computer Studies
Chart Title
96
94
92
90
88
86
84
82
80
78
0 2 4 6 8 10 12
Reference: Slides Presentation Used During The Training on Teaching Basic Statistics for Tertiary
Level Teachers Summer 2008
Most of the slides were taken from Elementary Statistics: A Handbook of Slide Presentation
prepared by ZVJ Albacea, CE Reano, RV Collado, LN Comia, NA Tandang in 2005 for the Institute
St:
of Statistics, CAS UP Los Baños
Write
hen
be theorleft
three most
more columnmust
numbers d
mistake
added,
becomes
first
numbers
third the
two chances
much of
smallermaking
if the a
one are added and then the
URS-IM-AA-CI-0167 Rev 00 Effective Date: August 24, 2020
LEARNING OUTCOMES
General Objectives
Specific Objectives
As a result of this lesson, students should be able to:
1. Analyze data using Data Analysis ToolPak and other functions in MS Excel;
2. Explain and interpret the results of the data analysis.
Descriptive Statistics
A descriptive statistic (in the count noun sense) is a summary statistic that
quantitatively describes or summarizes features from a collection of information
while descriptive statistics (in the mass noun sense) is the process of using and
analyzing those statistics. Descriptive statistics is distinguished from inferential (or
inductive statistics) by its aim to summarize a sample rather than use the data to learn
about the population that the sample of data is thought to represent. This generally means
that descriptive statistics, unlike inferential statistics, is not developed on the basis of
probability theory and are frequently non-parametric statistics. Even when a data analysis
draws its main conclusions using inferential statistics, descriptive statistics are generally
also presented. For example, in papers reporting on human subjects, typically a table is
included giving the overall sample size, sample sizes in important subgroups (e.g., for
each treatment or exposure group), and demographic or clinical characteristics such as
the average age, the proportion of subjects of each sex, the proportion of subjects with
related co-morbidities, etc. https://en.wikipedia.org/wiki/Descriptive_statistics
Mean
Mode
Variance
What is Symmetry?
Deciles
Quartiles
Measures of Variation
Encoded Data
Data Interpretation
Based on the summary measures, it can be noted that the distribution (weight in
pounds), whose mean is 145.13 with a standard deviation of approximately 18.67, is a
positively skewed (0.15 is greater than 0) and a platykurtic (-1.32 is less than 0)
distribution.
ACTIVITY NO. 2
Fifty families were surveyed and the number of children x was recorded for each
family as follows:
0,1,2,3,4,2,2,2,3,3,4,5,6,1,0,1,6,2,5,4,3,0,1,2,3,3,3,6,4,2,6,2,1,5,3,0,0,2,5,6,1,0,1,2,5,3
,4,2,2,3
Mean 2.72
Standard Error 0.255614506
Median 2.5
Mode 2
Standard
Deviation 1.807467503
Sample Variance 3.266938776
-
Kurtosis 0.771635469
Skewness 0.308046539
Range 6
Minimum 0
Maximum 6
Sum 136
Count 50
b. Interpret results.
Consider the systolic blood pressures of 12 smokers and 12 non-smokers are follows:
Smokers: 122,146,120,114,124,126,118,128,130,134,116,130
Non-Smokers: 114,134,114,116,138,110,112,116,132,126,108,116
Consider the systolic blood pressures of 12 smokers and 12 non-smokers are follows:
Smokers: 122,146,120,114,124,126,118,128,130,134,116,130
Non-Smokers: 114,134,114,116,138,110,112,116,132,126,108,116
SMOKERS
Mean 124.8333333
Standard Error 2.790224639
Median 125
Mode 130
Standard
Deviation 9.665621678
Sample Variance 93.42424242
Kurtosis 0.643159284
Skewness 0.731605702
Range 34
Minimum 112
Maximum 146
Sum 1498
Count 12
NON-SMOKERS
Mean 119.6666667
Standard Error 2.921532737
Median 116
Mode 116
Standard
Deviation 10.12048627
Sample Variance 102.4242424
-
Kurtosis 0.830332271
Skewness 0.819101367
Range 30
Minimum 108
Maximum 138
Sum 1436
URS-IM-AA-CI-0167
Count 12 Rev 00 Effective Date: August 24, 2020
b. Compare and interpret results.
Based on the summary measures for smokers, it can be observed that their
systolic blood pressure, whose mean is 124.83 with a standard deviation of 9.67, is a
positively skewed (0.73 is greater than 0) and a leptokurtic (0.64 is greater than 0)
distribution. Meanwhile, the non-smokers whose mean systolic blood pressure is
119.67 with a standard deviation of 10.12 is also a positively skewed (0.82 is greater
than 0) and a leptokurtic (0.83 is greater than 0) distribution. In this survey, it can be
concluded that systolic blood pressure of smokers is closer to the mean than that of the
distribution of the non-smokers.
Reference: Slides Presentation Used During The Training on Teaching Basic Statistics for Tertiary
Level Teachers Summer 2008
Most of the slides were taken from Elementary Statistics: A Handbook of Slide Presentation
prepared by ZVJ Albacea, CE Reano, RV Collado, LN Comia, NA Tandang in 2005 for the Institute
of Statistics, CAS UP Los Baños
LEARNING OUTCOMES
Direction of Relationship
1. Perfect Positive Correlation
If x and y have a strong positive linear correlation, r is close to +1.0. An r value
which is exactly equal to +1.0 indicates a perfect positive fit. Positive values
indicate a relationship between x and y variables such that as values for x increase,
values for y also increase.
4. Phi Coefficient or the Four-fold Coefficient is used when both x and y are
dichotomous. The computational formula is given by:
5. Chi Square Test for Independence compares two variables in a contingency table
to see if they are related. In a more general sense, it tests to see whether
distributions of categorical variables differ from each other. A very small chi
square test statistic means that your observed data fits your expected data
extremely well meaning that the two variables have correlation. Equivalently, a
very large chi square test statistic means that the data do not fit very well. In other
words, there is no relationship between the two variables.
QPA in QPA in
Math English
QPA in
Math 1
QPA in
English 0.485512 1
The computed r value of 0.485512 indicates that there is a moderate correlation between
QPA in Math and QPA in English of the sampled population.
Analyze using Data Analysis ToolPak. Use 0.05 alpha to test whether their
opinions differ or not.
ACTIVITY No. 2
A random sample of fifty men and fifty women were surveyed as to drinking
habits and classified as alcoholics, heavy drinkers and light drinkers. The results
were:
Analyze using Data Analysis ToolPak. Use 0.05 alpha to test their
independence.
One hundred individuals, aged 20-58, were a test of psychomotor skill. Both age and
score were classified as shown in the accompanying table:
Score
Age High Average Low
40-59 23 20 17
20-39 18 12 10
Test the relationship of Sex and their Attendance to Kindergarten in the table below:
Attendance to
Sample Sex
Kindergarten
1 0 0
2 1 1
3 0 1
4 0 0
5 1 1
did not
M=0
attend=0
F=1 attended=1
Attendance
to
Sex Kindergarten
Sex 1
Attendance
Examples
to for Self-Assessment Questions were taken from the book: Probability & Statistics.
Ymas Jr., Sergio E. Sta Monica Printing Corporation.Manila Philippines.2009
Kindergarten 0.666667 1
The correlational coefficient value of 0.666667 suggests that there is a moderate correlation
between Sex and Attendance to Kindergarten
URS-IM-AA-CI-0167 Revof00the sampled population.
Effective Date: August 24, 2020
Linear Regression
Regression is primarily used to build models/equations to predict a key response, Y,
from a set of predictor (X) variable. Correlation is primarily used to quickly and concisely
summarize the direction and strength of the relationships between a set of 2 or more
numeric variables.
Regression describes how an independent variable is numerically related to the
dependent variable. Correlation is used to represent the linear relationship between two
variables. On the contrary, regression is used to fit the best line and estimate one variable
on the basis of another variable.
Use correlation for a quick and simple summary of the direction and strength of the
relationship between two or more numeric variables. Use regression when you're looking
to predict, optimize, or explain a number response between the variables (how x
influences y)
When investigating the relationship between two or more variables, it is important
to know the difference correlation and regression. Correlation quantifies the direction
and strength of the relationship between two numeric variables, X and Y whose values
always lie between -1.0 – 1.0. Meanwhile, simple linear regression relates to X and Y
through an equation of the form y = a + bx.
EXCEL VIEW
Encode the data using five columns, first column for the dependent variable (board
rating) and the remaining columns for the independent variables (high school grade,
pre-board rating, age, and college grade).
Figure 6.1
Data View
Encoded Data
EXCEL GUIDE
Figure 6.2
Data, Data
Analysis,
Regression
Regression
Figure 6.3
Regression Dialog
Box
Click in the Input Y Range and select the range of the dependent variable in the first
column including the label.
Figure 6.4
Input Range
Including the labels Input Range A1:A14
Figure 6.6
Labels in First
Click Row
Click
Table 6.1
Regression
Output
Table 6.2
Regression Statistics
R Square equals 0.893, which is a good fit, 89.3% of the variation in the dependent
variable (board rating) is explained by the independent variables (high school grade, pre-
board rating, age, college grade).
Since the value of the Significance F (0.00059) is less than the 0.05, the results of
the analysis are reliable.
Otherwise, better to stop using this set of independent variables if Significance F
(0.00059) is greater than the 0.05. You may delete some variables and/or add other
variables.
Table 6.4
Coefficients
Based on the probability values, only the Pre-Board Rating with 0.0052 p-value is
below 0.05 which makes it a predictor of the board rating.
In other words, for each unit increase in high school grade, board rating decreases
with 0.0995. For each unit increase in Pre-Board Rating, board rating increases with
1.2856. For each unit increase in age, board rating decreases with 0.1424. For each unit
increase in college grade, board rating decreases with 0.2738.
The regression line can also be used to forecast or predict the dependent variable
based on the given independent variables by simply substituting the values.
For example, you would like to predict the board rating of a student whose high
school grade is 90, pre-board rating is 80, 30 years old and with a college grade of 85.
𝑌 = 13.482 − 0.0995 𝐻𝑖𝑔ℎ 𝑆𝑐ℎ𝑜𝑜𝑙 𝐺𝑟𝑎𝑑𝑒 + 1.2856 𝑃𝑟𝑒 − 𝐵𝑜𝑎𝑟𝑑 𝑅𝑎𝑡𝑖𝑛𝑔 − 0.1424 𝐴𝑔𝑒
− 0.2738 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 𝐺𝑟𝑎𝑑𝑒.
𝑌 = 13.482 − 0.0995 (90) + 1.2856 (80) − 0.1424 (30) − 0.2738 (85) =79.83
1. The following data relate the selling price Y to the living space x1, lot size x2,
and the number of bathrooms x3, for 10 recently sold homes in a common
area.
Selling Price House Size Lot Size Number of
( Million Pesos) (Square Meter) (Square Meter) bathrooms
1.8 48 52 2
2.2 54 60 2
3.4 52 65 3
4.3 50 100 3
6.5 100 250 4
10.2 120 500 6
EXERCISE No. 2
Reference: http://www/graphpad.com