You are on page 1of 75

Mathematics

in the Modern World


Module 9
Statistics: Data Management
Introduction
This chapter covers two of the most important statistical tools
in data management. The first part discusses the normal
distribution and empirical rules to solve application problems.
The second part tackles regression and correlation which are
two statistical techniques used to establish linear association
of variables.
Chapter Objectives
At the end of this chapter, the students should be able to:
1. use empirical rules to solve an application problem.
2. use the normal distribution to solve an application problem
involving probabilities.
3. determine the linear regression model.
4. use linear regression to make predictions.
Introduction: Statistics Overview
STATISTICS

Make decisions
Collection Solve Problems
Organization Design Products and
Processes
Presentation
Analysis
Interpretation
Introduction: Statistics Overview

STATISTICS
POPULATION STATISTICS
uses
is the science of
inductive
learning
reasoning information
from data
SAMPLE
Introduction: Statistics Overview

DESCRIPTIVE

Point
STATISTICS Probability
Estimation

INFERENTIAL Interval

Hypothesis
Testing
The Process of Statistics
Sampling Theory
POPULATION SAMPLE
Descriptive Statistics

Inferential Statistics

PARAMETER STATISTIC
The Process of Statistics
Problem Definition

Data Gathering

Data Analysis

Data Interpretation
The Process of Statistics
Problem Definition

Data Gathering
It can suggest the type of data
that will be involved in the
Data Analysis research process

Data Interpretation
The Process of Statistics
Problem Definition

Data Gathering It determines the precision


with which pertinent information
will be collected
Data Analysis (retrospective, observational,
designed experiment)
Data Interpretation
The Process of Statistics
Problem Definition
Statistical Objective?
-describe
Data Gathering -identify/classify
-compare/test
-predict
Data Analysis -explain
Number of Variables?
Data Interpretation -one
-two
-more…
The Process of Statistics
Problem Definition
Type of Variable?
-independent
Data Gathering -dependent
-intervening
Level of Measurement?
Data Analysis
-nominal
-ordinal
Data Interpretation -interval
-ratio
Section1: Normal Distribution and
the Central Limit Theorem
Inferential statistics uses sampling distribution to draw
conclusions about a given population based on the
analysis of random samples. One of the most important
topics in sampling distribution is the central limit
theorem.
Components of Statistical Research
Design – the researcher must know the appropriate statistical methods to
carry out a plan, implement rules, and evaluate experiments properly.
Description – the researcher must know how to guide readers in
understanding the methods of a research and in analyzing its results.
Inference – the researcher must use the results of data analysis to make
good predictions and correct decisions.
In addition, the researcher must do his or her best to have minimal
experimental errors to obtain high precision and a high degree of
reliability. This can only be achieved if the experiment is well planned and
implemented.
The Normal Distribution
The normal distribution is perhaps the most commonly used
continuous probability distribution in the entire field of statistics.
Consider an experiment that can generate interval data (that is,
continuous). For example, selecting random students in the class and
recording their heights. It can be shown that with a sufficiently large
sample (say at least 30 students), majority of the students have heights
that are close to the average while few have “extreme” measures (either
tall students or short students).
Properties of Normal Distribution
•The
  shape of the distribution is bell-shaped curve.
The curve is symmetric with respect to the middle value.
All three central measures (mean, median, mode) coincide at the middle.
The span of the “bell” is determined by the standard deviation of the
distribution; the larger the standard deviation, the wider is the span (or
range) of the “bell”. (The notation is used for this purpose)
 The curve is asymptotic to the horizontal axis, which means that a value
that is far from the central value has a small relative frequency.
 The area under the curve is 1.
The Empirical Rule for Normal Distribution

• About 68.3% of the population falls within the interval 𝜇 ± 𝜎.


• About 95.4% of the population falls within the interval 𝜇 ± 2𝜎.
• About 99.7% of the population falls within the interval 𝜇 ± 3𝜎.
Where 𝜇 is the population mean and 𝜎 is the population standard
deviation.
Example
Suppose the heights of 40 students are normally distributed with a mean
of 136 cm and a standard deviation of 8 cm. How many students have
heights ranging from
a. 128 cm to 144 cm?
b. 120 cm to 152 cm ?
c. 112 cm to 160 cm?
Example: Solution
Example: Solution
Example: Solution
 The Standard Normal Distribution

While a variable X can


be used to refer to a
normal random
variable, we use Z to
represent the standard
normal variable.
 The Standard Normal Distribution

𝑋
 
𝑍
 
 The Empirical Rule in
Areas Under the Normal Curve
TABULAR AREA

Probability
 
𝑃 ( 𝑍 <1.18 )=0.8810
Remark
Tabular values corresponding to z-values identified by the
row-label and column-label represent either
a) area under the curve to the left of z; or
b) cumulative probability for all values less than z
Formula
•Let  be a normally distributed variable with mean and
standard deviation . Then any value of can be transformed
into a standard normal score using the formula
Example
A statistics examination was administered to two sections, Section ABC
and Section XYZ. In Section ABC, the average score of the students
was 85 with a standard deviation of 4. In Section XYZ, the average was
83 with standard deviation of 3. Kara and Mia, who belong to ABC and
XYZ respectively, both scored 87 in the said examination. Who scored
better in terms of their relative position in their respective sections?
Assume that test scores in the 2 sections are normally distributed.
Solution
•  For Kara:
For Mia:
 
The standardized score of Mia is higher than the standardized score
of Kara. This means that Mia performed better in her section as
compared to Kara.
Example
•Let  X be a random variable that is normally distributed with mean and
a standard deviation . Find the standard score for
Example
•Let  X be a random variable that is normally distributed with mean and
a standard deviation . Find the probability
Example
•Let  X be a random variable that is normally distributed with mean and
a standard deviation . Find the probability

Solution: We compute the probability of the normally distributed


variable X using the standard normal distribution.
Finding
 
Method 1: Tabular Method
Finding
 
Method 2: Calculator Method (For CASIO only)
i) Set calculator to STAT Mode

Press Mode ==> ==> Press 3

Just press AC
Finding
 
Method 2: Calculator Method (For CASIO only)
ii) Find probability
press 1

Press Shift ==> ==> Press 5

Note: For other versions, just look for “Distr” key


Finding
 
Method 2: Calculator Method (For CASIO only)
ii) Find probability
Finding
 
Method 3: Excel Command

Click
Insert Function
Command
Finding
 
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
Finding
 
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding
 
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding
 
Method 3: Excel Command
* Using Normal Distribution choose “NORM.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
*Click “OK”
Finding
 
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
Finding
 
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding
 
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Finding
 
Method 3: Excel Command
* Using Standard Normal Distribution choose “NORM.S.DIST”
* Input the values
*Type “TRUE” for
the logical argument
“Cumulative”
Example
In a certain section, the scores in the Quiz 1 of MMW is known to be
normally distributed with a mean of 84 and a standard deviation of 3.5.
Determine the probability that a student in this section obtained a score
of
a) Less than or equal to 90
b) 88 or better
c) Between 85 to 90
 Solution: Given
For purposes of computation, let X be the random variable for the
GWA.
 Solution: Given
Example
A popular burger chain sells a particular soda brand using a machine
that discharges an average of 500 milliliters (ml) per cup. If the amount
of drink is normally distributed with a standard deviation of 35 ml,
a. what fraction of the cups will contain more than 550 ml?
b. how many cups will overflow if 530-ml cups will be used for
1,500 drinks?
c. below what value do you get the smallest 20% of the drinks?
 Example) Given: ,
• popular
A   burger chain sells a particular soda brand using a machine that discharges an average of
500 milliliters (ml) per cup. If the amount of drink is normally distributed with a standard deviation
of 35 ml,
a. what fraction of the cups will contain more than 550 ml?
With the assumption of normality, we can standardize

Calculator
 Example) Given: ,
• popular
A   burger chain sells a particular soda brand using a machine that discharges an average of
500 milliliters (ml) per cup. If the amount of drink is normally distributed with a standard deviation
of 35 ml,
b. how many cups will overflow if 530-ml cups will be used for
1,500 drinks?
Note: Cups overflow if the discharged content exceeds 530 ml
Find:

Then, (1500)(0.1949) cups will overflow.


 Example) Given: ,
• popular
A   burger chain sells a particular soda brand using a machine that discharges an average of
500 milliliters (ml) per cup. If the amount of drink is normally distributed with a standard deviation
of 35 ml,
c. below what value do you get the smallest 20% of the drinks?
Note: This is an inverse probability problem. Only Casio fx991EX
(or higher versions) has the capability to do this. You may
use Excel instead.
Find such that
*use the syntax: NORM.INV(probability, mean, standard_dev)
*or use syntax: NORM.S.INV(probability)
 Example) Given: ,
• below
c.   what value do you get the smallest 20% of the drinks?
Find such that
*use the syntax: NORM.INV(probability, mean, standard_dev)

*or use syntax: NORM.S.INV(probability)

<==>
Inferential Statistics
•One
  important role of statistics is to describe a large group of
subjects (called population) using only a part or portion
(called sample) of the group.
Inferential Statistics serves this purpose. Given a population of
size , we can consider a smaller group or a sample of size
such that whatever characteristic(s) is obtained from the
sample can be used to describe the entire population from
which the sample was drawn.
Inferential Statistics
•By  “characteristic” it means the common quantities that are
computed such as mean, variance, and proportions.
So, if the mean of the population is needed, Population

then the sample mean can be used by some 𝝁


 

rules of inferential statistics. Here, is called a


parameter while is called a statistic. ´𝒙
 

Sample
Inferential Statistics
•Other
  parameters are the population variance (), proportion (),
and correlation (). The corresponding statistics are the sample
variance (), Population

sample proportion (), and sample  


𝝁 𝝈
  𝟐

correlation (). Inferential statistics  


𝒑
 
𝝆
 
^𝒑
is concerned with the “estimation” of 𝒓
 
  𝟐
𝒔
 
´𝒙
the parameters using the sample statistics. Sample
 Inferences about population mean ()
•Consider
  finding the mean () of a large population. We first
form a sample of size . Then we compute the mean ().
We can use to determine .

….but what guarantees that this can be done?????


The Central Limit Theorem (CLT)
•If random
  samples of size are formed from a population with
mean and a standard deviation , then the means of the samples
tend to a normal distribution as the sample size increases. In
this case, the standard deviation of the means is given by

Moreover, the mean of the sample means equals the


population mean.
Remarks
•1.   CLT does not require the population to be normally
distributed. So, whether the population is normally
distributed or not, the means of the samples (with fixed
size ) are normally distributed.
2. By the Empirical Rule, any mean computed from a
random sample can be as close as possible to the
population mean. Specifically, 68% chance that it is from ,
95% chance that is from , and 99% chance that it is from .
Remarks
•3.  Since the means of the samples are normally distributed,
any specific mean must have a corresponding standard
value (or standard normal score). The formula in this case is
Example
In the MMW class of Professor A, the students obtained an average of
86.2 in an online quiz, with a standard deviation of 8. Assume that the
scores are normally distributed.
a) What is the probability that a randomly selected student scored
less than or equal to 88?
b) If a random sample of 15 students is selected from the class, what
is the probability that their average is less than or equal to 88?
 Example) Given: ,
•a)   What is the probability that a randomly selected student scored less
than or equal to 88?
Standardize :
 Example) Given: ,
•b) If
  a random sample of 15 students is selected from the class, what
is the probability that their average is less than or equal to 88?
Standardize :
Example
A manufacturing firm produces LED lamps with a mean lifetime of 900
hours and a standard deviation of 55 hours. Find the probability that a
random sample of 100 lamps will last on the average of
a. more than 915 hours
b. between 895 and 905 hours.
 Example) Given: ,
•A manufacturing
  firm produces LED lamps with a mean lifetime of 900
hours and a standard deviation of 55 hours. Find the probability that a
random sample of 100 lamps will last on the average of
a. more than 915 hours
 Example) Given: ,
•A manufacturing
  firm produces LED lamps with a mean lifetime of 900
hours and a standard deviation of 55 hours. Find the probability that a
random sample of 100 lamps will last on the average of
b. between 895 and 905 hours
;
Correlation and Regression
Sometimes you might wonder how two separate things could
relate to one another. For example, you might ask yourself:
Why does savings generally increase when expenditure
decreases? Or, why does your weight change, when you eat
more or eat less? These questions are about the relationship
between two variables or quantities. Data that involve two
variables are called Bivariate Data.
Correlation and Regression
In univariate data, the major purpose of the analysis is to
describe that data based on the descriptive statistics computed
such as averages, standard deviations, frequency counts, and
the likes. On the other hand, in Bivariate data, the purpose of
the analysis is to describe the relationships. We will be
discussing the relationship in terms of strength and direction.
The statistical procedure that is used to do this is called
correlation analysis.
Correlation Analysis
Correlation analysis is one statistical technique used to study
relationships among variables. Regression analysis is used to
determine the nature of relationship. In a two-variable linear
regression or simple linear regression, a positive relationship
occurs when the two variables increase at the same time while
a negative relationship occurs when one variable increases and
the other variable decreases, or vice versa.
Correlation Coefficient
To determine if there exists a linear relationship between two
variables, use correlation coefficient r whose values range
from –1 to 1.
Useful Formulas

n is the sample size and “SS” stands for sum of the squares
Coefficient of Determination
The square of r is called the coefficient of
determination which describes the degree of
variability between the dependent variable y and
the independent variable x.
 The Regression Line:
The line corresponding to a given set of points is called the
least-squares line of the linear regression model. Here,
Example
The grades of 10 senior high school students on a midterm report x and
on the final examination y are as follows:

a. Determine the correlation coefficient r.


b. Determine the linear regression line.
c. Predict the final examination grade of a student whose
midterm grade is 60.

You might also like