You are on page 1of 44

STT151A

Statistics for Research

Part 1: Regression Analysis


Part 1: Regression Analysis
Time Frame: Weeks 1-4
1.1 Introduction
1.2 Least-Squares Regression and Correlation (SLRM and MLRM)
1.3 Model Validation & Remedial Measures, Outlier Detection, and
Transformations
1.4 Variable Selection & Model Building
1.5 Intrinsically Linear Models
** 1.6 Logistic Regression
1.7 Kaplan-Meier Survival Analysis|
QUIZ #1
Course Syllabus Discussion
Download Statistica
https://helpdesk.dlsu.edu.ph/guides/software/statistica-installation-
guide.asp
1.2 Least-Squares Regression and
Correlation
Simple Linear Regression Model and
Multiple Linear Regression Model
Correlation Analysis
From: https://www.mathsisfun.com/data/correlation.html
The word Correlation is made of Co- (meaning "together"), and Relation

• Correlation is Positive when the values increase together, and


• Correlation is Negative when one value decreases as the other increases

Correlation can have a value:


•1 is a perfect positive correlation
•0 is no correlation (the values don't seem linked at all)
•-1 is a perfect negative correlation
The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.
We can easily see that warmer weather and higher sales go together. The relationship is good but not perfect.
Correlation Is Not Good at Curves
The correlation calculation only works properly for straight line relationships
It gets so hot that people aren't going near the shop, and sales start dropping.
Here is the latest graph:

The correlation value is now 0: "No Correlation" ... !


The calculated correlation value is 0, which means "no correlation".
But we can see the data follows a nice curve that reaches a peak around 25° C.
But the correlation calculation is not "smart" enough to see this.
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation
No relationship

X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Moral of the story: make a Scatter Plot, and look at it!
You may see a relationship that the calculation does not.
"Correlation Is Not Causation"
A common saying is "Correlation Is Not
Causation".

• What it really means is that a correlation does not prove one thing
causes the other:
• One thing might cause the other
• The other might cause the first to happen
• They may be linked by a different thing
• Or it could be random chance!
• There can be many reasons the data has a good correlation.
Example: Poor suburbs are more likely to have high pollution
Why?
• Do poor people make pollution?
• Are polluted suburbs the only place poor people can afford?
• Is it a common link, such as factories with low paying jobs and lots of
pollution?
Pearson Product-Moment
Correlation
Pearson Product-Moment Correlation
Ice
Temperatu
Cream x2 y2 xy
re °C (x)
Sales(y)

14.2 215 201.64 46225 3053


16.4 325 268.96 105625 5330
11.9 185 141.61 34225 2202
15.2 332 231.04 110224 5046
18.5 406 342.25 164836 7511
22.1 522 488.41 272484 11536
19.4 412 376.36 169744 7993
25.1 614 630.01 376996 15411
23.4 544 547.56 295936 12730
18.1 421 327.61 177241 7620
22.6 445 510.76 198025 10057
17.2 408 295.84 166464 7018
Total 224.1 4829 4362.1 2118025 95507
How to Perform Pearson
Correlation( r ) in Excel
How to open file in Statistica:
https://www.youtube.com/watch?v=uc_67xVZK8s

How to perform Pearson correlation in Statistica:


https://www.youtube.com/watch?v=Ev86DMtLXOk
Age (years) BMI(kg/m2)
73 28 =PEARSON(array, array)

22 22 Coefficient (r ) : 0.761713
74 27
34 29
50 29
42 27
64 28
53 29
43 24
21 19
12 17
Correlation
• Quantification of the relationship between two QUANTITATIVE
variables

• The quantity is called the Pearson’s correlation coefficient (r).

• -1 < r < 1
• (+) direct linear relationship
• (-) inverse linear relationship
Conclusion
There is a strong inverse linear relationship between water
temperature and decrease in pulse rate of children
How to Perform Pearson
Correlation( r ) in Statistica
Open data in Statistica:
https://www.youtube.com/watch?v=uc_67xVZK8s

Perform Pearson Correlation in Statistica:


https://www.youtube.com/watch?v=AvdKQyVr9FQ&list=PLsY7hM6ZLBNOBT9RPYo0oeIuFezrDNYXu&index=6
Correlation
Download FIES data from CANAVAS
Family Income and Expenditure Survey
Get the Pearson correlation between Total Income and all the variables
except the categorial variables encoded in text.

Reference : https://www.kaggle.com/datasets/grosvenpaul/family-income-and-expenditure/discussion
Note: This data is trimmed to use as a tool for class discussion. Complete data is available upon request from
PSA with recommendation of thesis adviser.
Observation on Correlation of Total Income
and other variables
What observations can you get from the data?
Which has the highest correlation with Total Income?
Which are not significantly related with Total Income?
Which variables have directly linear relationship with Total Income?
Which variables have indirect linear relationship with Total Income?
Simple Linear Regression Model
(SLRM)
Introduction to Simple Linear Model
http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/
https://www.youtube.com/watch?v=owI7zxCqNY0
The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor
variable x. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.

Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis
of new x values.
From the scatter plot, it can be seen that not all the data points
fall exactly on the fitted regression line. Some of the points are
above the blue curve and some are below it;

overall, the residual errors (e) have approximately mean zero.


The sum of the squares of the residual errors are called the
Residual Sum of Squares or RSS.

The average variation of points around the fitted regression


line is called the Residual Standard Error (RSE). This is one of
the metrics used to evaluate the overall quality of the fitted
regression model. The lower the RSE, the better it is.

Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This
method of determining the beta coefficients is technically called least squares regression or ordinary least squares
(OLS) regression.
Least Square Method using Excel
Least Square Method using excel
https://www.youtube.com/watch?v=P8hT5nDai6A
Least Square Method using
Statistica
https://www.youtube.com/watch?v=VInW7mmxzOU&list=PLsY7hM6ZLBNOBT9RPY
o0oeIuFezrDNYXu&index=7
Least Square Method using Statistica
Example 1.
Encode the table below, where the dependent variable is y and the
independent variable (predictor) is x.
x y
1 1.5
2 3.8
3 6.7
4 9.0
5 11.2
6 13.6
7 16
Based on the Least Square Method, the
line of best fit to the data is
𝑦 = −0.828571 + 2.414286x
Prediction: Example
𝑦 = −0.828571 + 2.414286x

𝑦 = −0.828571 + 2.414286 5 = 11.242


The EXPECTED value of y for the value of x at 5 is 11.242.

-0.828571 :
The EXPECTED value of y for when the predictor, x, is 0 is -0.828571.

𝟐. 𝟒𝟏𝟒𝟐𝟖𝟔:
The EXPECTED increase in value of y for every unit increase in x.
Model Validation using Coefficient of
Determination (R-squared)
https://www.youtube.com/watch?v=TCtDXmvXDUc
https://www.youtube.com/watch?v=igIT6xzAH8s

A measure of goodness-of-fit.
𝑅2 closer to 1 is a good model
𝑅2 closer to zero is not a good model.
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 0.99876183
Interpretation: 99.87% of the variation of y can be explained by x.
Model is significant if the p-value is less than 0.05 (p<0.05)
Example 2. Least Square Method using Statistica
Use FIES data, where the dependent variable is total income and the
independent variable (predictor) is Communication Expenditure.

Based on the Least Square Method, the line of best fit to the data is
𝑦 = 133241.9 + 27.9x
Model Validation using Coefficient of
Determination (R-squared)

𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 0.50428708
Interpretation: 50.42% of the variation of the total household income can be
explained by communication expenditure.
𝑦ො = 133241.9 + 27.9x
Total Household Income = 133241.9 + 27.9 (Communication Expenditure)
The symbol, 𝑦,
ො means it is an estimate only. The symbol, y, means it is the
actual population measurement.
Prediction:
1. The EXPECTED total household income when the communication
expenditure is 0 is 𝐏𝐡𝐩 𝟏𝟑𝟑𝟐𝟒𝟏. 𝟗𝟎.
2. The Expected increase in total household income for every one unit
increase in communication expenses is 𝐏𝐡𝐩𝟐𝟕. 𝟗.
3. A household with annual communication expenditure of 10000 has an
Expected total household income of
Total Household Income = 133241.9 + 27.9 10000 = 𝐏𝐡𝐩 𝟒𝟏𝟐, 𝟐𝟒𝟏. 𝟗

You might also like