# Linear Regression Analysis

This contains my personal notes only – thus, this is not complete. Most of the contents were taken from the training manual of IBM SPSS Modeler. Please refer to the training manual for a complete discussion.

Simple Linear Regression Model
Consider the scatterplot: It shows the relationship between mother’s weight and baby’s birthweight.

• Regression analysis finds a straight line that summarizes the relationship of the two variables such that the distance of the points from the line is minimum.
Mathematically, the line can be expressed as an equation: bweight = B0 + B1*mweight + E where: E ~ N (0, σ 2) B0 = constant B1 = effect on bweight for every one pound increase of mweight

Using sample data, a table below can be generated by SPSS.
Coefficientsa Model Standardized Unstandardized Coefficients B 1 (Constant) mweight a. Dependent Variable: bweight 2426.719 3.977 Std. Error 162.194 1.214 .167 Coefficients Beta t 14.962 3.276 Sig. .000 .001

Mathematically,

Estimated bweight = 2426.719 + 3.977*mweight
Note: This equation can be used in predicting bweight if information about mweight is available..

Multiple Linear Regression Model
Consider the framework:
• • • • Age weight at last menstrual period (mweight) History of hypertension(ht) Presence of uterine irritability (ui)

Baby’s Birth Weight (bweight)

Mathematically,
bweight = constant + B1*age + B2*mweight + B3*ht + B4*ui + E
where: E ~ N (0, σ 2) B1, B2, B3 and B4, called regression coefficients can be estimated if sample data are available.

Mathematically,
bweight = constant + B1*age + B2*mweight + B3*ht + B4*ui + E

where:

E ~ N (0, σ 2)
B1, B2, B3 and B4, called regression coefficients can be estimated if sample data are available.

SPSS generates regression table as follows:

Mathematically,
bweight = 2429.007 + 3.656*age + 4.203*mweight -645.545*ht – 530.065*ui + E

Note: This equation can be used to predict bweight if information about mother’s weight, age, ht and ui are available,

Uses of Linear Regression Analysis
• Regression analysis can be used (in applied research) to test the relationships between an outcome variable and set of predictor variables. • Regression analysis can be used also to predict the value of the outcome variable given the values of the predictor variables.

Fraud Detection in Insurance Claim (A Regression Analysis Example)
• The following data of patients in a hospital in the U.S are available:
– CLAIM- total insurance claim for a single medical treatment performed in a hospital – Age – age of patient – LOS – length of hospital stay – ASG - Severity of illness category. This is based on several health measures and higher scores indicate greater severity of the ilness – n=293

• Goals:
1) Build a predictive model for the insurance claim amount; 2) Use the model to identify outliers (patient with claim values from what the model predicts), which might be instances of errors or fraud made in the claims.

Dataset: InsClaim.dat
CLAIM- total insurance claim for a single medical treatment performed in a hospital. Age – age of patient LOS – length of hospital stay \ ASG - Severity of illness category. This is based on several health measures and higher scores indicate greater severity of the ilness. n=293

Diagram in Modeler 15.0

Generated output:

In equation format:
Predicted Claims = \$3026.754 + \$1105.646*length of stay + \$417.194*severity code – \$33.406*age

Detecting cases that deviate substantially from the model (Points Poorly fit by Model).
**Just compute the residual (DIFF = actual claim – predicted claim)
Examine carefully if fraud is possible

Generated outputs:

Using Linear Models Node to Perform Regression
It has more features than Regression Node, including: • the bility to create the best subset model, • several criteria for model selection, • the option to limit the number of predictors, and • the use of bagging and boosting.

Additional Features of Linear Models Node
• It automatically prepare the data for modeling, by transforming the target and predictors in order to maximize the predictive power of the model. This includes:
– outlier handling, adjusting the measurement level of predictors, and merging similar categories.

• It automatically creates dummy variables from categorical fields (that have nominal or ordinal measurement level).