You are on page 1of 2

Linear Regression

1. Read the File


2. Missing Value Check
a. Summary(Data)
i. Check for NA Values
ii. Impute Missing Values
1. Check for Normality of Data
a. Upto 5000 Rows : Shapiro Wilk Test
b. More than 5000 : Anderson Darling Test
2. Plot the Data
a. Histogram
b. Line Plot : Smoothing Line
3. If Data is Normal : Replace with Mean
4. If data is not normal, Skewed : Replace with Median
5. Categorical Data : Mode , Ask Business
3. Check for Categorical Variables
a. Create Dummy
b. N-1 : If N categories
i. Why N-1 : Because i can explain one category from other 2 categories
c. If we create N DUMMY : WARNING WILL COME .. there are aliased coefficients
in model
4. # BY USING DPLYR PACKAGE
library("dplyr")
# Creating Dummy for Job.Type
Data = Data%>%
mutate(Job.type_employed = ifelse(Job.Type == "Employed", 1 ,0),
Job.type_unemployed = ifelse(Job.Type == "Unemployed", 1 ,0),
Job.Retired = ifelse(Job.Type == "Retired", 1 ,0)
)%>%
select(-Job.Type)
5. Check for Outliers
1. Box plots
2. Replace outliers : Capping of them, to 95 % or 5% , you can choose at the Q1-
1.5IQR or Q3+1.5IQR
6. Check for Correlations
1. Use functions corrplot
7. Remove VIF : One at at time
More than 10
8. Build LM and check summary of the model : Remove P values greater than .05 one by one ,
Highest First
9. Till we get all star values
10. Check for assumptions
plot(model, which = 1)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 4)
10. RMSE : Root mean square error : To check model performance

You might also like