2. Missing Value Check a. Summary(Data) i. Check for NA Values ii. Impute Missing Values 1. Check for Normality of Data a. Upto 5000 Rows : Shapiro Wilk Test b. More than 5000 : Anderson Darling Test 2. Plot the Data a. Histogram b. Line Plot : Smoothing Line 3. If Data is Normal : Replace with Mean 4. If data is not normal, Skewed : Replace with Median 5. Categorical Data : Mode , Ask Business 3. Check for Categorical Variables a. Create Dummy b. N-1 : If N categories i. Why N-1 : Because i can explain one category from other 2 categories c. If we create N DUMMY : WARNING WILL COME .. there are aliased coefficients in model 4. # BY USING DPLYR PACKAGE library("dplyr") # Creating Dummy for Job.Type Data = Data%>% mutate(Job.type_employed = ifelse(Job.Type == "Employed", 1 ,0), Job.type_unemployed = ifelse(Job.Type == "Unemployed", 1 ,0), Job.Retired = ifelse(Job.Type == "Retired", 1 ,0) )%>% select(-Job.Type) 5. Check for Outliers 1. Box plots 2. Replace outliers : Capping of them, to 95 % or 5% , you can choose at the Q1- 1.5IQR or Q3+1.5IQR 6. Check for Correlations 1. Use functions corrplot 7. Remove VIF : One at at time More than 10 8. Build LM and check summary of the model : Remove P values greater than .05 one by one , Highest First 9. Till we get all star values 10. Check for assumptions plot(model, which = 1) plot(model, which = 2) plot(model, which = 3) plot(model, which = 4) 10. RMSE : Root mean square error : To check model performance