You are on page 1of 13

Model Name : Simple Linear Regression | Volume Vs Price

Dataset Name : Vol_price.csv


INDEX
• Introduction
• Exploratory data Analysis
• Heatmap of correlation
• Model Selection
• Evaluation Metrics and its plots
• Conclusion
INTRODCTION

Different Models in Machine Learning Algorithm

1. Linear Regression Model is applied when two or more depended


values are given to an independent dataset which is of feed forward
graph
2. Classification model is used to predict reponse and used when we
have two outcomes
3. Cluster Model is to predict grouping response and used when we
have more than two outcomes
INTRODUCTION
MODEL SELECTION is a method for setting a blueprint to analyze data and then
using it measure new data.

There are different methods for model selection :

1. Train-Test Split - Train set is used for creating the model, Test set is used for testing
the accuracy of the model. As per standards, for train we use 75% and for test 25%. The
purpose of splitting the data is to have enough data for the model for effective evaluation
of performance and also its easy to determine whether the model’s guesses are correct.

2. Cross Validation - Data is split into k different subsets. Based on k value it divides
the number of samples into bunches and takes 1 bunch for validation and remaining
bunches for training
EXPLORATORY DATA ANALYSIS

• Dataset contains 1000 rows and 2 columns


• From Dataset head entries are displayed ( head means top most ‘n’ entries )
• Volume and Price both have float Data type
• There are no Null or Missing values in the above dataset.
• Exploratory Analysis is done ; Variable Identification ( i.e check whether the given data is
Continuous, Categorical, Discrete, Geographical) , tells us whether the analysis is
Univariate or Bivariate analysis, to check the size of the dataset , whether the given data is
incomplete or it has any missing values.
• In the above given Dataset, Only one independent variable(Volume) and one Dependent
variable(Price) exists.
HEAT MAP OF CORRELATION
• A heat map is a two-dimensional representation of data in which values are
represented by colors.
• A simple heat map provides an immediate visual summary of information.
More elaborate heat maps allow the viewer to understand complex data sets.

• The heatmap abaove tells us the that Volume and Price are 80% correlated to each
other
• The model tested for accuracy can be compared with the correlation for better
evaluation
Model Selection

• The dataset contains two variable which is differentiated as two parts dependent and independent
variable and also both have continuous type of data , Simple Linear Regression is chosen
• Simple Linear Regression acts on the formula y=mx+c where Y is Salary which is dependent X is
Experience which is independent
• In the above linear slope equation, C is the Intercept and M is coefficient of the slope equation
where Intercept is the value of the linear predictor when all covariates are zero and Coefficient is
the indication of direction of the relationship between independent variable and dependent
variable.
• The model is evaluated using K-Fold method.
EVALUATION METRICS AND ITS PLOTS
KFold(9).py

Salary Vs Years of Experience for original set

• Intercept is 0.07268747227117345

• Coefficient is 5.00615349
Results
• Mean absolute Error: 2.3107620550991297
• Mean Squared Error: 8.502593164969865
• Root mean square error: 2.9159206376322837
• R2 score: 0.64256142169696

• As per the Dataset given, above graph and results tells us that our model is 64%
accurate, it means 64% of the data points have a close relationship to best fit line
( predicted Y line) and 36% are the outliers.
EVALUATION METRICS AND ITS PLOTS
Volume vs Price for 6 Splits

• Intercept is 0.07268747227117345

Results
Coefficient is 5.00615349
Results • Evaluation of results for Train set(6)
• Evaluation of results for Test Set(6) • Mean absolute error: 2.3371761276958445
• Mean absolute error: 2.1780552084385225 • Mean Squared Error: 8.63311338146916
• Mean Squared Error: 7.846847017015575 • Root mean squared error: 1.5287825639036587
• Root mean squared error: 1.4758235695497353 • R2 Score: 0.6331644636128946
• R2 Score: 0.6811464554522144

• By splitting the data into train and test for 6splits


• By the above results it is observed that our model is 68.11% accurate
EVALUATION METRICS AND ITS PLOTS
Volume vs Price for 7 Splits

• Intercept is 0.07268747227117345
Results
• Coefficient is 5.00615349
• Evaluation of results for Train set(7 splits)
Results
• Evaluation of results for Test Set(7 splits)
• Mean absolute error: 2.344536190212473
• Mean absolute error: 2.106690168287516 • Mean Squared Error: 8.682053103393914
• Mean Squared Error: 7.4182507201259655 • Root mean squared error:
• Root mean squared error: 1.4514441664382118 1.5311878363585811
• R2 Score: 0.6990249898793017 • R2 Score: 0.6322338273115873
• By splitting the data into train and test for 7 splits
• By the above results it is observed that our model is 69.99% accurate
EVALUATION METRICS AND ITS PLOTS
Volume vs Price for 8 Splits

RESULTS
• Intercept is 0.07268747227117345
• Evaluation of results for Train set(8 splits)
• Coefficient is 5.00615349
• Mean absolute error: 2.3565584343671806
• RESULTS
• Mean Squared Error: 8.76938690363058
• Mean absolute error: 1.9901874002227724
• Root mean squared error: 1.5351086067009008
• Mean Squared Error: 6.635036994344865
• R2 Score: 0.6274437855372744
• Root mean squared error:
1.410740018650769
• R2 Score: 0.7377094776620245

• By splitting the data into train and test for 8 splits


• By the above results it is observed that our model is 73.77% accurate
EVALUATION METRICS AND ITS PLOTS
Volume vs Price for 9 Splits

RESULTS
• Intercept is 0.07268747227117345
• Evaluation of results for Train set(9 splits)
• Coefficient is 5.00615349
• Mean absolute error: 2.3434892642156426
• RESULTS
• Mean Squared Error: 8.70135498942837
• Mean absolute error: 2.048649542445251
• Root mean squared error: 1.5308459309204316
• Mean Squared Error: 6.910707922234623
• R2 Score: 0.6324243477825833
• Root mean squared error:
1.4313104283995317
• R2 Score: 0.7157499601499937

• By splitting the data into train and test for 9 splits


• By the above results it is observed that our model is 71.57% accurate
Conclusion

• As per the Original dataset given, the model showed accuracy


of 80% which is performing better
• The salary which was previously given for the particular years of
experience was found better as per the Standards
• By applying K fold method, we found that the model was
performing best as compared to original set
• Final Decision can be made by choosing either of train test Split
methods.

You might also like