You are on page 1of 9

Name: Chinmay Tripurwar

Roll No: 22b3902


IIT Bombay DS203
Prof. Vinay Kulkarni

Simple Regression Model Analysis


Objective:
This exercise aims to gain a practical understanding of simple linear regression by modeling the relationship between
a single independent variable (x) and a dependent variable (y).

Dataset Overview:
The dataset, provided in 'e1.xlsx', contains values with 'x' ranging from 0 to 0.9346 and 'y' from 0 to 13.4821, split into
training and testing sets with an 80:20 ratio.

Section 1: Data Visualization and Model Creation


• Scatter Plot of Train Data (y vs. x)

y v/s x scatter plot


16
14
12
10
8
y

6
4
2
0
0 0.2 0.4 0.6 0.8 1 1.2
X

• Creation of Simple Linear Regression (SLR) Models

.1. Method 1: Using closed-form equations for ‘a’ and ‘b’:


Result:
After applying the closed-form equations for SLR, the coefficients for the model are determined to be:

• Slope (a) = 9.11037


• Intercept (b) = 3.20282

b represents the value of y that is expected by our model when the independent variable x equals 0.
a represents the rate of change of y as predicted by our model with respect to x.

.2. Method 2: Using the in-built Linear Regression functionality of Excel

The coefficients by our closed-form equation and the built-in Excel functionality come out to be very
close to each other. The built-in functionality of Excel also gives us various other parameters that can
help us gauge the performance of our model.

Section 2: Analysis of Train Data

• Calculation and Analysis of Regression Metrics

1. Y_cap (Predicted y): The value of 'y' predicted by the regression model for each 'x'
value. It's used to compare against the actual 'y' values to assess the model's
prediction accuracy.
2. e (Error): The difference between the actual 'y' value and the predicted 'ycap' for
each data point. It's a direct measure of the prediction error.
3. e_sq (Error Squared): The square of the error 'e'. Squaring the error emphasizes
larger errors more than smaller ones and is used in calculating other metrics like
MSE.
4. MAE (Mean Absolute Error): The average of the absolute values of the errors.
MAE provides a simple measure of prediction accuracy, with lower values
indicating better model performance.
5. SSE (Sum of Squared Errors): The sum of all squared errors 'e_sq'. SSE is a
measure of the total prediction error.
6. MSE (Mean Squared Error): The average of the squared errors 'e_sq'. MSE is a
common measure of model accuracy.
7. RMSE (Root Mean Squared Error): The square root of MSE. RMSE is a popular
metric for assessing model accuracy as it is in the same units as the dependent
variable and penalizes larger errors more.
The values calculated of the above measures are:

• Scatter Plot with Superimposed ycap vs. x

The red line captures the trend of the data points, ascending diagonally from
the lower left to the upper right, indicating a positive linear relationship
between 'x' and 'y'. Most data points are close to the line, suggesting the
model fits the data reasonably well.

there are a few points further from the line that may represent outliers or
variations not captured by the model.
• Scatter Plot of e vs. x
e
2

1.5

0.5

0
0 0.2 0.4 0.6 0.8 1 1.2
-0.5

-1

-1.5

-2

-2.5

The errors are randomized and scattered around showing no particular


pattern which indicates that the model is good.

• Histogram of e

The histogram of the error frequency should represent a normal distribution


but here this is not a perfect normal distribution as we expect.
Section 3: Testing and Comparison
• Scatter Plot of Train Data (y vs. x)

y
14

12

10

0
0 0.2 0.4 0.6 0.8 1 1.2

The values calculated of the above measures are:


1. Scatter Plot with Superimposed ycap vs. x

Our model also fits the test data well this shows that our model performance is good.
The MSE and RMSE errors have risen a bit compared to the trained dataset this might
be due to the limited training data.
• Scatter Plot of e vs. x

e
2
1.5
1
0.5
0
0 0.2 0.4 0.6 0.8 1 1.2
-0.5
-1
-1.5
-2
-2.5
-3

The errors are random and this shows our model performance is good.
The histogram represents a normal distribution which is another indication of good model
performance.

Analysis:
The model is performing well and demonstrating good results and the errors do not shoot up
very much this shows our model is predicting the dependent variable reliably.

1. Model Performance: The calculated coefficients 'a' and 'b' for our SLR model showed
a significant linear relationship between the independent variable 'x' and the
dependent variable 'y'. The positive slope indicated that as 'x' increases, 'y' also
increases.

2. Data Fit: The scatter plot with the superimposed line of predicted values (ycap)
closely followed the trend of the actual data points, suggesting that the SLR model has
a good fit.

3. Residual Analysis: The scatter plot of residuals 'e' did not exhibit any systematic
patterns, implying that the model's assumptions of linearity and homoscedasticity
(constant variance of errors) were largely met.

4. Error Metrics: The calculated error metrics such as MAE, SSE, MSE, and RMSE were
within acceptable ranges, indicating that the model's predictions were accurate and
reliable for the given data.

5. Predictive Accuracy: The comparison of train and test data error metrics suggested
that the model generalized well, maintaining its predictive accuracy on unseen data.

The exercise helped understand various concepts related to Simple linear regression.

You might also like