You are on page 1of 4

Class Exercise Demo (Polynomial Regression)

Soubhagya Dash
PGP/25/116
AIB - A

Class Exercise Objective:


Give an example of how polynomial regression may be used on non-linear data. We make an
attempt to determine the compensation of a new worker by comparing the salary data for the
same position levels in the new firm with the salary data for the worker's prior employer's
position level.
Provided Database:
We are given access to a database titled "Position Salaries," which allows us to estimate the
starting wage of a newly hired employee in an organization based on the individual's current
degree of experience as well as the pay scale of the business he has just started working for.
Details of the Database:
There are a total of 10 rows and 3 columns in this dataset. Due to the short size of the dataset,
we have decided to conduct training using the entire dataset. We would be using head()
function to print first 5 rows of the dataset. The snapshot of the data is attached below.
Features of the Dataset are:
• Independent Variables: Position and Level.
• Dependent Variables: Salary

Procedure and learnings:


In order to correctly execute polynomial regression on the dataset that was provided, the
procedures that are listed below were carried out:
1. To begin, we will construct two data frames, one for the independent characteristics and
one for the dependent features. Due to the fact that there are very few items in the dataset, we
will make use of the whole dataset for training purposes.
2. Linear Regression: We applied linear regression model to the dataset to observe the results
and found out that the linear regression model does not fit in the dataset. R2 is the fraction of
the variance in the prediction variable that can be explained by a linear model. This is one
way to look at R2. When R2 is very near to 1, it indicates that the model is doing well and
that it explains all of the variability that occurs around the mean of the target variable.
3. Creation of Linear Regression Graph: Post the above step, we created a linear
regression graph and observed that the line is not close to the dataset points.
4. Polynomial Regression: As we observed that Linear Regression does not fit well in the
dataset, we will now create Polynomial Regression. To assign the degree of the polynomial
line that we are going to draw in this, the "Polynomial Features" function is used, and the
degree is set as 4. During this process, the variable X is transformed into a new matrix
known as X Poly. This matrix is made up of all of the polynomial combinations of features
that have a degree of 4.
5. When we plot a curve of real and anticipated values, we notice that the curve that was
obtained fits quite well, and this conclusion can be drawn from the plot. Now this model is
compared to the previous model of linear regression, the performance metrics terms of this
model have shown significant improvement. However, the error values continue to be rather
high. The snapshot is added showing the same below:
6. Choosing the Degree of the Polynomial:
1. In order to understand how to determine the degree of the polynomial, we are going to use
a file called "Curve.csv." To start, we run some exploratory analysis on the dataset, and the
results of such study can be seen attached further down. Now, using this we would rebuild
our model and would try to find the correct related degree of the polynomial graph.
D. Model with Degree 5

E. Model with Degree 15

Note: The preceding images make it clear that there is overfitting beginning with the fifth
degree and continuing ahead. Consequently, the degree of our polynomial should go up to 5.
Therefore, the best degree lies between degree 2 to 5.
5. At this point, we are experimenting with different degrees (ranging from 1 to 20) and
quantifying the mistakes associated with testing and training.

Additional Learning: The training data and test data is split in 80:20. 8 | P a g e
6. The snapshot of the data is attached below:

Additional Learning: Because at that point overfitting might be regarded as testing error
that is greater than training error, we should pick the degree up to 5.
7. If we plot the RMSE of the training set against the testing set, we will observe overfitting.

Additional Learning: Error on the test set first decreases, but then gradually climbs once a
certain threshold of complexity is reached (5 or so). As the degree of difficulty is increased,
there is a corresponding decrease in the amount of error on the training set. Therefore, the
best number for the complexity of the model is five, which has a low bias and a low
variance.
I would like to conclude my report here.

You might also like