You are on page 1of 6

Salary Prediction with Non-linear Data

Class Exercise
Dimple Rohera
PGP/25/050

We started by importing all required libraries and required dataset.

Later the size of data was checked and a summary was produced.

Then we start with linear regression as we do not know how the data are, what can we actually
apply. So, we will perform linear regression.
We will now calculate MAE, MSE and it is observed to be at higher side. We then calculate the RMSE
value.

R2 was also calculated.

More the value closer to 1 less error prone it will be. We try to fit it in a graph we could see that the
graph seems to be curve that is there is more error.
Hence we prefered polynomial regression. We also analysed the dataset.

We then draw the graph and we can see its better fitting.

We then measure the MAE, MSE, RSME and we observe the value are smaller than before. Thus,
polynomial regression suits better for the dataset
We choose a different data set since the previous one is too small for the experiment. And we use
summary function to get the summery of the data.
Choosing the degree of the polynomial, we do a trial and error to choose the degree of the
polynomial. First, we begin with degree 1 i.e linear regression. Then degree 2, then 3, then degree 5.
We then split the data initial training set and test set. Test set is 20% of total data. We then calculate
the RMSE for degrees ranging from 1 to 20, and we choose the degree of the polynomial which has
the lowest value for RMSE. Then we plot the graph between RMSE and degree so as to choose the
right degree. Beyond 5 there will be over fitting

Clustering demo

We are doing a scatter plot to identify the number of clusters. High income low age group, medium
income high age group, and lower age is spread across all age groups. We are applying k-means
clustering by taking k=3. It makes use of Euclidean distance to calculate centrod distances. But we
find that some observations are not exactly coming in the expected clusters. Then we apply
normalization by standardizing the features. After this, by observing the scatter plot we have 3
clusters and 3 cluster centres.

Creating product segments

The business plans to introduce a new beer brand, thus market research is necessary to gather
information about other brands' calorie, salt, and alcohol levels. Given that there are more than 2
variables in this situation, a scatterplot cannot be used. Instead, a dendrogram, which is essentially a
tree depicting all of the data by classification, is used. We can split up. We're picturing it with an
elbow curvature.

You might also like