Professional Documents
Culture Documents
Class Exercise
Dimple Rohera
PGP/25/050
Later the size of data was checked and a summary was produced.
Then we start with linear regression as we do not know how the data are, what can we actually
apply. So, we will perform linear regression.
We will now calculate MAE, MSE and it is observed to be at higher side. We then calculate the RMSE
value.
More the value closer to 1 less error prone it will be. We try to fit it in a graph we could see that the
graph seems to be curve that is there is more error.
Hence we prefered polynomial regression. We also analysed the dataset.
We then draw the graph and we can see its better fitting.
We then measure the MAE, MSE, RSME and we observe the value are smaller than before. Thus,
polynomial regression suits better for the dataset
We choose a different data set since the previous one is too small for the experiment. And we use
summary function to get the summery of the data.
Choosing the degree of the polynomial, we do a trial and error to choose the degree of the
polynomial. First, we begin with degree 1 i.e linear regression. Then degree 2, then 3, then degree 5.
We then split the data initial training set and test set. Test set is 20% of total data. We then calculate
the RMSE for degrees ranging from 1 to 20, and we choose the degree of the polynomial which has
the lowest value for RMSE. Then we plot the graph between RMSE and degree so as to choose the
right degree. Beyond 5 there will be over fitting
Clustering demo
We are doing a scatter plot to identify the number of clusters. High income low age group, medium
income high age group, and lower age is spread across all age groups. We are applying k-means
clustering by taking k=3. It makes use of Euclidean distance to calculate centrod distances. But we
find that some observations are not exactly coming in the expected clusters. Then we apply
normalization by standardizing the features. After this, by observing the scatter plot we have 3
clusters and 3 cluster centres.
The business plans to introduce a new beer brand, thus market research is necessary to gather
information about other brands' calorie, salt, and alcohol levels. Given that there are more than 2
variables in this situation, a scatterplot cannot be used. Instead, a dendrogram, which is essentially a
tree depicting all of the data by classification, is used. We can split up. We're picturing it with an
elbow curvature.