Professional Documents
Culture Documents
Submitted By :-
Name - Rishabh Kumar
CRN - 1921129
URN - 1905388
Table of Content :-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Figure - 1.1
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
Splitting the dataset into the Training set and Test set
y_pred = regressor.predict(X_test)
Figure - 1.2
Visualising the Test set results
Figure - 1.3
PRACTICAL 2: Implement Random Forest Regression.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Figure - 1.1
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
regressor.predict([[6.5]])
array([167000.])
Figure - 2.2
PRACTICAL 3 : Implement Logistic Regression
import pandas as pd
import numpy as np
dataset = pd.read_csv('...\\User_Data.csv')
# input
x = dataset.iloc[:, [2, 3]].values
# output
y = dataset.iloc[:, 4].values
Figure - 3.1
Splitting the dataset into the Training set and Test set
Output :
y_pred = classifier.predict(xtest)
Test the performance of our model
Output :-
Output :
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
Output :-
Practical 4 : Implement Decision Tree classification algorithms
Importing libraries
import numpy as nm
import pandas as pd
importing datasets
data_set= pd.read_csv('user_data.csv')
Figure - 4.1
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
Splitting the dataset into training and test set
feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
classifier.fit(x_train, y_train)
Output :-
y_pred= classifier.predict(x_test)
Output :-
Output :-
Visulaizing the trianing set result
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output :-
Visulaizing the test set result
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
fori, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output :-
Practical 5 : Implement k-nearest neighbours classification algorithms
Importing libraries
import numpy as nm
import pandas as pd
Importing datasets
data_set= pd.read_csv('user_data.csv')
Figure 5.1
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
Splitting the dataset into training and test set
Feature Scaling
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
classifier.fit(x_train, y_train)
Output :-
y_pred= classifier.predict(x_test)
Output :-
Output :-
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show()
Output :-
Practical 6 : Implement Naive Bayes classification algorithms.
import numpy as np
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
Figure 6.1
Splitting the dataset into the Training set and Test set
Output :
print(y_train)
Output :
print(X_test)
Output :
print(y_test)
Output :
Feature Scaling
print(X_train)
Output :
print(X_test)
Output :
GaussianNB(priors=None, var_smoothing=1e-09)
print(classifier.predict(sc.transform([[30,87000]])))
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.re-
shape(len(y_test),1)),1))
Output :
[[65 3] [ 7 25]]
0.9
Visualising the Training set results
Output :
Output :
Practical 7 : Implement K-means clustering to Find Natural Patterns
in Data.
K-Means Clustering
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values
Figure 7.1
Using the elbow method to find the optimal number of clusters
Output :
Output :
Practical 8 : Implement K- Mode Clustering
import pandas as pd
import numpy as np
Figure 8.1
Elbow curve to find optimal K
cost = []
K = range(1,5)
for num_clusters in list(K):
kmode = KModes(n_clusters=num_clusters, init = "random", n_init = 5, verbose=1)
kmode.fit_predict(data)
cost.append(kmode.cost_)
Output :
Balanced Dataset: — Let’s take a simple example if in our data set we have positive
values which are approximately same as negative values. Then we can say our dataset
in balance.
Consider Orange color as a positive values and Blue color as a Negative value. We can
say that the number of positive values and negative values in approximately same.
Imbalanced Dataset: — If there is the very high different between the positive va-
lues and negative values. Then we can say our dataset in Imbalance Dataset.
predictions.
Recall: the number of true positives divided by the number of positive values
in the test data. Recall is also called Sensitivity or the True Positive Rate. It
of false negatives.
2. Over-sampling (Up Sampling): This technique is used to modify the unequal data
classes to create balanced datasets. When the quantity of data is insufficient, the
the imbalance dataset by reducing the size of the class which is in abundance. There
are various methods for classification problems such as cluster centroids and Tomek
links. The cluster centroid methods replace the cluster of samples by the cluster cen-
troid of a K-means algorithm and the Tomek link method removes unwanted overlap
between classes until all minimally distanced nearest neighbors are of the same class.
one-sided metric such as correlation coefficient (CC) and odds ratios (OR) or two-si-
ded metric evaluation such as information gain (IG) and chi-square (CHI) on both the
positive class and negative class. Based on the scores, we then identify the significant
features from each class and take the union of these features to obtain the final set
The Cost-Sensitive Learning (CSL) takes the misclassification costs into considera-
tion by minimising the total cost. The goal of this technique is mainly to pursue a high
the important roles in the machine learning algorithms including the real-world data
mining applications.
6. Ensemble Learning Techniques
The ensemble-based method is another technique which is used to deal with imbal-
anced data sets, and the ensemble technique is combined the result or performance
sifiers. It mainly combines the outputs of multiple base learners. There are various
Imbalanced data is one of the potential problems in the field of data mining and ma-
chine learning. This problem can be approached by properly analyzing the data. A few
approaches that help us in tackling the problem at the data point level are undersam-
pling, oversampling, and feature selection. Moving forward, there is still a lot of re-
Performance metrics are a part of every machine learning pipeline. They tell you if
you’re making progress, and put a number on it. All machine learning models, whether
it’s linear regression, or a SOTA technique like BERT, need a metric to judge perfor-
mance.
Regression metrics
Regression models have continuous output. So, we need a metric based on calculating
some sort of distance between predicted and ground truth.
Where:
Where:
The point of even calculating this coefficient is to answer the question “How much
(what %) of the total variation in Y(target) is explained by the variation in X(regres-
sion line)”
Adjusted R²
The Vanilla R² method suffers from some demons, like misleading the researcher into
believing that the model is improving when the score is increasing but in reality, the
learning is not happening. This can happen when a model overfits the data, in that
case the variance explained will be 100% but the learning hasn’t happened. To rectify
this, R² is adjusted with the number of independent variables.
Adjusted R² is always lower than R², as it adjusts for the increasing predictors and
only shows improvement if there is a real improvement.
Classification metrics
Classification problems are one of the world’s most widely researched areas. Use
cases are present in almost all production and industrial environments. Speech
recognition, face recognition, text classification – the list is endless.
Confusion Matrix
Confusion Matrix is a tabular visualization of the ground-truth labels versus model
predictions. Each row of the confusion matrix represents the instances in a predicted
class and each column represents the instances in an actual class. Confusion Matrix is
not exactly a performance metric but sort of a basis on which other metrics evaluate
the results.
In order to understand the confusion matrix, we need to set some value for the null
hypothesis as an assumption. For example, from our Breast Cancer data, let’s assume
our Null Hypothesis H⁰ be “The individual has cancer”.