Introduction To AI & ML: Lab Manual

Introduction to AI & ML
Lab Manual
Department of Computer Science and Engineering
The NorthCap University, Gurugram
ii
Introduction to AI and ML Lab Manual

(CSL 236)
2021-22
Introduction to AI &ML Lab Manual

CSL 236
NorthCap University, Gurugram- 122001, India
Session 2021-22
Published by:
School of Engineering and Technology
Department of Computer Science & Engineering
The NorthCap University Gurugram
• Laboratory Manual is for Internal Circulation only

iii

(CSL 236)
2021-22
© Copyright Reserved
No part of this Practical Record Book may be
reproduced, used, stored without prior permission of The NorthCap University
Copying or facilitating copying of lab work comes under cheating and is considered as use
of unfair means. Students indulging in copying or facilitating copying shall be awarded zero
marks for that particular experiment. Frequent cases of copying may lead to disciplinary
action. Attendance in lab classes is mandatory.
Labs are open up to 7 PM upon request. Students are encouraged to make full use of labs
beyond normal lab hours.
iv

(CSL 236)
2021-22
PREFACE
Machine Learning Lab Manual is designed to meet the course and program requirements of
NCU curriculum for B. Tech III year students of CSE branch. The concept of the lab work is
to give brief practical experience for basic lab skills to students. It provides the space and
scope for self-study so that students can come up with new and creative ideas.
The Lab manual is written on the basis of “teach yourself pattern” and expected that
students who come with proper preparation should be able to perform the experiments
without any difficulty. Brief introduction to each experiment with information about self-
study material is provided. The pre-requisite is having a basic working knowledge of
Python. The laboratory exercises will include familiarization with data pre-processing
techniques for ML like handling missing data, duplicate data, outliers feature scaling and
encoding. Feature Selection and Dimensionality Reduction are included to enhance the
performance and reduce the computational time. Various ML classification and regression
techniques are taught. Students would learn the algorithms pertaining to these and
implement the same using a high-level language, i.e., Python. Students are expected to come
thoroughly prepared for the lab. General disciplines, safety guidelines and report writing
are also discussed.
The lab manual is a part of curriculum for the, The NorthCap University, Gurugram.
Teacher’s copy of the experimental results and answer for the questions are available as
sample guidelines.
We hope that lab manual would be useful to students of CSE, IT, ECE and BSc branches and
author requests the readers to kindly forward their suggestions / constructive criticism for
further improvement of the work book.
Author expresses deep gratitude to Members, Governing Body-NCU for encouragement and
motivation.
Authors
The NorthCap University
Gurugram, India
v

(CSL 236)
2021-22
CONTENTS
S.N. Details Page No.
Syllabus 6
1 Introduction 9
2 Lab Requirement 10
3 General Instructions 11
4 List of Experiments 13
5 List of Flip Assignment 14
6 List of Projects 15
7 Rubrics 16
8 Annexure 1 (Format of Lab Report) 18

vi

(CSL 236)
2021-22
COURSE TEMPLATE
Department: Department of Computer Science and Engineering
Course Name: Machine Learning Course Code L-T-P Credits
CSL236 3-0-2 4
Type of Course (Check one): Program Core Program Elective  Open Elective
Pre-requisite(s), if any: Introduction to AI and ML
Frequency of offering (check one): Odd Even  Either semester Every semester
Brief Syllabus:
Introduction to artificial intelligence, History of AI, Proposing and evaluating AI application, Preprocessing
and Feature Engineering, Case study: Exploratory Analysis of Delhi Pollution, Simple Linear Regression,
Multiple Regression, Polynomial Regression, Support Vector Regression SVR, Decision Tree Regression,
Random Forest Regression, Logistic Regression, K Nearest Neighbors, Support Vector Machine, Kernel
SVM, Naïve Bayes, Decision Trees Classification, Random Forest Classification, Basic Terminologies:
Overfitting, Underfitting, Bias and Variance model, Bootstrapping, Cross-Validation and Resampling
Methods, Performance Measures: Confusion matrix, ROC. Comparing two classification Algorithms:
McNamara’s Test, paired t-test.
Total lecture, Tutorial and Practical Hours for this course (Take 15 teaching weeks per semester): 75
Practice
Lectures: 40 hours Tutorials: 0 hours Lab Work: 35 hours
Course Outcomes (COs)
Possible usefulness of this course after its completion i.e., how this course will be practically useful to
him once it is completed
CO 1 Understand and implement the preprocessing of the data to be used for machine learning model.
CO 2 Understand the strengths and limitations of various ML algorithms.
CO 3 Understand why models degrade and how to maintain them.
CO 4 Implement and use model grading metrics.
CO 5 Apply ML techniques and technologies to solve real world business problems.
UNIT WISE DETAILS No. of Units: 5
Unit Number: 1 Title: Introduction to AI and ML No. of hours: 4
Content Summary:
Introduction to artificial intelligence, History of AI, Overview of machine learning, techniques in machine
learning, deep learning, differences between deep learning, machine learning and AI, different
applications of machine learning, different types of data.
Unit Number: 2 Data preprocessing and engineering No. of hours: 10
Content Summary:
Introduction to Data Preprocessing, different preprocessing techniques, data cleaning, data
vii

(CSL 236)
2021-22
transformation: standardization and normalization, data smoothing, dimensionality reduction, different

encoding schemes for categorical and numerical features.
Unit Number: 3 Title: Regression Techniques No. of hours: 12
Content Summary:
Simple Linear Regression, Multiple Regression, Polynomial Regression, Support Vector Regression SVR,
Decision Tree Regression, Random Forest Regression
Unit Number: 4 Title: Classification algorithm techniques No. of hours: 12
Logistic Regression, K Nearest Neighbors, Support Vector Machine, Kernel SVM, Naïve Bayes, Decision
Trees Classification, Random Forest Classification
Unit Number: 5 Title: Analysis of various algorithms No. of hours: 7

Basic Terminologies: Over fitting, Under fitting, Bias and Variance model, Bootstrapping, Cross-Validation
and Resampling Methods, Performance Measures: Confusion matrix, ROC.
Brief Description of Self-learning components by students (through books/resource material etc.):
Data-preprocessing techniques
Advance Learning Components: Probability and Statistics and Linear Algebra
Books Recommended:
Text Books:
Michael Bowles, “Machine Learning in Python” Wiley, Third Edition, 2019
Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, 3rd ed.
Reference Books:
Ian H. Witten & Eibe Frank., “Data Mining Practical Machine Learning Tools and Techniques”, Morgan
Kauffmann Publishers, Second Edition, 2010
Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Third Edition, 2015
Tom Mitchell. Machine Learning. Mc Graw Hill
Reference Websites: (NPTEL, Swayam, Coursera, Edx, Udemy, LMS, official documentation weblink)
https://nculms.ncuindia.edu/
https://www.coursera.org/learn/machine-learning
https://www.simplilearn.com/big-data-and-analytics/machine-learning-certification-training-course
viii

(CSL 236)
2021-22
1. INTRODUCTION
That ‘learning is a continuous process’ cannot be over emphasized. The theoretical
knowledge gained during lecture sessions need to be strengthened through practical
experimentation. Thus, practical makes an integral part of a learning process.
OBJECTIVE:
The purpose of conducting experiments can be stated as follows:
1. To familiarize the students with the basic concepts of Machine Learning like
supervised, unsupervised and reinforcement learning.
2. The lab sessions will be based on exploring the concepts discussed in class.
3. Learning and understanding Data Preprocessing techniques.
4. Learning and understanding regression and classification problems and
algorithms.
5. Learning and understanding Feature selection and Dimensionality Reduction.
6. Learning and understanding performance metrics.
7. Hands on experience
ix

(CSL 236)
2021-22
2. LAB REQUIREMENTS
S.No. Requirements Details
1 Software Python 3.
Requirements
2 Operating System Windows(64-bit), Linux
3 Hardware 8 GB RAM (Recommended)
Requirements 2.60 GHz (Recommended)
4 Required Bandwidth NA
x

(CSL 236)
2021-22
3. GENERAL INSTRUCTIONS
3.1 General discipline in the lab

 Students must turn up in time and contact concerned faculty for the experiment
they are supposed to perform.
 Students will not be allowed to enter late in the lab.
 Students will not leave the class till the period is over.
 Students should come prepared for their experiment.
 Experimental results should be entered in the lab report format and
certified/signed by concerned faculty/ lab Instructor.
 Students must get the connection of the hardware setup verified before
switching on the power supply.
 Students should maintain silence while performing the experiments. If any
necessity arises for discussion amongst them, they should discuss with a very
low pitch without disturbing the adjacent groups.
 Violating the above code of conduct may attract disciplinary action.
 Damaging lab equipment or removing any component from the lab may invite
penalties and strict disciplinary action.
3.2 Attendance
 Attendance in the lab class is compulsory.
 Students should not attend a different lab group/section other than the one
assigned at the beginning of the session.
 On account of illness or some family problems, if a student misses his/her lab
classes, he/she may be assigned a different group to make up the losses in
consultation with the concerned faculty / lab instructor. Or he/she may work
in the lab during spare/extra hours to complete the experiment. No attendance
will be granted for such case.
3.3 Preparation and Performance

 Students should come to the lab thoroughly prepared on the experiments they
are assigned to perform on that day. Brief introduction to each experiment
with information about self-study reference is provided on LMS.
 Students must bring the lab report during each practical class with written
records of the last experiments performed complete in all respect.
 Each student is required to write a complete report of the experiment he has
xi

(CSL 236)
2021-22
performed and bring to lab class for evaluation in the next working lab.
Sufficient space in work book is provided for independent writing of theory,
observation, calculation and conclusion.
 Students should follow the Zero tolerance policy for copying / plagiarism. Zero
marks will be awarded if found copied. If caught further, it will lead to
disciplinary action.
 Refer Annexure 1 for Lab Report Format
xii

(CSL 236)
2021-22
4. LIST OF EXPERIMENTS
SNo Title of the Experiment Software Unit CO Time

. used Covered Covered Required
1 To introduce various python libraries used for Python 1 CO1 4 hours
machine learning. (Jupyter)
2 To apply various data pre-processing techniques Python 1 CO1 2 hours
used for effective machine learning on the given (Jupyter)
dataset.
3 To apply feature encoding schemes such as label Python 1 CO1 3 hours
encoder and one hot encoder. (Jupyter)
4 To apply different feature selection techniques in Python 1 CO1 3 hours
5 To apply PCA as feature reduction technique on Python 2 CO1 2 hours
IRIS dataset. (Jupyter)
6 To apply Simple Linear Regression on the given Python 2 CO2, CO3, 2 hours
dataset. (Jupyter) CO4
7 To apply multiple linear regression on any Python 2 CO2, CO3, 3 hours
regression dataset. (Jupyter) CO4
8 To apply Polynomial Linear Regression on the Python 2 CO2, CO3, 3 hours
given dataset. (Jupyter) CO4
9 To solve classification problems using Logistic Python 3 CO2, CO3, 2 hours
Regression. (Jupyter) CO4
10 To solve classification problems using KNN Python 3 CO2, CO3, 2 hours
classification. (Jupyter) CO4
11 Python 4 CO2, CO3, 2 hours
To solve classification problems using Naïve Bayes.
(Jupyter) CO4
12 To apply Support Vector Machines (SVM) on Python 4 CO2, CO3, 3 hours
classification problems. (Jupyter) CO4
13 Python 4 CO2, CO3, 3 hours
To apply Decision Trees for classification problems.
(Jupyter) CO4
Value Added Courses
1 Build a ML model from scratch using data- Python 1,2,3,4 CO1, CO2, 5 hours
preprocessing and regression algorithms and (Jupyter) CO3, CO4,
calculating various performance metrics. CO5
2 Build a ML model from scratch using data- Python 1,2,3,4 CO1, CO2, 5 hours
preprocessing and classification algorithms and (Jupyter) CO3, CO4,
calculating various performance metrics. CO5
xiii

(CSL 236)
2021-22
5. LIST OF FLIP EXPERIMENTS

1. Project – Dimensionality reduction using LDA.
2. Competition on Kaggle
6. LIST OF PROJECTS
1. Titanic Challenge:
2. House Price Prediction Using Advanced Regression Techniques:
3. Mechanism of Action (MoA) Prediction
7. RUBRICS
Marks Distribution
Continuous Evaluation (20 Marks) Project Evaluations (30 Marks)
Each experiment shall be evaluated for 10 Both the projects shall be evaluated for
marks and at the end of the semester 30 marks each and at the end of the
proportional marks shall be awarded out semester viva will be conducted related
of total 20. to the projects as well as concepts
Following is the breakup of 10 marks for learned in labs and this component
each carries 20 marks.
4 Marks: Observation & conduct of
experiment. Teacher may ask questions
about experiment.
3 Marks: For report writing
3 Marks: For the 15 minutes quiz to be
conducted in every lab.
xiv

(CSL 236)
2021-22
Annexure 1
(Student Lab Report)
Introduction to AI and ML
(CSL 236)
Lab Practical Report
Faculty name: Vidhi Khanduja Student name: Sarthak Mishra
Roll No.: 19ecu018
Semester: 5 th
Group: ECE

NorthCap University, Gurugram- 122001, India
Session 2021-22
xv

(CSL 236)
2021-22
1. LIST OF EXPERIMENTS
SNo. Title of the Experiment Software DATE

used
1 To introduce various python libraries used for Python 04.08.2021
2 To apply various data pre-processing Python 11.08.2021,
techniques used for effective machine learning (Jupyter) 18.08.2021,
on the given dataset. 25.08.20213
3 To apply feature encoding schemes such as Python 1.09.2021
label encoder and one hot encoder. (Jupyter)
4 To apply different feature selection techniques Python 08.09.2021,
in machine learning. (Jupyter)
5 To apply PCA as feature reduction technique on Python 15.09.2021
IRIS dataset. (Jupyter)
6 To apply Simple Linear Regression on the given Python 29.09.2021
dataset. (Jupyter)
7 To apply multiple linear regression on any Python 29.09.2021
regression dataset. (Jupyter)
8 To apply Polynomial Linear Regression on the Python 06.10.2021
given dataset. (Jupyter)
9 To solve classification problems using Logistic Python 13.10.2021
Regression. (Jupyter)
10 To solve classification problems using KNN Python 13.10.2021
classification. (Jupyter)
11 To solve classification problems using Naïve Python 20.10.2021
Bayes. (Jupyter)
12 To apply Support Vector Machines (SVM) on Python 27.10.2021
classification problems. (Jupyter)
13 To apply Decision Trees for classification Python 10.11.2021
problems. (Jupyter)
14 Build a ML model from scratch using data- Python 17.11.2021
preprocessing and classification algorithms and (Jupyter)
calculating various performance metrics.
xvi

(CSL 236)
2021-22
EXPERIMENT NO. 1
Student Name and Roll Number: Sarthak Mishra 19ecu018
Semester /Section: 5th
Link to Code:
https://colab.research.google.com/drive/19oX6X31tFmig_UMTlyndmd3yyDJr5WzJ?
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
To understand the basic libraries of python and differentiate between numpys and pandas.
Outcome:
Students will be familiarized with handling dataset using python basic libraries and applying various
operations on dataset using these.
xvii

(CSL 236)
2021-22
Problem Statement:
To introduce various libraries of python used for machine learning.
Background Study:
Basic libraries of python are necessary to import datasets and applying various data pre-processing and
machine learning techniques on them.
Question Bank:
1. How pandas can be used to read data from internet and from your system?
One crucial feature of Pandas is its ability to write and read Excel, CSV, and many other types of
files. Functions like the Pandas read_csv() method enable you to work with files effectively. You
can use them to save the data and labels from Pandas objects to a file and load them later as Pandas
Series or Data Frame instances.
2. How pandas dataframe can be converted to numpy arrays and vice versa?
Dataframe.to_numpy(dtype = None, copy = False)
3. How to access different rows and columns using loc and iloc?
dataset.iloc[:, 1:3], dataset.loc[: , :-1]
4. Differentiate between Feature Selection and Dimensionality Reduction.

Feature selection and dimensionality reduction are grouped together. While both methods are used
for reducing the number of features in a dataset, there is an important difference. Feature selection is
simply selecting and excluding given features without changing them. Dimensionality reduction
transforms features into a lower dimension.
5. What are the advantages of Wrapper methods over filter methods for feature selection?
Filter methods might fail to find the best subset of features in many occasions but wrapper methods
can always provide the best subset of features. Using the subset of features from the wrapper
methods make the model more prone to overfitting as compared to using subset of features from the
filter methods.
6. Explain Regularization methods for Feature Selection.

Regularization consists of adding a penalty to the different parameters of the machine learning
model to reduce the freedom of the model, i.e., to avoid over-fitting. In linear model regularization,
the penalty is applied over the coefficients that multiply each of the predictors.
7. What are Embedded feature selection methods.

xviii

(CSL 236)
2021-22
Embedded methods combine the qualities of filter and wrapper methods. It’s implemented by
algorithms that have their own built-in feature selection methods.
Student Work Area

import pandas as pd
dataset = pd.read_csv("C:\Data.csv")
X = dataset.iloc[:, 1:3]; Y = dataset.iloc[:, 1:3]
xix

(CSL 236)
2021-22
xx

(CSL 236)
2021-22
EXPERIMENT NO. 2
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
To understand the importance of data pre-processing techniques like handle missing values, duplicate values,
feature scaling etc.
Outcome:
Students will be familiarized with the understanding and importance of applying various data pre-processing
techniques.
Problem Statement:
Write a program to perform data pre-processing techniques for effective machine learning.
xxi

(CSL 236)
2021-22
Background Study:
Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote
the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the
technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training
Machine Learning models.
Question Bank:
1. What are different ways to handle missing values both for numerical as well as categorical data?
 Delete the observations,
 Replace missing values with the most frequent value and
 Develop a model to predict missing values
2. What is the function in python used for finding duplicate rows in data?
Duplicated ()
3. Differentiate between two scaling methods used for feature scaling?

Normalization is good to use when you know that the distribution of your data does not follow a
Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the
data like K-Nearest Neighbors and Neural Networks.
Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian
distribution. However, this does not have to be necessarily true. Also, unlike normalization,
standardization does not have a bounding range. So, even if you have outliers in your data, they will
not be affected by standardization.
Student Work Area

print("missing values:")
print(dataset.isna())
print("column with atleast one missing value:")

xxii

(CSL 236)
2021-22
print(dataset.isna().any())
dups = df.duplicated()
print(‘Number of duplicate rows = %d’ % (dups.sum()))
df[dups]
print("\nReplace NaNs with a single constant value:")

result = dataset['ordno'].fillna(0, inplace=False)
print(result)
From sklearn.imputer import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy=’mean’)
imputer = imputer.fit(X.iloc[: , 1:])
X.iloc[: , 1:] = imputer.transfrom(X.iloc[: , 1:])
xxiii

(CSL 236)
2021-22
From sklearn.imputer import SimpleImputer

Imputer = SimpleImputer(missing_values=np.nan, strategy=’median’)
Imputer = imputer.fit(X.iloc[: , :])
X.iloc[: , :] = imputer.transfrom(X.iloc[: , :])
xxiv

(CSL 236)
2021-22
EXPERIMENT NO. 3
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
To perform different encoding schemes, prepare dataset by converting categorical data to numeric form for
machine learning and understand and implement label encoding and one hot encoding.
Outcome:
Students will be able to understand different encoding schemes to prepare data for machine learning.
Problem Statement:
To apply different feature encoding schemes on the given dataset.
xxv

(CSL 236)
2021-22
Background Study:
Machine learning models require all input and output variables to be numeric. This means that if your data
contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two
most popular techniques are Label Encoding and One-Hot Encoding.
Question Bank:
1. Can ML algorithms handle categorical data directly?

Yes. ML algorithms handle categorical data directly
2. What are the different schemes for encoding categorical data?

Label Encoding or Ordinal Encoding, one hot Encoding, Dummy Encoding, Effect Encoding,
Binary Encoding, BaseN Encoding, Hash Encoding, Target Encoding
3. Differentiate between Label Encoding and One Hot Encoding?

Label Encoding refers to converting the labels right into a numeric form so one can convert them
into the machine-readable form. Machine getting to know algorithms can then decide in a higher
way how the ones labels need to be operated. It is a critical pre-processing step for the dependent
dataset in supervised getting to know.
One hot encoding is one method of converting information to prepare it for an algorithm and get a
better prediction. With one-hot, we can convert every categorical price into a brand-new categorical
column and assign a binary fee of 1 or 0 to the one’s columns. Each integer value is represented as a
binary vector.
Student Work Area

df = pd.DataFrame(info)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
gender_encoded = le.fit_transform(df['Gender'])
print(gender_encoded)
xxvi

(CSL 236)
2021-22
df['encoded_gender'] = gender_encoded
print(df)
encoded_position = le.fit_transform(df['Position'])
df['encoded_position'] = encoded_position
print(df)
from sklearn.preprocessing import OneHotEncoder
gender_encoded = le.fit_transform(df['Gender'])
gender_encoded = gender_encoded.reshape(len(gender_encoded), -1)
one = OneHotEncoder(sparse=False)
print(one.fit_transform(gender_encoded))
pos = pd.get_dummies(df['Position'], drop_first = True)
print(pos)
xxvi
i
(CSL 236)
2021-22
xxvi
ii
(CSL 236)
2021-22
EXPERIMENT NO. 4
Student Name and Roll Number: Sarthak Mishra
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
To understand the importance of feature selection, differentiate between different types of feature selection
and build a model using feature selection techniques.
Outcome:
Students will be familiarized with model building using feature selection techniques and optimization.
Problem Statement:
Write a program to apply filter feature selection techniques.
Background Study:
Feature selection is the process of reducing the number of input variables when developing a predictive
model. It is desirable to reduce the number of input variables to both reduce the computational cost of
xxix

(CSL 236)
2021-22
modeling and, in some cases, to improve the performance of the model.
Question Bank:
1. What are different filter feature selection techniques?

Filter methods are generally used as a preprocessing step. The selection of features is independent of
any machine learning algorithms. Instead, features are selected on the basis of their scores in various
statistical tests for their correlation with the outcome variable.
2. How feature selection techniques depend on the data type of input features and output variable?
The Pearson correlation method is the most common method to use for numerical variables; it
assigns a value between − 1 and 1, where 0 is no correlation, 1 is total positive correlation, and − 1
is total negative correlation. This is interpreted as follows: a correlation value of 0.7 between two
variables would indicate that a significant and positive relationship exists between the two. A
positive correlation signifies that if variable A goes up, then B will also go up, whereas if the value
of the correlation is negative, then if A increases, B decreases.
3. What is the mathematics behind Pearson’s Correlation to rank features?

To find the Pearson coefficient, the two variables are placed on a scatter plot. The variables are
denoted as X and Y. There must be some linearity for the coefficient to be calculated; a scatter plot
not depicting any resemblance to a linear relationship will be useless. The closer the resemblance to
a straight line of the scatter plot, the higher the strength of association. Numerically, the Pearson
coefficient is represented the same way as a correlation coefficient that is used in linear regression,
ranging from -1 to +1. A value of +1 is the result of a perfect positive relationship between two or
more variables. Positive correlations indicate that both variables move in the same direction.
Conversely, a value of -1 represents a perfect negative relationship. Negative correlations indicate
that as one variable increases, the other decreases; they are inversely related. A zero indicates no
correlation.
Student Work Area

X = data.iloc[:,0:20] #independent columns
Y = data.iloc[:,-1] #target column i.e price range
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X, Y)
dfscores = pd.DataFrame(fit.scores_)
xxx

(CSL 236)
2021-22
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']
featureScores
print(featureScores.nlargest(10,'Score')) #print 10 best features
Feature Importance
xxxi

(CSL 236)
2021-22
from sklearn.ensemble import ExtraTreesClassifier

import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=X.columns)

feat_importances.nlargest(10).plot(kind='barh')
plt.show()
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
xxxi
i
(CSL 236)
2021-22
EXPERIMENT NO. 5
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Study Dimensionality Reduction and Understand the basic principle behind Principal Component Analysis.
Outcome:
Students will be familiarized with Dimensionality Reduction especially Principal Component Analysis.
Problem Statement:
Reduce dimensionality of Iris dataset using Principal Component Analysis.
Background Study:
xxxi
ii
(CSL 236)
2021-22
Principal component analysis is a statistical technique that is used to analyze the interrelationships among a
large number of variables and to explain these variables in terms of a smaller number of variables, called
principal components, with a minimum loss of information.
Question Bank:
1. What is dimensionality reduction?

Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
More input features often make a predictive modeling task more challenging to model, more
generally referred to as the curse of dimensionality
2. Differentiate between Feature Selection, Feature Engineering and Dimensionality Reduction.

Feature engineering permits you to construct greater complex models than you could with simplest
raw facts. It also allows you to build interpretable fashions from any quantity of statistics. Feature
selection will assist you limit those capabilities to a manageable wide variety. Dimensionality
reduction refers to techniques that reduce the number of input variables in a dataset. More input
features often make a predictive modeling task more challenging to model, more generally referred
to as the curse of dimensionality.
3. What are principal components?

Principal components are new variables which are constructed as linear mixtures or combos of the
initial variables. Geometrically, major additives represent the directions of the data that specify a
maximal quantity of variance, this is to mention, the strains that seize maximum information of the
statistics.
Student Work Area

from sklearn.datasets import load_breast_cancer
breast = load_breast_cancer()
breast_data = breast.data
breast_data.shape
breast_labels = breast.target
breast_labels.shape
labels = np.reshape(breast_labels,(569,1))
final_breast_data = np.concatenate([breast_data,labels],axis=1)
final_breast_data.shape
from keras.datasets import cifar10
xxxi
v
(CSL 236)
2021-22
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print('Traning data shape:', x_train.shape)
print('Testing data shape:', x_test.shape)
y_train.shape,y_test.shape
# Find the unique numbers from the train labels
classes = np.unique(y_train)
nClasses = len(classes)
print('Total number of outputs : ', nClasses)
print('Output classes : ', classes)
plt.subplot(122)
curr_img = np.reshape(x_test[0],(32,32,3))
plt.imshow(curr_img)
print(plt.title("(Label: " + str(label_dict[y_test[0][0]]) + ")"))
from sklearn.preprocessing import StandardScaler
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features
x.shape
np.mean(x),np.std(x)
feat_cols = ['feature'+str(i) for i in range(x.shape[1])]
normalised_breast = pd.DataFrame(x,columns=feat_cols)
xxx
v
(CSL 236)
2021-22
normalised_breast.tail()
from sklearn.decomposition import PCA
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)
principal_breast_Df.tail()
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
indicesToKeep = breast_dataset['label'] == target
plt.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1']
, principal_breast_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)
plt.legend(targets,prop={'size': 15})
xxx
vi
(CSL 236)
2021-22
EXPERIMENT NO. 6
Link to Code:
usp=sharing
Date:
xxx
vii
(CSL 236)
2021-22
Faculty Signature:
Grade:
Objective(s):
Understand Simple Linear Regression and Study about the different performance metrics of SLR.
Outcome:
Student will be familiarized with regression problems and SLR as a solution to single feature problem.
Problem Statement:
To apply Simple Linear Regression on the given dataset.
Background Study:
Simple linear regression is an approach for predicting a response using a single feature. It is assumed that the
two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y)
as accurately as possible as a function of the feature or independent variable(x).
Question Bank:
1. What is a regression problem?

A regression problem is when the output variable is a real or continuous value, such as “salary” or
“weight”.
2. How Simple Linear Regression (SLR) helps in solving regression problems containing an input
feature and an output variable?
Simple linear regression is used to model the relationship between two continuous variables. Often,
the objective is to predict the value of an output variable (or response) based on the value of an input
(or predictor) variable.
3. What are the different performance metrics that can be used for evaluating SLR?
xxx
viii
(CSL 236)
2021-22
Accuracy, Precision, and Recall, F1 Score, Log Loss/Binary Crossentropy, Categorical

Crossentropy, AUC.
Student Work Area

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScalersc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)"""
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.scatter(X_test, y_test, color = 'red')

plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
xxxi
x
(CSL 236)
2021-22
plt.show()
xl

(CSL 236)
2021-22
EXPERIMENT NO. 7
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Understand mathematics behind Multiple Linear Regression and Solving linear regression problems
containing more than one independent feature using MLR.
Outcome:
Students will be familiarized with Multiple Linear Regression for solving linear regression problems.
Problem Statement:
To apply multiple linear regression on any regression dataset.
xli

(CSL 236)
2021-22
Background Study:
Multiple Linear Regression attempts to model the relationship between two or more features and a response
by fitting a linear equation to observed data.
Question Bank:
1. What is MLR?
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response variable.
Multiple regression is an extension of linear (OLS) regression that uses just one explanatory
variable.
2. Differentiate between SLR and MLR?

SLR examines the relationship between the dependent variable and a single independent variable.
MLR examines the relationship between the dependent variable and multiple independent variables.
Student Work Area

#Convert the column into categorical columns
states=pd.get_dummies(X['State'],drop_first=True)
# Drop the state coulmn
X=X.drop('State',axis=1)
# concat the dummy variables
X=pd.concat([X,states],axis=1)
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Fitting Multiple Linear Regression to the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score

xlii

(CSL 236)
2021-22
score=r2_score(y_test,y_pred)
xliii

(CSL 236)
2021-22
EXPERIMENT NO. 8
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Study and understand about Polynomial Regression on non-linear regression data and the mathematics
behind Polynomial Regression.
Outcome:
Students will be familiarized with handling of regression data having non-linear relationship between input
and output.
Problem Statement:
To apply Polynomial Linear Regression on the given dataset.
xliv

(CSL 236)
2021-22
Background Study:
Polynomial Regression is a form of linear regression in which the relationship between the independent
variable x and dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a
nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x).
Question Bank:
1. What is non-linear relationship between input and output?

In a nonlinear relationship, changes in the output do not change in direct proportion to changes in
any of the inputs. While a linear relationship creates a straight line when plotted on a graph, a
nonlinear relationship does not create a straight line but instead creates a curve.
2. How Polynomial Regression is used to handle non-linear relationship?
Polynomial regression extends the linear model by adding extra predictors, obtained by raising each
of the original predictors to a power.
Student Work Area

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=51)

sc = StandardScaler()
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
X_train
xlv

(CSL 236)
2021-22

from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
poly_reg.fit(X_train)
X_train_poly = poly_reg.transform(X_train)
X_test_poly = poly_reg.transform(X_test)
X_train_poly.shape, X_test_poly.shape
X_train_poly
Lr = LinearRegression()
Lr.fit(X_train_poly, y_train)
lr.score(X_test_poly, y_test,)
lr.predict([X_test_poly[0,:]])
y_pred = lr.predict(X_test_poly)
y_pred
xlvi

(CSL 236)
2021-22
y_test
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mse, rmse
EXPERIMENT NO. 9
Semester /Section:5th
Link to Code:
usp=sharing
Date:
xlvii

(CSL 236)
2021-22
Faculty Signature:
Grade:
Objective(s):
Study Logistic Regression and How Logistic Regression is used to solve classification problems.
Outcome:
Students will be familiarized with Logistic Regression and performance metrics to calculate its performance
on the given dataset.
Problem Statement:
To solve classification problems using Logistic Regression.
Background Study:
Logistic regression is a classification technique which helps to predict the probability of an outcome that can
only have two values. Logistic Regression is used when the dependent variable (target) is categorical. A
logistic regression produces a logistic curve, which is limited to values between 0 and 1.
Question Bank:
1. What is Logistic Regression?

Logistic regression is a statistical evaluation technique used to predict a statistics price primarily
based on earlier observations of a information set. A logistic regression model predicts a structured
facts variable by means of studying the connection between one or greater current impartial
variables.
2. How Logistic Regression is used for solving classification problems?

Logistic regression is another technique borrowed by machine learning from the field of statistics. It
is the go-to method for binary classification problems (problems with two class values).
3. Why sigmoid function is used in it?

xlvii
i
(CSL 236)
2021-22
The principal motive because we use sigmoid function is because it exists among (zero to 1).
Therefore, it's miles specially used for models wherein we need to predict the opportunity as an
output. Since probability of anything exists best between the range of 0 and 1, sigmoid is the right
desire.
Student Work Area

#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
# split X and y into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
xlix

(CSL 236)
2021-22
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
l

(CSL 236)
2021-22
EXPERIMENT NO. 10
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Study K-Nearest Neighbour algorithm (KNN) and understand the working principle behind KNN.
Outcome:
Students will be familiarized with classification technique using KNN.
Problem Statement:
To solve classification problems using KNN classification.
Background Study:
K-nearest neighbours (KNN) algorithm is a type of supervised ML algorithm which can be used for both
li

(CSL 236)
2021-22
classification as well as regression predictive problems. However, it is mainly used for classification
predictive problems in industry. The following two properties would define KNN well −
 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized
training phase and uses all the data for training while classification.
 Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it
doesn’t assume anything about the underlying data.
K-nearest neighbours algorithm uses ‘feature similarity’ to predict the values of new datapoints which
further means that the new data point will be assigned a value based on how closely it matches the points in
the training set. The most common parameter used to perform matching is Euclidean distance between the
points.
Question Bank:
1. What is KNN classifier?

KNN stands for k-Nearest Neighbors. It is a supervised learning set of rules. KNN is quite simple to
put into effect and is maximum broadly used as a first step in any system studying setup. It is
frequently used as a benchmark for greater complex classifiers including Artificial Neural Networks
(ANN) and Support Vector Machines (SVM).
2. How KNN makes use of Euclidean distance to calculate nearest neighbor?
3. What are the other distances that can be used for nearest neighbor?
Cosine similarity measure, Minkowsky, correlation, and Chi square are other distances used besides
Euclidean
4. What are the various performance metrics used for classification problems?
The most commonly used Performance metrics for classification problem are as follows, Accuracy,
Confusion Matrix, Precision, Recall, and F1 score.
Student Work Area

abalone = abalone.drop("Sex", axis=1)
import matplotlib.pyplot as plt
abalone["Rings"].hist(bins=15)
lii

(CSL 236)
2021-22
correlation_matrix = abalone.corr()
correlation_matrix["Rings"]
a = np.array([2, 2])
b = np.array([4, 4])
np.linalg.norm(a - b)
X = abalone.drop("Rings", axis=1)
X = X.values
y = abalone["Rings"]
y = y.values
distances = np.linalg.norm(X - new_data_point, axis=1)
k = 3
nearest_neighbor_ids = distances.argsort()[:k]
nearest_neighbor_ids
nearest_neighbor_rings = y[nearest_neighbor_ids]
nearest_neighbor_rings
prediction = nearest_neighbor_rings.mean()
import scipy.stats
class_neighbors = np.array(["A", "B", "B", "C"])
liii

(CSL 236)
2021-22
scipy.stats.mode(class_neighbors)
liv

(CSL 236)
2021-22
EXPERIMENT NO. 11
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Understand and study Naïve Bayes (NB) Classifier and Naïve Bayes theorem behind it.
Outcome:
Students will be familiarized with NB classification technique.
Problem Statement:
To solve classification problems using Naïve Bayes.
Background Study:
Naïve Bayes Classifier is a probabilistic classifier and is based on Bayes Theorem. In Machine
lv

(CSL 236)
2021-22
learning, a classification problem represents the selection of the Best Hypothesis given the data. Given
a new data point, we try to classify which class label this new data instance belongs to. The prior
knowledge about the past data helps us in classifying the new data point.
Question Bank:
1. What is Bayes theorem?

Bayes' theorem is a way to figure out conditional probability. Or The Bayes theorem describes
the probability of an event based on the prior knowledge of the conditions that might be related
to the event.
2. How Naïve Bayes classifier helps for solving classification problems?

Naïve Bayes Classifier is one of the simple and best Classification algorithms which facilitates
in building the quick device learning fashions which could make quick predictions. It is a
probabilistic classifier, which means that it predicts on the idea of the probability of an object.
3. What is the condition on features that should be fulfilled for successful application of Naïve
Bayes method?
The naive bayes classifiers do not provide an intrinsic method to evaluate important
characteristic. Naïve Bayes strategies work by means of figuring out the conditional and
unconditional probabilities related to the functions and expect the class with the highest
chance.
Student Work Area

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:,0] = le.fit_transform(X[:,0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
lvi

(CSL 236)
2021-22
y_pred
y_test
lvii

(CSL 236)
2021-22
EXPERIMENT NO. 12
Link to Code:
usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Understand and study Support Vector Machines (SVM), To study how linear hyperplane is calculated to
differentiate between two classes and Basic understanding of the different variants of SVM.
Outcome:
Students will be familiarized with Support Vector Machines classifier.
Problem Statement:
To solve classification problems using SVM.
lviii

(CSL 236)
2021-22
Background Study:
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning
models with associated learning algorithms that analyze data for classification and regression analysis.
Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues, SVMs are one of the most
robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik
(1982, 1995) and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one
of two categories, an SVM training algorithm builds a model that assigns new examples to one category or
the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in
space so as to maximise the width of the gap between the two categories. New examples are then mapped
into that same space and predicted to belong to a category based on which side of the gap they fall.
Question Bank:
1. What is SVM?
SVM or Support Vector Machine is a linear version for class and regression problems. It can solve
linear and non-linear problems and work properly for lots sensible issues. The concept of SVM is
simple: The algorithm creates a line or a hyperplane which separates the data into classes.
2. What are the advantages of using SVM over other classifiers?

VM is a totally efficient & simple classifier set of rules that is extensively used for pattern
reputation. Also, it is able to have a very good type overall performance than any other classifier. In
SVM approach, the principal goal of an SVM classifier is acquiring a function f(x), which
determines the decision boundary or hyper plane.
3. What do you mean by support vectors?

Support vectors are information points which are closer to the hyperplane and have an impact on the
placement and orientation of the hyperplane. Using these guide vectors, we maximize the margin of
the classifier. Deleting the help vectors will trade the location of the hyperplane.
Student Work Area

#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
cancer = datasets.load_breast_cancer()
# print the names of the 13 features
print("Features: ", cancer.feature_names)
# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)
# print data(feature)shape
cancer.data.shape
lix

(CSL 236)
2021-22
# print the cancer data features (top 5 records)
print(cancer.data[0:5])
# print the cancer labels (0:malignant, 1:benign)
print(cancer.target)
# Import train_test_split function
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=1
09) # 70% training and 30% test
#Import svm model
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy: how often is the classifier correct?
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))
# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))
lx

(CSL 236)
2021-22
EXPERIMENT NO. 13
Link to Code:
https://colab.research.google.com/drive/19oX6X31tFmig_UMTlyndmd3yyDJr5WzJ?usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Understand and study Decision Trees for classification problems and Study about the information gain used
to create decision trees.
Outcome:
Students will be familiarized with creation of decision trees.
Problem Statement:
Apply Decision Tree classifier for solving classification problems.
Background Study:
lxi

(CSL 236)
2021-22
Decision tree analysis is a predictive modelling tool that can be applied across many areas. Decision trees
can be constructed by an algorithmic approach that can split the dataset in different ways based on different
conditions. Decision trees are the most powerful algorithms that falls under the category of supervised
algorithms. They can be used for both classification and regression tasks. The two main entities of a tree are
decision nodes, where the data is split and leaves, where we got outcome.
Question Bank:
1. What is a decision tree?

A decision tree is a tree-like version that acts as a decision assist device, visually displaying choices
and their capacity outcomes, outcomes, and expenses. Drawing a decision tree diagram begins from
left to right and includes “burst” nodes that cut up into exclusive paths.
2. How decision tree is created to solve problems?

Decision bushes assist you to assess your options. Decision Trees are top notch tools for assisting
you to select between numerous publications of motion. They provide a surprisingly effective shape
inside which you could lay out alternatives and investigate the possible results of selecting the ones
alternatives.
3. List out the advantages and disadvantages of Decision Tree Classifiers?

Advantages: Easy to read and interpret. One of the advantages of decision trees is that their outputs
are easy to read and interpret without requiring statistical knowledge, Easy to prepare and Less data
cleaning required.
Disadvantages: A small change in the data can cause a large change in the structure of the decision
tree causing instability, for a Decision tree sometimes calculation can go far more complex
compared to other algorithms and Decision tree often involves higher time to train the model.
Student Work Area

from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and
30% test
# Create Decision Tree classifer object
lxii

(CSL 236)
2021-22
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
# Model Accuracy, how often is the classifier correct?
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
lxiii

(CSL 236)
2021-22
# Model Accuracy, how often is the classifier correct?
from six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
lxiv

(CSL 236)
2021-22
EXPERIMENT NO. 14
Link to Code:
https://colab.research.google.com/drive/16Rea0XO_qpeoO9wKi2MJ8MNj0zPI8nP_?usp=sharing
Date:
Faculty Signature:
Grade:
Objective(s):
Build a ML model from scratch using data-preprocessing and classification algorithms and calculating
various performance metrics.
lxv

(CSL 236)
2021-22
Student Work Area

CODE LINK ( COLAB):
https://colab.research.google.com/drive/16Rea0XO_qpeoO9wKi2MJ8MNj0zPI8nP_?
usp=sharing
Train-Test Split
X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=42, stratify=y)
print(X_train.shape, X_test.shape, y_test.shape, y_train.shape)
Model Training
# classify function
from sklearn.model_selection import cross_val_score
def classify(model, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
25, random_state=42)
model.fit(x_train, y_train)
print("Accuracy is", model.score(X_test, y_test)*100)
# cross validation - it is used for better validation of model

score = cross_val_score(model, X, y, cv=5)
print("Cross validation is",np.mean(score)*100)
predictions = model.predict(X_test)
lxvi

(CSL 236)
2021-22
lxvii

(CSL 236)
2021-22
lxvii
i
(CSL 236)
2021-22
Confusion Matrix
lxix

(CSL 236)
2021-22
lxx

(CSL 236)
2021-22

Introduction To AI & ML: Lab Manual

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To AI & ML: Lab Manual

Uploaded by

Copyright:

Available Formats

Introduction to AI & ML

Introduction to AI and ML Lab Manual