Introduction to Python for Machine Learning

Sub Code – RCS-753
INTRODUCTION TO PYTHON
A Report submitted in Industrial training
B. Tech
In
Computer Science and Engineering
Under the Supervision of

MR. PRAVEEN KUMAR
(Professor, CSE)
By
Sarthak Garg
(1703310174)
Raj Kumar Goel Institute of Technology Ghaziabad

5th KM. STONE, DELHI-MEERUT ROAD, GHAZIABAD (U.P)-201003
Department of Computer Science &

Engineering
Session: - 2020-2021
1
Sub Code – RCS-753
DECLARATION
I hereby declare that the Industrial Training Report entitled "INTRODUCTION TO

PYTHON" is an authentic record of my own work as requirements of Industrial Training
during the period from 1st July 2020 to 12 August 2020 for the award of degree of B.Tech
(Computer Science & Engineering), RKGIT, Ghaziabad under the guidance of Andrew Ng.
Date: 26 July 20 Sarthak Garg(1703310174)
2
VISION OF THE INSTITUTE
To continually develop excellent professionals capable of providing sustainable solutions to

challenging problems in their fields and prove responsible global citizens.
MISSION OF THE INSTITUTE
We wish to serve the nation by becoming a reputed deemed university for providing value
based professional education.
VISION OF THE DEPARTMENT

To be recognized globally for delivering high quality education in the ever changing field of
computer science & engineering, both of value & relevance to the communities we serve.
MISSION OF THE DEPARTMENT
1. To provide quality education in both the theoretical and applied foundations of Computer
Science and train students to effectively apply this education to solve real world problems.
2. To amplify their potential for lifelong high quality careers and give them a competitive
advantage in the challenging global work environment.
PROGRAM EDUCATIONAL OUTCOMES (PEOs)
PEO 1: Learning: Our graduates to be competent with sound knowledge in field of

Computer Science & Engineering.
PEO 2: Employable: To develop the ability among students to synthesize data and
technical concepts for application to software product design for successful careers
that meet the needs of Indian and multinational companies.
PEO 3: Innovative: To develop research oriented analytical ability among students to

prepare them for making technical contribution to the society.
PEO 4: Entrepreneur / Contribution: To develop excellent leadership quality among
students which they can use at different levels according to their experience and
contribute for progress and development in the society.
PROGRAM OUTCOMES (POs)
Engineering Graduates will be able to:
PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering

fundamentals, and an engineering specialization to the solution of complex engineering
problems.
PO2: Problem analysis: Identify, formulate, review research literature, and

analyze complex engineering problems reaching substantiated conclusions using
first principles of mathematics, natural sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex

engineering problems and design system components or processes that meet the
specified needs with appropriate consideration for the public health and safety, and
the cultural, societal, and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based

knowledge and research methods including design of experiments, analysis and interpretation
of data, and synthesis of the information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques,
resources, a n d modern engineering and IT tools including prediction and modelling to
complex engineering activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional

engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
PO11: Project management and finance: Demonstrate knowledge and

understanding of the engineering and management principles and apply these to
one’s own work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context of
technological change.
PROGRAM SPECIFIC OUTCOMES (PSOs)
PSO1: The ability to use standard practices and suitable programming environment to develop
software solutions.
PSO2: The ability to employ latest computer languages and platforms in creating innovative
career opportunities.
CO1 Relate to the 'real' working environment and get acquainted with the organization structure,
business operations and administrative functions.
CO2 Practice hands-on experience in the computer related fields so that they can relate and reinforce
what has been taught.
CO3 Develop synergetic collaboration with industry and the university in promoting a
knowledgeable society.
CO4 Set up the stage for future recruitment by potential employers.
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
CO1 3 3 2 1 1 1 1 1 2 1 2 2 3 2
CO2 3 3 2 1 2 2 1 2 3 3
CO3 3 3 1 1 2 1 1 2 1 2 2 1
CO4 3 3 1 1 2 1 1 1 2 2 1
CO 3 3 1.5 1 1.75 1 1 1 1.75 1 1.25 2 2.5 1.75
Certificate By Company
Table of Contents
LIST OF FIGURES.........................................................................................................6
CHAPTER 1: INTRODUCTION..................................................................................8
1.1 Machine Learning.................................................................................................8

1.2 Artificial Intelligence and deep learning.............................................................8
1.3 Types of Machine Learning..................................................................................9
1.4 Planning For Machine Learning.........................................................................10
1.5 Commandments of AI and ML............................................................................11
CHAPTER 2: Machine Learning Cycle......................................................................12
2.1 Planning.................................................................................................................12
2.2 Data Preparation..................................................................................................13
2.3 Data Pre-processing..............................................................................................14
2.4 Model Presentation.................................................................................................16
CHAPTER 3: Some Common Techniques and Algorithms.........................................17
3.1 Linear Regression...................................................................................................17

3.2 Multivariate Linear Regression............................................................................18
3.3 Logistic Regression.................................................................................................19
3.4 Clustering Technique.............................................................................................19
3.5 K nearest Algorithm...............................................................................................20
3.6 Naive Bayes Classifier............................................................................................21
CHAPTER 4: Python Library used................................................................................23
4.1 Numpy......................................................................................................................23
4.2 Pandas......................................................................................................................23
4.3 SciKit-learn..............................................................................................................23
4.4 Matplotlib.................................................................................................................23
4.5 Seaborn.....................................................................................................................24
CHAPTER 5: Report on Indian Diabetic Dataset..........................................................25
5.1 Summary...................................................................................................................25
5.1.1 Importing Library.............................................................................................26
5.1.2 Importing Dataset..............................................................................................26

5.1.3 Planning and Cleaning the data.......................................................................27
5.1.4 Creating train and test dataset for validation.................................................28
5.1.5 Selection of Algorithm........................................................................................29
5.1.6 Model Evaluation................................................................................................30
CONCLUSIONS AND FUTURE SCOPE....................................................................31
REFERENCES................................................................................................................31
LIST OF FIGURES
1. Linear Regression
2. Flowchart of regression
3. Logistic Regression
4. Clustering technique
5. K nearest algorithm
6. Naive Bayes classifier
7. Feature plot of dataset
8. Heatmap and correlation of dataset vectors
9. Pairplot of dataset’s vector
10. Confusion matrix

CHAPTER - 1
1) INTRODUCTION
1.1) Machine Learning:
Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is a field of computer science that gives computers the ability to learn without
being explicitly programmed.
“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with
experience E.”
It is the first-class ticket to most interesting careers in data analytics today. As data sources
proliferate along with the computing power to process them, going straight to the data is one
of the most straightforward ways to quickly gain insights and make predictions. Machine
Learning can be thought of as the study of a list of sub-problems, viz: decision making,
clustering, classification, forecasting, deep-learning, inductive logic programming, support
vector machines, reinforcement learning, similarity and metric learning, genetic algorithms,
sparse dictionary learning, etc.
1.2) Artificial Intelligence And Deep Learning:
Artificial Intelligence - In computer science, artificial intelligence (AI), sometimes

called machine intelligence, is intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans. Colloquially, the term "artificial intelligence"
is often used to describe machines (or computers) that mimic "cognitive" functions that
humans associate with the human mind, such as "learning" and "problem solving".
As machines become increasingly capable, tasks considered to require "intelligence" are

often removed from the definition of AI, a phenomenon known as the AI effect. A quip in
Tesler's Theorem says "AI is whatever hasn't been done yet." For instance, optical character
recognition is frequently excluded from things considered to be AI, having become a routine
technology.
Deep Learning - Deep learning is an artificial intelligence function that imitates the
workings of the human brain in processing data and creating patterns for use in decision
making. Deep learning is a subset of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is unstructured or unlabeled. Also
known as deep neural learning or deep neural network.
1.3) Types of Machine Learning:
ENSEMBLE LEARNING- Ensemble learning is the process by which multiple models,

such as classifiers or experts, are strategically generated and combined to solve a
particular computational intelligence problem. Ensemble learning is primarily used to
improve the (classification, prediction, function approximation, etc.) performance of a model,
or reduce the likelihood of an unfortunate selection of a poor one. Other applications of
ensemble learning include assigning a confidence to the decision made by the model,
selecting optimal (or near optimal) features, data fusion, incremental learning, nonstationary
learning and error-correcting. This article focuses on classification related applications of
ensemble learning, however, all principle ideas described below can be easily generalized to
function approximation or prediction type problems as well.
Supervised Learning - We know what we are trying to predict. We use some examples that
we (and the model) know the answer to, to “train” our model. It can then generate predictions
to examples we don’t know the answer to.
Examples: Predict the price a house will sell at. Identify the gender of someone based on a
photograph.
Unsupervised Learning- We don’t know what we are trying to predict. We are trying to
identify some naturally occurring patterns in the data which may be informative.
Examples: Try to identify “clusters” of customers based on data we have on them
Reinforcement Learning - Behavioural psychology is the core concept behind

reinforcement learning. It is similar to cumulative rewards given as incentives to people in
order to motivate for better results, this method is used to maximize the accurate learning
behaviour. It learns through a policy or a pre-defined rule of the best way to act when an
observation of the real world is given.
Classification Learning Technique:In machine learning and statistics, classification is the

problem of identifying to which of a set of categories (sub-populations) a new observation
belongs, on the basis of a training set of data containing observations (or instances) whose
category membership is known.
1.4) Planning for Machine Learning:
Machine Learning –It is a field of computer science that gives computers the ability to learn
without being explicitly programmed.
 “A computer program is said to learn from experience E with respect to some task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience E.
 How to plan for Machine Learning?
The first step in this process is setting and defining the goal.The main purpose is
making sure all the stakeholders understand the what, how and why of the project which
results in development of project charter.
The second phase is data retrieval .You want to get the data for analysis from various
sources in any form,structured or unstructured .So this step explains finding the right and
suitable data which can be used to address the business objective.Getting required access to
the data from the business owners is the next crucial step.
Data which comes in raw form probably requires cleansing and some transformation
and fitting to shape it into more consumable data for our ML algorithms.
The Most critical part is Data modelling often referred as model building. Here is
where we try to gain the insights or prediction results which are stated in our business charter.
The Final step is of data science process is presenting the model, results it achieved
and prediction it made and insights . These are mainly dependent on our business objective or
goal defined in the charter.This is the stage where we influence he business for its future
goals and defining future objectives as well.
 Required Skills of an Machine Learning:
• Language- Basic knowledge of any programming language in Python ,Scala, R ,

Java ,C ,C++,Julia
• Knowledge of Inferential Mathematics and Libraries of language you are using.
• Project Management: leading, planning, organizing, and controlling a penetration
testing team.
1.5) COMMANDMENTS OF AI AND ML:

These are a few basic commandments that should be followed. The commandments are as
follows:
1) AI should be designed for all, and benefit humanity.
2) AI should operate on principles of transparency and fairness, and be well signposted.
3) AI should not be used to transgress the data rights and privacy of individuals, families, or
communities.
4) The application of AI should be to reduce inequality of wealth, health, and opportunity.
5) AI should not be used for criminal intent, nor to subvert the values of our democracy, nor
truth, nor courtesy in public discourse.
6) The primary purpose of AI should be to enhance and augment, rather than replace,
human labour and creativity.
7) All citizens have the right to be adequately educated to flourish mentally, emotionally,
and economically in a digital and artificially intelligent world.
8) AI should never be developed or deployed separately from consideration of the ethical

consequences of its applications.
9) The autonomous power to hurt or destroy should never be vested in artificial intelligence.
10) Governments should ensure that the best research and application of AI is directed
toward the most urgent problems facing humanity.
CHAPTER - 2
2) Machine Learning Cycle:
The various stages in the Machine Learning methodology are
● Planning
● Data Preparation
● Modelling
● Model Presentation
2.1) Planning
The planning phase is the most important phase of the machine learning methodology. You
can never win a war if your planning skill is poor. The importance of planning is to
accumulate important information and facts about the selected target. This information can
then be applied, in the grass, to reach the potential necessary position and to get important
data.
In the planning there are two phases:
First is Defining Goals and Second is developing project charter.
Defining Goals - The journey always begins with the primary action of questioning ,what are
we doing ,why are we doing and how are we doing?
 What is the Business expecting from the project?

 Why is the management spending its resources on this project?
 How will this project drive the business in the next financial year?
Answers to these questions will ensure the job of planning is half done. The outcome will
always be crystal clear.
We should understand the goals by the various stakeholders involved in project and agreeing
upon the set goals is also done at this level. At this initial phase of the project emphasis is
majorly done on organizing resources,developing communication matrix between various
stakeholders to have smooth co-ordination among teams. This phase will be initiated and
shouldered by the senior project manager as it involves clear communication and information
exchange with the client.
Developing Project Charter:

It contains following points:
 Comprehensible research goal
 Project Mission and Context
 Mechanisms to perform data analysis
 Organising technical resources
 Proof of concepts
 Measure of Success
 Schedules and Time Lines
2.2) Data Preparation
It consist of few steps
1. Data Retrieval
2. Data Cleansing
3. Data Exploration and Refining
Data Retrieval - Extracting and retrieving the data on which dataset we are going to work
and create the model.
Data Cleansing - Cleansing the data which is retrieved in previous phase is known as data
cleansing.
Error in data are removed in this phase-
1. Mistakes during entry

2. Redundant space
3. Impossible values
4. Missing values
5. Outliers
6. Deviation from code book
7. Different unit of measurements
8. Different level of aggregation.
Possible solutions respectively for these are-
1. Manual overrules
2. Use string functions
3. Manual overrules
4. Replace with another value
5. Treat as missing value
6. Omit the values
7. Set value to null
8. Recalculate for same unit
9. Bring same level of aggregation
Data Exploration and Refining-
This phase will take a deep dive into understanding data. By using various graphical
techniques and statistical techniques to gain information the data scientist tries to understand
whether the data is normally distributed or not. Correlation functions are used to know how
variables are related to each other.The visualization techniques that we use can be simple as
histogram and yet complex like pairplot and others.
2.3) Data Preprocessing-
It have following steps-
1. Model Creation
2. Model Validation
3. Model Evaluation
Model Creation - With clear data and understanding of the content,we can build models with
the goal of making better predictions and classifying objects or gaining an understanding of
the system.This phase is most important part of the cycle because the better selection of
algorithm and the model will lead to the better accuracy of the model as well as the
prediction.
Just choose the right algorithm and the model for the dataset and the accuracy will always be
more than enough.
Model Validation -
Model can be validated by various means-
1. Adjusted R square
2. Standard Error
3. P-Value
4. Z-value
5. MAPE- Mean absolute percentage error
6. MSE- mean squared error
7. RMSE - Root mean square error
8. MASE - Mean absolute scaled error
Model Evaluation-
Popular Methods of Evaluating models in data science are-
1. Hold out method

2. Cross-validation method
Hold-out method -
In this the dataset is divided into three subsets-
 Training set is a subset of dataset used to build the model

 Validation set is subset of dataset used to access the performance of model built in the
training phase. It provides a test platform for fine tuning model’s parameter and selecting
the best performing model.
 Test set is subset of dataset which is used to check the future performance of the model
Cross-validation method -
When only a limited amount of data is available, to achieve an unbiased estimate of the
model performance we use k-fold cross-validation. We divide the data into k subsets of equal
size.We build models k times,each time leaving out one of the subsets from training and use
it as test set.
2.4) Model Presentation -
1. Preparation for the presentation which showcase the outcome results of the analysis or
model which we developed and how good these are addressing to the business or the
problem
2. Deployment of the model based on the preference
3. Scaling the model based on the issues emerged
CHAPTER - 3
3) Some Common Techniques and Algorithms-
3.1) Linear Regression-
A Supervised Learning Algorithm that learns from a set of training samples
It estimates relationship between a dependent variable (target/label) and one or more

independent variable (predictors).
3.2) Multivariate Linear Regression
A natural generalization of the simple linear regression model is a situation including

influence of more than one independent variable to the dependent variable, again with a
linear relationship (strongly, mathematically speaking this is virtually the same model). Thus,
a regression model is called the multiple linear regression model. Dependent variable is
denoted by y, x1, x2,…,xn are independent variables whereas β0 ,β1,…, βn denote coefficients.
Although the multiple regression is analogue to the regression between two random variables,
in this case development of a model is more complex. First of all, might we don’t put into
model all available independent variables but among m>n candidates we will
choose n variables with greatest contribution to the model accuracy. Namely, in general we
aim to develop as simpler model as possible; so a variable with a small contribution we
usually don’t include in a model.
3.3) Logistic Regression
Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
3.4) Clustering Technique - Clustering is a Machine Learning technique that involves the
grouping of data points. ... In theory, data points that are in the same group should have
similar properties and/or features, while data points in different groups should have highly
dissimilar properties and/or features. K- means is the most used algorithm for this technique.
3.5). K nearest Algorithm-
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine

learning algorithm that can be used to solve both classification and regression problems.
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes called
distance, proximity, or closeness) with some mathematics calculating the distance between
points on a graph.
There are other ways of calculating distance, and one way might be preferable depending on
the problem we are solving. However, the straight-line distance (also called the Euclidean
distance) is a popular and familiar choice.
Classification of Bird using KNN
3.6) Naive Bayes Classifier-
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels
to problem instances, represented as vectors of feature values, where the class labels are
drawn from some finite set. There is not a single algorithm for training such classifiers, but a
family of algorithms based on a common principle: all naive Bayes classifiers assume that the
value of a particular feature is independent of the value of any other feature, given the class
variable. For example, a fruit may be considered to be an apple if it is red, round, and about
10 cm in diameter. A naive Bayes classifier considers each of these features to contribute
independently to the probability that this fruit is an apple, regardless of any
possible correlations between the color, roundness, and diameter features.
CHAPTER - 4
4) Python Library used-
4.1) NUMPY-
 Introduces objects for multidimensional arrays and matrices, as well as functions that
allow to easily perform advanced mathematical and statistical operations on those objects
 provides vectorization of mathematical operations on arrays and matrices which
significantly improves the performance
 many other python libraries are built on NumPy.
4.2) Pandas:
 Adds data structures and tools designed to work with table-like data (similar to Series
and Data Frames in R)
 provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation
 Allows handling missing data
4.3) SciKit-Learn:
 provides machine learning algorithms: classification, regression, clustering, model
 Most important tools for the machine learning.

 built on NumPy, SciPy and matplotlib
4.4) Matplotlib:
 python 2D plotting library which produces publication quality figures in a variety of
 a set of functionalities similar to those of MATLAB line plots, scatter plots, barcharts,
histograms, pie charts etc.
 relatively low-level; some effort needed to create advanced visualization
4.5) Seaborn:


 Similar (in style) to the popular ggplot2 library in R
CHAPTER - 5
5) Report on Indian Diabetic Dataset -
Dataset - The dataset used is a sample of person in a diabetic dataset,high-risk region of

the India. The dataset that was used for this project is a subset of a much larger dataset,
and has the following feature vectors:
DESCRIPTION
Predict the onset of diabetes based on diagnostic measures.
SUMMARY
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective is to predict based on diagnostic measurements whether a
patient has diabetes.
Several constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

Inspiration:
some values are not in the range where they are supposed to be, should be treated as
missing value.
What kind of method is better to use to fill this type of missing value? How further
clasification will be like?
Is there any sub-groups significantly more likely to have diabetes?
#feature range plotted below:
In the dataset, there are 767 example vectors. Expert Systems have been used in the field of
medical science to assist the doctors in making certain diagnoses, and this can help save lives.
1. Importing the library-
We imported all the necessary libraries for the model creation-
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 matplotlib inline
 import math
2. Importing the Dataset-
data=
pd.read_csv('diabetes.csv')
data.head()
Process to read the csv file.
X=data.iloc[:,0:8].values
y =data.iloc[:,8].values
3. Planning and Cleaning the data-
First we perform the significance analysis of the 8 feature vectors, to see which vectors have
more significance in representing the classes and we find the correlation between all the
vectors.
Selected attributes: 1,2,3,4,5,6,7,8 : 9 Here we can see that all factors are important after we
do the PCA. The last feature has been deemed unworthy by the PCA implementation , which
made little sense to us as age is highly correlated to most diseases. We further our
investigation by using another attribute selector, the Significance Attribute Evaluator.
Then we used the pairplot to understand which vectors are more significant and important
than others and outcome was-
Hence we knew we can use all the vectors to compute the prediction.
data.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
4. Creating the Train and test dataset for validation -
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
Then we find out the data is not in standard form and we standardize the data using sklearn
library.
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test= sc.transform(X_test)
This fit as well as transform the data to the understandable form.
5. Selection of Algorithm-
As we know that the it is a classification problem so we will use the logistic regression in
solving this.
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
Creating object of regression and then creating the model with help of that.
6. Model Evaluation-
Confusion Matrix-
Precision , accuracy and recall-
precision= 0.915888, recall= 0.844828, accuracy=0.759740
As we can see that the precision and accuracy is good .Hence we can say that our model is
correctly identifying the dataset available.
Mean absolute error 0.29
Root mean squared 0.5386
Relative absolute error 64.028 %
Root relative squared error 113.1898 %

Conclusion-
We conclude that the dataset is not a complete space, and there are still other feature vectors
missing from it. What we were attempting to generalize is a subspace of the actual input
space, where the other dimensions are not known, and hence none of the classifiers were able
to do better than 71.6% . In the future, if similar studies are conducted to generate the dataset
used in this report, more feature vectors need to be calculated so that the classifiers can form
a better idea of the problem at hand.
References -
1. Datasciencesociety.org
2. Kaggle.com
3. HP student guide
4. Analytics Vidhya
5. Coursera

Introduction to Python for Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to Python for Machine Learning

Uploaded by

Copyright:

Available Formats

Sub Code – RCS-753

A Report submitted in Industrial training

Under the Supervision of

Raj Kumar Goel Institute of Technology Ghaziabad

Department of Computer Science &

I hereby declare that the Industrial Training Report entitled "INTRODUCTION TO

Date: 26 July 20 Sarthak Garg(1703310174)

To continually develop excellent professionals capable of providing sustainable solutions to

MISSION OF THE INSTITUTE

VISION OF THE DEPARTMENT

MISSION OF THE DEPARTMENT

PROGRAM EDUCATIONAL OUTCOMES (PEOs)

PEO 1: Learning: Our graduates to be competent with sound knowledge in field of

PEO 3: Innovative: To develop research oriented analytical ability among students to

Engineering Graduates will be able to:

PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering

PO2: Problem analysis: Identify, formulate, review research literature, and

PO3: Design/development of solutions: Design solutions for complex

PO4: Conduct investigations of complex problems: Use research-based

PO7: Environment and sustainability: Understand the impact of the professional

PO11: Project management and finance: Demonstrate knowledge and

PROGRAM SPECIFIC OUTCOMES (PSOs)

1.1 Machine Learning.................................................................................................8

CHAPTER 2: Machine Learning Cycle......................................................................12

CHAPTER 3: Some Common Techniques and Algorithms.........................................17

3.1 Linear Regression...................................................................................................17

CHAPTER 4: Python Library used................................................................................23

CHAPTER 5: Report on Indian Diabetic Dataset..........................................................25

5.1.2 Importing Dataset..............................................................................................26

5.1.4 Creating train and test dataset for validation.................................................28

5.1.5 Selection of Algorithm........................................................................................29

5.1.6 Model Evaluation................................................................................................30

CONCLUSIONS AND FUTURE SCOPE....................................................................31

6. Naive Bayes classifier

7. Feature plot of dataset

8. Heatmap and correlation of dataset vectors

9. Pairplot of dataset’s vector

10. Confusion matrix

1.1) Machine Learning:

1.2) Artificial Intelligence And Deep Learning:

Artificial Intelligence - In computer science, artificial intelligence (AI), sometimes

As machines become increasingly capable, tasks considered to require "intelligence" are

1.3) Types of Machine Learning:

ENSEMBLE LEARNING- Ensemble learning is the process by which multiple models,

Examples: Try to identify “clusters” of customers based on data we have on them

Reinforcement Learning - Behavioural psychology is the core concept behind

Classification Learning Technique:In machine learning and statistics, classification is the

 Required Skills of an Machine Learning:

• Language- Basic knowledge of any programming language in Python ,Scala, R ,

1.5) COMMANDMENTS OF AI AND ML:

1) AI should be designed for all, and benefit humanity.

2) AI should operate on principles of transparency and fairness, and be well signposted.

4) The application of AI should be to reduce inequality of wealth, health, and opportunity.

8) AI should never be developed or deployed separately from consideration of the ethical

2) Machine Learning Cycle:

The various stages in the Machine Learning methodology are

In the planning there are two phases:

First is Defining Goals and Second is developing project charter.

 What is the Business expecting from the project?

Developing Project Charter:

2.2) Data Preparation

It consist of few steps