Professional Documents
Culture Documents
INTRODUCTION TO PYTHON
B. Tech
In
Computer Science and Engineering
1
Sub Code – RCS-753
DECLARATION
2
VISION OF THE INSTITUTE
We wish to serve the nation by becoming a reputed deemed university for providing value
based professional education.
1. To provide quality education in both the theoretical and applied foundations of Computer
Science and train students to effectively apply this education to solve real world problems.
2. To amplify their potential for lifelong high quality careers and give them a competitive
advantage in the challenging global work environment.
PEO 2: Employable: To develop the ability among students to synthesize data and
technical concepts for application to software product design for successful careers
that meet the needs of Indian and multinational companies.
PO5: Modern tool usage: Create, select, and apply appropriate techniques,
resources, a n d modern engineering and IT tools including prediction and modelling to
complex engineering activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
PO12: Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context of
technological change.
PSO1: The ability to use standard practices and suitable programming environment to develop
software solutions.
PSO2: The ability to employ latest computer languages and platforms in creating innovative
career opportunities.
CO1 Relate to the 'real' working environment and get acquainted with the organization structure,
business operations and administrative functions.
CO2 Practice hands-on experience in the computer related fields so that they can relate and reinforce
what has been taught.
CO3 Develop synergetic collaboration with industry and the university in promoting a
knowledgeable society.
CO4 Set up the stage for future recruitment by potential employers.
CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
CO1 3 3 2 1 1 1 1 1 2 1 2 2 3 2
CO2 3 3 2 1 2 2 1 2 3 3
CO3 3 3 1 1 2 1 1 2 1 2 2 1
CO4 3 3 1 1 2 1 1 1 2 2 1
CO 3 3 1.5 1 1.75 1 1 1 1.75 1 1.25 2 2.5 1.75
Certificate By Company
Table of Contents
LIST OF FIGURES.........................................................................................................6
CHAPTER 1: INTRODUCTION..................................................................................8
2.1 Planning.................................................................................................................12
2.2 Data Preparation..................................................................................................13
2.3 Data Pre-processing..............................................................................................14
2.4 Model Presentation.................................................................................................16
4.1 Numpy......................................................................................................................23
4.2 Pandas......................................................................................................................23
4.3 SciKit-learn..............................................................................................................23
4.4 Matplotlib.................................................................................................................23
4.5 Seaborn.....................................................................................................................24
5.1 Summary...................................................................................................................25
5.1.1 Importing Library.............................................................................................26
REFERENCES................................................................................................................31
LIST OF FIGURES
1. Linear Regression
2. Flowchart of regression
3. Logistic Regression
4. Clustering technique
5. K nearest algorithm
1) INTRODUCTION
Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is a field of computer science that gives computers the ability to learn without
being explicitly programmed.
“A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T, as measured by P, improves with
experience E.”
It is the first-class ticket to most interesting careers in data analytics today. As data sources
proliferate along with the computing power to process them, going straight to the data is one
of the most straightforward ways to quickly gain insights and make predictions. Machine
Learning can be thought of as the study of a list of sub-problems, viz: decision making,
clustering, classification, forecasting, deep-learning, inductive logic programming, support
vector machines, reinforcement learning, similarity and metric learning, genetic algorithms,
sparse dictionary learning, etc.
Deep Learning - Deep learning is an artificial intelligence function that imitates the
workings of the human brain in processing data and creating patterns for use in decision
making. Deep learning is a subset of machine learning in artificial intelligence (AI) that has
networks capable of learning unsupervised from data that is unstructured or unlabeled. Also
known as deep neural learning or deep neural network.
Supervised Learning - We know what we are trying to predict. We use some examples that
we (and the model) know the answer to, to “train” our model. It can then generate predictions
to examples we don’t know the answer to.
Examples: Predict the price a house will sell at. Identify the gender of someone based on a
photograph.
Unsupervised Learning- We don’t know what we are trying to predict. We are trying to
identify some naturally occurring patterns in the data which may be informative.
Machine Learning –It is a field of computer science that gives computers the ability to learn
without being explicitly programmed.
“A computer program is said to learn from experience E with respect to some task
T and some performance measure P, if its performance on T, as measured by P,
improves with experience E.
How to plan for Machine Learning?
The first step in this process is setting and defining the goal.The main purpose is
making sure all the stakeholders understand the what, how and why of the project which
results in development of project charter.
The second phase is data retrieval .You want to get the data for analysis from various
sources in any form,structured or unstructured .So this step explains finding the right and
suitable data which can be used to address the business objective.Getting required access to
the data from the business owners is the next crucial step.
Data which comes in raw form probably requires cleansing and some transformation
and fitting to shape it into more consumable data for our ML algorithms.
The Most critical part is Data modelling often referred as model building. Here is
where we try to gain the insights or prediction results which are stated in our business charter.
The Final step is of data science process is presenting the model, results it achieved
and prediction it made and insights . These are mainly dependent on our business objective or
goal defined in the charter.This is the stage where we influence he business for its future
goals and defining future objectives as well.
3) AI should not be used to transgress the data rights and privacy of individuals, families, or
communities.
5) AI should not be used for criminal intent, nor to subvert the values of our democracy, nor
truth, nor courtesy in public discourse.
6) The primary purpose of AI should be to enhance and augment, rather than replace,
human labour and creativity.
7) All citizens have the right to be adequately educated to flourish mentally, emotionally,
and economically in a digital and artificially intelligent world.
9) The autonomous power to hurt or destroy should never be vested in artificial intelligence.
10) Governments should ensure that the best research and application of AI is directed
toward the most urgent problems facing humanity.
CHAPTER - 2
● Planning
● Data Preparation
● Modelling
● Model Presentation
2.1) Planning
The planning phase is the most important phase of the machine learning methodology. You
can never win a war if your planning skill is poor. The importance of planning is to
accumulate important information and facts about the selected target. This information can
then be applied, in the grass, to reach the potential necessary position and to get important
data.
Defining Goals - The journey always begins with the primary action of questioning ,what are
we doing ,why are we doing and how are we doing?
1. Data Retrieval
2. Data Cleansing
3. Data Exploration and Refining
Data Retrieval - Extracting and retrieving the data on which dataset we are going to work
and create the model.
Data Cleansing - Cleansing the data which is retrieved in previous phase is known as data
cleansing.
1. Manual overrules
2. Use string functions
3. Manual overrules
4. Replace with another value
5. Treat as missing value
6. Omit the values
7. Set value to null
8. Recalculate for same unit
9. Bring same level of aggregation
This phase will take a deep dive into understanding data. By using various graphical
techniques and statistical techniques to gain information the data scientist tries to understand
whether the data is normally distributed or not. Correlation functions are used to know how
variables are related to each other.The visualization techniques that we use can be simple as
histogram and yet complex like pairplot and others.
1. Model Creation
2. Model Validation
3. Model Evaluation
Model Creation - With clear data and understanding of the content,we can build models with
the goal of making better predictions and classifying objects or gaining an understanding of
the system.This phase is most important part of the cycle because the better selection of
algorithm and the model will lead to the better accuracy of the model as well as the
prediction.
Just choose the right algorithm and the model for the dataset and the accuracy will always be
more than enough.
Model Validation -
1. Adjusted R square
2. Standard Error
3. P-Value
4. Z-value
5. MAPE- Mean absolute percentage error
6. MSE- mean squared error
7. RMSE - Root mean square error
8. MASE - Mean absolute scaled error
Model Evaluation-
Hold-out method -
Cross-validation method -
When only a limited amount of data is available, to achieve an unbiased estimate of the
model performance we use k-fold cross-validation. We divide the data into k subsets of equal
size.We build models k times,each time leaving out one of the subsets from training and use
it as test set.
1. Preparation for the presentation which showcase the outcome results of the analysis or
model which we developed and how good these are addressing to the business or the
problem
2. Deployment of the model based on the preference
3. Scaling the model based on the issues emerged
CHAPTER - 3
Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one dependent
binary variable and one or more nominal, ordinal, interval or ratio-level independent
variables.
3.4) Clustering Technique - Clustering is a Machine Learning technique that involves the
grouping of data points. ... In theory, data points that are in the same group should have
similar properties and/or features, while data points in different groups should have highly
dissimilar properties and/or features. K- means is the most used algorithm for this technique.
3.5). K nearest Algorithm-
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other. KNN captures the idea of similarity (sometimes called
distance, proximity, or closeness) with some mathematics calculating the distance between
points on a graph.
There are other ways of calculating distance, and one way might be preferable depending on
the problem we are solving. However, the straight-line distance (also called the Euclidean
distance) is a popular and familiar choice.
Classification of Bird using KNN
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels
to problem instances, represented as vectors of feature values, where the class labels are
drawn from some finite set. There is not a single algorithm for training such classifiers, but a
family of algorithms based on a common principle: all naive Bayes classifiers assume that the
value of a particular feature is independent of the value of any other feature, given the class
variable. For example, a fruit may be considered to be an apple if it is red, round, and about
10 cm in diameter. A naive Bayes classifier considers each of these features to contribute
independently to the probability that this fruit is an apple, regardless of any
possible correlations between the color, roundness, and diameter features.
CHAPTER - 4
4.1) NUMPY-
Introduces objects for multidimensional arrays and matrices, as well as functions that
allow to easily perform advanced mathematical and statistical operations on those objects
provides vectorization of mathematical operations on arrays and matrices which
significantly improves the performance
many other python libraries are built on NumPy.
4.2) Pandas:
Adds data structures and tools designed to work with table-like data (similar to Series
and Data Frames in R)
provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation
4.3) SciKit-Learn:
4.4) Matplotlib:
a set of functionalities similar to those of MATLAB line plots, scatter plots, barcharts,
histograms, pie charts etc.
relatively low-level; some effort needed to create advanced visualization
4.5) Seaborn:
Similar (in style) to the popular ggplot2 library in R
CHAPTER - 5
DESCRIPTION
Predict the onset of diabetes based on diagnostic measures.
SUMMARY
This dataset is originally from the National Institute of Diabetes and Digestive and
Kidney Diseases. The objective is to predict based on diagnostic measurements whether a
patient has diabetes.
Several constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian heritage.
some values are not in the range where they are supposed to be, should be treated as
missing value.
What kind of method is better to use to fill this type of missing value? How further
clasification will be like?
In the dataset, there are 767 example vectors. Expert Systems have been used in the field of
medical science to assist the doctors in making certain diagnoses, and this can help save lives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib inline
import math
2. Importing the Dataset-
data=
pd.read_csv('diabetes.csv')
data.head()
X=data.iloc[:,0:8].values
y =data.iloc[:,8].values
First we perform the significance analysis of the 8 feature vectors, to see which vectors have
more significance in representing the classes and we find the correlation between all the
vectors.
Selected attributes: 1,2,3,4,5,6,7,8 : 9 Here we can see that all factors are important after we
do the PCA. The last feature has been deemed unworthy by the PCA implementation , which
made little sense to us as age is highly correlated to most diseases. We further our
investigation by using another attribute selector, the Significance Attribute Evaluator.
Then we used the pairplot to understand which vectors are more significant and important
than others and outcome was-
Hence we knew we can use all the vectors to compute the prediction.
data.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
Then we find out the data is not in standard form and we standardize the data using sklearn
library.
sc= StandardScaler()
X_train= sc.fit_transform(X_train)
X_test= sc.transform(X_test)
5. Selection of Algorithm-
As we know that the it is a classification problem so we will use the logistic regression in
solving this.
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
Creating object of regression and then creating the model with help of that.
6. Model Evaluation-
Confusion Matrix-
As we can see that the precision and accuracy is good .Hence we can say that our model is
correctly identifying the dataset available.
We conclude that the dataset is not a complete space, and there are still other feature vectors
missing from it. What we were attempting to generalize is a subspace of the actual input
space, where the other dimensions are not known, and hence none of the classifiers were able
to do better than 71.6% . In the future, if similar studies are conducted to generate the dataset
used in this report, more feature vectors need to be calculated so that the classifiers can form
a better idea of the problem at hand.
References -
1. Datasciencesociety.org
2. Kaggle.com
3. HP student guide
4. Analytics Vidhya
5. Coursera