Professional Documents
Culture Documents
INTERNSHIP REPORT ON
Internship Associate
SUSHMA 2GB16EC026
2019 – 2020
Department of Electronics and Communication Engineering
Government Engineering College
Huvina Hadagali - 583219
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELAGAVI
CERTIFICATE
Certified that, the Internship report entitled “PYTHON WITH MACHINE
LEARNING” is presented by Ms. SUSHMA (2GB16EC026) in partial
fulfillment for the award of Degree of Bachelor of Engineering in Electronics
and Communication Engineering, by the Visvesvaraya Technological
University, Belagavi, during the academic year 2019-20 the Internship report
has been approved as it satisfies the academic requirements in respect of the
Internship work prescribed for the said Degree.
……...……………………… ..……………………………
Signature of Internship Guide Signature of Internship Co–ordinator
Mr. PRADEEP A S Asst. Prof. and HOD Mr. PRADEEP A S Asst. Prof. and HOD
Dept. of ECE, GEC Huvina Hadagali Dept. of ECE, GEC Huvina Hadagali
……...……………………… ….……………………………
Signature of HOD Signature of Principal
Mr. PRADEEP A S Asst. Prof. and HOD Shri. Dr. SHASHIDHAR S RAMTHAL
Dept. of ECE, GEC Huvina Hadagali Principal, GEC Huvina Hadagali
External Viva
Name of the Examiners Signature with date
1…………………………… ………………………
2…………………………… ………………………
ACKNOWLEDGEMENT
I’m Presenting my Internship Report on “PYTHON WITH MACHINE
LEARNING”
The satisfaction that accompanies the successful completion of any task would be
incomplete without mentioning of the people who made it possible many responsible for the
knowledge and experience gained during the work course.
I would like to express my humble feeling of thanks to one and all who have helped
me directly or indirectly for the successful completion of the Internship Report.
I’m grateful to Government Engineering College Huvina Hadagali and
Department of Electronics and Communication Engineering for importing me the
knowledge with which I can do my best.
SUSHMA
2GB16EC026
CONTENTS
COMPANY PROFILE……………………………………………………….i & ii
COMPANY OVERVIEW…………………………………………………………iii
CHAPTER 2 Introduction
2.1 General Introduction: What is Machine Learning? ................................................................ 5
2.2 Machine Learning Vs. Traditional Programming ................................................................... 5
2.3 How does Machine Learning Works? ..................................................................................... 5
2.4 Why Machine Learning? ......................................................................................................... 6
2.5 Supervised Machine Learning ................................................................................................ 7
2.6 Unsupervised Machine Learning ..........................................................................................10
CHAPTER 3 Clustering
3.1 What Is Clustering? ............................................................................................................... 12
3.2 Applications Of Clustering....................................................................................................12
3.3 Clustering Algorithm.............................................................................................................13
CHAPTER 4 K Means Clustering Algorithm .................................................................................... 14
4.1 What is K Means Clustering?................................................................................................14
4.2 How does K Means clustering Algorithms works? .............................................................. 14
4.3 K-means Clustering – Example............................................................................................. 17
4.4 Advantages of K- Means Clustering Algorithm ................................................................... 17
4.5 Disadvantages of K- Means Clustering Algorithm ............................................................... 18
4.6 Applications of K- Means Clustering Algorithm .................................................................. 18
(i)
ಕರುನಾಡು ಟೆಕ್ಾಾಲಜೀಸ್ ಪ್ೆೈವೆೀಟ್ ಲಿಮಿಟೆಡ್
Karunadu Technologies Private Limited
• Email : support@karunadutechnologies.com
• Website : www.karunadutechnologies.com
• Based in : Chikkabanvara
( ii )
COMPANY OVERVIEW
Karunadu Technologies Pvt. Ltd. Is a leading IT software solutions and
services industry focusing on quality standards and customer values. We offer broad
range of customized software applications powered by concrete technology and
industry expertise. Karunadu Technologies Pvt. Ltd. Offers end to end embedded
solutions and services. We deal with broad range of product development along
with customized features ensuring at most customer satisfaction. Karunadu
Technologies Pvt. Ltd. is also a leading Skills and Talent Development company
that is building a manpower pool for global industry requirements. We empower
individual with knowledge, skills and competencies that assist them to escalate as
integrated individuals with a sense of commitment and dedication. Karunadu
Technologies Pvt. Ltd. also helps companies to find right individuals matching the
requirements. We engage in Outsourcing of talented candidates.
( iii )
CERTIFICATES
CERTIFICATES
(v)
LIST OF FIGURES
CHAPTER 1
INTRODUCTION TO INDUSTRY
1.1 Mission
IQRA software is committed to its role in technical individuals or corporate in areas
of speech compression, image processing, control system, wireless LAN, VHDL,
MATLAB(Sci-hub), DSP TMS320C67xx, java, Microsoft.Net, software quality testing,
SDLC & implementation, project management, manual testing, silk test, mercury test, QTP,
Test director for quality center.
1.2 Vision
DSP, VLSI, Embedded and software testing are one of the fastest growing areas in IT
across the globe. Our vision is to create a platform where trainees/students are able to learn
different features of technologies to secure a better position in IT industry or to improve their
careers.
I choose python as a working title for the project, being in a slightly irreverent mood
(and a big fan of Monty Python‟s Flying Circus)
1.7 Dictionary
Lists are sequences but the dictionaries are mappings.
They are mappings between a unique key and a value pair.
These mappings may not retain order.
Constructing a dictionary.
Accessing object from a dictionary.
Nesting Dictionaries.
Basic Dictionary Methods.
There are three kinds of mode, that python provides and how file can be opened:
“r”, For reading.
“w”, Writing.
“a”, Appending.
“r+” ,For both reading and writing.
Code in Python
dic={}
words=[]
with open(“101.txt”) as f1:
for line in f1 :
words = words + line. Split()
for I in range(len(words)):
count=0
for j in range(len(words)):
if words[i]==words[j]:
count=count+1
dic[words[i]]=count
for count in dic:
print(count+” “+str(dic[count]))
f4=open(“105.txt, “a+”)
f4.writelines(count+”“+str(dic[count])
CHAPTER 2
INTRODUCTION
Machine Learning is a system that can learn from example through self-improvement
and without being explicitly coded by programmer. The breakthrough comes with the idea
that a machine can singularly learn from the data (i.e., example) to produce accurate results.
Machine learning is supposed to overcome this issue. The machine learns how the
input and output data are correlated and it writes a rule. The programmers do not need to
write new rules each time there is new data.
Machine learning is the brain where all the learning takes place. The way the machine
learns is similar to the human being. Humans learn from experience. The more we know, the
more easily we can predict. By analogy, when we face an unknown situation, the likelihood
of success is lower than the known situation.
Machines are trained the same. To make an accurate prediction, the machine sees an example.
When we give the machine a similar example, it can figure out the outcome. However, like a human,
if it feed a previously unseen example, the machine has difficulties to predict.
The core objective of machine learning is the learning and inference. First of all, the
machine learns through the discovery of patterns. This discovery is made thanks to the data.
One crucial part of the data scientist is to choose carefully which data to provide to the
machine. The list of attributes used to solve a problem is called a feature vector. You can
think of a feature vector as a subset of data that is used to tackle a problem. The machine uses
some fancy algorithms to simplify the reality and transform this discovery into a model.
Therefore, the learning stage is used to describe the data and summarize it into a model.
The world today is evolving and so are the needs and requirements of people.
Furthermore, we are witnessing a fourth industrial revolution of data. In order to derive
meaningful insights from this data and learn from the way in which people and the system
interface with the data, we need computational algorithms that can churn the data and provide
us with results that would benefit us in various ways.
Data is expanding exponentially and in order to harness the power of this data, added
by the massive increase in computation power, Machine Learning has added another
dimension to the way we perceive information. Machine Learning is being utilized
everywhere. The electronic devices you use, the applications that are part of your everyday
life are powered by powerful machine learning algorithms.
Machine learning example – Google is able to provide you with appropriate search
results based on browsing habits. Similarly, Netflix is capable of recommending the films or
shows that you would want to watch based on the machine learning algorithms that perform
predictions based on your watch history.
Furthermore, machine learning has facilitated the automation of redundant tasks that
have taken away the need for manual labour. All of this is possible due to the massive amount
of data that you generate on a daily basis. Machine Learning facilitates several methodologies
to make sense of this data and provide you with steadfast and accurate results. Different
Types of Machine Learning
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Machine Learning
4. Semi Supervised Machine Learning
2.5.1 Definition
Supervised learning algorithms are used when the output is classified or labelled.
These algorithms learns from the past data that is inputted, called as training data, runs its
analysis and uses this analysis to predict future events of any new data within the known
classifications. The accurate prediction of test data requires large data to have a sufficient
understanding of the patterns. The algorithm can be trained further by comparing the training
outputs to actual ones and using the errors for modification of the algorithms.
If shape of object is rounded and depression at top having color Red then it will be labeled as
–Apple.
If shape of object is long curving cylinder having color Green-Yellow then it will be labeled
as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from
basket and asked to identify it.
Since the machine has already learned the things from previous data and this time
have to use it wisely. It will first classify the fruit with its shape and color and would confirm
the fruit name as BANANA and put it in Banana category. Thus the machine learns the
things from training data (basket containing fruits) and then apply the knowledge to test data
(new fruit). Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
2.5.4 Challenges in Supervised Machine Learning
Here, are challenges faced in supervised machine learning:
Irrelevant input feature present training data could give inaccurate results
Data preparation and pre-processing is always a challenge.
Accuracy suffers when impossible, unlikely, and incomplete values have been
inputted as training data.
If the concerned expert is not available, then the other approach is "brute-force." It
means you need to think that the right features (input variables) to train the machine
on. It could be inaccurate.
2.6.3 Description
Unsupervised learning is the training of machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to similarities,
patterns and differences without any prior training of data.
Thus, the machine has no idea about the features of dogs and cat so we can‟t categorize it in
dogs and cats. But it can categorize them according to their similarities, patterns, and
differences i.e., we can easily categorize the above picture into two parts.
First may contain all pictures having dogs in it and second part may contain all pic having
cats in it. Here you didn‟t learn anything before, means no training data or examples.
Unsupervised learning classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing the behaviour.
Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend to
buy Y.
CHAPTER 3
CLUSTERING
CHAPTER 4
Now let us assume we have a data set which is unlabeled and we need to divide it into
clusters.
Now we need to find the number of clusters. This can be done by two methods:
Elbow Method.
Purpose Method.
Elbow Method
In this method, a curve is drawn between “within the sum of squares” (WSS) and the
number of clusters. The curve plotted resembles a human arm. It is called the elbow method
because the point of elbow in the curve gives us the optimum number of clusters. In the graph
or curve, after the elbow point, the value of WSS changes very slowly so elbow point must be
considered to give the final value of the number of clusters.
Purpose Method
In this method, the data is divided based on different metrics and after then it is
judged how well it performed for that case. For example, the arrangement of the shirts in the
men‟s clothing department in a mall is done on the criteria of the sizes. It can be done on the
basis of price and the brands also. The best suitable would be chosen to give the optimal
number of clusters i.e. the value of K.
Now lets us get back to our given data set above. We can calculate the number of clusters
Step 1: Initialisation
Firstly, initialize any random points called as the centroids of the cluster. While
initializing you must take care that the centroids of the cluster must be less than the number
of training data points.
This algorithm is an iterative algorithm hence the next two steps are performed iteratively.
After initialization, all data points are traversed and the distance between all the
centroids and the data points are calculated. Now the clusters would be formed depending
upon the minimum distance from the centroids. In this example, the data is divided into two
clusters.
As the clusters formed in the above step are not optimized so we need to form
optimized clusters. For this, we need to move the centroids iteratively to a new location.
Take data points of one cluster, compute their average and then move the centroid of
that cluster to this new location. Repeat the same step for all other clusters.
Step 4: Optimization
The above two steps are done iteratively until the centroids stop moving i.e., they do
not change their positions anymore and have become static. Once this is done the k- means
algorithm is termed to be converged.
Step 5: Convergence
Now this algorithm has converged and distinct clusters are formed and clearly visible.
This algorithm can give different results depending on how the clusters were initialized in
the first step.
They need to analyse the areas from where the pizza is being ordered frequently.
They need to understand as to how many pizza stores has to be opened to cover
delivery in the area.
They need to figure out the locations for the pizza stores within all these areas in order
to keep the distance between the store and delivery points minimum.
Resolving these challenges includes a lot of analysis and mathematics. We would now
learn about how clustering can provide a meaningful and easy method of sorting out
such real life challenges.
CHAPTER 5
KNN is often used in search applications where you are looking for similar items, like
find items similar to this one.
KNN is a Supervised Learning algorithm that uses labelled input data set to predict
the output of the data points.
It is one of the simplest Machine learning algorithms and it can be easily implemented
for a varied set of problems.
It is mainly based on feature similarity. KNN checks how similar a data point is to its
neighbour and classifies the data point into the class it is most similar.
Unlike most algorithms, KNN is a non-parametric model which means that it does not
make any assumptions about the data set. This makes the algorithm more effective
since it can handle realistic data.
KNN is a lazy algorithm this means that it memorizes the training data set instead of
learning a discriminative function from the training data.
KNN can be used for solving both classification and regression problems.
1. Calculate distance
The number of neighbours (K) in KNN is a hyper parameter that you need choose at the time
of model building. You can think of K as a controlling variable for the prediction model.
Research has shown that no optimal number of neighbours suits all kind of data sets. Each
dataset has it's own requirements. In the case of a small number of neighbours, the noise will
have a higher influence on the result, and a large number of neighbours make it
computationally expensive. Research has also shown that a small amount of neighbours are
most flexible fit which will have low bias but high variance and a number of neighbour will
have a smoother decision boundary which means lower variance but higher bias.
Generally, Data scientists choose as an odd number if the number of classes is even.
You can also check by generating the model on different values of k and check their
performance. You can also try Elbow method here.
CHAPTER 6
LINEAR REGRESSION
In linear regression, the relationships are modelled using linear predictor functions
whose unknown model parameters are estimated from the data. Such models are called linear
models. Most commonly, the conditional mean of the response given the values of the
explanatory variables (or predictors) is assumed to be an affine function of those values; less
commonly, the conditional median or some other quintile is used. Like all forms of regression
analysis, linear regression focuses on the conditional probability distribution of the response
given the values of the predictors, rather than on the joint probability distribution of all of
these variables, which is the domain of multivariate analysis.
6.1 Definition
Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as regression
line).
This task can be easily accomplished by Least Square Method. It is the most common
method used for fitting a regression line.
It calculates the best-fit line for the observed data by minimizing the sum of the
squares of the vertical deviations from each data point to the line. Because the deviations are
first squared, when added, there is no cancelling out between positive and negative values.
6.2 Advantages
Linear regression is an extremely simple method. It is very easy and intuitive to use
and understand. A person with only the knowledge of high school mathematics can
understand and use it. In addition, it works in most of the cases. Even when it doesn‟t fit the
data exactly, we can use it to find the nature of the relationship between the two variables.
6.3 Disadvantage
By its definition, linear regression only models‟ relationships between dependent and
independent variables that are linear. It assumes there is a straight-line relationship
between them which is incorrect sometimes. Linear regression is very sensitive to the
anomalies in the data (or outliers).
Take for example most of your data lies in the range 0-10. If due to any reason only
one of the data items comes out of the range, say for example 15, this significantly
influences the regression coefficients.
Another disadvantage is that if we have a number of parameters than the number of
samples available then the model starts to model the noise rather than the relationship
between the variables.
CHAPTER-7
7.1 Definition
In many cases, there may be possibilities of dealing with more than one predictor
variable for finding out the value of the response variable. Therefore, the simple linear
models cannot be utilized as there is a need for undertaking Multiple Linear Regression for
analyzing the predictor variables. Using the two explanatory variables, we can delineate the
equation of Multiple Linear Regression as follows:
yi = β0 + β1x1i + β2x1i + εi
The two explanatory variables x1i and x1i, determine yi, for the ith data point.
Furthermore, the predictor variables are also determined by the three parameters β0, β1, and
β2 of the model, and by the residual ε1 of the point i from the fitted surface.
yi = Σβ1x1i + εi
Plotting these in a multiple regression model, she could then use these factors to see their
relationship to the prices of the homes as the criterion variable.
The predictor variables could be each manager's seniority, the average number of
hours worked, the number of people being managed and the manager's departmental budget.
relationship between the proximity of schools may lead her to believe that this had an
effect on the sale price for all homes being sold in the community.
This illustrates the pitfalls of incomplete data. Had she used a larger sample, she could
have found that, out of 100 homes sold, only ten percent of the home values were
related to a school's proximity. If she had used the buyers' ages as a predictor value,
she could have found that younger buyers were willing to pay more for homes in the
community than older buyers.
In the example of management salaries, suppose there was one outlier who had a
smaller budget, less seniority and with fewer personnel to manage but was making
more than anyone else. The HR manager could look at the data and conclude that this
individual is being overpaid. However, this conclusion would be erroneous if he didn't
take into account that this manager was in charge of the company's website and had a
highly coveted skillset in network security.
CHAPTER 8
POLYNOMIAL REGRESSION
Polynomial Regression is a form of linear regression in which the relationship
between the independent variable x and dependent variable y is modelled as an nth degree
polynomial. Polynomial regression fits a nonlinear relationship between the value of x and
the corresponding conditional mean of y, denoted E(y |x)
8.1 Definition
Polynomial regression is a form of regression analysis in which the relationship
between the independent variable x and the dependent variable y is modelled as an nth degree
polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x
and the corresponding conditional mean of y, denoted E(y |x), and has been used to describe
nonlinear phenomena such as the growth rate of tissues, the distribution of carbon isotopes in
lake sediments, and the progression of disease epidemics. Although polynomial regression
fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense
that the regression function E(y | x) is linear in the unknown parameters that are estimated
from the data. For this reason, polynomial regression is considered to be a special case of
Multiple Linear Regression.
At the same time PR has its own unique cases, and when we have problem and we
might try a SLR and MLR first and see what happens. With a PR sometimes we obtain better
results. For example, PR issued to observe how epidemics spread across population, and
similar use cases, so it‟s a matter of what we want to predict, and of course it‟s always good
to have more tools in our arsenal. But at this point you might ask yourself why PR is still
We use a PR when the SL straight line doesn‟t fit well our observations and we want
to obtain parabolic effect:
Well the trick here is that when we talk about linear and nonlinear models, we are not
thinking in terms of variables, but we are considering the coefficient. So, the question is
whether the function can be expressed as a linear combination of these coefficients or not,
being ultimately unknown is goal to find the coefficient values, and this is why linear and
nonlinear refers to the coefficients. So, PR is a special case of MLR rather than a standard
new type of regression.
CHAPTER 9
PROJECT DESCRIPTION
In this project, we will develop and evaluate the performance and the predictive power of a model
trained and tested on data collected from houses in Boston’s suburbs.Once we get a good fit, we will
use this model to predict the monetary value of a house located at the Boston’s area. A model like this
would be very valuable for a real state agent who could make use of the information provided in a
doyly basis.
9.2 DATASET:
This implies that if y_pred and y_test values are approximately similar than the applied
algorithm will suits for the given data set.
9.4.2 DATASET
There is csv file as input dataset. The name of csv file is “mlr11.csv”. This mlr11.csv
file consists of 11 rows and 8 columns.
9.5 FOR THE PURPOSE OF THE PROJECT THE DATASET HAS BEEN
PREPROCESSED
The essential features for the project are: ‘RM’, ‘LSTAT’, ‘PTRATIO’ and ‘MEDV’. The
remaining features havebeen excluded.
16 data points with a ‘MEDV’ value of 50.0 have been removed. As they likely contain censored or
missing values.
1 data point with a ‘RM’ value of 8.78 it is considered an outlier and has been removed for the
optimal performance of the model.
As this data is out of date, the ‘MEDV’ value has been scaled multiplicatively to account for 35
years of markt inflation.
We’ll now open a python 3 Jupyter Notebook and execute the following code snippet to load the dataset
and remove the non-essential features.Recieving a success message if the actions were correclty performed.
dataset=pd.read_csv('train.csv')
x=dataset.iloc[:,0:13].values
y=dataset.iloc[:,14:].values
y_pred=regressor.predict(x_test)
x=np.append(arr=np.ones((333,1)).astype(int),values=x,axis=
1) x_opt=x[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()
x_opt=x[:,[0,1,2,3,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()
x_opt=x[:,[0,2,3,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()
x_opt=x[:,[0,2,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()
x=x_opt
y_pred=regressor.predict(x_test)
The explanatory (independent) variables resulting from the polynomial expansion of the
“baseline” variables are known as higher-degree terms. Such variables are also used in
classification settings. Notice that PR is very similar to MLR, but consider at the same time instead
of the different variables the same one X1, but in different powers; so basically we are using 1
variable to different powers of the same original variable.
CONCLUSION
Throughout this article we made a machine learning regression project from end-to-end and
we learned and obtained several insights about regression models and how they are developed.
This was the first of the machine learning projects that will be developed on this series. If you
liked it, stay tuned for the next article! Which will be an introduction to the theory and concepts
regarding to classification algorithms.