You are on page 1of 49

«±ÉéñÀégÀAiÀÄå vÁAwæPÀ «±Àé«zÁå®AiÀÄ ¨É¼UÀ Á«

VISVESVARAYA TECHNOLOGICAL UNIVERSITY - BELAGAVI

INTERNSHIP REPORT ON

“PYTHON WITH MACHINE LEARNING”

Under the Guidance of


Mr. PRADEEP A S
Asst. Prof. and HOD

Dept. of ECE Govt. Engineering College Huvina Hadagali

Internship Associate

SUSHMA 2GB16EC026

2019 – 2020
Department of Electronics and Communication Engineering
Government Engineering College
Huvina Hadagali - 583219
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELAGAVI

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


GOVERNMENT ENGINEERING COLLEGE,
HUVINA HADAGALI – 583219

CERTIFICATE
Certified that, the Internship report entitled “PYTHON WITH MACHINE
LEARNING” is presented by Ms. SUSHMA (2GB16EC026) in partial
fulfillment for the award of Degree of Bachelor of Engineering in Electronics
and Communication Engineering, by the Visvesvaraya Technological
University, Belagavi, during the academic year 2019-20 the Internship report
has been approved as it satisfies the academic requirements in respect of the
Internship work prescribed for the said Degree.

……...……………………… ..……………………………
Signature of Internship Guide Signature of Internship Co–ordinator
Mr. PRADEEP A S Asst. Prof. and HOD Mr. PRADEEP A S Asst. Prof. and HOD
Dept. of ECE, GEC Huvina Hadagali Dept. of ECE, GEC Huvina Hadagali

……...……………………… ….……………………………
Signature of HOD Signature of Principal
Mr. PRADEEP A S Asst. Prof. and HOD Shri. Dr. SHASHIDHAR S RAMTHAL
Dept. of ECE, GEC Huvina Hadagali Principal, GEC Huvina Hadagali

External Viva
Name of the Examiners Signature with date
1…………………………… ………………………
2…………………………… ………………………
ACKNOWLEDGEMENT
I’m Presenting my Internship Report on “PYTHON WITH MACHINE
LEARNING”
The satisfaction that accompanies the successful completion of any task would be
incomplete without mentioning of the people who made it possible many responsible for the
knowledge and experience gained during the work course.
I would like to express my humble feeling of thanks to one and all who have helped
me directly or indirectly for the successful completion of the Internship Report.
I’m grateful to Government Engineering College Huvina Hadagali and
Department of Electronics and Communication Engineering for importing me the
knowledge with which I can do my best.

I would like to thank Dr. SHASHIDAR S RAMTHAL Principal, Government


Engineering College Huvina Hadagali.

I express my gratitude to our Internship Co-ordinator Mr. PRADEEP A S Asst. Prof.


and HOD, Department of Electronics and Communication Engineering, for his valuable
guidance and continual encouragement and assistance throughout the Internship work.

I extend my sense of gratitude to my guide Mr. PRADEEP A S Asst. Prof. and


HOD, Department of Electronics and Communication Engineering, for extending support and
cooperation. I am grateful to him for discussions about the technical matters and suggestions
concerned to our Internship work.

I would like to express sincere thanks to co-guides MISS. PRIYANKA B, MISS.


KOTRAMMA B and MISS. VISHALA T Department of ECE for valuable guidance and
timely suggestions.
I take this opportunity to thank my Parents and Friends who are constant source of
inspiration to me.
Finally, I would like to thank all Professors of Department of Electronics and
Communication Engineering, who with their constant and creative criticism, made me to
maintain standards throughout my endeavour to complete this Internship Report.

SUSHMA
2GB16EC026
CONTENTS
COMPANY PROFILE……………………………………………………….i & ii

COMPANY OVERVIEW…………………………………………………………iii

CERTIFICATES CREDITED……………………………………………….iv & v

List Of Figure ............................................................................................................................. vi


CHAPTER 1 Introduction to industry .................................................................................................. 1
1.1 Mission ...................................................................................................................................... 1
1.2 Vision ........................................................................................................................................ 1
1.3 History of python ...................................................................................................................... 1
1.4 Why python ............................................................................................................................... 2
1.5 Characteristics of python ........................................................................................................... 2
1.6 Data structure in python lists .................................................................................................... 3
1.7 Dictionary ................................................................................................................................... 3
1.8 File handling in python ............................................................................................................. 3

CHAPTER 2 Introduction
2.1 General Introduction: What is Machine Learning? ................................................................ 5
2.2 Machine Learning Vs. Traditional Programming ................................................................... 5
2.3 How does Machine Learning Works? ..................................................................................... 5
2.4 Why Machine Learning? ......................................................................................................... 6
2.5 Supervised Machine Learning ................................................................................................ 7
2.6 Unsupervised Machine Learning ..........................................................................................10

CHAPTER 3 Clustering
3.1 What Is Clustering? ............................................................................................................... 12
3.2 Applications Of Clustering....................................................................................................12
3.3 Clustering Algorithm.............................................................................................................13
CHAPTER 4 K Means Clustering Algorithm .................................................................................... 14
4.1 What is K Means Clustering?................................................................................................14
4.2 How does K Means clustering Algorithms works? .............................................................. 14
4.3 K-means Clustering – Example............................................................................................. 17
4.4 Advantages of K- Means Clustering Algorithm ................................................................... 17
4.5 Disadvantages of K- Means Clustering Algorithm ............................................................... 18
4.6 Applications of K- Means Clustering Algorithm .................................................................. 18

CHAPTER 5: KNN Clustering Algorithm .........................................................................................19


5.1 What is KNN Algorithm? ..................................................................................................... 19
5.2 Features of KNN Algorithm ................................................................................................. 19
5.3 How does KNN algorithm works? ........................................................................................ 20
5.4 How to decide number of K in KNN algorithms? ................................................................ 20
5.5 Advantage of KNN algorithm ............................................................................................................. 21
5.6 Disadvantage of KNN ......................................................................................................................... 22

CHAPTER 6: Linear Regression ......................................................................................................... 23


6.1 Definition .............................................................................................................................. 23
6.2 Advantages ............................................................................................................................ 24
6.3 Disadvantage ......................................................................................................................... 24

CHAPTER 7: Multiple Linear Regression .........................................................................................25


7.1 Definition .............................................................................................................................. 25
7.2 Examples of Multiple Regression ......................................................................................... 25
7.3 Advantages of Multiple Regression ...................................................................................... 26
7.4 Disadvantages of Multiple Regression ................................................................................. 26

CHAPTER 8: Polynomial Regression ................................................................................................ 28


8.1 Definition .............................................................................................................................. 28
8.2 Advantages of using Polynomial Regression ........................................................................ 29
8.3 Disadvantages of using Polynomial Regression ................................................................... 29

CHAPTER 9: Project Description .......................................................................................................30


9.1 Getting the data and preprocessors
9.2 Dataset ................................................................................................................................... 31
9.3 Python Code .......................................................................................................................... 32
9.4 Project Implementation
9.5 Preprocessed Dataset
9.6 Simple Code
9.7 Predicted Output and Results ................................................................................................ 36
CONCLUSION ...................................................................................................................................... 37
COMPANY PROFILE

KARUNADU TECHNOLOGIES PRIVATE


LIMITED
CHIKKABANAVARA BENGALURU

Head Office: #17, ATK complex, 4th Floor, Acharya


College Main Road, Beside Karur Vysya
Bank,Gutte basaveshwaranagar, Chikkabanvara,
Bengaluru, Karnataka- 560090

Registered Office : #59,2nd Main,1st Cross, Near


Bus Stop Singapura Village, Vidyanranyapura
Post, Bengaluru Karnataka – 560097

Support Centre : #11/1,1st Floor,Opp of Vagadevi


College, Near Krishna Engineering College,
Chikkabanavara,Bengaluru Karnataka - 560090

(i)
ಕರುನಾಡು ಟೆಕ್ಾಾಲಜೀಸ್ ಪ್ೆೈವೆೀಟ್ ಲಿಮಿಟೆಡ್
Karunadu Technologies Private Limited

• MD & CEO : Mr. Mahesh Deginal

• Mentor : Mr. Mahesh Deginal &


Mr. Arjun

• Email : support@karunadutechnologies.com

• Tel : 09902913646 / 09964823646

• Website : www.karunadutechnologies.com

• Based in : Chikkabanvara

• Area of Operations : Bengaluru.

( ii )
COMPANY OVERVIEW
Karunadu Technologies Pvt. Ltd. Is a leading IT software solutions and
services industry focusing on quality standards and customer values. We offer broad
range of customized software applications powered by concrete technology and
industry expertise. Karunadu Technologies Pvt. Ltd. Offers end to end embedded
solutions and services. We deal with broad range of product development along
with customized features ensuring at most customer satisfaction. Karunadu
Technologies Pvt. Ltd. is also a leading Skills and Talent Development company
that is building a manpower pool for global industry requirements. We empower
individual with knowledge, skills and competencies that assist them to escalate as
integrated individuals with a sense of commitment and dedication. Karunadu
Technologies Pvt. Ltd. also helps companies to find right individuals matching the
requirements. We engage in Outsourcing of talented candidates.

To Empower Unskilled Individual with knowledge, skills and technical


competencies in the field of Information Technology and Embedded engineering
which assist them to escalate as integrated individuals contributing to company’s
and Nation’s growth.

To develop software and Embedded solutions and services focusing on


quality standards and customer values. Offer end to end embedded solutions which
ensure the best customer satisfaction. To build Skilled and Talented manpower pool
for global industry requirements. ,To develop software and embedded products
which are globally recognized .,To become a global leader in Offering Scalable and
cost effective Software solutions and services across various domains like E-
commerce, Banking, Finance, Healthcare and much more. To generate employment
for the skilled and highly talented youth of our Country INDIA.

( iii )
CERTIFICATES
CERTIFICATES

(v)
LIST OF FIGURES

Sl. No Fig No Name of the figure Page No


1. 2.1 Example for python 2
2. 2.2 Dataflow for machine learning 6
3. 2.5 Description of Supervised machine learning ex: 8
4. 3.1 Working for clustering algorithm 13
5. 4.1 Optimal number of cluster in elbow method 15
6. 4.2 Optimal number of clusters in purpose method 16
7. 4.3 Data is divided into two clusters 16
8. 4.4 Move the controlees iteratively to a new location 16
9. 4.5 Conversion of clusters 17
10. 5.1 KNN algorithm 19
11. 5.2 KNN working diagram 20
12. 5.3 No. of k in KNN algorithm 21
13. 6.1 Regression analysis 23
14. 6.2 Observations are assumed to be the result random 24
15. 8.1 Formula of polynomial regression 28
16. 8.2 Polynomial Regression 29
17. 9.1 Dataset 31
18. 9.2 Python Code 32
19. 9.3 Python Code Result 32
PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 1
INTRODUCTION TO INDUSTRY

IQRA software technologies, is a premier institute which provides IT and software


skills training in scientific & engineering field with best quality at lower costs. We are one of
the fastest growing software solutions, technical and knowledge outsourcing company
situated in India with offices at Bangalore, Kanpur and lucknow.

1.1 Mission
IQRA software is committed to its role in technical individuals or corporate in areas
of speech compression, image processing, control system, wireless LAN, VHDL,
MATLAB(Sci-hub), DSP TMS320C67xx, java, Microsoft.Net, software quality testing,
SDLC & implementation, project management, manual testing, silk test, mercury test, QTP,
Test director for quality center.

1.2 Vision
DSP, VLSI, Embedded and software testing are one of the fastest growing areas in IT
across the globe. Our vision is to create a platform where trainees/students are able to learn
different features of technologies to secure a better position in IT industry or to improve their
careers.

1.3 History of Python


Python was developed in 1980 by Guido Van Rossum at the National research
institute for mathematics and computer science in the Netherlands as a successor of ABC
language capable of exception handling and interfacing. Python features a dynamic type
system and automatic memory management. It supports multiple programming paradigms,
including object-oriented, imperative, functional and procedural, and has a large and
comprehensive standard library.
In December 1989 the creator developed the 1st python interpreter as a hobby and then on 16
October 2000, python 2.0 was released with many few features.
In December 1989, I was looking for a “hobby” programming project that would keep
around Christmas. My office…would be closed, but I had a home computer and not much
else on my hands. I decide to write an interpreter for the new scripting language I had been
thinking about lately: a descendent of ABC that would appeal to Unix/C hackers.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 1


PYTHON WITH MACHINE LEARNING 2019-20

I choose python as a working title for the project, being in a slightly irreverent mood
(and a big fan of Monty Python‟s Flying Circus)

1.4 Why Python


The language‟s core philosophy is summarized in the document The Zen of python
(PEP20), which includes aphorisms such as…
 Beautiful is better than ugly
 Simple is better than complex
 Complex is better than complicated
 Readability counts
 Explicit is better than implicit

A simple program to print “Hello World” in Fig. 1

Fig.1 Example for Python

1.5 Characteristics of Python


 Easy to read: Python source-code is clearly defined and visible to the eyes.
 Portable: Python codes can be run on a wide variety of hardware platforms having the
same interface.
 Extendable: Users can add low level-modules to Python interpreter.
 Scalable: Python provides an improved structure for supporting large and techniques
of programming.
 Interactive Programming Language: Users can interact with the python interpreter
directly for writing programs.
 Easy language: Python is easy to learn language especially for beginners.
 Straight forward Syntax: The formation of python syntax is simple and
straightforward which also makes it popular.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 2


PYTHON WITH MACHINE LEARNING 2019-20

1.6 Data Structure in Python Lists


 Ordered collection of data.
 Supports similar slicing and indexing functionalities as in the case of strings.
 They are mutable.
 Advantages of a list over a conventional array
 Lists have no size or type constraints (no setting restrictions beforehand).
 They can contain different object types.
 We can delete elements from a list by using Del list_name [index_va]
Example-
My list = [„one‟,‟two‟,‟three‟, 4, 5]
len(my_list) would output 5.

1.7 Dictionary
 Lists are sequences but the dictionaries are mappings.
 They are mappings between a unique key and a value pair.
 These mappings may not retain order.
 Constructing a dictionary.
 Accessing object from a dictionary.
 Nesting Dictionaries.
 Basic Dictionary Methods.

1.8 File Handling in Python


Python too supports file handling and allows users to handle files i.e., to read and write files,
along with many other file handling options, to operate on files.
The concept of file handling has stretched over various other languages, but the
implementation either complicated or lengthy, but alike other concepts of python, this concept here is
also easy and short. Python treats file differently as text or binary and this is important.
Each line of code includes a sequence of characters and they form text file. Each line of a file
is terminated with a special character, called the EOL or End of Line characters like comma {,} or
new line character. It ends the current line and tells the interpreter a new one has begun. Let starts
with the reading and writing files.
We use open ( ) function in python to open a file in Read or Write mode. As explained above,
open ( ) will return a file object. To return a file object we use open ( ) function along with two
arguments, that accepts file name and the mode, whether to Read or Write. So, the syntax being: open
(file name, mode).

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 3


PYTHON WITH MACHINE LEARNING 2019-20

There are three kinds of mode, that python provides and how file can be opened:
 “r”, For reading.
 “w”, Writing.
 “a”, Appending.
 “r+” ,For both reading and writing.

Code in Python
 dic={}
 words=[]
 with open(“101.txt”) as f1:
 for line in f1 :
 words = words + line. Split()
 for I in range(len(words)):
 count=0
 for j in range(len(words)):
 if words[i]==words[j]:
 count=count+1
 dic[words[i]]=count
 for count in dic:
 print(count+” “+str(dic[count]))
 f4=open(“105.txt, “a+”)
 f4.writelines(count+”“+str(dic[count])

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 4


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 2
INTRODUCTION
Machine Learning is a system that can learn from example through self-improvement
and without being explicitly coded by programmer. The breakthrough comes with the idea
that a machine can singularly learn from the data (i.e., example) to produce accurate results.

2.1 General Introduction: What is Machine Learning?


Machine learning combines data with statistical tools to predict an output. This output
is then used by corporate to makes actionable insights. Machine learning is closely related to
data mining and Bayesian predictive modeling. The machine receives data as input, use an
algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the user's
historical data. Tech companies are using unsupervised learning to improve the user
experience with personalizing recommendation. Machine learning is also used for a variety of
task like fraud detection, predictive maintenance, portfolio optimization, automatize task and
so on.

2.2 Machine Learning vs. Traditional Programming

Traditional programming differs significantly from machine learning. In traditional


programming, a programmer codes all the rules in consultation with an expert in the industry
for which software is being developed. Each rule is based on a logical foundation; the
machine will execute an output following the logical statement. When the system grows
complex, more rules need to be written. It can quickly become unsustainable to maintain.

Machine learning is supposed to overcome this issue. The machine learns how the
input and output data are correlated and it writes a rule. The programmers do not need to
write new rules each time there is new data.

2.3 How does Machine learning work?

Machine learning is the brain where all the learning takes place. The way the machine
learns is similar to the human being. Humans learn from experience. The more we know, the
more easily we can predict. By analogy, when we face an unknown situation, the likelihood
of success is lower than the known situation.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 5


PYTHON WITH MACHINE LEARNING 2019-20

Machines are trained the same. To make an accurate prediction, the machine sees an example.
When we give the machine a similar example, it can figure out the outcome. However, like a human,
if it feed a previously unseen example, the machine has difficulties to predict.

Fig 2.1 Data flow for Machine learning.

The core objective of machine learning is the learning and inference. First of all, the
machine learns through the discovery of patterns. This discovery is made thanks to the data.
One crucial part of the data scientist is to choose carefully which data to provide to the
machine. The list of attributes used to solve a problem is called a feature vector. You can
think of a feature vector as a subset of data that is used to tackle a problem. The machine uses
some fancy algorithms to simplify the reality and transform this discovery into a model.
Therefore, the learning stage is used to describe the data and summarize it into a model.

2.4 Why Machine Learning?

The world today is evolving and so are the needs and requirements of people.
Furthermore, we are witnessing a fourth industrial revolution of data. In order to derive
meaningful insights from this data and learn from the way in which people and the system
interface with the data, we need computational algorithms that can churn the data and provide
us with results that would benefit us in various ways.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 6


PYTHON WITH MACHINE LEARNING 2019-20

Machine Learning has revolutionized industries like medicine, healthcare,


manufacturing, banking, and several other industries. Therefore, Machine Learning has
become an essential part of modern industry.

Data is expanding exponentially and in order to harness the power of this data, added
by the massive increase in computation power, Machine Learning has added another
dimension to the way we perceive information. Machine Learning is being utilized
everywhere. The electronic devices you use, the applications that are part of your everyday
life are powered by powerful machine learning algorithms.

Machine learning example – Google is able to provide you with appropriate search
results based on browsing habits. Similarly, Netflix is capable of recommending the films or
shows that you would want to watch based on the machine learning algorithms that perform
predictions based on your watch history.

Furthermore, machine learning has facilitated the automation of redundant tasks that
have taken away the need for manual labour. All of this is possible due to the massive amount
of data that you generate on a daily basis. Machine Learning facilitates several methodologies
to make sense of this data and provide you with steadfast and accurate results. Different
Types of Machine Learning
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Machine Learning
4. Semi Supervised Machine Learning

2.5 Supervised Machine Learning

2.5.1 Definition

Supervised learning algorithms are used when the output is classified or labelled.
These algorithms learns from the past data that is inputted, called as training data, runs its
analysis and uses this analysis to predict future events of any new data within the known
classifications. The accurate prediction of test data requires large data to have a sufficient
understanding of the patterns. The algorithm can be trained further by comparing the training
outputs to actual ones and using the errors for modification of the algorithms.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 7


PYTHON WITH MACHINE LEARNING 2019-20

2.5.2 Real Life Examples of Supervised Machine Learning


 Image Classification: The algorithm is drawn from feeding with labeled image data.
An algorithm is trained and it is expected that in the case of the new image the
algorithm classifies it correctly.
 Market Prediction: It is also called Regression. Historical business market data is
fed to the computer. With analysis and regression algorithm new price for the future is
predicted depending on variables.

2.5.3 Description of Supervised Machine Learning


Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically, supervised learning is a learning in which we teach or train the machine using data
which is well labeled that means some data is already tagged with the correct answer. After
that, the machine is provided with a new set of examples (data) so that supervised learning
algorithm analyses the training data (set of training examples) and produces a correct
outcome from labeled data. For instance, suppose you are given a basket filled with different
kinds of fruits. Now the first step is to train the machine with all different fruits one by one
like this:

If shape of object is rounded and depression at top having color Red then it will be labeled as
–Apple.
If shape of object is long curving cylinder having color Green-Yellow then it will be labeled
as –Banana.
Now suppose after training the data, you have given a new separate fruit say Banana from
basket and asked to identify it.
Since the machine has already learned the things from previous data and this time
have to use it wisely. It will first classify the fruit with its shape and color and would confirm
the fruit name as BANANA and put it in Banana category. Thus the machine learns the

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 8


PYTHON WITH MACHINE LEARNING 2019-20

things from training data (basket containing fruits) and then apply the knowledge to test data
(new fruit). Supervised learning classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” or “disease” and “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
2.5.4 Challenges in Supervised Machine Learning
Here, are challenges faced in supervised machine learning:
 Irrelevant input feature present training data could give inaccurate results
 Data preparation and pre-processing is always a challenge.
 Accuracy suffers when impossible, unlikely, and incomplete values have been
inputted as training data.
 If the concerned expert is not available, then the other approach is "brute-force." It
means you need to think that the right features (input variables) to train the machine
on. It could be inaccurate.

2.5.5 Advantages in Supervised Machine Learning


 Supervised learning allows you to collect data or produce a data output from the
previous experience
 Helps you to optimize performance criteria using experience
 Supervised machine learning helps you to solve various types of real-world
computation problems.

2.5.6 Disadvantages in Supervised Machine Learning


 Decision boundary might be overstrained if your training set which doesn't have
examples that you want to have in a class
 You need to select lots of good examples from each class while you are training the
classifier.
 Classifying big data can be a real challenge.
 Training for supervised learning needs a lot of computation time.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 9


PYTHON WITH MACHINE LEARNING 2019-20

2.6 Unsupervised Machine Learning


2.6.1 Definition
Unsupervised learning algorithms are used when we are unaware of the final outputs
and the classification or labeled outputs are not at our disposal. These algorithms studies and
generate a function to describe completely hidden and unlabeled patterns. Hence, there is no
correct output, but it studies the data to give out unknown structures in unlabeled data.
2.6.2 Real Life Examples
 Clustering: Data with similar traits are asked to group together by the algorithm, this
grouping is called clusters. These prove helpful in the study of these groups which can
be applied on the entire data within a cluster more or less.
 High Dimension Data: High dimension data is normally not easy to work with. With
the help of unsupervised learning, visualization of high dimension data becomes
possible.
 Generative Models: Once your algorithm analyses and comes up with the probability
distribution of the input, it can be used to generate new data. This proves to be very
helpful in cases of missing data.

2.6.3 Description
Unsupervised learning is the training of machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to similarities,
patterns and differences without any prior training of data.

Unlike supervised learning, no teacher is provided that means no training will be


given to the machine. Therefore, machine is restricted to find the hidden structure in
unlabelled data by our-self.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 10


PYTHON WITH MACHINE LEARNING 2019-20

Thus, the machine has no idea about the features of dogs and cat so we can‟t categorize it in
dogs and cats. But it can categorize them according to their similarities, patterns, and
differences i.e., we can easily categorize the above picture into two parts.

First may contain all pictures having dogs in it and second part may contain all pic having
cats in it. Here you didn‟t learn anything before, means no training data or examples.
Unsupervised learning classified into two categories of algorithms:
 Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing the behaviour.
 Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend to
buy Y.

2.6.4 Why Unsupervised Learning?


Here, are prime reasons for using Unsupervised Learning:
 Unsupervised machine learning finds all kind of unknown patterns in data.
 Unsupervised methods help you to find features which can be useful for
categorization.
 It is taken place in real time, so all the input data to be analyses and labeled in the
presence of learners.
 It is easier to get unlabeled data from a computer than labeled data, which needs
manual intervention.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 11


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 3
CLUSTERING

It is basically a type of unsupervised learning method . An unsupervised learning


method is a method in which we draw references from datasets consisting of input data
without labeled responses.

3.1 What Is Clustering?


Clustering is a technique widely used to find groups of observations (called clusters)
that share similar characteristics. This process is not driven by a specific purpose, which
means you don‟t have to specifically tell your algorithm how to group those observations
since it does it on its own (groups are formed organically). The result is that observations (or
data points) in the same group are more similar between them than other observations in
another group. The goal is to obtain data points in the same group as similar as possible, and
data points in different groups as dissimilar as possible.
Extremely well fitted for exploratory analysis, K-means is perfect for getting to know
your data and providing insights on almost all data types. Whether it is an image, a figure or a
piece of text, K-means is so flexible it can take almost everything.
Clustering is dividing data points into homogeneous classes or clusters:
 Points in the same group are as similar as possible.
 Points in different group are as dissimilar as possible.
When a collection of objects is given, we put objects into group based on similarity.

3.2 Applications of Clustering


 Clustering is used in almost all the fields. Listed here are few more applications,
which would add to what you have learnt. Clustering helps marketers improve their
customer base and work on the target areas. It helps group people (according to
different criteria‟s such as willingness, purchasing power etc.) based on their
similarity in many ways related to the product under consideration.
 Clustering helps in identification of groups of houses on the basis of their value, type
and geographical locations.
 Clustering is used to study earth-quake. Based on the areas hit by an earthquake
region, clustering can help analyses the next probable location where earthquake can
occur.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 12


PYTHON WITH MACHINE LEARNING 2019-20

 A Hospital Care chain wants to open a series of Emergency-Care wards within a


region. We assume that the hospital knows the location of all the maximum accident-
prone areas in the region. They have to decide the number of the Emergency Units to
be opened and the locations of these Emergency Units, so that all the accident-prone
areas are covered in the vicinity of these Emergency Units. The challenge is to decide
the location of these Emergency Units so that the whole region is covered. Here is
when K-means Clustering comes to rescue!

3.3 Clustering Algorithm


A Clustering Algorithm tries to analyses natural groups of data on the basis of some
similarity. It locates the centroid of the group of data points.

Fig.3.1 working for clustering algorithm


To carry out effective clustering, the algorithm evaluates the distance between each
point from the centroid of the cluster. The goal of clustering is to determine the intrinsic
grouping in a set of unlabeled data.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 13


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 4

K MEANS CLUSTERING ALGORITHM


K-Means is one of the most important algorithms when it comes to Machine learning
Training. Including K-means clustering - is an unsupervised learning technique used for data
classification. We provide several examples to help further explain how it works.
Unsupervised learning means there is no output variable to guide the learning process (no this
or that, no right or wrong) and data is explored by algorithms to find patterns. We only
observe the features but have no established measurements of the outcomes since we want to
find them out.
As opposed to supervised learning where your existing data is already labeled and you
know which behavior you want to determine in the new data you obtain, unsupervised
learning techniques don‟t use labeled data and the algorithms are left to themselves to
discover structures in the data. Within the universe of clustering techniques, K-means is
probably one of the mostly known and frequently used. K-means uses an iterative refinement
method to produce its final clustering based on the number of clusters defined by the user
(represented by the variable K) and the dataset.

4.1 What is K Means Clustering?


K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well-known clustering problem. K-means clustering is a method of vector
quantization, originally from signal processing, that is popular for cluster analysis in data
mining.

4.2 How does K Means clustering Algorithms works?


K- Means Clustering Algorithm needs the following inputs:
 K = number of subgroups or clusters
 Sample or Training Set = {x1, x2, x3,……xn}

Now let us assume we have a data set which is unlabeled and we need to divide it into
clusters.

Now we need to find the number of clusters. This can be done by two methods:

 Elbow Method.
 Purpose Method.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 14


PYTHON WITH MACHINE LEARNING 2019-20

Elbow Method

In this method, a curve is drawn between “within the sum of squares” (WSS) and the
number of clusters. The curve plotted resembles a human arm. It is called the elbow method
because the point of elbow in the curve gives us the optimum number of clusters. In the graph
or curve, after the elbow point, the value of WSS changes very slowly so elbow point must be
considered to give the final value of the number of clusters.

Fig 4.1 Optimal number of clusters in Elbow Method

Purpose Method

In this method, the data is divided based on different metrics and after then it is
judged how well it performed for that case. For example, the arrangement of the shirts in the
men‟s clothing department in a mall is done on the criteria of the sizes. It can be done on the
basis of price and the brands also. The best suitable would be chosen to give the optimal
number of clusters i.e. the value of K.

Now lets us get back to our given data set above. We can calculate the number of clusters

i.e. the value of K by using any of the above methods.

Step 1: Initialisation

Firstly, initialize any random points called as the centroids of the cluster. While
initializing you must take care that the centroids of the cluster must be less than the number
of training data points.

This algorithm is an iterative algorithm hence the next two steps are performed iteratively.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 15


PYTHON WITH MACHINE LEARNING 2019-20

Fig 4.2 Optimal number of clusters in Purpose Method

Step 2: Cluster Assignment

After initialization, all data points are traversed and the distance between all the
centroids and the data points are calculated. Now the clusters would be formed depending
upon the minimum distance from the centroids. In this example, the data is divided into two
clusters.

Fig 4.3 Data is divided into two clusters.

Step 3: Moving Centroid

As the clusters formed in the above step are not optimized so we need to form
optimized clusters. For this, we need to move the centroids iteratively to a new location.

Fig 4.4 Move the centroids iteratively to a new location.

Take data points of one cluster, compute their average and then move the centroid of
that cluster to this new location. Repeat the same step for all other clusters.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 16


PYTHON WITH MACHINE LEARNING 2019-20

Step 4: Optimization

The above two steps are done iteratively until the centroids stop moving i.e., they do
not change their positions anymore and have become static. Once this is done the k- means
algorithm is termed to be converged.

Step 5: Convergence

Now this algorithm has converged and distinct clusters are formed and clearly visible.

This algorithm can give different results depending on how the clusters were initialized in
the first step.

4.3 K-means Clustering – Example:


A pizza chain wants to open its delivery centres across a city. What do you think would be
the possible challenges?

 They need to analyse the areas from where the pizza is being ordered frequently.
 They need to understand as to how many pizza stores has to be opened to cover
delivery in the area.
 They need to figure out the locations for the pizza stores within all these areas in order
to keep the distance between the store and delivery points minimum.
 Resolving these challenges includes a lot of analysis and mathematics. We would now
learn about how clustering can provide a meaningful and easy method of sorting out
such real life challenges.

4.4 Advantages of K- Means Clustering Algorithm


 It Dept. of CSE, SAIT2019
 Robust,
 Easy to understand,
 Comparatively efficient,

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 17


PYTHON WITH MACHINE LEARNING 2019-20

 If data sets are distinct then gives the best results,


 Produce tighter clusters,
 When centroids are recomputed the cluster changes.
 Flexible,
 Easy to interpret.

4.5 Disadvantages of K- Means Clustering Algorithm


 Needs prior specification for the number of cluster centres.
 If there are two highly overlapping data then it cannot be distinguished and cannot tell
that there are two clusters.
 With the different representation of the data, the results achieved are also different.
 Euclidean distance can unequally weight the factors.
 It gives the local optima of the squared error function.
 Sometimes choosing the centroids randomly cannot give fruitful results.
 Can be used only if the meaning is defined.
 Cannot handle outliers and noisy data.
 Do not work for the non-linear dataset.
 Lacks consistency.
 Sensitive to scale.
 If very large data sets are encountered then the computer may crash.
 Prediction issues.

4.6 Applications of K- Means Clustering Algorithm


 Market segmentation,
 Document clustering,
 Image segmentation,
 Image compression,
 Vector quantization,
 Cluster analysis.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 18


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 5

KNN CLUSTERING ALGORITHM

5.1 What is KNN Algorithm?


K nearest neighbour or KNN Algorithm is a simple algorithm which uses the entire
dataset in its training phase. Whenever a prediction is required for an unseen data instance, it
searches through the entire training dataset for k-most similar instances and the data with the
most similar instance is finally returned as the prediction.

KNN is often used in search applications where you are looking for similar items, like
find items similar to this one.

5.2 Features of KNN Algorithm


The KNN algorithm has the following features:

 KNN is a Supervised Learning algorithm that uses labelled input data set to predict
the output of the data points.
 It is one of the simplest Machine learning algorithms and it can be easily implemented
for a varied set of problems.
 It is mainly based on feature similarity. KNN checks how similar a data point is to its
neighbour and classifies the data point into the class it is most similar.
 Unlike most algorithms, KNN is a non-parametric model which means that it does not
make any assumptions about the data set. This makes the algorithm more effective
since it can handle realistic data.

 KNN is a lazy algorithm this means that it memorizes the training data set instead of
learning a discriminative function from the training data.
 KNN can be used for solving both classification and regression problems.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 19


PYTHON WITH MACHINE LEARNING 2019-20

5.3 How does KNN algorithm works?


Suppose P1 is the point, for which label needs to predict. First, you find the k closest
point to P1 and then classify points by majority vote of its k neighbours. Each object votes for
their class and the class with the most votes is taken as the prediction. For finding closest
similar points, you find the distance between points using distance measures such as
Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance. KNN
has the following basic steps:

1. Calculate distance

2. Find Closest Neighbours

3. Vote for Labels

Fig 5.1 KNN working diagram

5.4 How to decide number of K in KNN algorithms?


Now, you understand the KNN algorithm working mechanism. At this point, the question
arises that How to choose the optimal number of neighbours? And what are its effects on the
classifier?

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 20


PYTHON WITH MACHINE LEARNING 2019-20

The number of neighbours (K) in KNN is a hyper parameter that you need choose at the time
of model building. You can think of K as a controlling variable for the prediction model.
Research has shown that no optimal number of neighbours suits all kind of data sets. Each
dataset has it's own requirements. In the case of a small number of neighbours, the noise will
have a higher influence on the result, and a large number of neighbours make it
computationally expensive. Research has also shown that a small amount of neighbours are
most flexible fit which will have low bias but high variance and a number of neighbour will
have a smoother decision boundary which means lower variance but higher bias.

Generally, Data scientists choose as an odd number if the number of classes is even.
You can also check by generating the model on different values of k and check their
performance. You can also try Elbow method here.

5.5 Advantage of KNN algorithm


 The algorithm is simple and easy to implement.
 There‟s no need to build a model, tune several parameters, or make additional
assumptions.
 The algorithm is versatile. It can be used for classification, regression, and search.
 The training phase of K-nearest neighbour classification is much faster compared to
other classification algorithms. There is no need to train a model for generalization
that is why KNN is known as the simple and instance-based learning algorithm.
 KNN can be useful in case of nonlinear data. It can be used with the regression
problem. Output value for the object is computed by the average of k closest
neighbours‟ value.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 21


PYTHON WITH MACHINE LEARNING 2019-20

5.6 Disadvantage of KNN


 The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
 The testing phase of K-nearest neighbour classification is slower and costlier in terms
of time and memory. It requires large memory for storing the entire training data set
for prediction.
 KNN requires scaling of data because KNN uses the Euclidean distance between two
data points to find nearest neighbours. Euclidean distance is sensitive to magnitudes.
The features with high magnitudes will weigh more than features with low
magnitudes.
 KNN also not suitable for large dimensional data.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 22


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 6

LINEAR REGRESSION
In linear regression, the relationships are modelled using linear predictor functions
whose unknown model parameters are estimated from the data. Such models are called linear
models. Most commonly, the conditional mean of the response given the values of the
explanatory variables (or predictors) is assumed to be an affine function of those values; less
commonly, the conditional median or some other quintile is used. Like all forms of regression
analysis, linear regression focuses on the conditional probability distribution of the response
given the values of the predictors, rather than on the joint probability distribution of all of
these variables, which is the domain of multivariate analysis.

6.1 Definition
Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as regression
line).

It is represented by an equation Y = a + b * X + e, where a is intercept, b is slope of the line


and e is error term. This equation can be used to predict the value of target variable based on
given predictor variable(s).

Fig 6.1 Regression analysis How to obtain best fit

line (Value of a and b)?

This task can be easily accomplished by Least Square Method. It is the most common
method used for fitting a regression line.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 23


PYTHON WITH MACHINE LEARNING 2019-20

It calculates the best-fit line for the observed data by minimizing the sum of the
squares of the vertical deviations from each data point to the line. Because the deviations are
first squared, when added, there is no cancelling out between positive and negative values.

Fig 6.2 Observations are assumed to be the result of random

6.2 Advantages
Linear regression is an extremely simple method. It is very easy and intuitive to use
and understand. A person with only the knowledge of high school mathematics can
understand and use it. In addition, it works in most of the cases. Even when it doesn‟t fit the
data exactly, we can use it to find the nature of the relationship between the two variables.

6.3 Disadvantage
 By its definition, linear regression only models‟ relationships between dependent and
independent variables that are linear. It assumes there is a straight-line relationship
between them which is incorrect sometimes. Linear regression is very sensitive to the
anomalies in the data (or outliers).
 Take for example most of your data lies in the range 0-10. If due to any reason only
one of the data items comes out of the range, say for example 15, this significantly
influences the regression coefficients.
 Another disadvantage is that if we have a number of parameters than the number of
samples available then the model starts to model the noise rather than the relationship
between the variables.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 24


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER-7

MULTIPLE LINEAR REGRESSION


Multiple linear regression (MLR), also known simply as multiple regression, is a
statistical technique that uses several explanatory variables to predict the outcome of a
response variable. The goal of multiple linear regression (MLR) is to model the linear
relationship between the explanatory (independent) variables and response (dependent)
variable.

A simple linear regression is a function that allows an analyst or statistician to make


predictions about one variable based on the information that is known about another variable.
Linear regression can only be used when one has two continuous variables an independent
variable and a dependent variable. The independent variable is the parameter that is used to
calculate the dependent variable or outcome.

7.1 Definition
In many cases, there may be possibilities of dealing with more than one predictor
variable for finding out the value of the response variable. Therefore, the simple linear
models cannot be utilized as there is a need for undertaking Multiple Linear Regression for
analyzing the predictor variables. Using the two explanatory variables, we can delineate the
equation of Multiple Linear Regression as follows:

yi = β0 + β1x1i + β2x1i + εi

The two explanatory variables x1i and x1i, determine yi, for the ith data point.
Furthermore, the predictor variables are also determined by the three parameters β0, β1, and
β2 of the model, and by the residual ε1 of the point i from the fitted surface.

General Multiple regression models can be represented as:

yi = Σβ1x1i + εi

7.2 Examples of Multiple Regression


A real estate agent could use Multiple Regression to analyze the value of houses. For
example, she could use as independent variables the size of the houses, their ages, the number
of bedrooms, the average home price in the neighbourhood and the proximity to schools.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 25


PYTHON WITH MACHINE LEARNING 2019-20

Plotting these in a multiple regression model, she could then use these factors to see their
relationship to the prices of the homes as the criterion variable.

Another example of using a multiple regression model could be someone in human


resources determining the salary of management positions – the criterion variable.

The predictor variables could be each manager's seniority, the average number of
hours worked, the number of people being managed and the manager's departmental budget.

7.3 Advantages of multiple regression


 Any disadvantage of using a multiple regression model usually comes down to the
data being used. Two examples of this are using incomplete data and falsely
concluding that a correlation is causation.
 When reviewing the price of homes, for example, suppose the real estate agent looked
at only 10 homes, seven of which were purchased by young parents. In this case, the
relationship between the proximity of schools may lead her to believe that this had an
effect on the sale price for all homes being sold in the community. This illustrates the
pitfalls of incomplete data. Had she used a larger sample, she could have found that,
out of 100 homes sold, only ten percent of the home values were related to a school's
proximity. If she had used the buyers' ages as a predictor value, she could have found
that younger buyers were willing to pay more for homes in the community than older
buyers.
 In the example of management salaries, suppose there was one outlier who had a
smaller budget, less seniority and with fewer personnel to manage but was making
more than anyone else. The HR manager could look at the data and conclude that this
individual is being overpaid. However, this conclusion would be erroneous if he didn't
take into account that this manager was in charge of the company's website and had a
highly coveted skillset in network security.

7.4 Disadvantages of multiple regression


 Multiple regression model usually comes down to the data being used. Two examples
of this are using incomplete data and falsely concluding that a correlation is causation.
When reviewing the price of homes, for example, suppose the real estate agent looked
at only 10 homes, seven of which were purchased by young parents. In this case, the

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 26


PYTHON WITH MACHINE LEARNING 2019-20

relationship between the proximity of schools may lead her to believe that this had an
effect on the sale price for all homes being sold in the community.
 This illustrates the pitfalls of incomplete data. Had she used a larger sample, she could
have found that, out of 100 homes sold, only ten percent of the home values were
related to a school's proximity. If she had used the buyers' ages as a predictor value,
she could have found that younger buyers were willing to pay more for homes in the
community than older buyers.
 In the example of management salaries, suppose there was one outlier who had a
smaller budget, less seniority and with fewer personnel to manage but was making
more than anyone else. The HR manager could look at the data and conclude that this
individual is being overpaid. However, this conclusion would be erroneous if he didn't
take into account that this manager was in charge of the company's website and had a
highly coveted skillset in network security.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 27


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 8
POLYNOMIAL REGRESSION
Polynomial Regression is a form of linear regression in which the relationship
between the independent variable x and dependent variable y is modelled as an nth degree
polynomial. Polynomial regression fits a nonlinear relationship between the value of x and
the corresponding conditional mean of y, denoted E(y |x)

8.1 Definition
Polynomial regression is a form of regression analysis in which the relationship
between the independent variable x and the dependent variable y is modelled as an nth degree
polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x
and the corresponding conditional mean of y, denoted E(y |x), and has been used to describe
nonlinear phenomena such as the growth rate of tissues, the distribution of carbon isotopes in
lake sediments, and the progression of disease epidemics. Although polynomial regression
fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense
that the regression function E(y | x) is linear in the unknown parameters that are estimated
from the data. For this reason, polynomial regression is considered to be a special case of
Multiple Linear Regression.

The explanatory (independent) variables resulting from the polynomial expansion of


the “baseline” variables are known as higher-degree terms. Such variables are also used in
classification settings.

Fig 8.1 Formula of Polynomial Regression

At the same time PR has its own unique cases, and when we have problem and we
might try a SLR and MLR first and see what happens. With a PR sometimes we obtain better
results. For example, PR issued to observe how epidemics spread across population, and
similar use cases, so it‟s a matter of what we want to predict, and of course it‟s always good
to have more tools in our arsenal. But at this point you might ask yourself why PR is still

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 28


PYTHON WITH MACHINE LEARNING 2019-20

We use a PR when the SL straight line doesn‟t fit well our observations and we want
to obtain parabolic effect:

Fig 8.2 Polynomial Regression

Well the trick here is that when we talk about linear and nonlinear models, we are not
thinking in terms of variables, but we are considering the coefficient. So, the question is
whether the function can be expressed as a linear combination of these coefficients or not,
being ultimately unknown is goal to find the coefficient values, and this is why linear and
nonlinear refers to the coefficients. So, PR is a special case of MLR rather than a standard
new type of regression.

8.2 Advantages of using Polynomial Regression


 Polynomial provides the best approximation of the relationship between the
dependent and independent variable.
 A Broad range of function can be fit under it.
 Polynomial basically fits a wide range of curvature.

8.3 Disadvantages of using Polynomial Regression


 The presence of one or two outliers in the data can seriously affect the results of the
nonlinear analysis.
 These are too sensitive to the outliers.
 In addition, there are unfortunately fewer model validation tools for the detection of
outliers in nonlinear regression than there are for linear regression.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 29


PYTHON WITH MACHINE LEARNING 2019-20

CHAPTER 9
PROJECT DESCRIPTION

NAME: PREDICTION OF HEIGHT AND WEIGHT

ALGORITHM: MULTIPLE LINEAR REGRESSION

In this project, we will develop and evaluate the performance and the predictive power of a model
trained and tested on data collected from houses in Boston’s suburbs.Once we get a good fit, we will
use this model to predict the monetary value of a house located at the Boston’s area. A model like this
would be very valuable for a real state agent who could make use of the information provided in a
doyly basis.

9.1 GETTING THE DATA AND PREVIOUS PREPROCESS


The dataset used in this project comes from the UCI Machine Learning Repository. This data was
collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes
from various suburbs located in Boston.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 30


PYTHON WITH MACHINE LEARNING 2019-20

The features can be summarized as follows:


 CRIM: This is the per capita crime rate by town
 ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
 INDUS: This is the proportion of non-retail business acres per town.
 CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0
otherwise)
 NOX: This is the nitric oxides concentration (parts per 10 million)
 RM: This is the average number of rooms per dwelling
 AGE: This is the proportion of owner-occupied units built prior to 1940
 DIS: This is the weighted distances to five Boston employment centers
 RAD: This is the index of accessibility to radial highways
 TAX: This is the full-value property-tax rate per $10,000
 PTRATIO: This is the pupil-teacher ratio by town
 B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African
American descent by town
 LSTAT: This is the percentage lower status of the population
 MEDV: This is the median value of owner-occupied homes in $1000s

9.2 DATASET:

Fig 9.1 Dataset


Dept. of ECE GEC Huvina Hadagali 2019-20 Page 31
PYTHON WITH MACHINE LEARNING 2019-20

Using Technology: U.S. Economy Case Study

U.S. economic data 1976 to 1987


X1 = dollars/barrel crude oil
X2 = % interest on ten yr. U.S. treasury notes
X3 = foreign investments/billions of dollars
X4 = Dow Jones industrial average
X5 = GNP/billions of dollars
X6 = purchasing power U.S. dollar (1983 base)
X7 = consumer debt/billions of dollars
Reference: Statistical Abstract of the United States 103rd and 109th edition

Fig 9.2 Python Code

9.3 PYTHON CODE

Fig 9.3 OLS Regression Result

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 32


PYTHON WITH MACHINE LEARNING 2019-20

This implies that if y_pred and y_test values are approximately similar than the applied
algorithm will suits for the given data set.

9.4 PROJECT IMPLEMENTATION

9.4.1 PYTHON LIBRARIES USED


 NumPy: It is a library for the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large collection of high-
level mathematical functions to operate on these arrays.
 Matplotlib: It is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+.
 Pandas: In computer programming, pandas is a software library written for the
Python programming language for data manipulation and analysis. In particular, it
offers data structures and operations for manipulating numerical tables and time
series.

9.4.2 DATASET
There is csv file as input dataset. The name of csv file is “mlr11.csv”. This mlr11.csv
file consists of 11 rows and 8 columns.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 33


PYTHON WITH MACHINE LEARNING 2019-20

9.5 FOR THE PURPOSE OF THE PROJECT THE DATASET HAS BEEN
PREPROCESSED
 The essential features for the project are: ‘RM’, ‘LSTAT’, ‘PTRATIO’ and ‘MEDV’. The
remaining features havebeen excluded.
 16 data points with a ‘MEDV’ value of 50.0 have been removed. As they likely contain censored or
missing values.
 1 data point with a ‘RM’ value of 8.78 it is considered an outlier and has been removed for the
optimal performance of the model.
 As this data is out of date, the ‘MEDV’ value has been scaled multiplicatively to account for 35
years of markt inflation.
We’ll now open a python 3 Jupyter Notebook and execute the following code snippet to load the dataset
and remove the non-essential features.Recieving a success message if the actions were correclty performed.

9.6 SIMPLE CODE


import numpy as np
import pandas as pd

dataset=pd.read_csv('train.csv')
x=dataset.iloc[:,0:13].values
y=dataset.iloc[:,14:].values

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=(0.25),random_state=5)

from sklearn.linear_model import linearregression


regressor=linearregression()
regressor.fit(x_train,y_train)

y_pred=regressor.predict(x_test)

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 34


PYTHON WITH MACHINE LEARNING 2019-20

x=np.append(arr=np.ones((333,1)).astype(int),values=x,axis=
1) x_opt=x[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()

x_opt=x[:,[0,1,2,3,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()

x_opt=x[:,[0,2,3,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()

x_opt=x[:,[0,2,5,6,7,8,9,10,11,12,13]]
regressor_ols=sm.ols(endog=y,exog=x_opt).fit()
regressor_ols.summary()

x=x_opt

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=(0.25),random_state=5)

from sklearn.linear_model import linearregression


regressor=linearregression()
regressor.fit(x_train,y_train)

y_pred=regressor.predict(x_test)

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 35


PYTHON WITH MACHINE LEARNING 2019-20

9.7 PREDICTED OUTPUT

The explanatory (independent) variables resulting from the polynomial expansion of the
“baseline” variables are known as higher-degree terms. Such variables are also used in
classification settings. Notice that PR is very similar to MLR, but consider at the same time instead
of the different variables the same one X1, but in different powers; so basically we are using 1
variable to different powers of the same original variable.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 36


PYTHON WITH MACHINE LEARNING 2019-20

CONCLUSION

We have a simple overview of some techniques and algorithms in machine learning.


Furthermore, there are more and more techniques apply machine learning as a solution. In the
future, machine learning will play an important role in our daily life.

Throughout this article we made a machine learning regression project from end-to-end and
we learned and obtained several insights about regression models and how they are developed.

This was the first of the machine learning projects that will be developed on this series. If you
liked it, stay tuned for the next article! Which will be an introduction to the theory and concepts
regarding to classification algorithms.

Dept. of ECE GEC Huvina Hadagali 2019-20 Page 37

You might also like