You are on page 1of 19

FOUR WEEKS SUMMER INSTITUTIONAL TRAINING REPORT

On

[Machine Learning using Python]

at

Udemy
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE

AWARD OF THE DEGREE OF

BACHELOR OF TECHNOLOGY

(Computer Science and Engineering)

SUBMITTED BY:

NAME: Tanveer Singh

UNI. ROLL NO.: 2001352

JULY, 2021

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BABA BANDA SINGH BAHADUR ENGINEERING COLLEGE

FATEHGARH SAHIB

Page 1 of 19
CANDIDATE’S DECLARATION
I Tanveer Singh hereby declare that I have undertaken four weeks training from Udemy during a period
from 1 July 2021 to 24 July 2021. The project entitled “Machine Learning using Python” submitted by
Tanveer Singh (2001352) in partial fulfillment of the requirement for the award of degree of the B.
Tech. (Computer Science and Engineering) submitted in Baba Banda Singh Bahadur Engineering
College, Fatehgarh Sahib is an authentic record of my own work carried out during four weeks Summer
Institutional Training. The matter presented in this project has not formed the basis for the award of any
other degree, diploma, fellowship or any other similar titles.

Signature of the Student


Place: Mohali
Date:

Page 2 of 19
E-CERTIFICATE

Page 3 of 19
ACKNOWLEDGEMENT

I express my sincere gratitude to the I. K. Gujral Punjab Technical University, Jalandhar, for giving me
the opportunity to undergo four weeks Summer Institutional Training, after my 2nd Semester of B.Tech.

I would like to thank Dr Lakhvir Singh (Principal) and Dr. Kanwalvir Singh Dhindsa Head of
Department, CSE at Baba Banda Singh Bahadur Engineering College, Fatehgarh Sahib for their kind
support.

I also owe my sincere gratitude towards Mr. Andrei Neagoie for his valuable advice and healthy criticism
throughout my training which helped me immensely to complete my work successfully.

I would also like to thank everyone who has knowingly and unknowingly helped me throughout my
work.

Last but not least, a word of thanks for the authors of all those books and papers which I have consulted
during my training as well as for preparing the report.

Page 4 of 19
ABSTRACT

In this project, I experimented with a real-world dataset, and to explore how machine learning algorithms
can be used to find the patterns in data and predict something using those patterns. I expected to gain
experience using a common data-set and machine learning libraries. After performing the required tasks
on a dataset of my choice, herein lies my final report.

Page 5 of 19
Table of Contents

Declaration of the Student 2


E-Certificate 3
Acknowledgement 4
Abstract 5

1. INTRODUCTION TO COMPANY 7

2. INTRODUCTION TO PROJECT
2.1.1 Objective 8
2.1.2 Project Overview 9

3. TRAINING WORK UNDERTAKEN


3.1.1 Hardware specification 12
3.2.1 Software specification
3.2.2 Python 13
3.2.3 Jupyter Notebook (Anaconda) 14
3.2.4 Libraries Used 15
3.3.1 Flowcharts 16

4. RESULT 17

5. CONCLUSION AND FUTURE SCOPE 18

6. REFRENCES 19

Page 6 of 19
INTRODUCTION TO COMPANY
Udemy

Udemy, Inc. is an American massive open online course (MOOC) provider aimed at professional adults
and students. It was founded in May 2010 by Eren Bali, Gagan Biyani, and Oktay Caglar.

Udemy is a platform that allows instructors to build online courses on their preferred topics. Using
Udemy's course development tools, they can upload videos, PowerPoint presentations, PDFs, audio, ZIP
files and live classes to create courses. Instructors can also engage and interact with users via online
discussion boards.

Courses are offered across a breadth of categories, including business and entrepreneurship, academics,
the arts, health and fitness, language, music, and technology. Most classes are in practical subjects such
as Excel software or using an iPhone camera. Udemy also offers Udemy for Business, enabling businesses
access to a targeted suite of over 7,000 training courses on topics from digital marketing tactics to office
productivity, design, management, programming, and more. With Udemy for Business, organizations can
also create custom learning portals for corporate training.

Courses on Udemy can be paid or free, depending on the instructor. In 2015, the top 10 instructors made
more than $17 million in total revenue.
Massive open online course (MOOC)

Udemy is part of the growing MOOC movement available outside the traditional university system, and
has been noted for the variety of courses offered.

Page 7 of 19
INTRODUCTION TO PROJECT
Objective

A dataset related to Iris flower species was chosen by me. This project predicts whether the Iris flower is
‘setosa’, ‘versicolor’, ‘virginica’ based on the ‘sepal length’, ‘sepal width’, ‘petal length’, ‘petal width’.

Dataset Description

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in
Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One
flower species is linearly separable from the other two, but the other two are not linearly separable from
each other.

The columns in this dataset are:

• Id

• Sepal Length (Cm)

• Sepal Width (Cm)

• Petal Length (Cm)

• Petal Width (Cm)

• Species

Dataset source: https://www.kaggle.com/uciml/iris (*however I imported it through library)

Figure 1

Page 8 of 19
Project Overview

Machine Learning is the science of getting computers to learn without being explicitly programmed. It is
closely related to computational statistics, which focuses on making prediction using computer. Machine
Learning focuses on the development of computer programs that can access data and use it to learn
themselves. The process of learning begins with observations or data, such as examples, providing inputs
and outputs in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.
Types of Machine Learning

The types of machine learning algorithms differ in their approach, the type of data they input and output,
and the type of task that they are intended to solve. Broadly Machine Learning can be categorized into
three categories.

I. Supervised Learning: Supervised Learning is a type of learning in which we are given a data set and we
already know what are correct output should look like, having the idea that there is a relationship between
the input and output. Basically, it is learning task of learning a function that maps an input to an output
based on example input-output pairs. It deduces a function from labeled training data consisting of a set
of training examples.

II. Unsupervised Learning: Unsupervised Learning is a type of learning that allows us to approach
problems with little or no idea what our problem should look like. We can derive the structure by
clustering the data based on a relationship among the variables in data. Basically, it is a type of self-
organized learning that helps in finding previously unknown patterns in data set without pre-existing
label.

III. Reinforcement Learning: Reinforcement learning is a learning method that interacts with its
environment by producing actions and discovers errors or rewards. Trial and error search and delayed
reward are the most relevant characteristics of reinforcement learning. This method allows machines and
software agents to automatically determine the ideal behavior within a specific context in order to
maximize its performance.

Page 9 of 19
Basic steps for creating a Machine Learning model:

• Importing the data: For importing data, pandas library is used widely. But I imported this
dataset from scikit-learn library.

• Cleaning data or preprocessing data: Machine Learning algorithms don’t work so well
with processing raw data. Before we can feed such data to an ML algorithm, we must preprocess
it. We must apply some transformations on it. With data preprocessing, we convert raw data into a
clean data set. In preprocessing we perform steps like removing duplicate values, removing NULL
values, replacing 0 with mean, etc. Because it will improve our results.

However, our iris dataset is already clean dataset.

• Split data in training set and test set: For feeding data to an ML algorithm, we have to first
split it into two sets training and test set. So that algorithm can train itself using training set and
test its model on test set. Usually this split is 80-20 i.e., 80% training set and 20% test set. For
splitting we use train_test_split function from scikit library.

• Choosing a model: In this project I have used k-nearest neighbors’ algorithm.

KNN classifier: This is a Python Machine Learning algorithm for classification and regression
This is a supervised learning algorithm that considers different centroids and uses a usually
Euclidean function to compare distance. Then, it analyzes the results and classifies each point to
the group to optimize it to place with all closest points to it. It classifies new cases using a
majority vote of k of its neighbors. The case it assigns to a class is the one most common among
its K nearest neighbors. For this, it uses a distance function.

Figure 2
Page 10 of 19
• Check output: Checking output includes observing the percentage using different models.

• Improve: If the output isn’t accurate enough, we can improve by changing model or cleaning
data more deeply.

Page 11 of 19
TRAINING WORK UNDERTAKEN
Hardware specification

Minimum hardware requirement:

• Operating system: Windows 8 or newer, 64-bit macOS 10.13+, or Linux, including Ubuntu,
RedHat, CentOS 7+, and others.

• System architecture: Windows- 64-bit x64, 32-bit x86; MacOS- 64-bit x86; Linux- 64-bit x86, 64-
bit aarch64 (AWS Graviton2 / arm64), 64-bit Power8/Power9, s390x (Linux on IBM Z & Linux
ONE).

• Minimum 5 GB disk space to download and install Anaconda. (SSD recommended)

• Minimum 4 GB RAM to run Anaconda.

Hardware specifications I used:

• Operating system: Windows 10

• System architecture: Windows- 64-bit x64

• 8 GB of RAM

• SSD

Page 12 of 19
Software Specification

Python is a widely used general-purpose, high level programming language. It was initially designed by
Guido van Rossum in 1991 and developed by Python Software Foundation. It was mainly developed for
an emphasis on code readability, and its syntax allows programmers to express concepts in fewer lines of
code. It supports multiple programming paradigms, including procedural, object-oriented, and functional
programming. Python is often described as a "batteries included" language due to its comprehensive
standard library. Features

• Interpreted: In Python there is no separate compilation and execution steps like C/C++. It directly run
the program from the source code. Internally, Python converts the source code into an intermediate form
called bytecodes which is then translated into native language of specific computer to run it.

• Platform Independent: Python programs can be developed and executed on the multiple operating
system platform. Python can be used on Linux, Windows, Macintosh, Solaris and many more.

• Multi- Paradigm: Python is a multi-paradigm programming language. Object-oriented programming and


structured programming are fully supported, and many of its features support functional programming
and aspect-oriented programming.

• Simple: Python is a very simple language. It is a very easy to learn as it is closer to English language. In
python more emphasis is on the solution to the problem rather than the syntax.

• Rich Library Support: Python standard library is very vast. It can help to do various things involving
regular expressions, documentation generation, unit testing, threading, databases, web browsers, CGI,
email, XML, HTML, WAV files, cryptography, GUI and many more.

• Free and Open Source: Firstly, Python is freely available. Secondly, it is open-source. This means that
its source code is available to the public. We can download it, change it, use it, and distribute it. This is
called FLOSS (Free/Libre and Open-Source Software). As the Python community, we’re all headed
toward one goal- an ever-bettering Python.

Figure 3

Page 13 of 19
Jupyter Notebook (Anaconda)

The Jupyter Notebook is an open-source web application that allows data scientists to create and share
documents that integrate live code, equations, computational output, visualizations, and other multimedia
resources, along with explanatory text in a single document. You can use Jupyter Notebooks for all sorts
of data science tasks including data cleaning and transformation, numerical simulation, exploratory data
analysis, data visualization, statistical modeling, machine learning, deep learning, and much more.

A Jupyter Notebook provides you with an easy-to-use, interactive data science environment that doesn’t
only work as an integrated development environment (IDE), but also as a presentation or educational tool.
Jupyter is a way of working with Python inside a virtual “notebook” and is growing in popularity with
data scientists in large part due to its flexibility. It gives you a way to combine code, images, plots,
comments, etc., in alignment with the step of the “data science process.” Further, it is a form of
interactive computing, an environment in which users execute code, see what happens, modify, and repeat
in a kind of iterative conversation between the data scientist and data. Data scientists can also use
notebooks to create tutorials or interactive manuals for their software.

Figure 4

Page 14 of 19
Libraries used

1. Scikit Learn: Scikit-learn (formerly scikits.learn and also known as sklearn) is a free
software machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.

(sklearn.datasets): This library is commonly used for importing models or algorithms, but it also
comes with some small standard datasets that do not require to download any file from some
external website. They are as followings:

i) load_boston

ii) load_iris

iii) load_diabetes

iv) load_digits

v) load_linnerud

vi) load_wine

vii) load_breast_cancer

(train_test_split from sklearn.model_selection): It is used for splitting data into training and testing set.

(KNeighborsClassifier from sklearn.neighbors): It is used for applying knn algorithm to your dataset,
after splitting data.

(metrics from sklearn): This is used to calculate the accuracy of the model used.

2. Job lib: There are several reasons to integrate joblib tools as a part of the ML pipeline. Capability
to use cache which avoids recomputation of some of the steps.

Execute Parallelization to fully utilize all the cores of CPU/GPU.

Beyond this, there are several other reasons why I would recommend joblib:

• Can be easily integrated

• No specific dependencies

Page 15 of 19
• Saves cost and time

• Easy to learn

Flowcharts

Figure 5(Workflow ML)

Page 16 of 19
Figure 6

RESULT
Result: My project successfully classifies iris into its three species based on its petal and sepal
characteristics/dimensions with 96.6 % accuracy.

Page 17 of 19
CONCLUSION AND FUTURE SCOPE
Conclusion

This training has introduced us to Machine Learning. Now, we know that Machine Learning is a technique
of training machines to perform the activities a human brain can do, bit faster and better than an average
human-being. Today we have seen that the machines can beat human champions in games such as Chess,
Mahjong, which are considered very complex. We have seen that machines can be trained to perform human
activities in several areas and can aid humans in living better lives. Machine learning is quickly growing
field in computer science. It has applications in nearly every other field of study and is already being
implemented commercially because machine learning can solve problems too difficult or time consuming
for humans to solve. To describe machine learning in general terms, a variety models are used to learn
patterns in data and make accurate predictions based on the patterns it observes. Machine Learning can be
a Supervised or Unsupervised. If we have a lesser amount of data and clearly labelled data for training, we
opt for Supervised Learning. Unsupervised Learning would generally give better performance and results
for large data sets. If we have a huge data set easily available, we go for deep learning techniques. Finally,
when it comes to the development of machine learning models of our own, we looked at the choices of
various development languages, IDEs and Platforms. Next thing that we need to do is start learning and
practicing each machine learning technique. The subject is vast, it means that there is width, but if we
consider the depth, each topic can be learned in a few hours. Each topic is independent of each other. We
need to take into consideration one topic at a time, learn it, practice it and implement the algorithm/s in it
using a language choice of yours. This is the best way to start studying Machine Learning. Practicing one
topic at a time, very soon we can acquire the width that is eventually required of a Machine Learning expert.

Future Scope

Future of Machine Learning is as vast as the limits of human mind. We can always keep learning, and
teaching the computers how to learn. And at the same time, wondering how some of the most complex
machine learning algorithms have been running in the back of our own mind so effortlessly all the time.
There is a bright future for machine learning. Companies like Google, Quora, and Facebook hire people
with machine learning. There is intense research in machine learning at the top universities in the world.
The global machine learning as a service market is rising expeditiously mainly due to the Internet
revolution. The process of connecting the world virtually has generated vast amount of data which is
boosting the adoption of machine learning solutions. Considering all these applications and dramatic
improvements that ML has brought us, it doesn't take a genius to realize that in coming future we will
definitely see more advanced applications of ML, applications that will stretch the capabilities of machine
learning to an unimaginable level.

Page 18 of 19
REFRENCES

• https://www.udemy.com/course/complete-python-developer-zero-to-mastery/

• https://scikit-learn.org/stable/

• https://www.kaggle.com/uciml/iris

• https://anaconda.org/anaconda

• https://www.geeksforgeeks.org/machine-learning/

• https://docs.python.org/3/

• https://tutorialspoint.com/machine_learning/index.htm

Page 19 of 19

You might also like