You are on page 1of 181

DEPARTMENT

OF
COMPUTER SCIENCE AND ENGINEERING

WSS-SOURCE BOOK

BIG DATA AND MACHINE LEARNING

WSCS17
Prepared by:

Mr. P.S.Dinesh (AP-II/ CSE)


Mr.A.Vignesh Kumar (AP-II/ CSE)
Ms. M.Karthiga (AP/CSE)
Ms. T. Kanimozhi (AP/CSE)
Ms. K.Dhana Shree (AP/ CSE)
Mr. R. Nirmalan (AP/ CSE)
Ms. U.Supriya (AP/ CSE)
Ms.B.Janani (AP/ CSE)
Ms.D.Prabhadevi (AP/ CSE)
Ms.P.Swathy Priyadarshini (AP/CSE)
Mr.V.Krishnamoorthy (AP/ CSE)
Ms. R.S.Soundariya (AP/ CSE)
Dr. P. Swathi Priyadarshini (AP/CSE)
Mr. S.Nithyanantham (AP-II/CSE)
Ms. P. Dhivya (AP/CSE)

With reference:

Technical Description- TDFS10_EN -WorldSkills International-Machine Learning and Big Data

HOD/CSE WSTC Coordinator Dean PDS

2 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Description of the Skill:

Web technology refers to the means by which computers communicate with each other using
markup languages and multimedia packages. It gives us a way to interact with hosted information, like
websites. In order to make websites look and function a certain way, web developers utilize different
languages. The three core languages that make up the World Wide Web are HTML, CSS, and JavaScript
is the backbone of most webpages.

Client-Side Mark-up and Scripting

Client-side technologies are things that operate in the browser. There is no need to interact with
the server. These languages are generally very easy to use, and we can try them out right on our own
computer.

HTML: Hypertext Mark-up Language

It is a basic mark-up language that we will use to create the structure of our web pages. We can
think of this as the framing for the house that we were building. It is the most basic and essential part of
our web site - it gives our house shape, rooms, and structure.

CSS: Cascading Style sheets

It is used to create the decoration for our website. CSS describes how a web page should look in
the browser. The key to good web page creation is to completely separate the presentation (CSS) from the
structure of our site (HTML). This way it is easy to make changes to the look of our site without
changing all of the HTML files.

JavaScript

JavaScript is a simple scripting language used to make things happen on the web page. As a
client-side language, JavaScript only works within the browser window. It cannot retrieve, create, or store
data on the server; it can only manipulate things within the browser.

Server-Side Programming Languages


There are many server-side programming languages that we can use on the web sites. These are
languages that interact with the web server to manipulate data. Server-side languages are used to send
form data, store passwords and login information, or otherwise store and retrieve data from the server.
The most common server side languages are PHP (PHP Hypertext Preprocessor) and ASP (Active Server
Pages). Higher end programming languages that are used to create complex web applications are .net,
asp.net, Ruby on Rails, or JSP.

The Relevance and Significance of this Document:


This document contains information about the standards required to carry out the skill-training
program and tasks to perform in day order and end project (termed as Test project) along with assessment
principles, methods, and procedures that govern the training program.

3 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
CONTENTS

SI. No. of Section


DESCRIPTION TOPICS COVERED
No. days Weightage
Work organization and management
● Principles and practices that enable
productive team work
● The principles and behavior of
systems
● The aspects of systems that
contribute to sustainable products,
strategies and practices ● Troubleshoot common web
● How to take initiatives and be design and development
enterprising in order to identify, problems
1 1 10
analyze and evaluate information ● Take into account time
from a variety of sources limitations and deadlines
● How to identify multiple solutions ● Debug and handle errors
to a problem and offer them as
options against time, budget, and
other constraints.
● How to use existing available tools
to create proper solutions to a
problem and requirement
WebsiteDesign
● Create, analyse, and
● How to follow design principles and
develop visual response to
patterns in order to produce
communication problems,
aesthetically pleasing,creative, and
including understanding
accessible interfaces.
hierarchy, typography,
● Issues related to the cognitive, social,
aesthetics, and composition
cultural, accessible, technological
● Create and manipulate and
and economic contexts for design
optimize images for the
● How to create and adapt graphics for
3 internet 4 15
the web
● Identify the target
● Different target markets and the
market and create a concept
elements of design which satisfy
for the design
each market
● Create responsive
● Protocols for maintaining a corporate
designs that function
identity, brand and style guide
correctly on multiple screen
● The limitations of Internet
resolutions and/or devices.
enabled devices and screen
resolutions
Layout
● Create code that conforms
● World Wide Web Consortium
and validates to the W3C
(W3C) standards for HTML and
CSS standards including the
accessibility guidelines
● Positioning and layout methods
● Create accessible and
● Usability and interaction design
usable web interfaces for a 15
4 ● Accessibility and communication 5
variety of devices and
for users with special needs
screen resolutions
● Cross browser compatibility Multi
● Use CSS or other
device compatibility
external files to modify the
● Search Engine Optimization (SEO)
appearance of the web
and performance optimization
interface

4 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Front-End Development
● Create website animations
and functionalities to assist
in context explanations and
● JavaScript • How to integrate adding visual appeal
libraries, frameworks and other ● Create and update
5 systems or features with JavaScript JavaScript code to enhance 6 30
● Use JavaScript pre/post processors a websites functionality,
and task running workflow usability and aesthetics
● Manipulate data and
custom media with
JavaScript
Back-End Development
● Manipulate data making
● Object-oriented PHP use of programming skills
● Open Source server side Libraries ● Protect against security
and Frameworks exploits
● Connect to server through SSH to ● Integrate with existing code
operate server-side libraries and with API (Application
6 frameworks. Programming Interfaces), 6 30
● How to design and implement libraries and frameworks
databases with MySQL ● Create or maintain database
● FTP (File Transfer Protocol) server to support system
and client relationships and software requirements
packages. ● Create code that is
modular and reusable
Test Project --- 1 ---
Test Project --- 1 ---
Assessment --- 1 ---
Total No. of days 25 100%

5 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
TRAINING SCHEDULE

Skill
BIG DATA AND MACHINE LEARNING
Name

Day Content
INTRODUCTION TO MACHINE LEARNING, DATA COLLECTION AND
Day 1
STUDY ABOUT DATASET
Day 2 INTRODUCTION TO RAPID MINER AND DATA PREPROCESSING
Day 3 DATA MINING INTRODUCTION — DATA PREPROCESSING WITH R
Day 4 DATA PREPROCESSING USING PYTHON
Day 5 MODEL EVALUATION AND SELECTION-LINEAR REGRESSION

Day 6 K-NEAREST NEIGHBOURS REGRESSION

Day 7 RANDOM FOREST REGRESSION


Day 8 GRADIENT BOOSTING REGRESSION
Day 9 SUPPORT VECTOR MACHINE REGRESSION
Day 10 LOGISTIC REGRESSION

Day 11 INTRODUCTION TO CLASSIFICATION AND DIFFERENT TYPES OF


CLASSIFICATION ALGORITHMS, NAÏVE BAYES ALGORITHM
Day 12 STOCHASTIC GRADIENT DESCENT ALGORITHM
Day 13
K-NEAREST NEIGHBOURS ALGORITHM

Day 14 SUPPORT VECTOR MACHINES CLASSIFICATION

Day 15 TIME SERIES PREDICTION USING ARIMA MODEL

Day 16 IMPLEMENTATION OF DECISION TREES

Day 17 IMPLEMENTATION OF RANDOM FOREST ALGORITHM

Day 18 CONVOLUTIONAL NEURAL NETWORK

Day 19 CREATING A SIMPLE CHAT BOT

Day 20 INTRODUCTION TO BIG DATA ANALYTICS AND INSTALLATION OF


HADOOP
Day 21 INSTALLATION AND DATABASE CREATION USING MONGODB
6 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Day 22 INTRODUCTION TO SQL AND NOSQL DATABASES

7 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Marking Scheme

The Assessment is done by awarding points by adopting two methods, Measurement and Judgments

Measurement – Either One which is measurable or one according to the binary system of measurement
(Yes/No).

Judgments – Either One based on Industry expectations (Scale) or one according to the binary system of
measurement (Yes/No).

8 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 1 INTRODUCTION TO MACHINE LEARNING AND DATA
COLLECTION AND STUDY ABOUT DATASET

Objectives: The aim of the task is to make the students understand what is machine learning and
bigdata and practice the students to collect dataset from various resources.

Outcome: Downloading dataset and preparation of their own dataset for various analytics

Resources required: Rapid Miner/R Tool/ Python IDE/ GoogleColab

Prerequisites: Knowledge about data and information

Theory:
What is Machine Learning?
Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning generally is
to understand the structure of data and fit that data into models that can be understood and utilized by
people. Machine Learning is used anywhere from automating mundane tasks to offering intelligent
insights, industries in every sector try to benefit from it.

Sample Coding:

Rapid Miner: Importing Dataset into Rapid Miner

Step 1: Select blank project, to start work on a new project.

Step 2: Select the yellow highlighted drop down arrow and create a new repository

9 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
10 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
11 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3: Now select the repository you created, and create two sub folders under the repository as
Data and processes.

12 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Type in Data and repeat same method for creating sub folder for process.

Step 4: Select the purple highlighted box, Import Data. In MyComputer, choose the downloaded
dataset

13 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
14 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 5: Do the preliminary formatting of dataset like choosing the class labels, changing role, etc.,

15 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 6: Finally save the dataset in the Data folder.

16 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Python:

1. Download dataset from Kaggle or any machine learning repositories in the form of CSV
2. from google.colab import files
uploaded = files.upload()

3. Select the downloaded dataset and click open


4. import pandas as pd
5. import io
df = pd.read_csv(io.BytesIO(uploaded['dataset.csv']))
6. print(df)
7. df.head(2)
8. df.tail(2)
9. df.dtypes

R –Tool:

1. Download the dataset from machine learning repository and save in a location in your computer
2. Oen RStudio and type the following command
mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")
3. mydata

17 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule
10.40a 11.30a
8.45am 10.25am
9.35am to m to m to 12.20pm 1.30pm to 2.20pm to 3.10pm to 3.25pm to
Day to to
10.25am 11.30a 12.20p to 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm
9.35am 10.40am
m m

Installation of Tasks
Rapid Miner, R related to
Day 1 Introduction Tea Break Lunch Break Dataset collection Tea Break
Tool and Python importing
IDE dataset

Description of the task:

Download a health-care dataset (Excel or CSV format) from Kaggle and import it in Rapid Miner, R
Tool and Python IDE and print the imported dataset to check whether the dataset is correctly imported.

Sample Output in Python using Google Colab:

Assessment specification:
Aspect Maximum
S.No Type Aspect Description Additional Aspect Description Requirement Score(10)
1 J Downloading correct 2 Marks
dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Creating new 0.5 Mark
repository in the name New repository correctly created
of dataset in 0 mark
RapidMiner No repository created Rapid Miner 0. 5

18 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
3 J Creating two 0.25 Mark
subfolders in If Data subfolder is created correctly inside
repository new repository
0.25 Mark
If Process subfolder is created correctly inside
new repository Rapid Miner 0. 5
0 Mark
No subfolders created
4 J Changing the roles 1 Mark
for target labels If target labels are set
0 Mark
If target labels are not set Rapid Miner 1
5 J Displaying the 1 Mark
imported dataset If dataset is correctly displayed indicating the
features and target labels
0 Mark Rapid Miner 1
If not imported properly
6 J Importing dataset 2 Marks
using Python If dataframe is properly used and printed
0 mark
If dataframe is not properly imported Pycharm Editor 2
7 J Displaying the 1 Mark
different parts of the If first 5 rows and last 5 rows are correctly
data using pandas diaplyed
package 0 Mark Pycharm Editor 1
If data manipulation is not properly done
8 J Importing dataset 1 Mark
using R Tool If dataframe is properly used and printed
0 mark 1
If dataframe is not properly imported R Studio
9 J Displaying the 1 Mark
different parts of the If first 5 rows and last 5 rows are correctly
data diaplyed
0 Mark
If data manipulation is not properly done If
no backgroung colour is used
R Studio 1

Conclusion:
Thus various resources for data collection and the methodology for importing the data sets
using Rapid Miner/ R Tool and Python IDE are completed.
19 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 2 INTRODUCTION TO RAPID MINER AND DATA PRE-PROCESSING

Objectives: The aim of the task is to pre-process a dataset for changing formatting the data and
identifying the statistics of the data to improve the results of the data analysis.

Outcome: Student will able to apply Rapid Miner tool for preprocessing dataset required for data
analysis.

Resources required: Rapid Miner (Student version)

Prerequisites:
Knowledge on Machine Learning

Theory:

Rapid Miner is a data science software platform developed by the company of the same name that
provides an integrated environment for data preparation, machine learning, deep learning, text
mining, and predictive analytics. It is used for business and commercial applications as well as for
research, education, training, rapid prototyping, and application development and supports all steps
of the machine learning process including data preparation, results visualization, model validation
and optimization. Rapid Miner is developed on an open core model. The Rapid Miner Studio Free
Edition, which is limited to 1 logical processor and 10,000 data rows, is available under
the AGPL license, by depending on various non-open source components.
Rapid Miner is written in the Java programming language. Rapid Miner provides a GUI to
design and execute analytical workflows. Those workflows are called “Processes” in Rapid Miner
and they consist of multiple “Operators”. Each operator performs a single task within the process,
and the output of each operator forms the input of the next one. Alternatively, the engine can be
called from other programs or used as an API. Individual functions can be called from the command
line. Rapid Miner provides learning schemes, models and algorithms and can be extended
using R and Python scripts.

Data preprocessing

Tasks in data preprocessing

1. Data cleaning: fill in missing values, smooth noisy data, identify or


remove outliers, and resolve inconsistencies.
2. Data integration: using multiple databases, data cubes, or files.
3. Data transformation: normalization and aggregation.
4. Data reduction: reducing the volume but producing the same or similar
analytical results.
5. Data discretization: part of data reduction, replacing numerical attributes
with nominal ones.
20 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
1. Data cleaning

1.1 Fill in missing values - using Rapid Miner

a. Open Rapidminer and Import File “sales_data missing “


b. Check the missing value in meta data view and observe the amount attribute.
c. Navigate to Data Transformation and select Data Cleansing method
d. Select “Replace Missing Values” to perform the operation.
e. Run the program, check the missing values.

Step 1:

Step 2:

21 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3:

1.2 Outlier Treatment for Reducing Noise

Step 1:Import File sales_data Outlier


 Observe the attribute “amount” there are some outliers (Noise). Apply data cleaning
technique to reduce this noise from the data.

Step 2:
 Using operators expand Transformation ,then expand Data Cleansing,Outlier Detection
and then select the option open Detect Outlier(Distant)

 In Detect outlier number ,change the value to 1

22 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3: Run the program, Check the outlier

1.3 Data transformation

1.3.1 Normalization:
23 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
 Import File sales_data

 Observe the attributes, and apply normalization, by using “Range Transformation-Min-Max


Method”, with the range of 0.0 to 1.0.

 Using operators expand Data Transformation ,then expand Value Modification Numerical
Value Modification open Normalize)

Step 1:

Step 2:Run the program, Check the result after normalization all the values will replace by given
range.

24 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Demo: 1.3.2 Aggregation:

Step 1: Select mode for importing the data


Step 2: After selecting your mode of import choose the correct sheet or file to import.

Step 3: Select your sheet and click next.

Step 4 : De-select any columns that you do not want to import. In this case I do not care to see
what teams the QBs play for. You must make the column you want to sort by ID. You can see
in the first column that contains the names was changed from attribute to ID. After you do that
click Next and save your data.

25 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 5:Once you get into your main process drag and drop your data onto process area. Click
Data Transformation > Aggregation. Drag and drop the aggregate widget onto the process area.
Next connect the out port of the data to the exa port on the left side of the aggregate widget.
Then connect the exa port on the right side of the widget to the result port.

26 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 6:After you connect the ports select edit list by aggregation attributes. Here make an entry for
each attribute you want to aggregate and select the functions you want to use. After you do this
click Ok.

Step 7:Next click select attribute by group by attributes. Here move your ID column (in this case
Name) into the right side by selecting your ID and clicking on the arrow pointing right. Click Ok.
27 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 8 :Now just click the play button on the toolbar and you get your results!

Demo: 1.3.3 Generalization:

Step 1 :The Generate Nominal Data operator generates an ExampleSet based on nominal attributes.
Select any one parameter in the Generate Nominal Data operator.

Parameters

 number_examples :This parameter specifies the number of examples to be generated. Range:


integer
 number_of_attributes :This parameter specifies the number of regular attributes to be
generated. Please note that the label attribute is generated automatically besides these regular
attributes. Range: integer
 number_of_values :This parameter specifies the number of unique values of the attributes.
Range: integer
 use_local_random_seed : This parameter indicates if a local random seed should be used for
randomization. Using the same value of local random seed will produce the same
ExampleSet. Changing the value of this parameter changes the way examples are randomized,
thus the ExampleSet will have a different set of values. Range: boolean
 local_random_seedThis parameter specifies the local random seed. This parameter is only
available if the use local random seed parameter is set to true. Range: integer

Step 2: The number examples parameter is set to 100, thus the ExampleSet will have 100
examples. The number of attributes parameter is set to 3, thus three nominal attributes will be
generated beside the label attribute. The number of values parameter is set to 5, thus each attribute
28 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
will have 5 possible values

1.4 Data reduction

1.4 Discritization by using Rapidminer


Step 1: Import File sales_data Discretization
Step 2: Observe all attributes and apply discretization on this file, by using “Discretize by
Binning”, provided by number of bins.

Step 3: Using operators expand Data Transformation, then expand Type Conversion 
Discretization  Discretize by Binning

Step 4: Run the program, Check the result after discretization all the see the difference by
plotting Histogram for the attribute “product_id”.

29 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Histogram of “product_id” after discretization, “5” number of bins.

Schedule:
8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm
3.10pm to 3.25pm
3.25pm
Day to to to to to to to to
to
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm
4.15pm
Rapid Miner
Day 2 Installation and Tea Lunch
Exercise -1 Exercise -2 Tea Break Exercise -3
Introduction to data Break Break
preprocessing

Description of the task:

Download IRIS dataset (Excel or CSV format) from Kaggle and import it in Rapid Miner, and do
the necessary preprocessing

Sample Output

30 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 2 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Creating new 0.5 Mark
repository in the New repository correctly created
name of dataset in 0 mark
RapidMiner No repository created Rapid Miner 0. 5
3 J Creating two 0.25 Mark
subfolders in If Data subfolder is created correctly
repository inside new repository
0.25 Mark
If Process subfolder is created correctly
inside new repository Rapid Miner 0. 5
0 Mark
No subfolders created
4 J Handling 1 Mark
Missing values Replace the missing values with
correct values
0 mark 1
Rapid Miner
Not replacing the missing values
5 J Finding outliers 1 Mark
in dataset Determining the outliers and
replacing it
0 mark
No outliers detected
Rapid Miner 1

31 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
6 J Transforming 1 Mark
data to suitable Apply normalization to the dataset
form 0 Mark
No normalization is applied Rapid Miner 1
7 J Transforming 1 Mark
data to suitable Apply aggregation to the dataset
form 0 Mark
No aggregation is applied Rapid Miner 1
8 J Apply Generate 1 Mark
Nominal Data Apply Generate Nominal Data operator
to the dataset for generating an
ExampleSet Rapid Miner 1
0 Mark
No Generate Nominal Data operator is
applied
9 J Apply Data 1 Mark
Reduction Apply data Discretization on the dataset
process 0 Mark
No data Discretization is applied Rapid Miner 1
10 J Plotting 1 Mark
Histogram Apply histogram plotting for all the
attributes in the dataset
0 Mark Rapid Miner 1
No histogram plotting is applied

Conclusion:
Thus various pre-processing tasks can be achieved using Rapid Miner which in result making the
data cleans to perform various analytics.

32 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 3 DATA MINING INTRODUCTION — DATA PREPROCESSING WITH R

Objectives: The aim of the task is to make the students understand the basic concepts of data mining
and fundamental data processing procedures using R

Outcome: Students will be able to perform pre-processing on sample data using R.

Resources required: R for windows

Prerequisites: Basic knowledge on information and data

Theory: Data Mining and R programming

Data mining is a field of research that has emerged in the 1990s, and is very popular
today, sometimes under different names such as “big data” and “data science“, which have a similar
meaning. To give a short definition of data mining, it can be defined as a set of techniques for
automatically analyzing data to discover interesting knowledge or pasterns in the data.

 Data mining finds valuable information hidden in large volumes of data.


 Data mining is the analysis of data and the use of software techniques for finding patterns and
regularities in sets of data.
 The computer is responsible for finding the patterns by identifying the underlying rules and
features in the data.
 It is possible to "strike gold" in unexpected places as the data mining software extracts patterns
not previously discernible or so obvious that no-one has noticed them before.
 Mining analogy:
o large volumes of data are sifted in an attempt to find something worthwhile.
o in a mining operation large amounts of low grade materials are sifted through in order
to find something of value.

To perform data mining, a process consisting of seven steps is usually followed. This process is often
called the “Knowledge Discovery in Database” (KDD) process.

1. Data cleaning: This step consists of cleaning the data by removing noise or other
inconsistencies that could be a problem for analyzing the data.
2. Data integration: This step consists of integrating data from various sources to prepare the
data that needs to be analyzed. For example, if the data is stored in multiple databases or file, it
may be necessary to integrate the data into a single file or database to analyze it.
3. Data selection: This step consists of selecting the relevant data for the analysis to be
performed.
4. Data transformation: This step consists of transforming the data to a proper format that can
be analyzed using data mining techniques. For example, some data mining techniques require
that all numerical values are normalized.

33 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
5. Data mining: This step consists of applying some data mining techniques (algorithms) to
analyze the data and discover interesting patterns or extract interesting knowledge from this
data.
6. Evaluating the knowledge that has been discovered: This step consists of evaluating the
knowledge that has been extracted from the data. This can be done in terms of objective and/or
subjective measures.
7. Visualization: Finally, the last step is to visualize the knowledge that has been extracted from
the data.
R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses
an extensive catalog of statistical and graphical methods. It includes machine learning algorithm,
linear regression, time series, statistical inference to name a few. Most of the R libraries are written
in R, but for heavy computational task, C, C++ and Fortran codes are preferred.

Data Preprocessing in R

Importing the Dataset


dataset = read.csv('dataset.csv')

Dealing with Missing Values

dataset$age = ifelse(is.na(dataset$age),ave(dataset$age, FUN = function(x) mean(x, na.rm =


'TRUE')),dataset$age)

dataset$salary = ifelse(is.na(dataset$salary), ave(dataset$salary, FUN = function(x) mean(x, na.rm =


'TRUE')), dataset$salary)

The above code blocks check for missing values in the age and salary columns and update the missing
34 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
cells with the column-wise average.

dataset$age = as.numeric(format(round(dataset$age, 0)))

Since we are not interested in having decimal places for age we will round it up using the above code

Dealing with Categorical Data


Categorical variables represent types of data which may be divided into groups. Examples of
categorical variables are race, sex, age group, educational level etc. In our dataset, we have two
categorical features, nation, and purchased_item. In R we can use the factor method to convert texts
into numerical codes.

dataset$nation = factor(dataset$nation, levels = c('India','Germany','Russia'), labels = c(1,2,3))

dataset$purchased_item = factor(dataset$purchased_item, levels = c('No','Yes'), labels = c(0,1))

35 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Splitting the Dataset into Training and Testing Sets

caTools library in R is used to split our dataset to training_set and test_set

install.packages('caTools') #install once

library(caTools) # importing caTools library

set.seed(123)

split = sample.split(dataset$purchased_item, SplitRatio = 0.8)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

Scaling the Features


training_set[,3:4] = scale(training_set[,3:4])

test_set[,3:4] = scale(test_set[,3:4])

The scale method in R can be used to scale the features in the dataset. Here we are only scaling the
non-factors which are the age and the salary.
Training_set:

36 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Test_set:

Schedule:

8.45a
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm
m to 3.25pm to
Day to to to to to to to to
9.35a 4.15pm
10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
m

Data mining - Tea R programming Lunch Data pre- Tea Data pre-
Day 4
introduction Break introduction Break processing task Break processing task

Description of the task:

Download Shopping Analysis dataset (Excel or CSV format) from Kaggle


(https://github.com/tarunlnmiit/machine_learning/blob/master/DataPreprocessing.csv) and import it using
R tool, and do the necessary preprocessing.

Sample Output:

37 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
38 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing dataset 1 Mark
Dataset correctly imported and
displayed
0 mark RStudio 1
No dataset imported
3 J Splitting the 1 Mark
dataset into Dataset is splitted into dependent RStudio
dependent variable and independent variable using
variable and iloc function 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Handling 1 Mark
Missing values Replace the missing values with RStudio
correct values
0 mark 1
Not replacing the missing values
5 J Handling 1 Mark
categorical data Use Label encoding to covert RStudio
categorical data to numerical data
0 mark 1
No outliers detected

6 J Splitting Data 1 Mark


into training and Used caTools to split the dataset into RStudio
testing data training data and testing data
0 Mark 1
No caTools function is applied
7 J Transforming 1 Mark
data by applying Apply feature scaling to convert RStudio
feature scaling different scales of the data to a standard
scale 1
0 Mark
No feature scaling is applied
8 J Reducing the 1 Mark
dimensions of the Apply PCA to the dataset for extracting RStudio
data using PCA the correlated features
0 Mark 1
No dimensionality reduction is applied
9 J Displaying the 2 Marks
preprocessed Use packages to plot the features that RStudio
dataset have high correlation with the target
labels 2
0 Mark
39 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
No visualization of the dataset is
performed

Conclusion:

Thus various pre-processing tasks can be achieved using RTool which in result makes the data cleans to
perform various analytics

40 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 4 DATA PREPROCESSING USING PYTHON

Objectives: The aim of the task is to pre-process a dataset for formatting the data and identifying the
statistics of the data to improve the results of the data analysis.

Outcome:
Student will able to apply python libraries for preprocessing dataset required for data analysis.

Resources required:
Python Version 3.6

Prerequisites:
Knowledge on Python Programming language

Theory:
 Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
 Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In
other words, whenever the data is gathered from different sources it is collected in raw format
which is not feasible for the analysis.
Need of Data Preprocessing:
 For achieving better results from the applied model in Machine Learning projects the format of
the data has to be in a proper manner. Some specified Machine Learning model needs
information in a specified format, for example, Random Forest algorithm does not support null
values; therefore to execute random forest algorithm null values have to be managed from the
original raw data set.

 Another aspect is that data set should be formatted in such a way that more than one Machine
Learning and Deep Learning algorithms are executed in one data set and best out of them is
chosen.
Sample Coding:

The “chronic_kidney_disease.arff” dataset is used for this task, which is available at the UCI Repository.
1. Read and clean the data

# kidney_dis.py
import pandas as pd
import numpy as np
# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
'ba','bgr','bu','sc','sod','pot','hemo','pcv',
'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff", 41 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
header=None,
names=header
)
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")
# print total samples
print("Total samples:", len(df))
# print 4-rows and 6-columns
print("Partial data\n", df.iloc[0:4, 0:6])
Below is the output of the above code:
$ python kidney_dis.py
Total samples: 157
Partial data
age bp sg al su rbc
30 48 70 1.005 4 0 normal
36 53 90 1.020 2 0 abnormal
38 63 70 1.010 3 0 abnormal
41 68 80 1.010 3 2 normal

2. Saving targets with different color names


In this dataset we have two ‘targets’ i.e. ‘ckd’ and ‘notckd’ in the last column (‘classification’). It is better to
save the ‘targets’ of classification problem with some ‘color-name’ for the plotting purposes. This helps in
visualizing the scatter-plot as shown in this chapter.
# kidney_dis.py
import pandas as pd
import numpy as np
# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
'ba','bgr','bu','sc','sod','pot','hemo','pcv',
'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
header=None,
names=header
)
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")
# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])
targets = df['classification'].astype('category')
# save target-values as color for plotting
42 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
# red: disease, green: no disease

label_color = ['red' if i=='ckd' else 'green' for i in targets]


print(label_color[0:3], label_color[-3:-1])

One can also convert the ‘categorical-targets (i.e. strings ‘ckd’ and ‘notckd’) into ‘numeric-targets (i.e. 0
and 1’) using “.cat.codes” command, as shown below,
# covert 'ckd' and 'notckd' labels as '0' and '1'
targets = df['classification'].astype('category').cat.codes
# save target-values as color for plotting
# red: disease, green: no disease
label_color = ['red' if i==0 else 'green' for i in targets]
print(label_color[0:3], label_color[-3:-1])

Below is the first three and last two samples of the ‘label_color’,
$ python kidney_dis.py
['red', 'red', 'red'] ['green', 'green']

3. Basic PCA analysis


Let’s perform the dimensionality reduction using PCA.
3.1. Preparing data for PCA analysis
Note that, for PCA the features should be ‘numerics’ only. Therefore we need to remove the ‘categorical’
features from the dataset.
# kidney_dis.py
import pandas as pd
import numpy as np
# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
'ba','bgr','bu','sc','sod','pot','hemo','pcv',
'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
header=None,
names=header
)
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")
# print total samples
# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
43 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
# save target-values as color for plotting
# red: disease, green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])
# list of categorical features
categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
'dm', 'cad', 'appet', 'pe', 'ane'
]
# drop the "categorical" features
# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
df.drop(labels=categorical_, axis=1, inplace=True)
print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data

Below is the output of the above code. Note that, if we compare the below results with the results of Listing
8.1, we can see that the ‘rbc’ column is removed.
$ python kidney_dis.py
Partial data
age bp sg al su bgr
30 48 70 1.005 4 0 117
36 53 90 1.020 2 0 70
38 63 70 1.010 3 0 380
41 68 80 1.010 3 2 157

4. Dimensionality reduction
Let’s perform dimensionality reduction using the PCA model. The results infer where we can see that the
model can fairly classify the kidney disease based on the provided features.
# kidney_dis.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# create header for dataset


header = ['age','bp','sg','al','su','rbc','pc','pcc',
'ba','bgr','bu','sc','sod','pot','hemo','pcv',
'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff",
header=None,
names=header
)
# dataset has '?' in it, convert these into NaN
44 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

# print total samples


# print("Total samples:", len(df))
# print 4-rows and 6-columns
# print("Partial data\n", df.iloc[0:4, 0:6])

targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease, green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])

# list of categorical features


categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
'dm', 'cad', 'appet', 'pe', 'ane'
]

# drop the "categorical" features


# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
df.drop(labels=categorical_, axis=1, inplace=True)
# print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data

pca = PCA(n_components=2)
pca.fit(df)
T = pca.transform(df) # transformed data
# change 'T' to Pandas-DataFrame to plot using Pandas-plots
T = pd.DataFrame(T)

# plot the data


T.columns = ['PCA component 1', 'PCA component 2']
T.plot.scatter(x='PCA component 1', y='PCA component 2', marker='o',
alpha=0.7, # opacity
color=label_color,
title="red: ckd, green: not-ckd" )
plt.show()

5. Data Visualization
45 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Chronic Kidney Disease

The dataset had a large number of features. PCA looks for the correlation between these features and
reduces the dimensionality. In this example, we reduce the number of features to 2 using PCA.
After the dimensionality reduction, only 2 features are extracted, therefore it is plotted using the
scatter-plot, which is easier to visualize. For example, one can clearly see the differences between the
‘ckd’ and ‘not ckd’ in the current example.
The dimensionality reduction methods, such as PCA are used to reduce the dimensionality of the
features to 2 or 3. Next, these 2 or 3 features can be plotted to visualize the information.

Schedule:

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm


3.25pm to
Day to to to to to to to to to
4.15pm
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm

Tea
Tea Lunch
Day 3 Introduction Exercise -1 Exercise -2 Brea Exercise -3
Break Break
k

Description of the task:

Download Shopping Analysis dataset (Excel or CSV format) from Kaggle


(https://github.com/tarunlnmiit/machine_learning/blob/master/DataPreprocessing.csv) and import it using
Python packages, and do the necessary preprocessing.

46 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Sample Output:

Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing dataset 1 Mark
using Pandas Dataset correctly imported and
package displayed
0 mark Python 1
No dataset imported
3 J Splitting the 1 Mark
dataset into Dataset is splitted into dependent
dependent variable and independent variable using
variable and iloc function Python 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Handling 1 Mark
Missing values Replace the missing values with
correct values
0 mark Python 1
Not replacing the missing values

47 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
5 J Handling 1 Mark
categorical data Use Label Encoders to covert
categorical data to numerical data
0 mark Python 1
No outliers detected

6 J Splitting Data 1 Mark


into training and Apply train_test_split function to split
testing data the dataset into training data and testing
data Python 1
0 Mark
No train_test_split function is applied
7 J Transforming 1 Mark
data by applying Apply feature scaling to convert
feature scaling different scales of the data to a standard
scale Python 1
0 Mark
No feature scaling is applied
8 J Reducing the 1 Mark
dimensions of the Apply PCA to the dataset for extracting
data using PCA the correlated features
0 Mark Python 1
No dimensionality reduction is applied
9 J Displaying the 2 Marks
dataset using Use matplotlib to plot the features that
matplotlib have high correlation with the target
labels Python 2
0 Mark
No visualization of the dataset is
performed

Conclusion:

Thus various pre-processing tasks can be achieved using Python which in result making the data cleans
to perform various analytics.

48 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 5 MODEL EVALUATION AND SELECTION-LINEAR REGRESSION

Objectives:
The aim of the task is to select a model for doing analytics and insights about Linear Regression.

Outcome:
Student will able to apply the machine learning model for predicting a result using Linear Regression.

Resources required: Python for Windows

Prerequisites: Knowledge on Python Programming

Theory:
Machine learning continues to be an increasingly integral component of our lives, whether we’re applying
the techniques to research or business problems. Machine learning models ought to be able to give
accurate predictions in order to create real value for a given organization.

Linear Regression
Linear regression is a method for approximating a linear relationship between two variables.

Types of Regression Models

Linear Equations

49 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Sample Coding:

import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
import io
from google.colab import files
uploaded = files.upload()
import io
df2 = pd.read_csv(io.BytesIO(uploaded['homeprice.csv']))
df2

heat_map = sns.heatmap(df2, vmin=0, vmax=1, center=10)


plt.show()

plt.figure(figsize=(10, 5))
sns.heatmap(df2.corr(), annot=True)
50 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
sns.heatmap(df2.corr())
plt.show()

ch = pd.read_csv(io.BytesIO(uploaded['homeprice.csv']))
print(ch.head())
grp = sns.regplot(x='area', y='price', data=ch, color='orange')
plt.title("Predicting Home Price")
plt.show()

order = ch[['area']]
print(order)
totalorders = ch[['price']]
print(totalorders)
reg = linear_model.LinearRegression()
reg.fit(order, totalorders)

51 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
from google.colab import files
uploaded = files.upload()
ch1 = pd.read_csv(io.BytesIO(uploaded['homeprice1.csv']))
print(ch1.head())

p = reg.predict(ch1)
print(p)

Schedule:

8.45am
10.25a 10.40a 1.30p 2.20p 3.10p
To 9.35am 11.30a 12.20p
m m m m m 3.25pm
to mto m
Day to to To to to to
9.3 10.25a 12.20p to
10.40a 11.30a 2.20p 3.10p 3.25p 4.15pm
5a m m 1.30pm
m m m m m
m
Linear Regression Task on
Tea Linear Regression Lunch for predicting Tea Linear
Day 6 Introduction
Break Basics Break the home Break regressio
price n

Description of the task:

Determine the rise in price of Gold using the past data and predict how the price would be in the
forthcoming years using Linear Regression in Python

Sample Output

52 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing dataset 1 Mark
Dataset correctly imported and
displayed
0 mark Python 1
No dataset imported
3 J Splitting the 1 Mark Python
dataset into Dataset is splitted into dependent
dependent variable and independent variable using
variable and iloc function 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Preprocessing the 1 Mark Python
data Preprocessing procedures are
carried out
0 mark 1
No preprocessing of data
5 J Separate data 1 Mark Python
frame for features Created separate dataframes to
and labels store the features and target labels
0 mark 1
No separate data frames created

6 J Splitting Data 1 Mark Python


into training and Used train_test_split function to split the
testing data dataset into training data and testing data
0 Mark
53 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
No train_test_split s function is applied 1

7 J Linear 1 Mark Python


Regression Use Linear Regression to train the
dataset
0 mark 1
No Linear Regression used
8 J Prediction 1 Mark Python
Used test data for evaluation
0 Mark
No testing 1
9 J Evaluation 1 Mark Python
metrics Used Linear Regression related metrics
to determine the accuracy of the model
0 Mark 1
No metrics used for determining the
accuracy of the model
10 J Visualization 1 Mark Python
Used matplotlib or seaborn for
visualizing the model results
0 Mark 1
No visualization results

Conclusion:
Thus linear regression is used to predict the future values of the target labels in the chosen dataset.

54 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 6 K-NEAREST NEIGHBOR REGRESSION

Objectives:
The aim of the task is to make the students understand the implementation of K-Nearest Neighbour
regression using Python programming

Outcome:
Students will be able to perform K-Nearest Neighbour regression method which aims to predict the
numerical target based on similarity measure such as distance function.

Resources required: Python for windows

Prerequisites: Basic knowledge on Linear Regression.

Theory:

K-Nearest Neighbor Regression:

K-Nearest Neighbors (KNN) is one of the simplest algorithms used in Machine Learning for
regression and classification problem. A classification problem has a discrete value as its output, whereas a
regression problem has a real number (a number with a decimal point) as its output. KNN algorithms use
data and classify new data points based on similarity measures (e.g. distance function). Classification is
done by a majority vote to its neighbors. The data is assigned to the class which has the nearest neighbors.
As the number of nearest neighbors k is increased, accuracy might increase.

Sample Coding:

55 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step1: Import the Libraries

Step 2: Fetch the data


Coding:

Output:

Step 3: Define Predictor Variable


Predictor variable, also known as an independent variable is used to determine the value of the target
variable. We use ‘Open-Close’ and ‘High-Low’ as a predictor variable.

Coding:

56 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Output:

Step 4: Define Target Variables


The target variable, also known as the dependent variable is the variable whose values are to be
predicted by predictor variables.

Coding:

Step 5: Split the dataset


The dataset will be splitted into training dataset and test dataset, where 70% of data will be used as
training data and the rest 30% will be used as testing data. Assume that ‘Xtrain’, ‘Ytrain’ are train
dataset and ‘Xtest’ and ‘Ytest’ are test dataset.
Coding:

Step 6: Instantiate KNN Model


After splitting the dataset into training and test dataset, K-nearest classifier will be instantiated. Next,
train data will be fitted by using ‘fit’ function and finally the train and test accuracy will be calculated by
57 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
using ‘accuracy_score’ function.

Coding:

Output:

Step 7: Predicting and Visualizing the test result

58 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm


Day 3.25pm to
to to to to to to to to to
4.15pm
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm

Day Introduction – Linear


Hands on Session k-nearest
7 Regression vs k- Tea k-nearest neighbor Lunch
for k-nearest Tea Break neighbor
nearest neighbor Break Example Break
neighbor regression regression task
Regression

Description of the task:

Download the iris dataset (Excel or CSV format) from Kaggle, implement the KNN algorithm and visualize
the test result

Sample Output in Python:

59 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Maximum
S.No Type Aspect Description Additional Aspect Description Requirement Score(10)
1 J Downloading correct 2 Marks
dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2

2 J Predictor variable 1 Mark


If the independent variables are
predicted properly Python
0 Mark
If the independent variables are 1
not predicted properly

3 J Target variable 1 Mark Python 1


If the dependent variables are
predicted properly
0 Mark
If the dependent variables are not
predicted properly

4 J Test data 2 Marks Python 2


If dataset is splitted into test data
0 Mark
If data is not splitted properly

60 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
5 J Training data 2 Marks Python 2
If dataset is splitted into training data
0 Mark
If data is not splitted properly

6 J Usage of fit function 1 Mark Python 1


If trained data is fitted using fit
function
0 Mark
If trained data is not fitted properly

7 J Visualizing the test 1 Mark Python 1


results If test results are visualized in the form
of a graph
0 Mark
If test results are not visualized

Conclusion:
Thus the implementation of K-nearest neighbor Regression, using Python IDE are completed
successfully.

61 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 7 RANDOM FOREST REGRESSION

Objectives:
The aim of the task is to predict the selling prices of houses based on some economic factors and
build a random forest regression model using Python programming
Outcome:
Student will able to apply Random Forest regression algorithm for predicting the result based on
the average of different predictions of different decision trees.
Resources required: Python Version 3.5
Prerequisites: Knowledge on Python Programming language
Theory:
Random Forest Regression
Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where different types of algorithms or same algorithm are joined
multiple times to form a more powerful prediction model. The random forest algorithm combines multiple
algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name
"Random Forest". The random forest algorithm can be used for both regression and classification tasks.

Sample Coding:
Step1: Import libraries
Coding:

Output:

62 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 2: Define the features and the target

Coding:

Output:

63 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3: Split the dataset into train and test sets

Coding:

Step 4: Build the random forest regression model with random forest regressor function
Coding:

Step 5: Evaluate the random forest regression model


Coding:

Output:

Schedule:

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
Day to to to to to to to to to to
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm

Tea Lunch
Day 8 Introduction Exercise -1 Exercise -2 Tea Break Exercise -3
Break Break

Description of the task:


Download the temperature dataset of a city (Excel or CSV format) from Kaggle and predict the
maximum temperature for tomorrow in that city using one year of past weather data.

64 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score(10)

1 J Downloading correct 2 Marks


dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser
2 J Splitting data 1 Mark Python
If data is divided into attributes program
and labels
0 Mark
If data is not properly divided
into attributes and labels
3 J Training and Test data 1 Mark Python
If data is divided into training program
and test data
65 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
0 Mark
If data is not divided into
training and test data

4 J Feature Python program


1 Mark
scaling
If standard scalar class is used
for feature scaling
0 Mark
If standard scalar class is not
used for feature scaling

5 J Training the Python program


1 Mark
algorithm
If random forest regressor is
imported
0 Mark
If random forest regressor is not
imported
6 J Regressor object Python program
1 Mark
If regressor object is created to
solve regression problem
0 Mark
If regressor object is not created
to solve regression problem
7 J Estimate the performance 2 Marks Python program
If mean absolute error, mean
squared error, and root mean
squared error are used to
determine the performance
0 Mark
If mean absolute error, mean
squared error, and root mean
squared error are not used to
determine the performance
7 J Visualization of result 1 Mark Python program
If regression result is visualized
with a graph
0 Mark
If regression result is not
visualized with a graph

Conclusion:
Thus the implementation of Random Forest Regression algorithm, using Rapid Miner/ R
Tool and Python IDE are completed successfully.
66 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 8 GRADIENT BOOSTING REGRESSION

Objectives:
The aim of the task is to make the students understand the implementation of gradient boosting
regression using R

Outcome:
Students will be able to perform boosting method which aims to optimise an arbitrary
(differentiable) cost function.

Resources required: R for windows

Prerequisites: Basic knowledge on R programming.

Theory: Gradient Boosting

Boosting is a class of ensamble learning techniques for regression and classification problems.
Boosting aims to build a set of weak learners (i.e. predictive models that are only slightly better than
random chance) to create one ‘strong’ learner (i.e. a predictive model that predicts the response variable
with a high degree of accuracy). Gradient boosting is a machine learning technique for regression and
classification problems, which produces a prediction model in the form of an ensemble of weak prediction
models, typically decision trees.

Sample coding
Step1: Importing packages
Coding

require(gbm)
require(MASS)#package with the boston housing dataset
#separating training and test data
train =sample(1:506,size=374)

Step2: Apply Boston housing dataset to predict the median value of houses

Coding:

67 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Boston.boost=gbm(medv ~ . ,data = Boston[train,],distribution = "gaussian",n.trees =
10000,shrinkage = 0.01, interaction.depth = 4)
Boston.boost
summary(Boston.boost)
Output: #Summary gives a table of Variable Importance and a plot of Variable
Importance
bm(formula = medv ~ ., distribution = "gaussian", data = Boston[-train,], n.trees = 10000,
interaction.depth = 4, shrinkage = 0.01)
A gradient boosted model with gaussian loss function.
10000 iterations were performed.
There were 13 predictors of which 13 had non-zero influence.
>summary(Boston.boost)
var rel.inf
rm rm 36.96963915
lstat lstat 24.40113288
dis dis 10.67520770
crim crim 8.61298346
age age 4.86776735
black black 4.23048222
nox nox 4.06930868
ptratio ptratio 2.21423811
tax tax 1.73154882
rad rad 1.04400159
indus indus 0.80564216
chas chas 0.28507720
zn zn 0.09297068
Step 3: Plotting the Partial Dependence Plot
#Plot of Response variable with lstat variable
plot(Boston.boost,i="lstat")
#Inverse relation with lstat variable
plot(Boston.boost,i="rm")
#as the average number of rooms increases the the price increases

68 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 4: Prediction on Test Set
Coding:

69 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
n.trees = seq(from=100 ,to=10000, by=100) #no of trees-a vector of 100 values
#Generating a Prediction matrix for each Tree
predmatrix<-predict(Boston.boost,Boston[-train,],n.trees = n.trees)
dim(predmatrix) #dimentions of the Prediction Matrix
#Calculating The Mean squared Test Error
test.error<-with(Boston[-train,],apply( (predmatrix-medv)^2,2,mean))
head(test.error) #contains the Mean squared test error for each of the 100 trees averaged
#Plotting the test error vs number of trees
plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main =
"Perfomance of Boosting on Test Set")
#adding the RandomForests Minimum Error line trained on same data and similar
parameters
abline(h = min(test.err),col="red") #test.err is the test error of a Random forest fitted on same
data
legend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1)
Output:

dim(predmatrix)
[1] 206 100
head(test.error)
100 200 300 400 500 600
26.428346 14.938232 11.232557 9.221813 7.873472 6.911313

70 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule:

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
Day to to to to to to to to to to
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm

Gradient
Introduction – Random Hands on Session for
Day 9 Tea Gradient Boosting Lunch Tea boosting
Forest vs Gradient Gradient boosting
Break Example Break Break regression
Boosting regression
task

Description of the task:


Download the housing dataset (Excel or CSV format) from Kaggle and predict the median prices
of homes located in the Boston area given other attributes of the house.
Sample Output:

71 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score(10)

1 J Downloading correct 2 Marks


dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Splitting of data 1 Mark R program
If data is divided into training
and test data
0 Mark
If data is not divided into
training and test data
3 J Prediction of median 1 Mark R program
value If median value of houses are
computed
0 Mark
If median value of houses are
not computed
4 J Plotting graph R program
1 Mark
If response variable and lstat
variable are plotted as a graph
0 Mark
If response variable and lstat
variable are not plotted as a
graph
5 J Prediction matrix R program
1 Mark
If prediction matrix is
generated for ech tree
0 Mark

72 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
If prediction matrix is not
generated for ech tree
6 J Mean squared test R program
2 Marks
error
If mean squared test error is
computed
0 Mark
If mean squared test error is not
computed
7 J Plotting test error 1 Mark R program
If mean squared test error is
plotted as a graph
0 Mark
If mean squared test error is not
plotted as a graph
7 J Visualization of result 1 Mark R program
If the performance of boosting
on test set is visualized with a
graph
0 Mark
If the performance of boosting
on test set is visualized with a
graph

Conclusion:
Thus the implementation of Gradient Boost Regression algorithm, using Rapid Miner/ R
Tool is completed successfully.

73 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 9 SUPPORT VECTOR REGRESSION

Objectives:
The aim of the task is to make the students understand the implementation of support vector
regression using Python.

Outcome:
Students will be able to perform support vector method which aims to optimise an arbitrary
(differentiable) cost function by fitting the best line within a predefined or threshold error value.

Resources required: Python for windows

Prerequisites: Basic knowledge on Linear Regression.

Theory:

Support Vector Regression:


Support Vector regression is a type of Support vector machine that supports linear and non-linear
regression. As it seems in the below graph, the mission is to fit as many instances as possible between the
lines while limiting the margin violations. The violation concept in this example represents as ε (epsilon).
SVR requires the training data:{ X, Y} which covers the domain of interest and is accompanied by
solutions on that domain.

Sample Coding:
Step1: Importing the libraries

74 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step2: Reading the dataset

dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

Step 3: Feature Scaling


from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

Step 4: Fitting SVR to the dataset

rom sklearn.svm import SVR


regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

Step 5: Predicting a new result

rom sklearn.svm import SVR


regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

75 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 6: Visualizing the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

Output:

Schedule:

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm


Day to to to to to to to to 3.1 pm 3.25pm
9.35am 10.25am 0.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm to to
4.15pm
3.25pm
Support
Day Introduction – Linear Hands on Session for
Tea Lunch Tea vector
Regression vs Support Support Vector Example Support vector
10 Break Break Break regression
Vector Regression regression
task

Description of the task:


Download the salary dataset of employees working in an organisation (Excel or CSV format) from
Kaggle and predicting the salary of the employees according to the position held in the working
organization.

76 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score(10)

1 J Downloading correct 2 Marks


dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Feature scaling 1 Mark
If StandardScaler class is used to
perform feature scaling
0 Mark
If StandardScaler class is not
used to perform feature scaling
1
3 J Support vector regressor 2 Marks R program
If support vector regressor is
created to fit the regression
model
0 Mark
If support vector regressor is not
created to fit the regression
model 2
4 J Predicting new R program
1 Mark
result
If predict method is applied to
make predictions on the
trained model
0 Mark
If predict method is not
applied to make predictions
on the trained model 1
5 J Mean squared test R program
2 Marks
error
If mean squared test error is
computed
0 Mark
If mean squared test error is not
2
computed
6 J Visualizing the SVR results 2 Marks R program
If the SVR is visualized with a
graph
0 Mark
If the SVR result is not
2
visualized with a graph

77 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Conclusion
By applying the Support vector Regression, predicting the salary of the employees according to the
position held in the working organization is completed successfully.

78 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 10
LOGISTIC REGRESSION

Objectives:
The aim of the task is to make the students understand about logistic regression to classify binary
response variables
Outcome:
Students will be able to understand the need of classifying dependent and independent variables for
predicting binary classes and understand the computation of event occurrence probability

Resources required: Python Version 3.5


Prerequisites: Basic knowledge on Python programming
Theory:
Logistic Regression
Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability
of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression
model predicts P(Y=1) as a function of X. Linear Regression is used to determine the value of a
continuous dependent variable. Logistic Regression is generally used for classification purposes. Unlike
Linear Regression, the dependent variable can take a limited number of values only i.e, the dependent
variable is categorical. When the number of possible outcomes is only two it is called Binary Logistic
Regression.

79 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Sample coding
Step 1: Data gathering
The required data to buid a logistic regression model in Python inorder to determine whether the
candidates would get admitted to a prestigious university is gathered. The two possible outcomes
are: Admitted (represented by the value of ‘1’) and Rejected (represented by the value of ‘0’).The
logistic regression model consists of:
(i) The dependent variable which represents whether a person gets admitted; and
(ii) The 3 independent variables are the GMAT score, GPA and Years of work experience

80 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 2: Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt

Step 3: Build a dataframe


import pandas as pd
candidates = {'gmat':
[780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,
620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa':
[4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.
3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience':
[3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
'admitted':
[1,1,0,1,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,1]
} the logistic regression in Python
Step 4: Create

dfX= =pd.DataFrame(candidates,columns=
df[['gmat', 'gpa','work_experience']]['gmat', 'gpa','work_experience','admitted'])
-> Indepenednt variable
print (df)
y = df['admitted'] -> Dependent variable

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) ->


Train _test_split applied

81 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Apply the logistic regression as follows:

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

Create confusion matrix

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'],


colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
plt.show()

Output:

Accuracy = (TP+TN)/Total = (4+4)/10 = 0.8

82 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule:

8.45am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm


9.35am to
Day to to to to to to to to to
10.25am
9.35am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm

Introduct Ex. 1
ion
Predicting Visualization and Lunc Logistic
to the Tea Tea ROC
Day 12 Recursive Feature h Regression Model
Logistic variables Break Break Curve
Elimination Break Fitting
Regressi and data
on exploration

Description of the task


Download the dataset containing the information of users from a companies database and apply the
logistic regression model to predict whether a user will purchase the company’s newly launched product
or not.
Output:

Confusion Matrix:
[[65 3]
[8 24]]

Accuracy: 0.89

83 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score(10)

1 J Downloading correct 2 Marks


dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Splitting the data set Python
2 Marks
program
If data is splitted into training
and test data
0 Mark
If data is not splitted into
training and test data 2
3 J Feature scaling
2 Marks
If StandardScaler class is used
to perform feature scaling Python program
0 Mark
If StandardScaler class is not
used to perform feature scaling 2
4 J Traning
1 Mark
logistic
regression If Logistic regression classifier
model is used to train the model Python program
0 Mark
If Logistic regression classifier
1
is not used to train the model
5 J Prediction on testing
1 Mark
data
If predict() is used to make
predictions on testing data
0 Mark
Python program
If predict() is not used to
make predictions on testing
1
data
6 J Confusion matrix
1 Mark
If confusion matrix is generated
from the predicted result
0 Mark
Python program
If confusion matrix is not
generated from the predicted
result 1
7 J Visulaizing the test results
1 Mark
If the performance result is Python program
visualized in the form of a
graph

84 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
0 Mark
If the performance result is not
visualized in the form of a
graph 1

Conclusion:
Thus the implementation of Logistic Regression algorithm, using python programming is
completed successfully.

85 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 11 INTRODUCTION TO CLASSIFICATION AND DIFFERENT TYPES OF
CLASSIFICATION ALGORITHMS AND NAÏVE BAYES ALGORITHM

Objectives:
The aim of the task is to classify the given test dataset using Naïve Bayes algorithm to predict the
accuracy.
Outcome:
Students will be able to apply python libraries for the given dataset and also they will be able to
identify the accuracy using Naïve Bayes algorithm.
Resources required: Python Version 3.5
Prerequisites: Knowledge on Python Programming language
Theory:
Naive Bayes is a probabilistic classifier in Machine Learning which is built on the principle of Bayes
theorem. Naive Bayes classifier makes an assumption that one particular feature in a class is
unrelated to any other feature and that is why it is known as naive.
The below picture denotes the Bayes theorem:

• P(h): the probability of hypothesis H being true (regardless of the data). This is known as the
prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
Sample Coding
from sklearn import datasets
iris = datasets.load_iris()
print(iris)
print("Features: ", iris.feature_names)
print ("Labels: ", iris.target_names)
print(iris.data[0:5])
from sklearn.model_selection import train_test_split
86 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3,random_state=1
09)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Schedule:
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
8.45am to to
Day to to to to to to to to
9.35am 3.25pm
10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 4.15pm

Day Introduction Tea Exercise Lunch Naive Tea Exercise Day 3 Introducti Tea
13 to different Break -1 Break Bayes Break -2 on to Break
classification Algorithm different
algorithms classificati
on
algorithms

Description of the task:

Download the dataset containing the weather information from Kaggle and classify whether
players will play or not based on weather condition
Sample Output:

87 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score(10)

1 J Downloading correct 2 Marks


dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Splitting the data set Python
2 Marks
program
If data is splitted into training
and test data
0 Mark
If data is not splitted into
training and test data 2
3 J Feature scaling
2 Marks
If StandardScaler class is used
to perform feature scaling Python program
0 Mark
If StandardScaler class is not
used to perform feature scaling 2
4 J Training
1 Mark
Naïve Bayes
model If Naïve Bayes classifier is used
to train the model Python program
0 Mark
If Naïve Bayes classifier
1
classifier is not used to train the
model
5 J Prediction on testing
1 Mark
data
If predict() is used to make
predictions on testing data
0 Mark
Python program
If predict() is not used to
make predictions on testing
1
data
6 J Confusion matrix
1 Mark
If confusion matrix is generated
from the predicted result
0 Mark
Python program
If confusion matrix is not
generated from the predicted
result 1
7 J Visulaizing the test results
1 Mark
88 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
If the performance result is
Python program
visualized in the form of a
graph
0 Mark
If the performance result is not
visualized in the form of a
graph
1

Conclusion:

Thus the implementation of Naïve Bayes algorithm, using python programming is


completed successfully.

89 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 12 STOCHASTIC GRADIENT DESCENT ALGORITHM

Objectives:
The aim of the task is to classify the given test dataset using Stochastic Gradient Descent Algorithm to
predict the accuracy level.
Outcome:
Student will be able to apply python libraries for the given dataset and also they will be able to
identify the accuracy using Stochastic Gradient Descent classifier.
Resources required: Python Version 3.5
Prerequisites: Knowledge on Python Programming language
Theory:
Gradient descent is the backbone of a machine learning algorithm. Imagine that you are on a mountain
and are blindfolded and your task is to come down from the mountain to the flat land without assistance. The
only assistance you have is a gadget which tells you the height from sea-level. What would be your approach
be. You would start to descend in some random direction and then ask the gadget what is the height now. If
the gadget tells you that height and it is more than the initial height then you know you started in the wrong
direction. You change the direction and repeat the process. This way in much iteration finally you
successfully descend down. Well here is the analogy with machine learning terms now:
Size of Steps took in any direction = Learning rate
Gadget tells you height = Cost function
The direction of your steps = Gradients

Looks simple but mathematically how can we represent this. Here is the maths:

Where m = Number of observations


where alpha = Learning Rate
Sample Coding:
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_boston
from sklearn import preprocessing
90 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prettytable import PrettyTable
from sklearn.linear_model import SGDRegressor
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from numpy import random
from sklearn.model_selection import train_test_split
print("DONE")
#Data Loading, Splitting and Standardizing
boston_data=pd.DataFrame(load_boston().data,columns=load_boston().feature_names)
Y=load_boston().target
X=load_boston().data
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3)

print("X Shape: ",X.shape)


print("Y Shape: ",Y.shape)
print("X_Train Shape: ",x_train.shape)
print("X_Test Shape: ",x_test.shape)
print("Y_Train Shape: ",y_train.shape)
print("Y_Test Shape: ",y_test.shape)

# standardizing data
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test=scaler.transform(x_test)

## Adding the PRIZE Column in the data


train_data=pd.DataFrame(x_train)
train_data['price']=y_train
train_data.head(3)

x_test=np.array(x_test)
y_test=np.array(y_test)

#Linear Regression with Scikit Learn’s SGDRegressor

# SkLearn SGD classifier


n_iter=100
clf_ = SGDRegressor(max_iter=n_iter)
clf_.fit(x_train, y_train)
y_pred_sksgd=clf_.predict(x_test)
plt.scatter(y_test,y_pred_sksgd)
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Predicted y')
91 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
plt.title('Scatter plot from actual y and predicted y')
plt.show()

print('Mean Squared Error :',mean_squared_error(y_test, y_pred_sksgd))

#Linear Regression with our Custom SGD

def MyCustomSGD(train_data,learning_rate,n_iter,k,divideby):

# Initially we will keep our W and B as 0 as per the Training Data


w=np.zeros(shape=(1,train_data.shape[1]-1))
b=0

cur_iter=1
while(cur_iter<=n_iter):

# We will create a small training data set of size K


temp=train_data.sample(k)

# We create our X and Y from the above temp dataset


y=np.array(temp['price'])
x=np.array(temp.drop('price',axis=1))

# We keep our initial gradients as 0


w_gradient=np.zeros(shape=(1,train_data.shape[1]-1))
b_gradient=0

for i in range(k): # Calculating gradients for point in our K sized dataset


prediction=np.dot(w,x[i])+b
w_gradient=w_gradient+(-2)*x[i]*(y[i]-(prediction))
b_gradient=b_gradient+(-2)*(y[i]-(prediction))

#Updating the weights(W) and Bias(b) with the above calculated Gradients
w=w-learning_rate*(w_gradient/k)
b=b-learning_rate*(b_gradient/k)

# Incrementing the iteration value


cur_iter=cur_iter+1

#Dividing the learning rate by the specified value


learning_rate=learning_rate/divideby

return w,b #Returning the weights and Bias


#a small Predict function

def predict(x,w,b):
y_pred=[]
92 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
for i in range(len(x)):
y=np.asscalar(np.dot(w,x[i])+b)
y_pred.append(y)
return np.array(y_pred)

#Custom SGD with the following parameters


#Learning rate = 1
#Batch Size K=10
#Divide the Learning rate by = 2

w,b=MyCustomSGD(train_data,learning_rate=1,n_iter=100,divideby=2,k=10)
y_pred_customsgd=predict(x_test,w,b)

plt.scatter(y_test,y_pred_customsgd)
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Predicted y')
plt.title('Scatter plot from actual y and predicted y')
plt.show()
print('Mean Squared Error :',mean_squared_error(y_test, y_pred_customsgd))

#Improve CustomSGD Result by changing the parameters as:


#Learning rate = 0.001
#Iterations = 1000
#Divide the Learning rate by = 1 (ie. Not Dividing :-P)
w,b=My2CustomSGD(train_data,learning_rate=0.001,n_iter=1000,divideby=1,k=10)
y_pred_customsgd_improved=predict(x_test,w,b)

plt.scatter(y_test,y_pred_customsgd_improved)
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Predicted y')
plt.title('Scatter plot from actual y and predicted y')
plt.show()
print('Mean Squared Error :',mean_squared_error(y_test, y_pred_customsgd_improved))

Schedule:

8.45am 9.35am 10.25am 10.40am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm


Day 11.30am
to to to to to to to to to
to 12.20pm
9.35am 10.25am 10.40am 11.30am 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm

Gradient Descent Tea Exercise -1 Lunch Exercise -2 Tea Exercise -3


Day 14 Algorithm Break Break Break

93 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Description of the task:

Download the Boston dataset from Kaggle and perform Linear Regression on Boston Housing data
using Scikit Learn’s SGDRegressor and visualize the results

Sample Output:

Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score(10)

1 J Downloading correct 2 Marks


dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Splitting the data set Python
2 Marks
program
If data is splitted into training
and test data
0 Mark
If data is not splitted into
training and test data 2
3 J Linear Regression
2 Marks
Use Linear Regression to train
the data Python program
0 Mark
If Linear Regression is not used
to train the model 2
4 J Mean Squared
1 Mark
Error
94 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Determining the Mean Squared
Python program
Error
0 Mark
If no mean square error is used 1
5 J SGD Regressor
1 Mark
Use SGD Regressor to train the
data
0 Mark
Python program
If SGD Regressor is not used to
train the model
1
6 J Custom SGD
1 Mark
Use Custom SGD to train the
data
0 Mark
Python program
If Custom SGD is not used to
train the model
1
7 J Visulaizing the test results
1 Mark
If the performance result is Python program
visualized in the form of a
graph
0 Mark
If the performance result is not
visualized in the form of a
graph
1

Conclusion:

Thus the implementation of Stochastic Gradient Descent algorithm, using python programming
is completed successfully.

95 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 13 K-NEAREST NEIGHBOURS ALGORITHM Object
ives:

The aim of the task is to classify the given test dataset using K- Nearest Neighbour to predict the accuracy
level.

Outcome:

Student will be able to apply python libraries for the given dataset and also they will be able to identify the
accuracy using KNN classifier.

Resources required: Python Version 3.5

Prerequisites: Knowledge on Python Programming language

Theory:

 Classifying the input data is a very important task in Machine Learning, for example, whether a mail is
genuine or spam, whether a transaction is fraudulent or not and there are multiple other examples.

 Let’s say, you live in a gated housing society and your society has separate dustbins for different types of
waste: one for paper waste, one for plastic waste, and so on. What you are basically doing over here is
classifying the waste into different categories. So, classification is the process of assigning a ‘class label’
to a particular item. In the above example, we are assigning the labels ‘paper’, ‘metal’, ‘plastic’, and so on
to different types of waste.

KNearest Neighbor
KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for
underlying data distribution.
Sample Coding:
# Assigning features and label variables
# First Feature
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
# Second Feature
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
# Label or target varible
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
weather_encoded=le.fit_transform(weather)
print(weather_encoded)
# converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
#combinig weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
# Train the model using the training sets
model.fit(features,label)
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print(predicted)
Schedule

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm


3.25pm to
Day to to to to to to to to to
4.15pm
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm

Day 15 KNN Algorithm Tea Exercise -1 Lunch Exercise -2 Tea Exercise -3


Break Break Break

Description of the task:

Download the wine dataset from Kaggle and perform classification on wine data to classify the three types
of wine using KNearest Neighbor algorithm and visualize the results

Sample Output:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing dataset 1 Mark
Dataset correctly imported and
displayed
0 mark Python 1
No dataset imported
3 J Splitting the 1 Mark Python
dataset into Dataset is splitted into dependent
dependent variable and independent variable using
variable and iloc function 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Preprocessing the 1 Mark Python
data Preprocessing procedures are
carried out
0 mark 1
No preprocessing of data
5 J Separate data 1 Mark Python
frame for features Created separate dataframes to
and labels store the features and target labels
0 mark 1
No separate data frames created

6 J Splitting Data 1 Mark Python


into training and Used train_test_split function to split the
testing data dataset into training data and testing data
0 Mark 1
No train_test_split s function is applied
7 J KNearest 1 Mark Python
Neighbor Use KNearest Neighbor to train the
dataset
1
0 mark
No KNearest Neighbor used
8 J Prediction 1 Mark Python
Used test data for evaluation
0 Mark
No testing 1
9 J Evaluation 1 Mark Python
metrics Used confusion matrix to determine the
accuracy of the model
0 Mark 1
No metrics used for determining the
accuracy of the model

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
10 J Visualization 1 Mark Python
Used matplotlib or seaborn for
visualizing the model results
0 Mark 1
No visualization results

Conclusion:

Thus the implementation of KNearest Neighbor algorithm, using python programming is completed
successfully.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 14 SUPPORT VECTOR MACHINES CLASSIFICATION

Objective:
The aim of this task is to use SVM to representation of different classes in a hyperplane in
multidimensional space.
Outcome:
Students will able to
 Understand SVM that distinctly classify data points
 Apply how to maximize the margin of the classifier
Resource Required: Python IDE (Jupiter or Anaconda or Spyder or Pycharm)

Prerequisites: Python Programming Basics

Theory:

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.

Sample Coding:

#Import scikit-learn dataset library

from sklearn import datasets

#Load dataset

cancer = datasets.load_breast_cancer()

# print the names of the 13 features

print("Features: ", cancer.feature_names)

# print the label type of cancer('malignant' 'benign')

print("Labels: ", cancer.target_names)

# Import train_test_split function

from sklearn.model_selection import train_test_split

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109)

#Import svm model

from sklearn import svm

#Create a svm Classifier

clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets

clf.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = clf.predict(X_test)

#Import scikit-learn metrics module for accuracy calculation

from sklearn import metrics

# Model Accuracy: how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?

print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?

print("Recall:",metrics.recall_score(y_test, y_pred))

Schedule

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm


3.25pm to
Day to to to to to to to to to
4.15pm
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm

Day 15 SVM Algorithm Tea Exercise -1 Lunch Exercise -2 Tea Exercise -3


Break Break Break

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Description of the task:

Download the iris dataset from Kaggle and perform classification on iris data to classify the three types of
flowers using SVM and visualize the results.

Sample Output:

Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing dataset 1 Mark
Dataset correctly imported and
displayed
0 mark Python 1
No dataset imported
3 J Splitting the 1 Mark Python
dataset into Dataset is splitted into dependent
dependent variable and independent variable using
variable and iloc function 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Preprocessing the 1 Mark Python
data Preprocessing procedures are
carried out
0 mark 1
No preprocessing of data

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
5 J Separate data 1 Mark Python
frame for features Created separate dataframes to
and labels store the features and target labels
0 mark 1
No separate data frames created

6 J Splitting Data 1 Mark Python


into training and Used train_test_split function to split the
testing data dataset into training data and testing data
0 Mark 1
No train_test_split s function is applied
7 J Support Vector 1 Mark Python
Machine Used Support Vector Machine to
train the dataset
1
0 mark
No KNearest Neighbor used
8 J Prediction 1 Mark Python
Used test data for evaluation
0 Mark
No testing 1
9 J Evaluation 1 Mark Python
metrics Used confusion matrix to determine the
accuracy of the model
0 Mark 1
No metrics used for determining the
accuracy of the model
10 J Visualization 1 Mark Python
Used matplotlib or seaborn for
visualizing the model results
0 Mark 1
No visualization results

Conclusion:

Thus the implementation of Support Vector Machine using python programming is completed
successfully.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 15 TIME SERIES PREDICTION USING ARIMA MODEL

Objective:
The aim of this task is to use Arima Model to predict the time series data
Outcome:

Students will able to understand ARIMA model to predict and obtain the results for time series data

Resource Required: Python IDE (Jupiter or Anaconda or Spyder or Pycharm)

Prerequisites: Python Programming Basics

Theory:

Time series analysis comprises methods for analyzing time series data in order to extract meaningful
statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values
based on previously observed values. Time series are widely used for non-stationary data, like economic, weather,
stock price, and retail sales Forecasting is the next step where you want to predict the future values the series is
going to take.
ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is actually a class of models that
‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so
that equation can be used to forecast future values.

Sample Coding:

import pandas as pd

from pandas import datetime

def parser(x):

return datetime.strptime(x,'%Y-%m')

sales=pd.read_csv('/content/sample_data/sales.csv',date_parser=(0),date_parser=parser)

sales.head()

sales.Month[1]

sales=pd.read_csv('/content/sample_data/sales.csv',index_col=0,parse_dates=[0],date_parser=parser)

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
sales.head()

import matplotlib.pyplot as plt

sales.plot()

from statsmodels.graphics.tsaplots import plot_acf

plot_acf(sales)

sales.shift(1)

sales_diff=sales.diff(periods=1)

#integrated of order 1, denoted by d (for diff), one of the parameter of the ARIMA model

sales_diff

sales_diff=sales_diff[1:]

sales_diff.head()

X=sales.values

X.size

train=X[0:15]

test=X[15:]

predictions=[]

from statsmodels.tsa.arima_model import ARIMA

#p,d,q p = periods taken for autoregressive model

#d -> Integrated order, difference

# q periods in moving average model

model_arima = ARIMA(train,order=(1,0,1))

model_arima_fit = model_arima.fit()

print(model_arima_fit.aic)

predictions= model_arima_fit.forecast(steps=10)[0]

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
predictions

plt.plot(test)

plt.plot(predictions,color='red')

Schedule

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm


3.25pm to
Day to to to to to to to to to
4.15pm
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm

Day 15 SVM Algorithm Tea Exercise -1 Lunch Exercise -2 Tea Exercise -3


Break Break Break

Description of the task:

Download the COVID dataset from Kaggle and perform prediction using ARIMA model to determine the
increase in cases and deaths

Sample Output:

Assessment specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
2 J Importing dataset 1 Mark
Dataset correctly imported and
displayed
0 mark Python 1
No dataset imported
3 J Splitting the 1 Mark Python
dataset into Dataset is splitted into dependent
dependent variable and independent variable using
variable and iloc function 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Preprocessing the 1 Mark Python
data Preprocessing procedures are
carried out
0 mark 1
No preprocessing of data
5 J Separate data 1 Mark Python
frame for features Created separate dataframes to
and labels store the features and target labels
0 mark 1
No separate data frames created

6 J Splitting Data 1 Mark Python


into training and Used train_test_split function to split the
testing data dataset into training data and testing data
0 Mark 1
No train_test_split s function is applied
7 J ARIMA Model 1 Mark Python
Used ARIMA model to train the
dataset
0 mark 1
No KNearest Neighbor used
8 J Prediction 1 Mark Python
Used test data for evaluation
0 Mark
No testing 1
9 J Evaluation 1 Mark Python
metrics Used confusion matrix to determine the
accuracy of the model
0 Mark 1
No metrics used for determining the
accuracy of the model
10 J Visualization 1 Mark Python
Used matplotlib or seaborn for
visualizing the model results
0 Mark 1
No visualization results

Conclusion:

Thus the implementation of time series prediction with ARIMA model using python programming is
completed successfully.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 16 IMPLEMENTATION OF DECISION TREES

Objectives:
The aim of the task is to provide adequate knowledge on machine learning using decision
tree in PIMA Indian Diabetes dataset.

Outcome: At the end of the task, students can,


 Learn how to use the methodology of decision tree using python.
 Understand how the decision tree works on IRIS dataset.

Resources required: Programming language Python, Chrome browser or any other available
browser.

Theory:
A decision tree is a flowchart-like tree structure where an internal node represents feature
(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.
The topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This
flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram
which easily mimics the human level thinking. That is why decision trees are easy to understand
and interpret.

Working principle of Decision Tree algorithm work:


The basic idea behind any decision tree algorithm is as follows:

1. Select the best attribute using Attribute Selection Measures (ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of the
condition will match:
 All the tuples belong to the same attribute value.
 There are no more remaining attributes.
 There are no more instances.

Sample Coding:

Decision Tree Classifier Building in Scikit-learn:

Step 1: Importing Required Libraries:

Let's first load the required libraries.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 2: Loading Data

Let's first load the required Pima Indian Diabetes dataset using pandas' read CSV function.

Step 3: Splitting Data

To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70%
training and 30% test

Step 4: Building Decision Tree Model

Let's create a Decision Tree Model using Scikit-learn.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 5: Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars. Accuracy
can be computed by comparing actual test set values and predicted values.

Well, you got a classification rate of 67.53%, considered as good accuracy. You can improve this
accuracy by tuning the parameters in the Decision Tree Algorithm.

Step 6: Visualizing Decision Trees


You can use Scikit-learn's export_graphviz function for display the tree within a Jupyter
notebook. For plotting tree, you also need to install graphviz and pydotplus.

pip install graphviz

pip install pydotplus

export_graphviz function converts decision tree classifier into dot file and pydotplus convert this
dot file to png or displayable form on Jupyter.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Schedule

8.45a 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30p 2.20pm 3.10pm


3.25pm
Day m to to to to to to m to to to
to
9.35a 10.25a 10.40a 11.30a 12.20p 1.30p 2.20p 3.10pm 3.25pm
4.15pm
m m m m m m m
Day 15 Decision Tree Tea Exercise -1 Lunch Exercise -2 Tea Exercise -3
Algorithm Break Break Break

Description of the task:

Download a IRIS dataset (Excel or CSV format) from Kaggle and import it in Python IDE and print the
decision tree.

Sample Output in Python:

Output:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Assessment specification:

Aspect Aspect Maximum


S.No Additional Aspect Description Requirement
Type Description Score (10)
1 Mark
If pandas file is imported
1 Mark
Load the
1 J If IRIS dataset is imported - 2
Dataset
0 Mark
If pandas and IRIS is not
imported
0.5 Mark
If 75% of the data is set for
training
Splitting Data 0.5 Mark
2 J into Training If 25% of the data is set for - 1
and Test Sets testing
0 mark
If the data is not found on
training and testing

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
0.25 Mark
Import the model to use
0.25 Mark
3 J Create Model Making an instance of the model - 0.5
0 Mark
Import and instance code is not
found to make a model
0.5 Mark
If training is made on the model
0.5 Mark
If prediction is made on the
4 J Model pattern - 1
labels of unseen data
0 Mark
Training and prediction is not
made the mdel
1 Mark
If accuracy of the model is
Measuring
defined
5 J Model - 1
0 Mark
Performance
If accuracy parameter calculation
is not found
1 Mark
If optimal value of depth of the
Tuning the
6 J tree is found - 1
Depth of a tree
0 mark
If depth calculation is not found
1 Mark
Setting up of data frame on
feature importance
Calculation of 0.5 Mark
7 J feature If the values are sorted on the - 1.5
importance feature importance
0 Mark
If no relevant code for feature
importance selection
1 Mark
Completed within 30 min
Time 0.5 Mark
8 M 30 mins 1
Management Completed within 30 to 45 mins
0 Mark
Exceeded 45 mins

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
1 Mark
If exhibiting all two
aspects
Coding Ethics
0.5 Mark
i) Indentation
9 J If exhibiting any one - 1
ii) Overall design
aspect
look
0 Mark
If not exhibiting even one
aspect

Conclusion:
Thus the working mechanism of decision tree has been successfully implemented on machine
learning and results have been verified successfully.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 17 IMPLEMENTATION OF RANDOM FOREST ALGORITHM

Objectives:
The aim of the task is to provide adequate knowledge on machine learning using random
forest classifier in IRIS dataset and neural networks.

Outcome: At the end of the task, students can,


 Learn how to use the methodology of random forest using python.
 Understand how random forest works on IRIS dataset.
 Know about the concept of neural networks.

Resources required: Programming language Python, Chrome browser or any other


available browser.

Theory:
Random Forests:

Random forests is a supervised learning algorithm. It can be used both for classification
and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests creates
decision trees on randomly selected data samples, gets prediction from each tree and selects the
best solution by means of voting. It also provides a pretty good indicator of the feature
importance.

Working Principle of Random Forest algorithm:


It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each
decision tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.

Sample code:
Step1: Loading the dataset and print the target and features names

Start by importing the datasets library from scikit-learn, and load the iris dataset with
load_iris().

#Import scikit-learn dataset library


from sklearn import datasets

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
#Load dataset
iris = datasets.load_iris()

The code given below is to print the names of IRIS dataset.

# print the label species(setosa, versicolor,virginica)


print(iris.target_names)
# print the names of the four features
print(iris.feature_names)

Sample Output for the given code:


['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

The code given below is to print the top 5 records of IRIS dataset.

Sample Code:
# print the iris data (top 5 records)
print(iris.data[0:5])
# print the iris labels (0:setosa, 1:versicolor, 2:virginica)
print(iris.target)

Sample Output for the given code:


[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]

Step 2: Create a DataFrame of Iris Dataset

Source Code:
# Creating a DataFrame of given iris dataset
import pandas as pd
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
'petal width':iris.data[:,3],
'species':iris.target
})
data.head()

Sample Output for the given code:


sepal length sepal width petal length petal width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Step 3: Split features and labels into training and test data

Sample Code:
# Import train_test_split function
from sklearn.model_selection import train_test_split
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']] # Features
y=data['species'] # Labels
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and
30% test

Step 4: Prediction

Sample code:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

The code given below is to import the scikit-learn metrics module for accuracy
calculation on IRIS dataset.

Sample code:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Sample Output for the given code:


Accuracy: 0.9333333333333333

The code given below is to make a prediction for a single item on IRIS dataset

Sample code:
#Make a prediction for a single item
#sepal length = 3
#sepal width = 5
#petal length = 4
#petal width = 2
clf.predict([[3, 5, 4, 2]])

Sample Output for given code:


array([2])

The code given below is to finding of important features, creating a Gaussian classifier
and for training the model.

Step 5: Creation of random forest model

Source code:
#Finding Important Features
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier


clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)


clf.fit(X_train,y_train)

Sample Output for the given code:


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
Step 6: Feature importance scores

Sample code:
import pandas as pd
feature_imp = pd.Series
(clf.feature_importances_,index=iris.feature_names).sort_values (ascending=False)
feature_imp

Sample Output for the given code:


petal width (cm) 0.540009
petal length (cm) 0.360691
sepal length (cm) 0.073600
sepal width (cm) 0.025701
dtype: float64

Step 7: Visualization using seaborn library


Source code:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

Sample Output for the given code:


No handles with labels found to put in legend.

Step 8: Generating the model on selected features

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
The code given below is for generating model on selected features.

Source code:
#Generating model on selected Features
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into features and labels
X=data[['petal length', 'petal width','sepal length']] # Removed feature "sepal length"
y=data['species']
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and
30% test

The code given below is to import Random Forest Classifier and Gaussian Classifier.

Sample code:
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier


clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)


clf.fit(X_train,y_train)

# prediction on test set


y_pred=clf.predict(X_test)

#Import scikit-learn metrics module for accuracy calculation


from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Sample Output for the given code:


Accuracy: 0.9555555555555556

Schedule

8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm


3.25pm to
Day to to to to to to to to to
4.15pm
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Day 15 Random Forest Tea Exercise -1 Lunch Exercise -2 Tea Exercise -3
Classifier Break Break Break

Description of the task:

Download a PIMA Indian dataset (Excel or CSV format) from Kaggle and import it in Python
IDE and print the accuracy using random forest classifier.

Assessment specification:

Aspect Aspect Maximum


S.No Additional Aspect Description Requirement
Type Description Score(10)
0.5 Mark
If IRIS dataset is loaded
Load the
1 J successfully - 0.5
dataset
0 mark
If dataset is not found properly
0.5 Mark
If the labeled species is printed
successfully
Print the
0.25 Mark
species
2 J If the names of features are - 0.75
and
printed successfully
names
0 mark
None of the operation is
performed
0.5 Mark
If the data (top 5 records) of
IRIS is printed
Print data and
3 J 0.5 Mark - 1
labels
If the labels of IRIS is printed
0 mark
None of the operation is found
1 Mark
If the creation of dataframe is
performed on IRIS
4 J Data frame - 1
0 Mark
If the dataframe creation is
not performed

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
0.5 Mark
If import is performed on
train and test split function
1 Mark
5 J Split function If the split dataset is moved into - 1.5
training set and test set
0 mark
If importing and splitting
operation is not performed
0.5 Mark
Random forest Import random forest model
6 J - 0.5
model 0 Mark
Import is not done
0.5 Mark
Import scikit-learn metrics
Accuracy
7 J module for calculating accuracy - 0.5
calculation
0 Mark
Accuracy is not obtained properly
0.5 Mark
If prediction is made exactly on
8 J Single item a single item - 0.5
0 Mark
If prediction is not made
0.75 Mark
If important features are found in
Random forest classifier
0.5 Mark
If Gaussian classifier is
Features,
successfully created
Gaussian
9 J 0.25 Mark - 1.5
classifier and
If training function is used on
training
the model
0 Mark
If features, Gaussian
classifier and training is not
found
0.5 Mark
If pandas pd is imported
10 J pandas pd successfully - 0.5
0 Mark
If no pandas found

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
0.25 Mark
If the function of matplot library
11 J matplot library is imported - 0.25
0 Mark
If no such function is imported
0.25 Mark
If model is generated on a
Generating
12 J selected features - 0.25
model
0 Mark
If model is not defined
0.25 Mark
If random forest and
Random Forest Gaussian classifier created
13 J and Gaussian on sklearn - 0.25
Classifier 0 Mark
If not such function is imported
or created
0.5 Mark
Completed within 30 min
0.25 Mark
Time
14 M Completed within 30 to 45 30 mins 0.5
Management
mins
0 Mark
Exceeded 45 mins
0.5 Mark
Coding Ethics If exhibiting all two aspects
i. Indentation 0.25 Mark
15 J - 0.5
ii. Overall If exhibiting any one aspect
design look 0 Mark
If none of the aspect is found

Conclusion:

Thus the working mechanism of Random forest algorithm has been successfully implemented.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 18 CONVOLUTIONAL NEURAL NETWORK

Objective:
The aim of this task is to use deep learning for image classification by applying convolutional
neural network.
Outcome:
Students will able to build a neural network model for performing image classification for a real-
world dataset.
Resource Required:Python IDE (Jupiter or Anaconda or Spyder or Pycharm)

Prerequisites:Python Programming Basics

Theory:

Deep Learning is a very popular subset of machine learning due to its high level of performance
across many types of data. Convolutional Neural Network (CNN) in deep learning is used to classify
images. Many libraries in Python helps to apply CNN in which the Keras library in Python makes is a
simpler one.
Computers see images using pixels. Pixels in images are usually related. For example, a certain
group of pixels may signify an edge in an image or some other pattern. Convolutions use this to help
identify images.
A convolution multiplies a matrix of pixels with a filter matrix or kernel and sums up the
multiplication values. Then the convolution slides over to the next pixel and repeats the same process
until all the image pixels have been covered.
CNN like other neural networks, are made up of neurons with learnable weights and biases. Each
neuron receives several inputs, takes a weighted sum over them, pass it through an activation function
and responds with an output.

Layers in a Convolutional Neural Network:


A convolution neural network has multiple hidden layers that help in extracting information from an
image. The four important layers in CNN are:
1. Convolution layer
2. ReLU layer
3. Pooling layer
4. Fully connected layer

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Sample Coding:
Loading the dataset:
The mnist dataset is conveniently provided as part of the Keras library, so one can easily load the
dataset. Out of the 70,000 images provided in the dataset, 60,000 are given for training and 10,000 are
given for testing.
While loading the dataset below, X_train and X_test will contain the images, and y_train and
y_test will contain the digits that those images represent.
from keras.datasets import mnist
#download mnist data and split into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

Exploratory data analysis


Now take a look at one of the images in the dataset to check what one is working with. We will
plot the first image in our dataset and check its size using the ‘shape’ function.
import matplotlib.pyplot as plt
#plot the first image in the dataset
plt.imshow(X_train[0])
#check image shape
X_train[0].shape

Data pre-processing
Next, reshape the dataset inputs (X_train and X_test) to the shape that our model expects when
we train the model. The first number is the number of images (60,000 for X_train and 10,000 for X_test).
Then comes the shape of each image (28x28). The last number is 1, which signifies that the images are
greyscale.
#reshape data to fit model
X_train = X_train.reshape(60000,28,28,1)
X_test = X_test.reshape(10000,28,28,1)
We need to ‘one-hot-encode’ our target variable. This means that a column will be created for
each output category and a binary variable is inputted for each category. For example, we saw that the
first image in the dataset is a 5. This means that the sixth number in our array will have a 1 and the rest of
the array will be filled with 0.
from keras.utils import to_categorical
#one-hot encode target column
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
y_train[0]

Building the model


The code to build our model :
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
#create model
model = Sequential()
#add model layers
model.add(Conv2D(64, kernel_size=3, activation=’relu’, input_shape=(28,28,1)))

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
model.add(Conv2D(32, kernel_size=3, activation=’relu’))
model.add(Flatten())
model.add(Dense(10, activation=’softmax’))

Compile the Model:


#compile model using accuracy to measure model performance
model.compile(optimizer='adam',loss='categorical_crossentropy', metrics=['accuracy'])
Train the Model:
#train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3)
Making Predictions:
#predict first 4 images in the test set
model.predict(X_test[:4])
#actual results for first 4 images in test set
y_test[:4]

Schedule:

8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m

Day 4 Introduction To Tea Building the Lunc Evaluating the Tea Task /
CNN Brea Model h model Brea Assessm
k Break k ent

Description of the task:


Build the CNN model for classifying the images of real-world dataset and evaluate the model
accuracy.

Sample Output:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Assessment Specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing dataset 1 Mark
Dataset correctly imported and
displayed
0 mark Python 1
No dataset imported
3 J Splitting the 1 Mark Python
dataset into Dataset is splitted into dependent
dependent variable and independent variable using
variable and iloc function 1
independent 0 Mark
variable Dataset is not splitted into dependent
variable and independent variable
4 J Preprocessing the 1 Mark Python
data Preprocessing procedures are
carried out
0 mark 1
No preprocessing of data
5 J Separate data 1 Mark Python
frame for features Created separate dataframes to
and labels store the features and target labels
0 mark 1
No separate data frames created

6 J Splitting Data 1 Mark Python


into training and Used train_test_split function to split the
testing data dataset into training data and testing data
0 Mark 1
No train_test_split s function is applied
7 J Create CNN 2 Marks Python
layers Use Tensorflow Keras to create
CNN layers to train the dataset
0 mark 2
No Linear Regression used

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
8 J Prediction 1 Mark Python
Used test data for evaluation
0 Mark
No testing 1
9 J Visualization 1 Mark Python
Used matplotlib or seaborn for
visualizing the model results
0 Mark 1
No visualization results

Conclusion:
The Convolution Neural Network modelling is built using Keras library package in Python which
is used for classifying the images and also the accuracy of the model is evaluated.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 19 CREATING A SIMPLE CHAT BOT

Objective:
The aim of this task is to creating a simple Chat bot in python using NLTK library.
Outcome:
Students will able to build a Chabot in python using NLTK library.

Resource Required: Python IDE (Jupiter or Anaconda or Spyder or Pycharm)

Prerequisites: Python Programming Basics

Theory:
What is a Chatbot?

A chatbot is AI-based software designed to interact with humans in their natural languages. These
chatbots are usually converse via auditory or textual methods, and they can effortlessly mimic human
languages to communicate with human beings in a human-like manner.

How do Chatbots work?


There are broadly two variants of chatbots: Rule-Based and Self-learning.

1. In a Rule-based approach, a bot answers questions based on some rules on which it is trained
on. The rules defined can be very simple to very complex. The bots can handle simple queries
but fail to manage complex ones.

2. Self-learning bots are the ones that use some Machine Learning-based approaches and are
definitely more efficient than rule-based bots. These bots can be of further two
types: Retrieval Based or Generative

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Sample Coding:

Importing the necessary libraries

The nltk library is used. NLTK stands for Natural Language Toolkit and is a leading python library to
work with text data. The first line of code below imports the library, while the second line uses
the nltk.chat module to import the required utilities.

import nltk
from nltk.chat.util import Chat, reflections

The code below shows that utility Chat is a class that provides logic for building the chatbot.

print(Chat)

Output:

<class 'nltk.chat.util.Chat'>

The other import you did above was Reflections, which is a dictionary that contains a set of input text
and its corresponding output values. You can examine the dictionary with the code below. This is an
optional dictionary and you can create your own dictionary in the same format as below.

reflections

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Output:

{'i am': 'you are',


'i was': 'you were',
'i': 'you',
"i'm": 'you are',
"i'd": 'you would',
"i've": 'you have',
"i'll": 'you will',
'my': 'your',
'you are': 'I am',
'you were': 'I was',
"you've": 'I have',
"you'll": 'I will',
'your': 'my',
'yours': 'mine',
'you': 'me',
'me': 'you'}

Building the Chatbot

The first step is to create rules that will be used to train the chatbot. The lines of code below create a
simple set of rules. The first element of the list is the user input, whereas the second element is the
response from the bot. Several such lists are created in the set_pairs object.
set_pairs = [
[
r"my name is (.*)",
["Hello %1, How are you doing today ?",]
],
[
r"hi|hey|hello",
["Hello", "Hey there",]
],
[
r"what is your name?",
["You can call me a chatbot ?",]

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
],
[
r"how are you ?",
["I am fine, thank you! How can i help you?",]
],
[
r"I am fine, thank you",
["great to hear that, how can i help you?",]
],
[
r"how can i help you? ",
["i am looking for online guides and courses to learn data science, can you suggest?", "i am looking
for data science training platforms",]
],
[
r"i'm (.*) doing good",
["That's great to hear","How can i help you?:)",]
],
[
r"i am looking for online guides and courses to learn data science, can you suggest?",
["Pluralsight is a great option to learn data science. You can check their website",]
],
[
r"thanks for the suggestion. do they have great authors and instructors?",
["Yes, they have the world class best authors, that is their strength;)",]
],
[
r"(.*) thank you so much, that was helpful",
["Iam happy to help", "No problem, you're welcome",]
],
[
r"quit",
["Bye, take care. See you soon :) ","It was nice talking to you. See you soon :)"]
],
]
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
After creating the pairs of rules above, we define the chatbot using the code below. The code is simple
and prints a message whenever the function is invoked.
def chatbot():
print("Hi, I'm the chatbot you built")

chatbot()

Output:

Hi, I'm the chatbot you built

The next step is to instantiate the Chat() function containing the pairs and reflections.

chat = Chat(set_pairs, reflections)


print(chat)

Output:

<nltk.chat.util.Chat object at 0x7f49c76e3be0>

You have created a simple rule-based chatbot, and the last step is to initiate the conversation. This is done
using the code below where the converse() function triggers the conversation.

chat.converse()
if __name__ == "__main__":
chatbot()

The code above will generate the following chatbox in your notebook, as shown in the image below.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Output:

You're ready to interact with the chatbot. Start by typing a simple greeting, "hi", in the box, and you'll get
the response "Hello" from the bot, as shown in the image below.

Output:

You can continue conversing with the chatbot and quit the conversation once you are done, as shown in
the image below.
Output:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Schedule:

8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m

Day 5 Introduction To Tea 1)Natural Lunc Create a simple Tea Task /


Chatbot Brea Language h chatbot using Brea Assessm
k Toolkit Break NLTK in k ent
(NLTK) python
2)Downloading
and installing
NLTK

Description of the task:


Build the simple chatbot created in python for facilitating Hotel Room Booking. The chatbot
gathers the parameters like Star Rating, Hotel Name, Location, Tariff and then offers choice.

Sample Output:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Assessment Specification:

Aspect Aspect Maximum


S.No Type Description Additional Aspect Description Requirement Score(10)
1 J Downloading 1 Marks
correct dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 1
2 J Importing 2 Marks
necessary libraries Libraries are imported correctly
0 mark
No libraries imported Python 2
3 J Simple rules 2 Marks Python
creation Simple rules are created to train the
chatbot
0 Mark 2
Rules are not created properly

4 J Converse() 2 Marks Python


function Converse() function is used to
initiate the conversation
0 mark 2
No conversation is initiated
5 J Run the chatbot 2 Marks Python
Chatbot is executed successfully
0 mark
No successful execution 2

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
6 J Visualization 1 Mark Python
Results are visualized correctly
0 Mark
No visualization results 1

Conclusion:
Thus an interactive chatbot is created using nltk library in python for an application and executed
successfully.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 20 INTRODUCTION TO BIG DATA ANALYTICS AND INSTALLATION OF
HADOOP

Objective:
The aim of this task is to understand the basic concepts of Big data and the installing process of
Hadoop.
Outcome:
Students will be able to understand the importance of Big data and how the big data applied in
real world environment.

Resource Required: Linux operating System, JDK and Hadoop

Prerequisites:Working with Linux commands

Theory:

 Big data analytics is the often complex process of examining large and varied data sets, or big
data, to uncover information -- such as hidden patterns, unknown correlations, market trends and
customer preferences -- that can help organizations make informed business decisions.
 On a broad scale, data analytics technologies and techniques provide a means to analyse data sets
and draw conclusions about them which help organizations make informed business decisions.
Business intelligence (BI) queries answer basic questions about business operations and
performance.
 Big data analytics is a form of advanced analytics, which involves complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by high-
performance analytics systems.

The importance of big data analytics


Driven by specialized analytics systems and software, as well as high-powered computing systems, big
data analytics offers various business benefits, including:
 New revenue opportunities
 More effective marketing
 Better customer service
 Improved operational efficiency
 Competitive advantages over rivals

Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians
and other analytics professionals to analyze growing volumes of structured transaction data, plus other
forms of data that are often left untapped by conventional BI and analytics programs. This encompasses a
mix of semi-structured and unstructured data -- for example, internet clickstream data, web server logs,
social media content, text from customer emails and survey responses, mobile phone records, and
machine data captured by sensors connected to the internet of things (IoT).

Big data analytics technologies and tools


Unstructured and semi-structured data types typically don't fit well in traditional data warehouses that are

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
based on relational databases oriented to structured data sets. Further, data warehouses may not be able to
handle the processing demands posed by sets of big data that need to be updated frequently or even
continually, as in the case of real-time data on stock trading, the online activities of website visitors or the
performance of mobile applications.
As a result, many of the organizations that collect process and analyze big data turn to NoSQL databases,
as well as Hadoop and its companion data analytics tools, including:
 YARN: a cluster management technology and one of the key features in second-generation
Hadoop.
 MapReduce: a software framework that allows developers to write programs that process massive
amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone
computers.
 Spark: an open source, parallel processing framework that enables users to run large-scale data
analytics applications across clustered systems.
 HBase: a column-oriented key/value data store built to run on top of the Hadoop Distributed File
System (HDFS).
 Hive: an open source data warehouse system for querying and analyzing large data sets stored in
Hadoop files.
 Kafka: a distributed publish/subscribe messaging system designed to replace traditional message
brokers.
 Pig: an open source technology that offers a high-level mechanism for the parallel programming
of MapReduce jobs executed on Hadoop clusters.

Installation of Hadoop

Hadoop is a software framework from Apache Software Foundation that is used to store and process
Big Data. It has two main components; Hadoop Distributed File System (HDFS), its storage system and
MapReduce, is its data processing framework. Hadoop has the capability to manage large datasets by
distributing the dataset into smaller chunks across multiple machines and performing parallel
computation on it.

If Java is unavailable in the system, need to install it from the website


https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

Step 1:
Download the Hadoop version 3.1 from the following Link
CLICK HERE TO INSTALL HADOOP

Step 2:
Extract it to the folder.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 3:
Setting Up System Environment Variables

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 4: Configurations

Now we need to edit some files located in the hadoop directory of the etc folder where we installed
hadoop. The files that need to be edited have been highlighted.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
1. Edit the file core-site.xml in the hadoop directory. Copy this xml property in the configuration file

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

2. Edit mapred-site.xml and copy this property in the configuration

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

3. Create a folder ‘data’ in the hadoop directory

Create a folder with the name ‘datanode’ and a folder ‘namenode’ in this data directory

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
4. Edit the file hdfs-site.xml and add below property in the configuration

Note: The path of namenode and datanode across value would be the path of the datanode and namenode
folders you just created.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value> C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\datanode</value>
</property>
</configuration>

5. Edit the file yarn-site.xml and add below property in the configuration

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

6. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder where your jdk
1.8 is installed

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Hadoop needs windows OS specific files which does not come with default download of hadoop.

To include those files, replace the bin folder in hadoop directory with the bin folder provided in this
github link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils

Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin folder,
rename it like bin_old and paste the copied bin folder in that directory.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Check whether hadoop is successfully installed by running this command on cmd-

hadoop version

Schedule:

8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m

Day 5 Introduction to Tea Installation of Lunc Installation of Tea Task /


BigData Brea Java and setting h Hadoop 2.0 Brea Assessm
Analytics k up Environment Break k ent
variables

Description of the task:

Using the installation steps of Hadoop 2.0, install Hadoop 3.0 and display the version details.

Sample Output:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Assessment Specification

Aspect Maximum
S.No Aspect Description Additional Aspect Description Requirement
Type Score (10)
1 Mark
JDK selection & Installation
1 J Java Installation - 2
1 Mark
Environment Setup

1 Mark
Hadoop download
2 J Hadoop - 2
1 Mark
Environment Setup

0.25 Mark
Configure core-site.xml
0.25 Mark
3 J Configuration - 1
Configure Hadoop-env.xml
0.5 Mark
Configure hdfs-site.xml
0.5 Mark
Configure mapred-site.xml
4 J Configuration 0.5 Mark - 2
Configure yarn-site.xml

Create Data folder under create


5 J Folder creation - 0.5
datanode folder and namenode folder

6 J Include JAVA 0.5 Mark - 0.5


©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
Edit hadoop-env.cmd and replace
%JAVA_HOME%

Learning Hadoop
7 J Learning the commands 1
Commands

1 Mark
Completed within 30 min
0.5 Mark
8 M Time Management 30 mins 1
Completed within 30 to 45 mins
0 Mark
Exceeded 45 mins

Conclusion:

Thus Hadoop 2.0 and Hadoop 3.0 is installed and executed successfully.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 21 INSTALLATION AND DATABASE CREATION USING MONGODB

Objectives:
The aim of the task is to provide adequate knowledge on creating and deploying a highly scalable
and performance-oriented database.

Outcome: At the end of the task, students can,


 Understand the installation procedure of MongoDB and its environment setup
 Create highly scalable and performance oriented database using MongoDB.
Resources required: MondoDB software

Prerequsities: Knowledge about Big Data Analytics

Theory:
Installation of MongoDB:

Step1: Search the given link in web browser


https://www.mongodb.com/try/download/enterprise

Step2: Move to community Server tab in the same page and choose the latest version of
windows. Click Download

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 3: Open the downloaded package and click to accept the “Terms and Conditions”
button and proceed Next.

Step 4: Choose the setup type as “Complete” and proceed “Next”. Then click “ Run Service as
Network Service User”. Fill the service name and directory fields. Then Proceed Next

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 5: Check the field Install MongoDB Compass, proceed next and select “Install”
button.

Step 6: After Installation, License agreement screen opens up and press “Agree”.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 7: MongoDB Compass Community screen opens and select “Next”.

Step 8: Press “Get Started”. Check all the boxes in Private settings and press “Start using
Compass”.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 9: Then Select “Finish” in the installation window.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 10: Move to “C:/” drive in my computer and create a folder named as “data”

Step 11: Inside data folder create another new folder named “db”

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 11: Move to folder “C://Program Files/MongoDB/Server/4.2/bin” (Note: The
version 4.2 gets changed based on your installation)

Step 12: Copy the entire path of the above screen. Move to “System Properties” and
select “Environment Variables”.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 13: Move to “System Variables” and choose “Path”. Select button “Edit”, a screen
will open and select “New” button. Paste the copied path and click “Ok”.

Step 14: Press “Ok” in Environment Variables and System properties. Open the command
prompt and type “mongod”. Now the installation of packages started.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 15: Open the another command prompt and type “mongo”.

(Note: Don’t close the two running command prompts.)

Step 16: Go to Mongodb window which we have minimized. Clicks connect and it will
get connected.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Step 17: After connection establishment, the MongoDB Compass Community Server
shows page to create database.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Database Creation in MongoDB:
Step 1: Click the “Create Database” button. Enter the name of the database to create and
its first collection. Both the database name and the collection name are required.

Step 2: Click “Create Databse” to create the database and its first collection.

Step 3: To access the collection screen for a database by clicking a “Database Name” in
the main Databases view.

Step 4: Click the “Create Collection” button and enter the name of the collection to
create.

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Compass displays views in the Collections Screen with a special icon, and indicates the
collection from which the view was created.
From the Documents tab, you can view, insert, modify, clone, and delete documents in
your selected collection or view.

Step 5: To insert documents into the collection, click “Add Data” dropdown button and
select insert document.

(Note: Click the {} brackets for JSON view. This is the default view. Or click the list icon
for Field-by-Field mode)

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Queries:
To pass an array of documents into the insert() method

db.post.insert([
{
title: "MongoDB Overview",
description: "MongoDB is no SQL database",
by: "BIT",
url: "http://www.bitsathy.ac.in",
tags: ["mongodb", "database", "NoSQL"],
likes: 100
},
{
title: "NoSQL Database",
description: "NoSQL database doesn't have tables",
by: "BIT",
url: "http://www. bitsathy.ac.in",
tags: ["mongodb", "database", "NoSQL"],
likes: 20,
comments: [
{
user:"user1",
message: "My first comment",
dateCreated: new Date(2013,11,10,2,35),
like: 0
}
]
}
])

To insert Single detail of an Employee, use insertOne() method

> db.createCollection("empDetails")
{ "ok" : 1 }
> db.empDetails.insertOne(
{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26",
e_mail: "radhika_sharma.123@gmail.com",
phone: "9848022338"
})
{
"acknowledged" : true,
"insertedId" : ObjectId("5dd62b4070fb13eec3963bea")
}

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
To insert many employee details, use insertMany() method

> db.empDetails.insertMany(
[
{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26",
e_mail: "radhika_sharma.123@gmail.com",
phone: "9000012345"
},
{
First_Name: "Rachel",
Last_Name: "Christopher",
Date_Of_Birth: "1990-02-16",
e_mail: "Rachel_Christopher.123@gmail.com",
phone: "9000054321"
},
{
First_Name: "Fathima",
Last_Name: "Sheik",
Date_Of_Birth: "1990-02-16",
e_mail: "Fathima_Sheik.123@gmail.com",
phone: "9000054321"
}
]
)
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("5dd631f270fb13eec3963bed"),
ObjectId("5dd631f270fb13eec3963bee"),
ObjectId("5dd631f270fb13eec3963bef")
]
}
>

To Update any details of an employee, useUpdate() method

>db.post.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB Tutorial'}})


WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Schedule:

8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m

Day 5 Installation of Tea Execution of Lunc Exercise 1 Tea Exercise


MongoDB Brea simple queries h Brea 2
k in Mongo DB Break k

Description of the task:

The following is the structure of 'restaurants' collection, use Mongodb to create the given structure
containing documents and collections

{
"address": {
"building": "1007",
"coord": [-73.856077, 40.848447],
"street": "Morris Park Ave",
"zipcode": "10462"
},
"borough": "Bronx",
"cuisine": "Bakery",
"grades": [
{“date": {“$date": 1393804800000}, "grade": "A", "score": 2},
{“date": {“$date": 1378857600000}, "grade": "A", "score": 6},
{“date": {“$date": 1358985600000}, "grade": "A", "score": 10},
{“date": {“$date": 1322006400000}, "grade": "A", "score": 9},
{“date": {“$date": 1299715200000}, "grade": "B", "score": 14}
],
"name": "Morris Park Bake Shop",
"restaurant_id": "30075445”
}

1. Write a MongoDB query to display all the documents in the collection restaurants.

Sample output:

{ "_id" : ObjectId("564c2d939eb21ad392f175c9"), "address" : { "building" : "351", "coord" : [ -


73.98513559999999, 40.7676919 ], "street" : "West 57 Street", "zipcode" : "10019" }, "borough" :
"Manhattan", "cuisine" : "Irish", "grades" : [ { "date" : ISODate("2014-09-06T00:00:00Z"), "grade" :
"A", "score" : 2 }, { "date" : ISODate("2013-07-22T00:00:00Z"), "grade" : "A", "score" : 11 }, { "date" :
ISODate("2012-07-31T00:00:00Z"), "grade" : "A", "score" : 12 }, { "date" : ISODate("2011-12-29T00:0
0:00Z"), "grade" : "A", "score" : 12 } ], "name" : "Dj Reynolds Pub And Restaurant", "restaurant_id" :
"30191841" }
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
etc……………..

2. Write a MongoDB query to display the fields restaurant_id, name, borough and cuisine for all the
documents in the collection restaurant.

Sample output:

{ "_id" : ObjectId("564c2d939eb21ad392f175c9"), "borough" : "Manhattan", "cuisine" : "Irish", "name" :


"Dj Reynolds Pub And Restaurant", "restaurant_id" : "30191841" }
{ "_id" : ObjectId("564c2d939eb21ad392f175ca"), "borough" : "Bronx", "cuisine" : "Bakery", "name" :
"Morris Park Bake Shop", "restaurant_id" : "30075445" }
{ "_id" : ObjectId("564c2d939eb21ad392f175cb"), "borough" : "Brooklyn", "cuisine" : "American ",
"name" : "Riviera Caterer", "restaurant_id" : "40356018" }
Etc……….

3. Write a MongoDB query to display the fields restaurant_id, name, borough and cuisine, but exclude the
field _id for all the documents in the collection restaurant.

Sample output:

{ "borough" : "Manhattan", "cuisine" : "Irish", "name" : "Dj Reynolds Pub And Restaurant",
"restaurant_id" : "30191841" }
{ "borough" : "Bronx", "cuisine" : "Bakery", "name" : "Morris Park Bake Shop", "restaurant_id" :
"30075445" }
{ "borough" : "Brooklyn", "cuisine" : "American ", "name" : "Riviera Caterer", "restaurant_id" :
"40356018" }
Etc……….

4. Write a MongoDB query to display the fields restaurant_id, name, borough and zip code, but exclude the
field _id for all the documents in the collection restaurant.

Sample output:

{ "address" : { "zipcode" : "10019" }, "borough" : "Manhattan", "name" : "Dj Reynolds Pub And
Restaurant", "restaurant_id" : "30191841" }
{ "address" : { "zipcode" : "10462" }, "borough" : "Bronx", "name" : "Morris Park Bake Shop",
"restaurant_id" : "30075445" }
{ "address" : { "zipcode" : "11224" }, "borough" : "Brooklyn", "name" : "Riviera Caterer",
"restaurant_id" : "40356018" }
Etc…….

5. Write a MongoDB query to display all the restaurant which is in the borough Bronx

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Sample output:

{ "_id" : ObjectId("564c2d939eb21ad392f175ca"), "address" : { "building" : "1007", "coord" : [ -


73.856077, 40.848447 ], "street" : "Morris Park Ave", "zipcode" : "10462" }, "borough" : "Bronx",
"cuisine" : "Bakery", "grades" : [ { "date" : ISODate("2014-03-03T00:00:00Z"), "grade" : "A", "score" :
2 }, { "date" : ISODate("2013-09-11T00:00:00Z"), "grade" : "A", "score" : 6 }, { "date" :
ISODate("2013-01-24T00:00:00Z"), "grade" : "A", "score" : 10 }, { "date" : ISODate("2011-11-
23T00:00:00Z"), "gra
de" : "A", "score" : 9 }, { "date" : ISODate("2011-03-10T00:00:00Z"), "grade" : "B", "score" : 14 } ],
"name" : "Morris Park Bake Shop", "restaurant_id" : "30075445" }
{ "_id" : ObjectId("564c2d939eb21ad392f175d1"), "address" : { "building" : "2300", "coord" : [ -
73.8786113, 40.8502883 ], "street" : "Southern Boulevard", "zipcode" : "10460" }, "borough" : "Bronx",
"cuisine" : "American ", "grades" : [ { "date" : ISODate("2014-05-28T00:00:00Z"), "grade" : "A",
"score" : 11 }, { "date" : ISODate("2013-06-19T00:00:00Z"), "grade" : "A", "score" : 4 }, { "date" :
ISODate("2012-06-15T00:00:00Z"), "grade" : "A", "score" : 3 } ], "name" : "Wild Asia", "restaurant_id"
: "40357217" }
Etc…….

6. Write a MongoDB query to display the first 5 restaurant which is in the borough Bronx.

Sample output:

{ "_id" : ObjectId("564c2d939eb21ad392f175ca"), "address" : { "building" : "1007", "coord" : [ -


73.856077, 40.848447 ], "street" : "Morris Park Ave", "zipcode" : "10462" }, "borough" : "Bronx",
"cuisine" : "Bakery", "grades" : [ { "date" : ISODate("2014-03-03T00:00:00Z"), "grade" : "A", "score" :
2 }, { "date" : ISODate("2013-09-11T00:00:00Z"), "grade" : "A", "score" : 6 }, { "date" :
ISODate("2013-01-24T00:00:00Z"), "grade" : "A", "score" : 10 }, { "date" : ISODate("2011-11-
23T00:00:00Z"), "gra
de" : "A", "score" : 9 }, { "date" : ISODate("2011-03-10T00:00:00Z"), "grade" : "B", "score" : 14 } ],
"name" : "Morris Park Bake Shop", "restaurant_id" : "30075445" }
{ "_id" : ObjectId("564c2d939eb21ad392f175d1"), "address" : { "building" : "2300", "coord" : [ -
73.8786113, 40.8502883 ], "street" : "Southern Boulevard", "zipcode" : "10460" }, "borough" : "Bronx",
"cuisine" : "American ", "grades" : [ { "date" : ISODate("2014-05-28T00:00:00Z"), "grade" : "A",
"score" : 11 }, { "date" : ISODate("2013-06-19T00:00:00Z"), "grade" : "A", "score" : 4 }, { "date" :
ISODate("2012-06-15T00:00:00Z"), "grade" : "A", "score" : 3 } ], "name" : "Wild Asia", "restaurant_id"
: "40357217" }
{ "_id" : ObjectId("564c2d939eb21ad392f175e7"), "address" : { "building" : "1006", "coord" : [ -
73.84856870000002, 40.8903781 ], "street" : "East 233 Street", "zipcode" : "10466" }, "borough" :
"Bronx", "cuisine" : "Ice Cream, Gelato, Yogurt, Ices", "grades" : [ { "date" : ISODate("2014-04-
24T00:00:00Z"), "grade" : "A", "score" : 10 }, { "date" : ISODate("2013-09-05T00:00:00Z"), "grade" :
"A", "score" : 10 }, { "date" : ISODate("2013-02-21T00:00:00Z"), "grade" : "A", "score" : 9 }, { "date" :
IS
ODate("2012-07-03T00:00:00Z"), "grade" : "A", "score" : 11 }, { "date" : ISODate("2011-07-
11T00:00:00Z"), "grade" : "A", "score" : 5 } ], "name" : "Carvel Ice Cream", "restaurant_id" :
"40363093" }
Etc………………

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Assessment specification:

Aspect Additional Aspect Maximum


S.No Aspect Description Requirement
Type Description Score (10)
1 Mark
If collection is created
0.5 Mark
1 J Create collection - 1
If partially correct
0 Mark
If no collection is created
1 Mark
If all the documents are added
0.5 Mark
2 J Add documents If any of the 2 documents are - 1
added
0 mark
If no documents are added
1 Mark
If the output is correct
7. Display all the 0.5 Mark
dodocuments in the collection
3 J If the output is partially - 1
resrestaurants.
correct
0 Mark
If the query is not executed
1 Mark
Display the fields If the output is correct
restaurant_id, name, 0.5 Mark
4 J borough and cuisine for If the output is partially - 1
all the documents in the correct
collection restaurant. 0 Mark
If the query is not executed
1 Mark
8. Display the fields If the output is correct
restaurant_id, name, borough0.5 Mark
5 J and cuisine, but exclude the If the output is partially - 1
field _id for all the documents correct
in the collection restaurant. 0 Mark
If the query is not executed
Display the fields 1 Mark
6 J restaurant_id, name, If the output is correct - 1
borough and zip code, but 0.5 Mark

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
exclude the field _id for all If the output is partially
the documents in the correct
collection restaurant. 0 Mark
If the query is not executed
1 Mark
If the output is correct
9. Display all the 0.5 Mark
7 J resrestaurants which is in the If the output is partially - 1
boborough Bronx.
10. correct
0 Mark
If the query is not executed
1 Mark
If the output is correct
Display the first 5 0.5 Mark
resrestaurant which is in
8 J If the output is partially - 1
the boborough Bronx.
correct
0 Mark
If the query is not executed
1 Mark
Completed within 30 min
0.5 Mark
9 M Time Management Completed within 30 to 45 30 mins 1
mins
0 Mark
Exceeded 45 mins
1 Mark
If exhibiting all two
aspects
Coding Ethics 0.5 Mark
10 J Indentation If exhibiting any one - 1
Overall design look aspect
0 Mark
If not exhibiting even
one aspect

Conclusion:
Thus the working of MongoDB is clearly discussed and the given queries are implemented in
MongoDB and the results are verified successfully

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
DAY 22 INTRODUCTION TO SQL AND NOSQL DATABASES

Objectives: The aim of the task is to provide adequate knowledge SQL and NoSQL databases

Outcome: At the end of the task, students can,


 Learn the differences between SQL and NoSQL databases.
 Understand the working of SQL and NoSQL databases for different applications.

Resources required: Oracle and MongoDB

Prerequisities: Knowledge in SQL queries

Theory:

Introduction to SQL database:


Structure Query Language (SQL) is a database query language used for storing and
managing data in Relational DBMS. SQL was the first commercial language introduced for E.F
Codd's Relational model of database. Today almost all RDBMS (MySql, Oracle, Infomix,
Sybase, MS Access) use SQL as the standard database query language. SQL is used to perform
all types of data operations in RDBMS.

SQL commands:

SQL defines following ways to manipulate data stored in an RDBMS.

(i) DDL: Data Definition Language


This includes changes to the structure of the table like creation of table, altering table, deleting a
table etc.
All DDL commands are auto-committed. That means it saves all the changes permanently in the
database.

Command Description

create to create new table or database

alter for alteration

truncate delete data from table

drop to drop a table

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
rename to rename a table

(ii) DML: Data Manipulation Language


DML commands are used for manipulating the data stored in the table and not the table itself.
DML commands are not auto-committed. It means changes are not permanent to database, they
can be rolled back.

Command Description

insert to insert a new row

update to update existing row

delete to delete a row

merge merging two rows or two tables

(iii) TCL: Transaction Control Language

These commands are to keep a check on other commands and their effect on the database.

These commands can annul changes made by other commands by rolling the data back to its
original state. It can also make any temporary change permanent.

Command Description

Commit to permanently save

Rollback to undo change

Savepoint to save temporarily

(iv) DCL: Data Control Language

Data control languages are the commands to grant and take back authority from any database
user.

Command Description

Grant grant permission of right

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Revoke take back permission.

(v) DQL: Data Query Language

Data query language is used to fetch data from tables based on conditions that we can easily
apply.

Command Description

Select retrieve records from one or more table

Limitations of Relational databases

1. In relational database we need to define structure and schema of data first and then only we can
process the data.
2. Relational database systems provides consistency and integrity of data by enforcing ACID
properties (Atomicity, Consistency, Isolation and Durability). There are some scenarios where this is
useful like banking system. However in most of the other cases these properties are significant
performance overhead and can make your database response very slow.
3. Most of the applications store their data in JSON format and RDBMS don’t provide you a better
way of performing operations such as create, insert, update, delete etc on this data. On the other hand
NoSQL store their data in JSON format, which is compatible with most of the today’s world
application.

Introduction to NoSQL database:

NoSQL databases are different than relational databases. In relational database you need to create
the table, define schema, set the data types of fields etc before you can actually insert the data. In NoSQL
, you can insert, update data on the fly.

One of the advantages of NoSQL database is that they are really easy to scale and they are much
faster in most types of operations that we perform on database. There are certain situations where you
would prefer relational database over NoSQL, however when you are dealing with huge amount of data
then NoSQL database is your best choice.

Advantages of NoSQL

There are several advantages of working with NoSQL databases such as MongoDB and
Cassandra. The main advantages are high scalability and high availability.

(i) High scalability: NoSQL database such as MongoDB uses sharding for horizontal scaling.
Sharding is partitioning of data and placing it on multiple machines in such a way that the order
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
of the data is preserved. Vertical scaling means adding more resources to the existing machine
while horizontal scaling means adding more machines to handle the data. Vertical scaling is not
that easy to implement, on the other hand horizontal scaling is easy to implement. Horizontal
scaling database examples: MongoDB, Cassandra etc. Because of this feature NoSQL can
handle huge amount of data, as the data grows NoSQL scale itself to handle that data in efficient
manner.

(ii) High Availability: Auto replication feature in MongoDB makes it highly available because in
case of any failure data replicates itself to the previous consistent state.

Types of NoSQL database

Here are the types of NoSQL databases and the name of the databases system that falls in that
category. MongoDB falls in the category of NoSQL document based database.

(i) Key Value Store : Memcached, Redis, Coherence

(ii) Tabular: Hbase, Big Table, Accumulo

(iii)Document based: MongoDB, CouchDB, Cloudant

When you would want to choose NoSQL over relational database:

1. When you want to store and retrieve huge amount of data.


2. The relationship between the data you store is not that important
3. The data is not structured and changing over time
4. Constraints and Joins support is not required at database level
5. The data is growing continuously and you need to scale the database regular to handle the data.

Basic of SQL and NoSQL database:

SQL database:

DDL COMMANDS

1. CREATE
CREATE statement is used to create a new database, table, index or stored procedure.

Create database example:

CREATE DATABASE explainjava;

Create table example:

CREATE TABLE user (id INT (16) PRIMARY KEY AUTO_INCREMENT, name
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
VARCHAR (255) NOT NULL);

2. DROP
DROP statement allows you to remove database, table, index or stored procedure.

Drop database example:

DROP DATABASE explainjava;

Drop table example:

DROP TABLE user;

3. ALTER
ALTER is used to modify existing database data structures (database, table).

Alter table example:

ALTER TABLE user ADD COLUMN lastname VARCHAR (255) NOT NULL;

4. RENAME
RENAME command is used to rename SQL table.

Rename table example:

RENAME TABLE user TO student;

5. TRUNCATE
TRUNCATE operation is used to delete all table records. Logically it is the same as
DELETE command.

Differences between DELETE and TRUNCATE commands are:

TRUNCATE is really faster


TRUNCATE cannot be rolled back
TRUNCATE command does not invoke ON DELETE triggers

Example:

TRUNCATE student;

DML COMMANDS:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
1. SELECT
SELECT query is used to retrieve a data from SQL tables.

Example:

SELECT * FROM student;

2. INSERT
INSERT command is used to add new rows into the database table.

Example:

INSERT INTO student (name, lastname) VALUES ('Dmytro', 'Shvechikov');

3. UPDATE
UPDATE statement modifies records into the table.

Example:

UPDATE student SET name = 'Dima' WHERE lastname = 'Shvechikov';

4. DELETE
DELETE query removes entries from the table.

Example:

DELETE FROM student WHERE name = 'Dima';

DCL COMMANDS:

1. GRANT

GRANT command gives permissions to SQL user account.

For example, I want to grant all privileges to ‘explainjava’ database for user
‘dmytro@localhost’.

Let us create a user first:

CREATE USER 'dmytro'@'localhost' IDENTIFIED BY '123';

Then I can grant all privileges using GRANT statement:


©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
GRANT ALL PRIVILEGES ON explainjava.* TO 'dmytro'@'localhost'; and we have to
save changes using FLUSH command:

FLUSH PRIVILEGES;

2. REVOKE
REVOKE statement is used to remove privileges from user accounts.

Example:

REVOKE ALL PRIVILEGES ON explainjava.* FROM 'dmytro'@'localhost'; and save


changes:

FLUSH PRIVILEGES;

TCL COMMANDS:

1. START TRANSACTION; after that, you are doing manipulations with a data
(insert, update, delete) and at the end, you need to commit a transaction.

2. COMMIT
As a mentioned above COMMIT, command finishes transaction and stores all changes
made inside of a transaction.

Example:

START TRANSACTION;
INSERT INTO student (name, lastname) VALUES ('Dmytro', 'Shvechikov');
COMMIT;

3. ROLLBACK
ROLLBACK statement reverts all changes made in the scope of transaction.

Example:

START TRANSACTION;
INSERT INTO student (name, lastname) VALUES ('Dmytro', 'Shvechikov');
ROLLBACK;

4. SAVEPOINT
SAVEPOINT is a point in a transaction when you can roll the transaction back to a
certain point without rolling back the entire transaction.
Example:
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
SAVEPOINT SAVEPOINT_NAME;

NoSQL database: (MongoDB)

The data in MongoDB is stored in form of documents. These documents are stored in
Collection and Collection is stored in Database.

(i) The following is the syntax for creating a collection in MongoDB:


db.collection_name.insert({key:value, key:value…})

> use bitdb


switched to db bitdb

db.students.insert({
name: "Chaitanya",
dept: CSE,
place: "Coimbatore"
})

We do not have a collection students in the database bitdb. This command will
create the collection named “students” on the fly and insert a document in it with the
specified key and value pairs.

(ii) To check whether the document is successfully inserted, type the following
command.

Syntax: db.collection_name.find()
It shows all the documents in the given collection.

(iii) To check whether the collection is created successfully, use the following
command.

Syntax: show collections


- This command shows the list of all the collections in the currently selected database.

(iv) To insert a document into the collection:

Syntax: db.collection_name.insert()

(v) To drop a collection, first connect to the database in which you want to delete
collection and then type the following command to delete the collection:
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
Syntax: db.collection_name.drop()

(vi) To update a document in MongoDB, we provide criteria in command and the


document that matches that criteria is updated.

Syntax: db.collection_name.update(criteria, update_date)

(vii) To delete documents from a collection. The remove() method is used for
removing the documents from a collection in MongoDB.

Syntax: db.collection_name.remove(delete_criteria)

Sample coding and Sample output: (SQL database)

1. SQL Commands:

Query:
1. Create table books (author varchar (20), bookid int, noofpages int);
insert into books values ('xxx', 1234, 90);
select * from books;
2. alter table books add publisher varchar (30);
select * from books;
3. insert into books values('xxx',1234, 90, 'VPN');
select * from books;
4. alter table books rename to library;
select * from library;

author bookid noofpages publisher


xxx 1234 90 VPN

5. alter table library drop column author;


select * from library;
6. insert into library values (2254, 89, 'jiuhui');
select * from library;

bookid noofpages publisher


1234 90 VPN
2254 89 jiuhui

7. select * from library where pages='90';

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
bookid pages publisher
1234 90 VPN

8. update library set noofpages=100 where bookid=1234;


Select * from library;

bookid noofpages publisher


1234 100 VPN
2254 89 jiuhui

9. delete from library where bookid=1234;


select * from library;
bookid noofpages publisher

2254 89 jiuhui

Sample coding and Sample output: (NoSQL database)

(i) Create collection , drop collection and show collection :

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
(ii) Insert document in collection :

(iii)Delete document in collection :


We have a collection student in the MongoDB database
named beginnersbookdb. The documents in students collection are:

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
To remove the student from this collection who has a student id equal to 3333. To do this
we would write a command using remove() method like this:

db.students.remove({"StudentId": 3333})

To verify whether the document is actually deleted. Type the following command:

db.students.find().pretty()

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Schedule:

8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m

Day 5 Introduction to Tea SQL commands Lunc NoSQL Tea Task /


SQL and NoSQL Brea h commands Brea Assessm
k Break k ent

Description of the task:

1. Create tables to store the details of student, teacher and department using SQL and perform
the basic operations like find, insert, update and delete.

2. Create collection and documents to add student details, teacher details and department
details using MongoDB , perform the basic operations like find, insert, update and delete.

Assessment specification:

Aspect Maximum
S.No Aspect Description Additional Aspect Description Requirement
Type Score (10)
2 Mark
If all the tables are created
1 Mark
1 J Create table - 2
If 2 tables are created
0 Mark
If no table is created
1 Mark
If values for all the tables are
inserted
2 J Insert values 0.5 Mark - 1
If values for 2 tables are inserted
0 mark
If the values are not inserted
0.5 Mark
If any of the values are deleted
3 J Delete values - 0.5
0 Mark
If the values are not deleted

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
0.5 Mark
If any of the values are updated
4 J Update values - 0.5
0 Mark
If the values are not updated
2 Mark
If all the collections are created
1 Mark
5 J Create collections - 2
If 2 collections are created
0 Mark
If no collection is created
1 Mark
If values for all the collections
are inserted
0.5 Mark
6 J Insert documents - 1
If values for 2 collections are
inserted
0 mark
If the values are not inserted
0.5 Mark
If any of the values are deleted
0.5 Mark
Delete and update
7 J If any of the values are updated - 1
documents
0 Mark
If the values are not deleted and
updated
1 Mark
Completed within 30 min
0.5 Mark
8 M Time Management 30 mins 1
Completed within 30 to 45 mins
0 Mark
Exceeded 45 mins
1 Mark
If exhibiting all two
aspects
Coding Ethics 0.5 Mark
9 J Indentation If exhibiting any one - 1
Overall design look aspect
0 Mark
If not exhibiting even one
aspect

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning
Conclusion:
Thus the working of SQL and NoSQL databases are clearly discussed and implemented
in SQL and MongoDM and the results are verified successfully

©Bannari Amman Institute of Technology. All Rights Reserved


Machine Learning

You might also like