You are on page 1of 57

SREE NARAYANA GURUKULAM

COLLEGE OF ENGINEERING
KADAYIRUPPU, KOLENCHERY 682311

LABORATORY RECORD
YEAR---------TO

NAME---------------------------------------------------------------------------
SEMESTER-----------------------------------------ROLL NO.------------------------
BRANCH----------------------------------------

Certified that this is a Bonafide Record of Practical work done in partial

fulfillment of the requirements for the award of the Degree in Master of

Computer Applications of Sree Narayana Gurukulam College of Engineering.

Kadayiruppu

Date:

Head of the Department Course Instructor

Submitted for University Practical Examination

External Examiner Internal Examiner


INDEX PAGE
NO PROGRAM DATE
1 Program to review the fundamentals of
python
2 Program to handle data using pandas &
perform data visualization using matplotlib &
seaborn.
3 Program to implement k-NN classification using
any standard dataset available in the public
domain and find the accuracy of the algorithm.

4 Program to implement Naive Bayes Algorithm


using any standard dataset available in
the public domain and find the accuracy of the
algorithm
5 Program to implement simple linear regression
technique using any standard dataset available
in the public domain and evaluate its
performance.

6 Program to implement multiple linear


regression technique using any standard dataset
available in the public domain and evaluate its
performance.

7 Program to implement Support vector


machine.
8 Program to implement k-means clustering
technique using any standard dataset available
in the public domain
9 Programs on convolutional neural network to
classify images from any standard dataset in
the public domain.

10 Implement problems on natural language


processing - Part of Speech tagging, N-gram &
smoothening and Chunking using NLTK
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory

Program : 1

Aim: To review fundamentals of Python programming for Data science and machine learning programming.
Here, we focus on concepts such as

(a) Data types

(b) Containers: List, Tuple, Dictionary, Sets

(c) Functions and return statement

Program

(a) Data types

1a=5
2 print("Type of a: ", type(a)) 3
4 b = 5.0
5 print("\nType of b: ", type(b)) 6
7 c = 2 + 4j
8 print("\nType of c: ", type(c))

Type of a: <class 'int'>

Type of b: <class 'float'>

Type of c: <class 'complex'>

1 # Python Program for 2


# Creation of String 3
4 # Creating a String 5 #
with single Quotes

6 String1 = 'Welcome to the Data Science Lab'


7 print("String with the use of Single Quotes: ") 8
print(String1)
9
10 # Creating a String 11

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 3/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory
# with double Quotes
12 String1 = "I'm happy to program"
13 print("\nString with the use of Double Quotes: ") 14
print(String1)
15 print(type(String1))
16
17 # Creating a String 18
# with triple Quotes
19 String1 = '''I'm happy to learn "Python"'''
20 print("\nString with the use of Triple Quotes: ") 21
print(String1)
22 print(type(String1))
23
24 # Creating String with triple 25
# Quotes allows multiple lines 26
String1 = '''Programming
27 is
28 fun'''
29 print("\nCreating a multiline String: ") 30
print(String1)
31
32
33 # Python Program to Access 34 #
characters of String
35
36 String1 = "Programming" 37
print("Initial String: ") 38
print(String1)
39
40 # Printing First character
41 print("\nFirst character of String is: ") 42
print(String1[0])
43
44 # Printing Last character
45 print("\nLast character of String is: ") 46
print(String1[-1])
47

String with the use of Single Quotes:


Welcome to the Data Science Lab

String with the use of Double Quotes: I'm


happy to program
<class 'str'>

String with the use of Triple Quotes:


I'm happy to learn "Python"
<class 'str'>

Creating a multiline String:


Programming
is
fun
Initial String:
Programming

First character of String is: P

Last character of String is: g

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 4/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory

(b) Containers: List, Tuple Dictionary, Sets, Numpy Arrays

1 # Python program to demonstrate


2 # Creation of List
3
4 # Creating a List
5 List = []
6 print("Initial blank List: ")
7 print(List)
8
9 # Creating a List with
10 # the use of a String
11 List = ['Python']
12 print("\nList with the use of String: ")
13 print(List)
14
15 # Creating a List with
16 # the use of multiple values
17 List = ["Python", "C", "Java"]
18 print("\nList containing multiple values: ")
19 print(List[0])
20 print(List[2])
21
22 # Creating a Multi-Dimensional List
23 # (By Nesting a list inside a List)
24 List = [['Python', 'Java'], ['C']]
25 print("\nMulti-Dimensional List: ")
26 print(List)
27

Initial blank List:


[]

List with the use of String:


['Python']

List containing multiple values: Python


Java

Multi-Dimensional List:
[['Python', 'Java'], ['C']]

1 # Python program to demonstrate 2 #


creation of Set
3
4 # Creating an empty tuple
5 Tuple1 = ()
6 print("Initial empty Tuple: ")
7 print (Tuple1)
8
9 # Creating a Tuple with
10 # the use of Strings
11 Tuple1 = ('Java', 'Python')
12 print("\nTuple with the use of String: ")

13 print(Tuple1)
14
15 # Creating a Tuple with
16 # the use of list

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 5/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory
17 list1 = [1, 2, 4, 5, 6]
18 print("\nTuple using List: ")
19 print(tuple(list1))
20
21 # Creating a Tuple with the
22 # use of built-in function
23 Tuple1 = tuple('Python')
24 print("\nTuple with the use of function: ")
25 print(Tuple1)
26
27 # Creating a Tuple
28 # with nested tuples
29 Tuple1 = (0, 1, 2, 3)
30 Tuple2 = ('data', 'science')
31 Tuple3 = (Tuple1, Tuple2)
32 print("\nTuple with nested tuples: ")
33 print(Tuple3)
34
35
36 # Python program to
37 # demonstrate accessing tuple
38
39 tuple1 = tuple([1, 2, 3, 4, 5])
40
41 # Accessing element using indexing
42 print("First element of tuple")
43 print(tuple1[0])
44
45 # Accessing element from last
46 # negative indexing
47 print("\nLast element of tuple")
48 print(tuple1[-1])
49
50 print("\nThird last element of tuple")
51 print(tuple1[-3])
52

Initial empty Tuple: ()

Tuple with the use of String: ('Java',


'Python')

Tuple using List: (1,


2, 4, 5, 6)

Tuple with the use of function: ('P',


'y', 't', 'h', 'o', 'n')

Tuple with nested tuples:


((0, 1, 2, 3), ('data', 'science')) First
element of tuple
1

Last element of tuple 5

Third last element of tuple 3

1 # Python program to
2 # demonstrate boolean type

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 6/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory
3
4 print(type(True))
5 print(type(False))
6
7 print(type(true)) #Error for case sensitive t in True 8

<class 'bool'>
<class 'bool'>

NameError Traceback (most recent call last)


<ipython-input-5-d732ccf9e5f8> in <module>()
5 print(type(False)) 6
----> 7 print(type(true)) #Error for case sensitive t in True

NameError: name 'true' is not defined

SEARCH STACK OVERFLOW

1 # Python program to demonstrate 2 #


Creation of Set in Python
3
4 # Creating a Set 5
set1 = set()
6 print("Initial blank Set: ") 7
print(set1)
8
9 # Creating a Set with 10
# the use of a String
11 set1 = set("Programming")
12 print("\nSet with the use of String: ") 13
print(set1)
14
15 # Creating a Set with 16
# the use of a List
17 set1 = set(["Python", "Programming", "Java", "Programming"]) 18
print("\nSet with the use of List: ")
19 print(set1)
20
21 # Creating a Set with
22 # a mixed type of values
23 # (Having numbers and strings)
24 set1 = set([1, 2, 'C', 4, 'Java', 6, 'Java']) 25
print("\nSet with the use of Mixed Values") 26
print(set1)
27
28 # Python program to demonstrate 29 #
Accessing of elements in a set 30
31 # Creating a set
32 set1 = set(["Java", "Program", "Python", "Java"]) 33
print("\nInitial set")
34 print(set1)
35
36 # Accessing element using 37 #
for loop
38 print("\nElements of set: ") 39
for i in set1:
40 print(i, end =" ")
41

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 7/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory
42 # Checking the element 43
# using in keyword
44 print("Java" in set1)
45
46

Initial blank Set:


set()

Set with the use of String:


{'r', 'P', 'o', 'n', 'g', 'a', 'm', 'i'}

Set with the use of List:


{'Java', 'Programming', 'Python'}

Set with the use of Mixed Values


{1, 2, 4, 6, 'Java', 'C'}

Initial set
{'Java', 'Python', 'Program'}

Elements of set:
Java Python Program True

1 # Creating an empty Dictionary 2


Dict = {}
3 print("Empty Dictionary: ")
4 print(Dict)
5
6 # Creating a Dictionary 7
# with Integer Keys
8 Dict = {1: 'Java', 2: 'C', 3: 'Python'}
9 print("\nDictionary with the use of Integer Keys: ")
10 print(Dict)
11
12 # Creating a Dictionary 13
# with Mixed keys
14 Dict = {'Name': 'Java', 1: [1, 2, 3, 4]}
15 print("\nDictionary with the use of Mixed Keys: ") 16
print(Dict)
17
18 # Creating a Dictionary 19
# with dict() method
20 Dict = dict({1: 'Java', 2: 'C', 3:'Python'})
21 print("\nDictionary with the use of dict(): ") 22
print(Dict)
23
24 # Creating a Dictionary
25 # with each item as a Pair
26 Dict = dict([(1, 'Java'), (2, 'C'), ('program', 'Python'), (3, 'Machine Learning')]) 27
print("\nDictionary with each item as a pair: ")
28 print(Dict)
29
30 print("Accessing a element using key:") 31
print(Dict['program'])
32
33 # accessing a element using get()

34 # method
35 print("Accessing a element using get:") 36
print(Dict.get(3))
37

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 8/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory

Empty Dictionary:
{}

Dictionary with the use of Integer Keys:


{1: 'Java', 2: 'C', 3: 'Python'}

Dictionary with the use of Mixed Keys:


{'Name': 'Java', 1: [1, 2, 3, 4]}

Dictionary with the use of dict():


{1: 'Java', 2: 'C', 3: 'Python'}

Dictionary with each item as a pair:


{1: 'Java', 2: 'C', 'program': 'Python', 3: 'Machine Learning'}
Accessing a element using key:
Python
Accessing a element using get:
Machine Learning

(c) Functions and return statement

1 def my_function():
2 print("Hello from a simple function") 3
4 my_function()
5
6 def my_function(fname):
7 print("Hello " + fname + " from a fuction with one argument") 8
9 my_function("Amy")
10 my_function("Sara")
11 my_function("Thomas")
12
13 #function with 2 arguments
14 def my_function(fname, lname):
15 print(fname + " " + lname) 16
17 my_function("Amy", "Joseph")
18
19 #function with arbritrary number of arguments
20 # If the number of arguments is unknown, add a * before the parameter name: 21 def
my_function(*kids):
22 print("The youngest child is " + kids[2])
23
24 my_function("Amy", "Sara", "Stephen") 25
26 #Arguments with the key = value syntax. 27
def my_function(child3, child2, child1):
28 print("The youngest child is " + child3)
29
30 my_function(child1 = "Amy", child2 = "Sara", child3 = "Stephen")
31
32 #Keyword Arguments
33 # If the number of keyword arguments is unknown, add a double ** before the parameter name 34
35 def my_function(**kid):
36 print("His last name is " + kid["lname"]) 37
38 my_function(fname = "Sara", lname = "Peter") 39

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 9/10
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory
40 #Default parameter value
41 def my_function(country = "Norway"):
42 print("I am from " + country) 43
44 my_function("Sweden")
45 my_function("India")
46 my_function()
47 my_function("Brazil")
48
49 #List as argument
50
51 def my_function(food):
52 for x in food:
53 print(x)
54
55 fruits = ["apple", "banana", "cherry"] 56
57 my_function(fruits)
58
59 #return statement
60
61 def my_function(x):
62 return 5 * x
63
64 print(my_function(3))
65 print(my_function(5))
66 print(my_function(9))
67
68 #Pass statement
69 #function definitions cannot be empty, but if you for some reason have a function 70
#definition with no content, put in the pass statement to avoid getting an error 71 def
myfunction():
72 pass
73
74 #recursion Example
75
76 def tri_recursion(k):
77 if(k > 0):
78 result = k + tri_recursion(k - 1)
79 print(result)
80 else:
81 result = 0
82 return result
83
84 print("\n\nRecursion Example Results") 85
tri_recursion(6)
86
87
88 #Lambda function
89 #A lambda function is a small anonymous function.
90 #A lambda function can take any number of arguments, but can only have one expression. 91
92 #Example 1: Multiply argument a with argument b and return the result: 93 x =
lambda a, b : a * b
94 print(x(5, 6))
95
96 #Fuction within another fuction
97 #The power of lambda is better shown when you use them as an anonymous function 98 #inside
another function.
99
100 print("\n\n\n Function with function")

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 10/1
3/7/22, 7:51 PM Fundamentals.ipynb - Colaboratory
101 def myfunc(n):
102 return lambda a : a * n
103
104 mydoubler = myfunc(2)
105 mytripler = myfunc(3)
106
107 print("My doubler: %d" %mydoubler(11))
108 print("My Tripler: %d" %mytripler(11))
109

Hello from a simple function


Hello Amy from a fuction with one argument
Hello Sara from a fuction with one argument
Hello Thomas from a fuction with one argument Amy
Joseph
The youngest child is Stephen The
youngest child is Stephen His last
name is Peter
I am from Sweden
I am from India I
am from Norway I
am from Brazil
apple
banana
cherry
15
25
45

Recursion Example Results 1


3
6
10
15
21
30

Function with function


My doubler: 22
My Tripler: 33

https://colab.research.google.com/drive/1FOlIXPiv93y86LuNu-zYtJZmU8BsNHlx#printMode=true 11/1
3/7/22, 7:59 PM SREEJITH T SHAJI CO1-DataVisualisation-
Hunting Exoplanets In Space - Scatter & Line

Program : 2

Aim : Programto handle data using pandas & perform data visualization using matplotlib & seaborn.

Hunting Exoplanets In Space - Scatter & Line Plots

1 # Load the training dataset.


2 import pandas as pd
3 exo_train_df = pd.read_csv('/content/exoTrain.csv') 4
exo_train_df.head()

LABEL FLUX.1 FLUX.2 FLUX.3 FLUX.4 FLUX.5 FLUX.6 FLUX.7 FLUX.8 FLUX.9 FLUX.10 FLUX

0 2 93.85 83.81 20.10 -26.98 -39.56 -124.71 -135.18 -96.27 -79.89 -160.17 -20

1 2 -38.88 -33.83 -58.54 -40.09 -79.31 -72.81 -86.55 -85.33 -83.97 -73.38 -8

2 2 532.64 535.92 513.73 496.92 456.45 466.00 464.50 486.39 436.56 484.39 46

3 2 326.52 347.39 302.35 298.13 317.74 312.70 322.33 311.31 312.42 323.33 31

4 2 -1107.21 -1112.59 -1118.95 -1095.10 -1057.55 -1034.48 -998.34 -1022.71 -989.57 -970.88 -93

5 rows × 3198 columns

1 # Display Statistical information of the train dataset 2


exo_train_df.describe()

LABEL FLUX.1 FLUX.2 FLUX.3 FLUX.4 FLUX.5

count 2828.000000 2828.000000 2828.000000 2828.000000 2828.000000 2828.000000 2828

mean 1.013083 -311.926436 -336.771086 -351.412090 -360.745163 -371.939353 -401

std 0.113652 7864.298788 8578.027713 8228.290394 7913.072488 10585.024782 14139

min 1.000000 -227856.260000 -315440.760000 -284001.760000 -234006.870000 -423195.620000 -597552

25% 1.000000 -51.045000 -49.790000 -47.462500 -45.537500 -41.390000 -43

50% 1.000000 -2.510000 -2.300000 -1.920000 -2.030000 -1.855000 -2

75% 1.000000 49.052500 44.495000 43.952500 40.402500 41.817500 39

max 2.000000 150725.800000 129578.360000 102184.980000 82253.980000 67934.170000 156492

8 rows × 3198 columns

1 # Check the number of rows and columns in the 'exo_train_df'. 2


exo_train_df.shape

(2828, 3198)

Check For The Missing Values

1 # Find the total number of missing values in the 'exo_train_df'. 2


exo_train_df.isnull().sum()

LABEL 0
FLUX.1 0
FLUX.2 0
FLUX.3 0
FLUX.4 0
..
FLUX.3193 1
FLUX.3194 1
FLUX.3195 1
FLUX.3196 1
FLUX.3197 1
Length: 3198, dtype: int64

https://colab.research.google.com/drive/1eyoUgfgyvWBAugjRzvtlcMywvQ5CZYrp?usp=sharing 1
3/7/22, 7:59 PM SREEJITH T SHAJI CO1-DataVisualisation-
Hunting Exoplanets In Space - Scatter & Line

There are no missing values in the DataFrame.

Slicing A DataFrame Using The iloc[] Function


Create Pandas series for the first 3 stars and the last 3 stars in the DataFrame.

Syntax:

dataframe_name.iloc[row_position_start : row_position_end, column_position_start : column_position_end]

In this syntax:
row_position_start denotes the position of the row in the DataFrame starting from whose values you want to take in the new Pandas series or
DataFrame.
row_position_end denotes the position of the row in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.
column_position_start denotes the position of the column in the DataFrame starting from whose values you want to take in the new
Pandas series or DataFrame.
column_position_end denotes the position of the column in the DataFrame till whose values you want to take in the new Pandas series or
DataFrame.

You can verify manually whether we have extracted the values from the first row or not by viewing the first 5 rows of the DataFrame using the head()
function.

1 # Create a Pandas series for the first star and store it in a variable called 'star_0'. 2 star_0 =
exo_train_df.iloc[0, :]
3 star_0.head()

LABEL 2.00
FLUX.1 93.85
FLUX.2 83.81
FLUX.3 20.10
FLUX.4 -26.98
Name: 0, dtype: float64

1 type(star_0)

pandas.core.series.Series

1 # Create a Pandas series for the second star and store it in a variable called 'star_1'. 2 star_1=
exo_train_df.iloc[1, :]
3 star_1.head()

LABEL 2.00
FLUX.1 -38.88
FLUX.2 -33.83
FLUX.3 -58.54
FLUX.4 -40.09
Name: 1, dtype: float64

1 # Create a Pandas series for the third star and store it in a variable called 'star_2'. 2 star_2=
exo_train_df.iloc[2, :]
3 star_2.head()

LABEL 2.00
FLUX.1 532.64
FLUX.2 535.92
FLUX.3 513.73
FLUX.4 496.92
Name: 2, dtype: float64

1 # Create a Pandas series for the last star and store it in a variable called 'star_5086'. 2
star_5086= exo_train_df.iloc[-1, :]
3 star_5086.head()

LABEL 1.00
FLUX.1 -63.94
FLUX.2 -78.34
FLUX.3 -87.04
FLUX.4 -58.34
Name: 2827, dtype: float64

https://colab.research.google.com/drive/1eyoUgfgyvWBAugjRzvtlcMywvQ5CZYrp?usp=sharing 1
3/7/22, 7:59 PM SREEJITH T SHAJI CO1-DataVisualisation-
Hunting Exoplanets In Space - Scatter & Line

Scatter and Line Plots of Flux

Now plot the Flux values on the y − axis for each observation for a star. On x − axis, we will plot numbers ranging from 1 to 3197 .

Scatter And Line Plots For First 3 Stars^


To make this plot,
1. We first need to import a Python module named matplotlib.pyplot with plt as an alias. This module is exclusively designed for creating
graphs such as bar graphs, histogram, line plot, scatter plot etc. We will learn more about this module as we go on in this course.

import matplotlib.pyplot as plt

2. Then we need to call the figure() function from the plt module to resize the plot. The figure() function takes figsize=
(horizontal_width, vertical_height) parameter as an input.

plt.figure(figsize=(16, 4))

3. Then we need either a Python list, a NumPy array or a Pandas series containing the numbers between 1 and 3197 to plot them on the
x − axis.

x_values_star_0 = np.arange(1, 3198)

4. Then we need star_0 Pandas series to plot the FLUX values on the y − axis for the first star in the DataFrame.

y_values_star_0 = star_0[1:]

5. Then we need to call the scatter() function from the plt module with the required inputs as described in the third and the fourth steps.

plt.scatter(x_values_star_0, y_values_star_0)

6. Finally, we need to call the show() function from the plt module.

plt.show()

1 #Create a scatter plot for 'star_0' Pandas series.


2 # 1. Import the 'numpy' and 'matplotlib.pyplot' modules. 3
import numpy as np
4 import matplotlib.pyplot as plt 5
6
7 # 2. Call the 'figure()' function to resize the plot. 8
plt.figure(figsize=(16,4))
9
10 # Here, 16 means the plot is 16 units wide and 4 units high. Play with these numbers to draw different sized plots. 11 #
Call the 'scatter()' function to make a scatter plot between the x and y values.
12 # The scatter() function requires two inputs: x and y where x is the data to be plotted on the x-axis and y is the data to be p 13 # In our
case, x is a Pandas series of numbers between 1 and 3197 and y is the 'FLUX' values for a star.
14
15
16 # Here, star_0[1:] is a Pandas series containing all the 'FLUX' values starting from the value at index 1 till the value at las 17 # The
'arange(1, 3198)' function from the 'numpy' module will generate numbers from 1 to 3197.
18 # 3. Call the 'scatter()' function. 19
x_star0 = np.arange(1,3198)
20 y_star0 = star_0[1:]
21 plt.scatter(x_star0,y_star0)
22
23
24 # 4. Call the 'show()' function.

25 plt.show()
26 # The 'show()' function displays the plot.

https://colab.research.google.com/drive/1eyoUgfgyvWBAugjRzvtlcMywvQ5CZYrp?usp=sharing 1
3/7/22, 7:59 PM SREEJITH T SHAJI CO1-DataVisualisation-
Hunting Exoplanets In Space - Scatter & Line

1 # Create a line plot for 'star_0' Pandas series. 2 #


Line plot for the first star in the DataFrame. 3
plt.figure(figsize=(16,4))
4 x_star0 = np.arange(1,3198) 5
y_star0 = star_0[1:]
6 plt.plot(x_star0,y_star0)
7 # Call the plot(x, y) function to draw a line plot between the x and y values. 8
9
10

[<matplotlib.lines.Line2D at 0x7f53f0d18f10>]

The line plot also confirms the periodic downward-peaks in the FLUX values.

1 # Create a scatter plot for the second star, i.e., 'star_1'. 2


plt.figure(figsize=(16,4))
3 x_star1 = np.arange(1,3198) 4
y_star1 = star_1[1:]
5 plt.scatter(x_star1,y_star1)
6
7
8

<matplotlib.collections.PathCollection at 0x7f53f026a150>

It is quite dificult to spot any clear pattern in the scatter plot for the second star in the DataFrame. Let's draw a line plot to identify a pattern.

1 # Create a line plot for the second star, i.e., 'star_1'. 2


plt.figure(figsize=(16,4))
3 x_star1 = np.arange(1,3198) 4
y_star1 = star_1[1:]

5 plt.plot(x_star1,y_star1)
6
7

https://colab.research.google.com/drive/1eyoUgfgyvWBAugjRzvtlcMywvQ5CZYrp?usp=sharing 1
3/7/22, 7:59 PM SREEJITH T SHAJI CO1-DataVisualisation-
[<matplotlib.lines.Line2D at 0x7f53f01f6350>] Hunting Exoplanets In Space - Scatter & Line

As we can see, there are consistent sudden drops in the brightness levels for the second star in the DataFrame. This suggests that the planet is
orbiting its star at very high radial speed. Also, the planet could be very close to the star.

1 #Create a scatter plot for the third star, i.e., 'star_2'. 2


3 plt.figure(figsize=(16,4)) 4
x_star2 = np.arange(1,3198) 5
y_star2 = star_2[1:]
6 plt.scatter(x_star2,y_star2)
7

<matplotlib.collections.PathCollection at 0x7f53f011bad0>

Here also, we can spot a clear repetitive downward-peaks which confirms that the star has at least one planet.

1 #Create a line plot for the third star, i.e, 'star_2'. 2


plt.figure(figsize=(16,4))
3 x_star2 = np.arange(1,3198) 4
y_star2 = star_2[1:]
5 plt.plot(x_star2,y_star2)
6

[<matplotlib.lines.Line2D at 0x7f53f0095650>]

The line plot also confirms the repetitive downward-peak pattern.

Scatter Plots And Line Plots For 2nd Last Star^^^


Now, create the scatter plots and line plots for 2nd Last Star or any star in the DataFrame which have been labelled or classified as 1 .

https://colab.research.google.com/drive/1eyoUgfgyvWBAugjRzvtlcMywvQ5CZYrp?usp=sharing 1
3/7/22, 7:59 PM SREEJITH T SHAJI CO1-DataVisualisation-
Hunting Exoplanets In Space - Scatter & Line
1 # Create a scatter plot for the second-last star, i.e., 'star_5085' in the DataFrame. 2
star_5085= exo_train_df.iloc[-2, :]
3 plt.figure(figsize=(16,4))
4 x_star5085 = np.arange(1,3198) 5
y_star5085 = star_5085[1:]
6 plt.scatter(x_star5085,y_star5085)
7

<matplotlib.collections.PathCollection at 0x7f53f1b1bcd0>

There is no clear periodic downward-peak pattern in the FLUX values for the second-last star.

1 # Student Action: Create a line plot for the second-last star in the DataFrame. 2
plt.figure(figsize=(16,4))
3 x_star5085 = np.arange(1,3198) 4
y_star5085 = star_5085[1:]
5 plt.plot(x_star5085,y_star5085)

[<matplotlib.lines.Line2D at 0x7f53f1755c90>]

The line-plot also confirms that there is no clear periodic downward-peak pattern in the FLUX values.

https://colab.research.google.com/drive/1eyoUgfgyvWBAugjRzvtlcMywvQ5CZYrp?usp=sharing 1
CO2 KNN Classification SREEJITH T SHAJI
3/7/22, 8:01 PM

Program : 3

Aim: Program to implement k-NN classification using any standard dataset available in the public domain and
find the accuracy of the algorithm

Algorithm:

The class of an unknown instance is computed using the following steps:

1. The distance between the unknown instance and all other training instances is computed.
2. The k nearest neighbors are identified.
3. The class labels of the k nearest neighbors are used to determine the class label of the unknown instance
by using techniques like majority voting.

1 from sklearn import neighbors, datasets, preprocessing 2


from sklearn.model_selection import train_test_split 3 from
sklearn.metrics import accuracy_score
4 from sklearn.neighbors import KNeighborsClassifier 5
from sklearn.metrics import classification_report 6 from
sklearn.metrics import confusion_matrix
7 iris=datasets.load_iris()
8 x,y=iris.data[:,:],iris.target
9 x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=20,train_size=.8)
10 scaler=preprocessing.StandardScaler().fit(x_train)
11 x_train=scaler.transform(x_train)
12 x_test=scaler.transform(x_test)
13 x_train
14
15

array([[-0.02627732, -1.00810966, 0.13128182, 0.00996 ],


[-0.87523698, 1.10598438, -1.34741591, -1.31804007],
[-0.75395703, 1.10598438, -1.29054292, -1.31804007],
[-1.11779688, 1.34088372, -1.34741591, -1.45084008],
[-0.51139713, 2.04558174, -1.4042889 , -1.05244006],
[-0.26883722, -0.53831098, 0.64313872, 1.07236006],
[-0.99651693, -1.71280767, -0.26682911, -0.25564001],
[ 0.94396229, -0.30341164, 0.47251976, 0.14276001],
[-0.99651693, 0.63618571, -1.34741591, -1.31804007],
[ 0.58012243, -1.243009 , 0.70001171, 0.93956005],
[ 0.58012243, 0.63618571, 1.26874161, 1.73636009],
[ 2.27804175, -0.53831098, 1.66685254, 1.07236006],
[-0.39011718, -1.47790834, 0.01753584, -0.12284001],
[-1.48163674, 0.40128637, -1.34741591, -1.31804007],
[-1.72419664, 0.40128637, -1.4042889 , -1.31804007],
[-0.87523698, 0.87108505, -1.29054292, -1.31804007],
[-0.87523698, -1.243009 , -0.43744808, -0.12284001],
[ 1.18652219, 0.40128637, 1.21186862, 1.47076008],
[ 1.30780214, 0.16638703, 0.7568847 , 1.47076008],
[ 1.06524224, 0.63618571, 1.09812264, 1.20516007],
[ 1.42908209, 0.40128637, 0.52939275, 0.27556001],
[ 1.06524224, -0.06851231, 0.70001171, 0.67396004],
[ 2.27804175, -0.06851231, 1.3256146 , 1.47076008],
[-0.26883722, -1.243009 , 0.07440883, -0.12284001],
[ 0.33756253, -1.00810966, 1.04124965, 0.27556001],
[ 0.70140238, 0.16638703, 0.98437666, 0.80676004],
[ 0.21628258, -0.30341164, 0.41564677, 0.40836002],
[ 0.45884248, -0.53831098, 0.58626573, 0.80676004],
[-1.72419664, -0.30341164, -1.34741591, -1.31804007],
https://colab.research.google.com/drive/1ApbBuLjwZsMcUOKwhBmcTpKBwwCAItKT?usp=sharing 1/
CO2 KNN Classification SREEJITH T SHAJI
3/7/22, 8:01 PM
[-1.60291669, -1.71280767, -1.4042889 , -1.18524006],
[-0.14755727, -1.243009 , 0.70001171, 1.07236006],

1 # Identify the ideal value for k


2 score=[]
3 for k in range(1,15):
4 knn=neighbors.KNeighborsClassifier(n_neighbors=k)
5 knn.fit(x_train,y_train)
6 y_pred=knn.predict(x_test)
7 score.append(accuracy_score(y_test,y_pred))
8 print("when k = %s ,accuracy is %s"%(k,accuracy_score(y_test,y_pred))) 9

when k = 1 ,accuracy is 0.9666666666666667


when k = 2 ,accuracy is 0.9666666666666667
when k = 3 ,accuracy is 0.9666666666666667
when k = 4 ,accuracy is 0.9666666666666667
when k = 5 ,accuracy is 0.9666666666666667
when k = 6 ,accuracy is 0.9666666666666667
when k = 7 ,accuracy is 0.9666666666666667
when k = 8 ,accuracy is 0.9666666666666667
when k = 9 ,accuracy is 0.9666666666666667
when k = 10 ,accuracy is 0.9666666666666667
when k = 11 ,accuracy is 1.0
when k = 12 ,accuracy is 1.0
when k = 13 ,accuracy is 1.0
when k = 14 ,accuracy is 1.0

1 # S4.2: Train kNN regressor model for 'k = 6'.


2 knn6 = neighbors.KNeighborsClassifier(n_neighbors=3) 3
knn6.fit(x_train,y_train)
4 # Perform prediction using 'predict()' function. 5
y_test_pred=knn6.predict(x_test)
6
7 # Call the 'score()' function to check the accuracy score of the train set and test set.
8 print("Test set accuracy:" , knn6.score(x_test,y_test))
9 print("confusion matrix:")
10 print(confusion_matrix(y_test,y_test_pred))
11 print(classification_report(y_test,y_test_pred))

Test set accuracy: 0.9666666666666667


confusion matrix:
[[10 0 0]
[ 0 10 0]
[ 0 1 9]]
precision recall f1-score support

0 1.00 1.00 1.00 10


1 0.91 1.00 0.95 10
2 1.00 0.90 0.95 10

accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30

https://colab.research.google.com/drive/1ApbBuLjwZsMcUOKwhBmcTpKBwwCAItKT?usp=sharing 2/
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
422Colaboratory

Program : 4

Aim: Program to implement Naïve Bayes Algorithm using any standard dataset available in the public
domain and find the accuracy of the algorithm

Short notes: Naive Bayes

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class,
given our prior knowledge. Bayes’ Theorem is stated as:

P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data.

We are using Iris Dataset. The Iris Flower Dataset involves predicting the flower species given
measurements of iris flowers.

It is a multiclass classification problem. The number of observations for each class is balanced. There are 150
observations with 4 input variables and 1 output variable. The variable names are as follows:

Sepal length in cm.

Sepal width in cm.

Petal length in cm.

Petal width in cm., and

Class.

Algorithm:

Step 1: Separate By Class.

Step 2: Summarize Dataset.

Step 3: Summarize Data By Class.

Step 4: Gaussian Probability Density Function.

Step 5: Class Probabilities.

1 #Import Modules
2 import numpy as np
3 import matplotlib.pyplot as plt
4 from sklearn import neighbors, datasets, preprocessing 5
from sklearn.model_selection import train_test_split 6 from
sklearn.neighbors import KNeighborsClassifier
7 from sklearn.metrics import classification_report,confusion_matrix,accuracy_score 8
9
10 #Load iris dataset & do train_test_split 11
iris=datasets.load_iris()
12 x,y=iris.data[:,:],iris.target
13 x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=20,train_size=.8)
14

https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
422Colaboratory

1 #Feature Scaling
2 scaler=preprocessing.StandardScaler().fit(x_train)
3 x_train=scaler.transform(x_train)
4 x_test=scaler.transform(x_test)
5 x_train

array([[-0.02627732, -1.00810966, 0.13128182, 0.00996 ],


[-0.87523698, 1.10598438, -1.34741591, -1.31804007],
[-0.75395703, 1.10598438, -1.29054292, -1.31804007],
[-1.11779688, 1.34088372, -1.34741591, -1.45084008],
[-0.51139713, 2.04558174, -1.4042889 , -1.05244006],
[-0.26883722, -0.53831098, 0.64313872, 1.07236006],
[-0.99651693, -1.71280767, -0.26682911, -0.25564001],
[ 0.94396229, -0.30341164, 0.47251976, 0.14276001],
[-0.99651693, 0.63618571, -1.34741591, -1.31804007],
[ 0.58012243, -1.243009 , 0.70001171, 0.93956005],
[ 0.58012243, 0.63618571, 1.26874161, 1.73636009],
[ 2.27804175, -0.53831098, 1.66685254, 1.07236006],
[-0.39011718, -1.47790834, 0.01753584, -0.12284001],
[-1.48163674, 0.40128637, -1.34741591, -1.31804007],
[-1.72419664, 0.40128637, -1.4042889 , -1.31804007],
[-0.87523698, 0.87108505, -1.29054292, -1.31804007],
[-0.87523698, -1.243009 , -0.43744808, -0.12284001],
[ 1.18652219, 0.40128637, 1.21186862, 1.47076008],
[ 1.30780214, 0.16638703, 0.7568847 , 1.47076008],
[ 1.06524224, 0.63618571, 1.09812264, 1.20516007],
[ 1.42908209, 0.40128637, 0.52939275, 0.27556001],
[ 1.06524224, -0.06851231, 0.70001171, 0.67396004],
[ 2.27804175, -0.06851231, 1.3256146 , 1.47076008],
[-0.26883722, -1.243009 , 0.07440883, -0.12284001],
[ 0.33756253, -1.00810966, 1.04124965, 0.27556001],
[ 0.70140238, 0.16638703, 0.98437666, 0.80676004],
[ 0.21628258, -0.30341164, 0.41564677, 0.40836002],
[ 0.45884248, -0.53831098, 0.58626573, 0.80676004],
[-1.72419664, -0.30341164, -1.34741591, -1.31804007],
[-1.60291669, -1.71280767, -1.4042889 , -1.18524006],
[-0.14755727, -1.243009 , 0.70001171, 1.07236006],
[ 0.70140238, -0.77321032, 0.87063068, 0.93956005],
[-0.87523698, 0.63618571, -1.17679694, -0.91964005],
[ 0.70140238, -0.53831098, 1.04124965, 1.20516007],
[-0.26883722, -0.06851231, 0.18815481, 0.14276001],
[-0.99651693, -2.41750569, -0.15308313, -0.25564001],
[-1.11779688, 0.16638703, -1.29054292, -1.31804007],
[-0.39011718, 2.75027975, -1.34741591, -1.31804007],
[ 0.58012243, 0.87108505, 1.04124965, 1.60356009],
[-0.39011718, -1.243009 , 0.13128182, 0.14276001],
[-1.23907683, -0.06851231, -1.34741591, -1.45084008],
[ 0.21628258, -1.94770701, 0.13128182, -0.25564001],
[-0.51139713, 1.57578306, -1.29054292, -1.31804007],
[-1.23907683, -0.06851231, -1.34741591, -1.18524006],
[ 0.58012243, -1.71280767, 0.35877378, 0.14276001],
[ 0.58012243, -1.243009 , 0.64313872, 0.40836002],
[-1.11779688, -0.06851231, -1.34741591, -1.31804007],
[-1.84547659, -0.06851231, -1.51803488, -1.45084008],
[ 0.70140238, 0.40128637, 0.41564677, 0.40836002],
[-0.02627732, 2.28048107, -1.46116189, -1.31804007],
[-1.48163674, 0.87108505, -1.34741591, -1.18524006],
[-0.02627732, -0.77321032, 0.18815481, -0.25564001],
[ 0.58012243, -0.53831098, 0.7568847 , 0.40836002],
[-0.99651693, 1.10598438, -1.4042889 , -1.18524006],
[-0.26883722, -0.06851231, 0.41564677, 0.40836002],
[-0.39011718, -1.47790834, -0.03933715, -0.25564001],
[ 1.79292194, -0.30341164, 1.43936058, 0.80676004],
[ 0.09500263, -0.06851231, 0.7568847 , 0.80676004],

https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
422Colaboratory

In this step, we introduce the class GaussianNB that is used from the sklearn.naive_bayes library. Here, we have
used a Gaussian model, there are several other models such as Bernoulli, Categorical and Multinomial. Here, we
assign the GaussianNB class to the variable classifier and fit the X_train and y_train values to it for
training purpose.
1 #Implement Naive Bayes
2 from sklearn.naive_bayes import GaussianNB 3
darsana=GaussianNB()
4 darsana.fit(x_train,y_train)

GaussianNB()

1 #Predict the values for test data 2


y_test_pred=darsana.predict(x_test)
3
4 # Display accuracy score & display confusion matrix & classification report 5
print(accuracy_score(y_test,y_test_pred))
6 print("confusion matrix")
7 print(confusion_matrix(y_test,y_test_pred))
8 print("classification report")
9 print(classification_report(y_test,y_test_pred))
10

1.0
confusion matrix
[[10 0 0]
[ 0 10 0]
[ 0 0 10]]
classification report
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 10
2 1.00 1.00 1.00 10

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

1 #bernoulli
2 from sklearn.naive_bayes import BernoulliNB
3 darsana= BernoulliNB()
4 darsana.fit(x_train,y_train)
5 y_test_pred=darsana.predict(x_test)
6 print(accuracy_score(y_test,y_test_pred))
7 print("confusion matrix")
8 print(confusion_matrix(y_test,y_test_pred))
9 print("classification report")
10 print(classification_report(y_test,y_test_pred))
11

https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
422Colaboratory

0.7
confusion matrix
[[10 0 0]
[ 2 2 6]
[ 0 1 9]]
classification report
precision recall f1-score support

0 0.83 1.00 0.91 10


1 0.67 0.20 0.31 10
2 0.60 0.90 0.72 10

accuracy 0.70 30
macro avg 0.70 0.70 0.65 30
weighted avg 0.70 0.70 0.65 30

From the above confusion matrix, we infer that, out of 45 test set data, 44 were correctly classified and only 1 was
incorrectly classified. This gives us a high accuracy of 97.7

https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
422Colaboratory

Program : 5
Aim : Program to implement simple linear regression technique using any standard dataset available in the public domain and
evaluate its performance.

Problem Statement
As an owner of a startup, you wish to forecast the sales of your product to plan how much money should be spent on advertisements. This is because
the sale of a product is usually proportional to the money spent on advertisements.
Predict the impact of TV advertising on your product sales by performing simple linear regression analysis.

List of Activities
Activity 1: Analysing the dataset
Activity 2: Train-Test split
Activity 3: Model training
Activity 4: Plotting the best fit line
Activity 5: Model prediction

Activity 1: Analysing the Dataset


Create a Pandas DataFrame for Advertising-Sales dataset using the below link. This dataset contains information about the money spent on the TV,
radio and newspaper advertisement (in thousand dollars) and their generated sales (in thousand units). The dataset consists of examples that are
divided by 1000.
Dataset Link: https://raw.githubusercontent.com/jiss-sngce/CO_3/main/advertising.csv

Also, print the first five rows of the dataset. Check for null values and treat them accordingly.

1 # Import modules 2
import numpy as np
3 import pandas as pd
4 import matplotlib.pyplot as plt 5
import seaborn as sns
6
7 # Load the dataset
8 ad_df=pd.read_csv('https://raw.githubusercontent.com/jiss-sngce/CO_3/main/advertising.csv') 9 # Print
first five rows using head() function
10 ad_df.head()

TV Radio Newspaper Sales

0 230.1 37.8 69.2 22.1

1 44.5 39.3 45.1 10.4

2 17.2 45.9 69.3 12.0

3 151.5 41.3 58.5 16.5

4 180.8 10.8 58.4 17.9

1 # Check if there are any null values. If any column has null values, treat them accordingly 2
ad_df.isnull().sum()

TV 0
Radio 0
Newspaper 0
Sales 0
dtype: int64

https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
422Colaboratory
Activity 2: Train-Test Split
For simple linear regression, consider only the effect of TV ads on sales. Thus, TV is the feature variable and Sales is the target variable.

Split the dataset into training set and test set such that the training set contains 67% of the instances and the remaining instances will become the
test set.

1 # Split the DataFrame into the training and test sets. 2


from sklearn.model_selection import train_test_split
3 x=ad_df['TV']
4 y=ad_df['Sales']
5 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state=15)

Activity 3: Model Training

Train the simple regression model using training data to obtain the best fit line y = mx + c. For this, perform the following tasks:
1. Create following two functions:

A function errors_product() that calculates the errors for the feature and target variables i.e. (xi − x)(yi − y )

A function squared_errors() that calculates the squared errors for the feature variable only i.e. (xi − x)2
2. Calculate the slope and intercept values for the best fit line by applying the following formulae:
∑(xi − x)(yi − y ) errors_product(). sum()
slope ⇒ m = =
∑(xi − x)2 squared_errors(). sum()
intercept ⇒ c = y − mx

1 # Create the 'errors_product()' and 'squared_errors()' function. 2 def


errors_product():
3 pro=(x_train-x_train.mean()) * (y_train-y_train.mean())
4 return pro
5 def squared_errors():
6 sq=(x_train-x_train.mean())**2
7 return sq 8

1 # Calculate the slope and intercept values for the best fit line. 2 slope =
errors_product().sum() / squared_errors().sum()
3 slope
4 round(slope,3)
5

0.057

1 inc=y_train.mean()-slope*x_train.mean()
2 inc
3 round(inc,3)

6.767

Q: What is the equation obtained for the best fit line of this model?
A:slope(m)=0.057
intercept(c)=6.767

sales=0.057*tv+6.767

Activity 4: Plotting the Best Fit Line


After obtaining the slope and intercept values for the best fit line, plot this line along with the scatter plot to see how well it fits the points.

1 # Plot the regression line in the scatter plot between Sales and TV advertisment values. 2
plt.style.use('dark_background')
3 plt.figure(figsize=(16,14))
4 plt.scatter(ad_df['TV'],ad_df['Sales'])
5 plt.plot(ad_df['TV'],slope * ad_df['TV'] + inc,color='r', label='y=0.057*x+6.767') 6
plt.xlabel('Tv')
https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:03 PM CO2 Simple linear regression SREEJITH T SHAJI
7 plt.ylabel('Sales') 422Colaboratory
8 plt.legend()
9 plt.show()

https://colab.research.google.com/drive/16QqZPmMNfL3qz3zNmM_2SBKtZ9o60g06?usp=sharing
3/7/22, 8:05 PM 9/12/21 SREEJITH T SHAJI simple linear CO3

Activity 5: Model Prediction


For the TV advertising of $50,000, what is prediction for Sales? In order to predict this value, perform the following task:
Based on the regression line, create a function sales_predicted() which takes a budget to be used for TV advertising as an input and returns the
corresponding units of Sales.
Call the function sales_predicted() and pass the amount spent on TV advertising.
Note: To predict the sales for TV advertising of $50,000, pass 50 as parameter to sales_predicted() function as the original data of this dataset
consists of examples that are divided by 1000. Also, the value obtained after calling sales_predicted(50) must be multiplied by 1000 to obtain the
predicted units of sales.

1 #Create a function which takes TV advertisement value as an input and returns the sales. 2 def
sales_predicted(tv_bd):
3 return 0.057*tv_bd+6.767
4
5
6 # Calculating sales value against $50,000 spent in TV ads 7
bd=sales_predicted(50)
8 bd*1000

9617.0

Q: If you are planning to invest $50,000 dollars in TV advertising, how many unit of sales can be predicted according to this simple linear
regression model?
A: $9617

1 x_train.shape

(134,)

1 y_train.shape

(134,)

1 type(x_train)

pandas.core.series.Series

1 #Deploy linear regession model using sklearn.linear_model.


2 #1.import 'LinearRegression' class from sklearn.linear_model. 3 from
sklearn.linear_model import LinearRegression
4 x_train_res=x_train.values.reshape(-1,1)
5 y_train_res=y_train.values.reshape(-1,1)
6
7 #2.create obkject for linear regression class. 8
darsana=LinearRegression()
9
10 #3.call fit().fit() use training model.input to fit() should be 2-dimensional array. 11
darsana.fit(x_train_res,y_train_res)
12 #4.print slope and intercept value. 13
print(darsana.intercept_)
14 print(darsana.coef_)

[6.76732677]
[[0.05729132]

https://colab.research.google.com/drive/1qZ9sw3UQdRtk6g_dM8KBnqP2zKDv_Sre#scrollTo=RKKT9L8nIfL_&printMode=true 8/
3/7/22, 8:08 PM SREEJITH T SHAJI MultipleLinearRegression - Colaboratory

Program : 6

Aim : Program to implement multiple linear regression technique using any standard dataset available in the public
domain and evaluate its performance.

MultipleLinearRegression

Program to implement multiple linear regression technique using any standard dataset available in the public domain and evaluate its performance.
The description for all the columns containing data for air pollutants, temperature, relative humidity and absolute humidity is provided below.
Columns Description

PT08.S1(CO) PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
μg
C6H6(GT) True hourly averaged Benzene concentration in 3
m
PT08.S2(NMHC)
PT08.S3(NOx) PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NO targeted)
PT08.S4(NO2) PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5(O3) PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
T Temperature in °C x

RH Relative Humidity (%)

AH AH Absolute Humidity

Multiple Linear Regression Model Using sklearn Module

1 #Load Dataset & display 1st 5 rows. Github link is as follows:


2 # https://raw.githubusercontent.com/jiss-sngce/air/main/airquality.csv.csv 3
import pandas as pd
4 df=pd.read_csv('https://raw.githubusercontent.com/jiss-sngce/air/main/airquality.csv.csv')
5 df.head()
6

DateTime PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) PT08.S3(NOx) PT08.S4(NO2) PT08.S5(O3) T RH

2004-03-
0 10 1360.0 11.9 1046.0 1056.0 1692.0 1268.0 13.6 48.9
18:00:00

2004-03-
1 10 1292.0 9.4 955.0 1174.0 1559.0 972.0 13.3 47.7
19:00:00

2004-03-
2 10 1402 0 90 939 0 1140 0 1555 0 1074 0 11 9 54 0

1 #Display the columns in dataframe 2


df.info

<bound method DataFrame.info of DateTime PT08.S1(CO) C6H6(GT) ... Month Day Day Name
0 2004-03-10 18:00:00 1360.0 11.9 ... 3 10 Wednesday
1 2004-03-10 19:00:00 1292.0 9.4 ... 3 10 Wednesday
2 2004-03-10 20:00:00 1402.0 9.0 ... 3 10 Wednesday
3 2004-03-10 21:00:00 1376.0 9.2 ... 3 10 Wednesday
4 2004-03-10 22:00:00 1272.0 6.5 ... 3 10 Wednesday
... ... ... ... ... ... ... ...
9352 2005-04-04 10:00:00 1314.0 13.5 ... 4 4 Monday
9353 2005-04-04 11:00:00 1163.0 11.4 ... 4 4 Monday
9354 2005-04-04 12:00:00 1142.0 12.4 ... 4 4 Monday
9355 2005-04-04 13:00:00 1003.0 9.5 ... 4 4 Monday
9356 2005-04-04 14:00:00 1071.0 11.9 ... 4 4 Monday

[9357 rows x 14 columns]>

1 # Build a linear regression model using the sklearn module by including all the features except DateTime,Day Name & RH. 2
3 from sklearn.model_selection import train_test_split 4
from sklearn.linear_model import LinearRegression
5 features=list(df.columns.values[1:-1])
6 features.remove('RH')
7 X=df[features]

https://colab.research.google.com/drive/1wWepOEZZlbxKa8Ik128t8pS7kJ0ehxgx#scrollTo=-BxonYJweOlM&printMode=true 1/
3/7/22, 8:08 PM SREEJITH T SHAJI MultipleLinearRegression - Colaboratory
8 y=df['RH']
9
10 # Splitting the DataFrame into the train and test sets. 11 #
Test set will have 33% of the values.
12

13 X_train, X_test, y_train,y_test= train_test_split(X,y,test_size=0.33,random_state=42) 14


y_train_reshaped=y_train.values.reshape(-1,1)
15 y_test_reshaped=y_test.values.reshape(-1,1)
16
17 # Build a linear regression model using the 'sklearn.linear_model' module. 18
19 sklearn_lin_reg=LinearRegression()
20 sklearn_lin_reg.fit(X_train,y_train_reshaped)
21
22 # Print the value of the intercept .
23
24 print('Intercept',sklearn_lin_reg.intercept_[0])
25
26 # Print the names of the features along with the values of their corresponding coefficients. 27
28 print("coefficent : ",sklearn_lin_reg.coef_)
29 for item in list(zip(X.columns.values,sklearn_lin_reg.coef_[0])):
30 print(item[0],item[1])

Intercept -15028.451823247718
coefficent : [[ 1.48327948e-02 -9.03464156e-01 -5.88095941e-03 1.50325488e-03
2.64965020e-02 -1.06574176e-03 -2.35491907e+00 2.95517421e+01
7.50515310e+00 1.16786097e+00 3.52321248e-02]]
PT08.S1(CO) 0.014832794792690625
C6H6(GT) -0.9034641560183382
PT08.S2(NMHC) -0.005880959405385411
PT08.S3(NOx) 0.0015032548783276978
PT08.S4(NO2) 0.026496502045666503
PT08.S5(O3) -0.001065741763271788
T -2.354919067592639
AH 29.551742104329783
Year 7.505153097892558
Month 1.1678609682998067
Day 0.03523212478929974

1 # Evaluate the linear regression model using the 'r2_score', 'mean_squared_error' & 'mean_absolute_error' functions of the 'skl 2
3 from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error 4 import
numpy as np
5 y_train_pred=sklearn_lin_reg.predict(X_train)
6 y_test_pred=sklearn_lin_reg.predict(X_test)
7 print('Train Set')
8 print('R-squared : ',r2_score(y_train_reshaped,y_train_pred))
9 print('mean squared error : ',mean_squared_error(y_train_reshaped,y_train_pred))
10 print('root mean squared error : ',np.sqrt(mean_squared_error(y_train_reshaped,y_train_pred))) 11
print('mean absolute error : ',mean_absolute_error(y_train_reshaped,y_train_pred))
12 print('\nTest set')
13 print('R-squared : ',r2_score(y_test_reshaped,y_test_pred))
14 print('mean squared error : ',mean_squared_error(y_test_reshaped,y_test_pred))
15 print('root mean squared error : ',np.sqrt(mean_squared_error(y_test_reshaped,y_test_pred))) 16
print('mean absolute error : ',mean_absolute_error(y_test_reshaped,y_test_pred))
17
18

Train Set
R-squared : 0.8785638240066055
mean squared error : 35.11591834141915
root mean squared error : 5.925868572742662 mean
absolute error : 4.571994849644625

Test set
R-squared : 0.8787020691681189
mean squared error : 34.702124455429534
root mean squared error : 5.8908509109830245 mean
absolute error : 4.564460432924346

https://colab.research.google.com/drive/1wWepOEZZlbxKa8Ik128t8pS7kJ0ehxgx#scrollTo=-BxonYJweOlM&printMode=true 2/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory

Program : 7
Aim : Program to implement support vector machine

Loading Data
Let's load both the training and the test datasets.
Train Dataset: https://raw.githubusercontent.com/akshayr89/MNSIST_Handwritten_Digit_Recognition-SVM/master/train.csv Test Dataset:

https://raw.githubusercontent.com/akshayr89/MNSIST_Handwritten_Digit_Recognition-SVM/master/test.csv MNIST Train & test dataset

is uploaded in google classroom also.

Dataset credits: http://yann.lecun.com/exdb/mnist/

Now, get the information on both data frames.

1 #Load train & test dataset 2


import numpy as np
3 import pandas as pd
4 import matplotlib.pyplot as plt 5
import seaborn as sns
6 train_df=pd.read_csv('https://raw.githubusercontent.com/akshayr89/MNSIST_Handwritten_Digit_Recognition-SVM/master/train.csv')
7 test_df=pd.read_csv('https://raw.githubusercontent.com/akshayr89/MNSIST_Handwritten_Digit_Recognition-SVM/master/test.csv')
8 train_df.head()
9 print()
10 print(train_df.shape)
11 print(test_df.shape)

(42000, 785)
(28000, 784)

1 # Get the information on the train dataset. 2


train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42000 entries, 0 to 41999
Columns: 785 entries, label to pixel783
dtypes: int64(785)
memory usage: 251.5 MB

There are 42000 rows and 785 columns in the training dataset.

1 # Get the information on the test dataset. 2


test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28000 entries, 0 to 27999
Columns: 784 entries, pixel0 to pixel783
dtypes: int64(784)
memory usage: 167.5 MB

There are 28000 rows and 784 columns in the test dataset. This means we don't have the labels column for the test set.

1 # Print the first and last five columns of both the test and train datasets. 2
print(" train data")
3 print("first 5 columns : ",list(train_df.columns[:5]))
4 print("last 5 columns : ",list(train_df.columns[-5:]))
5 print()
6
7 print("test data")
8 print("first 5 columns : ",list(test_df.columns[:5]))
9 print("last 5 columns : ",list(test_df.columns[-5:]))

train data
first 5 columns : ['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3']
last 5 columns : ['pixel779', 'pixel780', 'pixel781', 'pixel782', 'pixel783']

test data
first 5 columns : ['pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4']
last 5 columns : ['pixel779', 'pixel780', 'pixel781', 'pixel782', 'pixel783']

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 1/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory
As you can see, the train set has the label column but the test set doesn't.

Now, let s print the first five rows of the data frame containing the train set.

1 # Print the first ten rows of the data frame containing the train set. 2

As you can see:

 The first row contains the pixel values of the image of the handwritten digit 1.

 Similarly, the second row contains the pixel values of the image of the handwritten digit 0

 The 10th row contains the pixel values of the image of the handwritten digit 3.

Let's print the image of the digit 4.

The matplotlib.pyplot.imshow() Function


To display an image from its pixel values, you can use the imshow() function of the matplotlib.pyplot module. So, to create the image of the digit 4
from its pixel values, we will follow the steps given below:

1. Create a 1D array containing the pixel values from the training data frame for the image and store it in a variable.

2. Then reshape the above array into a 2D array having 28 rows and 28 columns.
parameters cmap = 'gray', vmin = 0, vmax = 255 .
Note: There are other parameters that can be passed to imshow() function as inputs. But for now, we will pass the above parameters only.

4. Provide the title to the image.

1 # Display the image of the handwritten digit 4 from the train data frame. 2
four_pixels = train_df.iloc[3, 1:]
3 four_pixels = four_pixels.values.reshape(28, 28)
4 plt.figure(figsize = (5, 5), dpi = 81)
5 plt.title("Handwritten Digit 4", fontsize = 16)
6 plt.imshow(four_pixels, cmap = 'gray', vmin = 0, vmax = 255) 7
plt.show()

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 2/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory

In the above code:

four_pixels = train_df.iloc[3, 1:] part gets the pixel values of the image of the digit 4
that are stored in the 4th row of the data frame.

four_pixels = four_pixels.values.reshape(28, 28) part first gets the pixel values from the Pandas series in the form of a NumPy array and then
reshapes the 1D array into a 2D array having 28 rows and 28 columns.

plt.figure(figsize = (5, 5), dpi = 81) part sets the figure size.

plt.title("Handwritten Digit 4", fontsize = 16) part sets the title of the plot.

plt.imshow(four_pixels, cmap = 'gray', vmin = 0, vmax = 255) part creates a 2D image in gray colour.

If you look at the axes of the above image, you can see that nearly the first four and last three rows are blank. Similarly, the first five and last five
columns are blank which is denoted by the black colour. So let's print the rows 5 to 26 and columns 5 to 25 of the four_pixel NumPy array to see the
pixel values of the image of the handwritten digit 4.

1 # Print the rows 5 to 26 and columns 5 to 23 of the 'four_pixel' NumPy array to see the pixel values of the image of the handwr 2
print(four_pixels[4:26,5:23])

[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 220 179 6 0 0 0 0 0 0 0 0 9 77 0 0 0 0]
[ 0 28 247 17 0 0 0 0 0 0 0 0 27 202 0 0 0 0]
[ 0 0 242 155 0 0 0 0 0 0 0 0 27 254 63 0 0 0]
[ 0 0 160 207 6 0 0 0 0 0 0 0 27 254 65 0 0 0]
[ 0 0 127 254 21 0 0 0 0 0 0 0 20 239 65 0 0 0]
[ 0 0 77 254 21 0 0 0 0 0 0 0 0 195 65 0 0 0]
[ 0 0 70 254 21 0 0 0 0 0 0 0 0 195 142 0 0 0]
[ 0 0 56 251 21 0 0 0 0 0 0 0 0 195 227 0 0 0]
[ 0 0 0 222 153 5 0 0 0 0 0 0 0 120 240 13 0 0]
[ 0 0 0 67 251 40 0 0 0 0 0 0 0 94 255 69 0 0]
[ 0 0 0 0 234 184 0 0 0 0 0 0 0 19 245 69 0 0]
[ 0 0 0 0 234 169 0 0 0 0 0 0 0 3 199 182 10 0]
[ 0 0 0 0 154 205 4 0 0 26 72 128 203 208 254 254 131 0]
[ 0 0 0 0 61 254 129 113 186 245 251 189 75 56 136 254 73 0]
[ 0 0 0 0 15 216 233 233 159 104 52 0 0 0 38 254 73 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 254 73 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 254 73 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 206 106 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 186 159 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 209 101 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

From the above output, you can see the non-zero pixel values arranged in the pattern of digit 4.
It is to be noted that the pixel values for a grayscale image range from 0 to 255.
You can also look at the descriptive statistics for the first 10 images in the train data frame.

1 # Create a data frame from the training data frame that contain the pixel values of the images of the digit 6. 2
six_pixels_train_df=train_df.loc[train_df['label'] == 6, :]
3 six_pixels_train_df

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 3/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10

21 6 0 0 0 0 0 0 0 0 0 0 0
26 6 0 0 0 0 0 0 0 0 0 0 0

.
64 6 0 0 0 0 0 0 0 0 0 0 0

72 6 0 0 0 0 0 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ...

41921 6 0 0 0 0 0 0 0 0 0 0 0

41927 6 0 0 0 0 0 0 0 0 0 0 0

41967 6 0 0 0 0 0 0 0 0 0 0 0

41993 6 0 0 0 0 0 0 0 0 0 0 0

41998 6 0 0 0 0 0 0 0 0 0 0 0

4137 rows × 785 columns


Now, from the above data frame, let's create an image of the first instance of the image of digit 6. Its index is 21

45 6 0 0 0 0 0 0 0 0 0 0 0

# Create an image from the pixel values of the image of the digit 6 that are stored in row 21.
six_pixels = train_df.iloc[21, 1:]
six_pixels = six_pixels.values.reshape(28, 28)
plt.figure(figsize = (5, 5), dpi = 81)
plt.title("Handwritten Digit 6", fontsize = 16)
plt.imshow(six_pixels, cmap = 'gray', vmin = 0, vmax = 255) 7 plt.show()

Now, let's print the part of the array containing the pixel values of the above image such that their arrangement resembles the digit 6.

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 4/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory
1 # S3.8: Print the rows 2 to 22 and columns 5 to 21 of the 'six_pixels' array. 2
print(six_pixels[2:22,5:21])

[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 70]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 27 189 254]
[ 0 0 0 0 0 0 0 0 0 0 0 0 28 219 255 206]
[ 0 0 0 0 0 0 0 0 0 0 8 94 233 248 179 31]
[ 0 0 0 0 0 0 0 0 0 0 146 254 251 84 0 0]
[ 0 0 0 0 0 0 0 0 51 173 252 209 65 0 0 0]
[ 0 0 0 0 0 0 2 119 252 254 146 20 0 0 0 0]
[ 0 0 0 0 0 18 131 254 239 130 25 0 0 0 0 0]
[ 0 0 0 0 17 237 254 239 58 0 0 0 0 0 0 20]
[ 0 0 4 70 223 251 196 61 0 0 0 30 112 138 207 226]
[ 0 0 153 254 228 68 0 0 0 34 143 249 254 233 177 179]
[ 0 67 253 208 40 0 0 31 99 226 241 195 112 14 0 18]
[ 67 241 168 8 0 0 60 239 253 161 37 0 0 0 20 165]
[185 254 74 0 0 43 224 254 116 0 0 0 3 73 205 253]
[252 121 1 0 47 205 230 53 2 0 0 53 176 254 219 118]
[254 107 2 1 127 254 65 5 24 107 198 250 252 195 27 0]
[234 254 199 172 254 254 186 254 254 254 234 134 53 0 0 0]
[109 195 233 250 254 254 254 244 129 46 20 0 0 0 0 0]
[ 0 0 24 71 254 254 254 235 84 0 0 0 0 0 0 0]]

Now, for a machine learning algorithm (in this case, SVM), to correctly identify an image for a digit, it has to figure out the arrangement of pixel values for a
digit on a 2D grid (in this case, 28×28 grid). Knowing this, we can now build a machine learning model (in this case, SVM) to classify the images of
different handwritten digits.

Check for Data Imbalance


Before building a classification model, let's check whether the training dataset is imbalanced or not.

1 # Find out the counts of records for each digit in the training dataset. 2

3 train_df['label'].value_counts(dropna = False ,normalize = True)*100 4

1 11.152381

7 10.478571
3 10.359524
9 9.971429
2 9.945238
6 9.850000
0 9.838095
4 9.695238
8 9.673810
5 9.035714
Name: label, dtype: float64

Note:
1. The dropna = False parameter counts the number of NA or null values if they are present in a Pandas series.

2. The normalize = True parameter calculates the count of a value as the fraction of the total number of records.

From the count of labels, we can see that the training dataset is balanced. Hence, we can now proceed to build a classification model.

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 5/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory

Activity 1: Feature Scaling or Normalisation


Now that we have ensured that there is no data imbalance, let's scale down the pixel values of each image because the support vector machines is
sensitive to the numeric data. Also, in the case of large values, the time taken to train an SVM model will be high.

So let's divide each pixel value for each image by 255 (the greatest pixel value for a grayscale image) to reduce the values between 0 and 1.

1 # Create features and target data frames and divide each pixel for each image by 255.0 2
feature_train=train_df.iloc[:,1:]/255.0
3 target_train_actual=train_df['label']
4 feature_train.set_index(keys=target_train_actual,inplace = False).T.describe()

label 1 0 1 4 0 0 7 3

count 784.000000 784.000000 784.000000 784.000000 784.000000 784.000000 784.000000 784.000000 784

mean 0.083278 0.223134 0.067152 0.075155 0.255567 0.115351 0.085794 0.121489 0

std 0.253570 0.389066 0.233957 0.226616 0.411225 0.286204 0.247198 0.277065 0

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0

25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0

50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0

75% 0.000000 0.259804 0.000000 0.000000 0.515686 0.000000 0.000000 0.000000 0

max 1.000000 1.000000 1.000000 1.000000 0.996078 1.000000 1.000000 1.000000 1

8 rows × 42000 columns

Activity 2: Model Building^


Let's build a preliminary SVM classification model to classify the images of digits.
Note: Since there are 42000 training samples (or image samples or rows), the SVC model will take some time (about 4 to 6 minutes) to train.

1 #Build an SVC model with the linear kernel. 2


from sklearn.svm import SVC
3 svc_dp_linear=SVC(kernel='linear')
4 svc_dp_linear.fit(feature_train,target_train_actual)

SVC(kernel='linear')

Now that we have built a classification model using support vector machines, let's get the predicted digts and them compare the predicted values with
the actual values.

Note: The code below may take 3 to 5 minutes to execute.

1 # Predict the target values for the training set.


2 target_train_pred=svc_dp_linear.predict(feature_train)
3 target_train_pred

array([1, 0, 1, ..., 7, 6, 9])

Now let's create a confusion matrix to check for misclassification.

1 # Create a confusion matrix to check for misclassification.


2 from sklearn.metrics import confusion_matrix,classification_report 3
confusion_matrix(target_train_actual,target_train_pred)

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 6/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory

array([[4130, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[ 0, 4674, 2, 1, 0, 0, 0, 0, 6, 1],
[ 2, 7, 4092, 16, 13, 3, 6, 9, 27, 2],
[ 6, 3, 48, 4188, 1, 49, 0, 5, 38, 13],
[ 2, 6, 3, 1, 3999, 0, 1, 3, 0, 57],
[ 4, 8, 12, 67, 4, 3649, 19, 0, 29, 3],
[ 1, 0, 2, 1, 4, 11, 4116, 0, 2, 0],
[ 2, 3, 22, 4, 10, 1, 0, 4308, 2, 49],
[ 11, 30, 19, 60, 2, 49, 3, 2, 3880, 7],
[ 4, 8, 2, 12, 61, 6, 0, 76, 11, 4008]])

1 #Print the precision, recall and f1-score values to further evaluate the efficacy of the model. 2
print(classification_report(target_train_actual,target_train_pred))

precision recall f1-score support

0 0.99 1.00 1.00 4132


1 0.99 1.00 0.99 4684
2 0.97 0.98 0.98 4177
3 0.96 0.96 0.96 4351
4 0.98 0.98 0.98 4072
5 0.97 0.96 0.96 3795
6 0.99 0.99 0.99 4137
7 0.98 0.98 0.98 4401
8 0.97 0.95 0.96 4063
9 0.97 0.96 0.96 4188

accuracy 0.98 42000


macro avg 0.98 0.98 0.98 42000
weighted avg 0.98 0.98 0.98 42000

The f1-scores for all the labels (or digits) are almost equal to 1. This implies that the SVC model built to classify digits is very accurate. So now let's
predict the digits on the test set.

Activity 3: Prediction on Test Set^^


We already know that the test set does not have a label column. So don't need to separate the features and target variables. But we do need to normalise
the features in the test set as well with the same technique used for the train set. Hence, we will divide each pixel value in the test set by 255.

1 # Divide each pixel value in the test set by 255. Also, for each image pixels, print the minimum and maximum pixel values. 2
feature_test=test_df/255

Now let's predict the digits for the test set using the SVC model that we just built.
Note: The code below may take 3 to 5 minutes to execute.

1 # Predict the digits for the test set using the SVC model built above. 2
target_test_pred=svc_dp_linear.predict(feature_test)
3 target_test_pred

array([2, 0, 5, ..., 3, 9, 2])

Now let's get the count of the predicted labels (or handwritten digits) to see their distribution.

1 # Get the count of the predicted labels (or handwritten digits) to see their distribution.

2 pd.Series(target_test_pred).value_counts()

1 3288
2 2882
7 2868
3 2818
0 2810
4 2808
6 2729
9 2677
8 2609
5 2511
dtype: int64

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 7/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
It seems that the handwritten digits in the test Colaboratory
set are quite uniformly distributed.

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 8/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory

Activity 4: Visualising Digits


Let's now visualise at least one-one sample from each digit. But first, let's add a new column called label to the test_df data frame so that its values are
the predicted labels (or digits). Make sure that the column is added to the column index = 0 location.

1 # Add 'label' at column index = 0 to the 'test_df' data frame so that its values are the predicted labels (or digits). 2
test_df.insert(loc =0,column='label',value=target_test_pred)

Lets's display the first 5 rows of the modified test_df data frame.

1 # Display the first 5 rows of the modified 'test_df' data frame. 2


test_df.head()

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 pixe

0 2 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0 0 0 0 0

2 5 0 0 0 0 0 0 0 0 0 0 0

3 4 0 0 0 0 0 0 0 0 0 0 0

4 3 0 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

Now let's group all the rows of the test_df data frame by the label column so that pixel values of images of a digit can be clubbed together and
a sample of a digit can be retrieved easily later.
Eg., you can easily retrieve one of the sample images of digit 0 from a data frame containing pixel values of all the image samples of the digit 0 only.

1 # Group all the rows of the 'test_df' data frame by the 'label' column. Also, get a data frame containing pixel values of imag 2
grouped_test_df = test_df.groupby(by = "label")
3 zeros_test_df = grouped_test_df.get_group(0) 4
zeros_test_df

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10

1 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 0 0 0 0 0 0 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0 0 0

13 0 0 0 0 0 0 0 0 0 0 0 0

19 0 0 0 0 0 0 0 0 0 0 0 0

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 9/
3/7/22, 8:10 PM SREEJITH T SHAJI Support Vector Machines - MNIST Digits Classification -
Colaboratory
... ... ... ... ... ... ... ... ... ... ... ... ...
Now, let's create an image from the pixel values of one of the samples of digit 0.
27967 0 0 0 0 0 0 0 0 0 0 0 0

1 # Create27971
an image from0the pixel values
0 of one0of the samples
0 of digit00. 2 sample_of_zero_test_pixels
0 0 0= test_df.iloc[6,
0 0 0
1:].values.reshape(28, 28) 0
3
27974 0 0 0 0 0 0 0 0 0 0 0 0
plt.figure(figsize = (6, 6), dpi = 81)
plt.title("Handwritten
27977 Digit
0 0 Image",0 fontsize =016) 0 0 0 0 0 0 0 0 0
plt.imshow(sample_of_zero_test_pixels, cmap = "gray", vmin = 0, vmax = 255) 7 plt.show()
27983 0 0 0 0 0 0 0 0 0 0 0 0

2810 rows × 785 columns

Indeed the predicted image is 0. Let's create an image of one of the sample images of digit three.

1 # Get a data frame containing pixel values of all images of digit 3 from 'grouped_test_df' data frame. 2
grouped_test_df = test_df.groupby(by = 'label')
3 zeros_test_df = grouped_test_df.get_group(3) 4
zeros_test_df
5
6
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10
Now, let's create an image of one of the sample images of digit 3.
# Create an image4 of one of 3 the sample
0 images of 0 digit 3. 0 0 0 0 0 0 0 0 0
sample_of_zero_test_pixels = test_df.iloc[4, 1:].values.reshape(28, 28)
3
plt.figure(figsize = (6, 6), dpi = 81)
plt.title("Handwritten Digit 3 Image", fontsize = 16)
plt.imshow(sample_of_zero_test_pixels, cmap = "gray", vmin = 0, vmax = 255) 7 plt.show()

https://colab.research.google.com/drive/1yRxpEhgaSYM7uHL5ruJ-t6KwBKrRc90R#scrollTo=vhwhnTmmuQVO&printMode=true 1
3/7/22, 8:11 PM SREEJITH T SHAJI K means -CO4 -
Colaboratory

Program : 8

Aim : Program to implement k-means clustering technique using any standard dataset available in the public

Problem Statement
Program to implement k-means clustering technique using any standard dataset available in the public domain

Dataset Description
In this project, we will be using the dataset holding the information of carbon dioxide emission from different car models. The

dataset includes 36 instances with 5 columns which can be briefed as:


Column Description

Car Brand of the car

Model Model of the car

Volume Total space available inside the car (in litres) Weight

Total weightof the car (in kg)

CO2 Total emission of carbon dioxide from the car

Note: (This is a manually created custom dataset for this project.)

List of Activities
Activity 1: Import Modules and Read Data
Activity 2: Data Cleaning
Activity 3: Find Optimal Value of K
Activity 4: Plot Silhouette Scores

Activity 1: Import Modules and Read Data


Import the necessary Python modules along with the following modules:

KMeans - For clustering using K-means.

re - To remove unwanted rows using regex.

Read the data from a CSV file to create a Pandas DataFrame and go through the necessary data-cleaning process (if required).
Dataset link: https://raw.githubusercontent.com/jiss-sngce/CO_3/main/jkcars.csv
1 # Import the modules and Read the data. 2
3 import numpy as np 4
import pandas as pd
5 df=pd.read_csv('https://raw.githubusercontent.com/jiss-sngce/CO_3/main/jkcars.csv')
6
7 # Print the first five records 8
9 df.head()

Car Model Volume Weight CO2

0 Mitsubishi Space Star 1200 1160 95

1 Skoda Citigo 1000 929 95

2 Fiat 500 900 865 90

3 Mini Cooper 1500 1140 105

4 VW Up! 1000 929 105

1 # Get the total number of rows and columns, data types of columns and missing values (if exist) in the dataset 2 df.shape
https://colab.research.google.com/drive/12BW12BwUPCYFJUA8ak7qGze5omBItkEX#scrollTo=P7DeDzA-DXUF&printMode=true 1/
3/7/22, 8:11 PM SREEJITH T SHAJI K means -CO4 -
Colaboratory
3 df.dtypes
4 df.isnull().sum()

Car 0
Model 0
Volume 0
Weight 0
CO2 0
dtype: int64

Activity 3: Find Optimal value of K


In this activity, you need to find the optimal value of K using the silhouette score.

1. Create a subset of the dataset consisting of three columns i.e Volume , Weight , and CO2 .

1 # Create a new DataFrame consisting of three columns 'Volume', 'Weight', 'CO2'. 2


3 new_df = df[['Volume','Weight','CO2']] 4
5 # Print the first 5 rows of this new DataFrame. 6
7 new_df.head()

Volume Weight CO2

0 1200 1160 95

1 1000 929 95

2 900 865 90

3 1500 1140 105

4 1000 929 105

2. Compute K-Means clustering for the 3D dataset data_3d by varying K from 2 to 10 clusters. Also, for each K , calculate silhouette score
using silhouette_score function.
Steps to Follow
Create an empty list to store silhouette scores obtained for each K (let's say sil_scores ). Initiate a

for loop that ranges from 2 to 10.

Perform K-means clustering for the current value of K inside for loop. Use

fit() and predict() to create clusters.

Calculate silhouette score for current K value using silhouette_score() function and append it to the empty list sil_scores .
Create a DataFrame with two columns. The first column must contain K values from 2 to 10 and the second column must contain
silhouette values obtained after the for loop.

1 # Calculate inertia for different values of 'K'. 2


3 from sklearn.metrics import silhouette_score 4
from sklearn.cluster import KMeans
5 # Create an empty list to store silhouette scores obtained for each 'K' 6
sil_scores = []
7 clusters = range(2,11) 8
9 for k in clusters:
10 kmean_k = KMeans (n_clusters = k, random_state=10)
11 kmean_k.fit(new_df)
12 cluster_labels = kmean_k.predict(new_df)
13 sil_scores.append(silhouette_score(new_df,cluster_labels)) 14
15 sil_data = pd.DataFrame({'K value':clusters,'silhouette_score':sil_scores})
16 sil_data

https://colab.research.google.com/drive/12BW12BwUPCYFJUA8ak7qGze5omBItkEX#scrollTo=P7DeDzA-DXUF&printMode=true 2/
3/7/22, 8:11 PM SREEJITH T SHAJI K means -CO4 -
Colaboratory
K value silhouette_score

0 2 0.466982

1 3 0.569304

2 4 0.506027

3 5 0.537547

4 6 0.549792

5 7 0.525962

6 8 0.509034

7 9 0.461402

8 10 0.434958

Q: What are the maximum silhouette score and the corresponding cluster value?

A:

Maximum silhouette score: 0.569304 Corresponding cluster


value: 3

https://colab.research.google.com/drive/12BW12BwUPCYFJUA8ak7qGze5omBItkEX#scrollTo=P7DeDzA-DXUF&printMode=true 3/
3/7/22, 8:11 PM SREEJITH T SHAJI K means -CO4 -
Colaboratory

Activity 4: Plot silhouette Scores & WCSS Scores to find optimal value for K

Create a line plot with K ranging from 2 to 10 on the x-axis and the silhouette scores stored in sil_scores list on the y -axis.

1 # Plot silhouette scores vs number of clusters. 2


import matplotlib.pyplot as plt

3 plt.figure(figsize=(14,5))
4 plt.plot(clusters,sil_scores)
5 plt.xlabel("K value")
6 plt.ylabel("silhouette_score")
7 plt.xticks(range(2,11))
8 plt.grid()
9 plt.show()

Q: Write your observations of the graph.


A: From the graph, we can conclude that the optimal value of K is 3.

1 # S3.1: Determine 'K' using Elbow method. 2


wcss = []
3 clusters = range(2,11) 4
5
6 # Initiate a for loop that ranges from 1 to 10. 7
for k in clusters:
8 # Inside for loop, perform K-means clustering for current value of K. Use 'fit()' to train the model.
9 kmean_k = KMeans (n_clusters = k, random_state=10)
10 kmean_k.fit(new_df )
11 # Find wcss for current K value using 'inertia_' attribute and append it to the empty list.
12 wcss.append(kmean_k.inertia_)
13 # Plot WCSS vs number of clusters. 14
plt.figure(figsize=(14,5))
15 plt.plot(clusters,wcss)
16 plt.xlabel("K value")
17 plt.ylabel("wcss")
18 plt.xticks(range(2,11))
19 plt.grid()
20 plt.show()

https://colab.research.google.com/drive/12BW12BwUPCYFJUA8ak7qGze5omBItkEX#scrollTo=P7DeDzA-DXUF&printMode=true 4/
3/7/22, 8:11 PM SREEJITH T SHAJI K means -CO4 -
Colaboratory

1 # Clustering the dataset for K = 3 2


3 from sklearn.cluster import KMeans
4 # Perform K-Means clustering with n_clusters = 4 and random_state = 10 5
kmeans_model=KMeans(n_clusters=3,random_state=10)
6
7 # Fit the model to the scaled_df 8
kmeans_model.fit(new_df)
9
10 # Make a series using predictions by K-Means
11 cluster_labels=pd.Series(kmeans_model.predict(new_df))
12 cluster_labels.value_counts()

2 16
1 9
0 7
dtype: int64

1 cluster_labels

1 df.columns

Index(['Car', 'Model', 'Volume', 'Weight', 'CO2'], dtype='object')

1 # Create a DataFrame with cluster labels for cluster visualisation 2


df2=pd.concat([df,cluster_labels],axis=1)
3 df2.columns=list(df.columns)+['label']
4 df2

https://colab.research.google.com/drive/12BW12BwUPCYFJUA8ak7qGze5omBItkEX#scrollTo=P7DeDzA-DXUF&printMode=true 5/
3/7/22, 8:11 PM SREEJITH T SHAJI K means -CO4 -
Colaboratory
Car Model Volume Weight CO2 label

0 Mitsubishi Space Star 1200 1160 95 0

1 Skoda Citigo 1000 929 95 0

2 Fiat 500 900 865 90 0

3 Mini Cooper 1500 1140 105 2

4 VW Up! 1000 929 105 0

5 Skoda Fabia 1400 1109 90 2

6 Ford Fiesta 1500 1112 98 2

7 Audi A1 1600 1150 99 2

8 Hyundai I20 1100 980 99 0

9 Suzuki Swift 1300 990 101 0

10 Ford Fiesta 1000 1112 99 0

11 Honda Civic 1600 1252 94 2

12 Hundai I30 1600 1326 97 2

13 Opel Astra 1600 1330 97 2

14 BMW 1 1600 1365 99 2

15 Mazda 3 2200 1280 104 1

16 Skoda Rapid 1600 1119 104 2

17 Ford Focus 2000 1328 105 1

18 Ford Mondeo 1600 1584 94 2

19 Mercedes C-Class 2100 1365 99 1

20 Skoda Octavia 1600 1415 99 2

21 Volvo S60 2000 1415 99 1

22 Mercedes CLA 1500 1465 102 2

23 Audi A4 2000 1490 104 1

24 Audi A6 2000 1725 114 1

25 Volvo V70 1600 1523 109 2

26 BMW 5 2000 1705 114 1

27 Volvo XC70 2000 1746 117 1

28 Ford B-Max 1600 1235 104 2

29 BMW 216 1600 1390 108 2

30 Opel Zafira 1600 1405 109 2

https://colab.research.google.com/drive/12BW12BwUPCYFJUA8ak7qGze5omBItkEX#scrollTo=P7DeDzA-DXUF&printMode=true 6/
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb

Program : 9
Aim : Programs on convolutional neural network to classify images from any standard dataset in the public domain.
Importing Necessary Libraries

1 import tensorflow as tf
2 from tensorflow import keras
3 import matplotlib.pyplot as plt 4
import numpy as np
5
6 #load MNIST dataset available in Keras library
7 (X_train, y_train) , (X_test, y_test) = keras.datasets.mnist.load_data()

To Know No. of Examples in Training Dataset

1 # No. of examples in train dataset


2 print("Length of Xtrain:",len(X_train)) 3
print(type(X_train))
4
5 # No. of examples in test dataset
6 print("Length of Xtest:",len(X_test))
7 print("Dimension of Xtrain:",X_train.ndim) 8
print("Shape of Xtrain:",X_train.shape)
9

Length of Xtrain: 60000


<class 'numpy.ndarray'>
Length of Xtest: 10000
Dimension of Xtrain: 3
Shape of Xtrain: (60000, 28, 28)

1 # Give details of an image in the dataset 2


X_train[0].shape

(28, 28)

1 #matshow() function OR imshow() function is used to represent an array as a matrix in a new figure window. 2
#plt.matshow(X_train[0]) OR plt.imshow(X_train[2])
3 plt.imshow(X_train[0])
4

<matplotlib.image.AxesImage at 0x7f760be9f990>

1 # Scale the values in Xtrain & Xtest 2


X_train = X_train / 255
3 X_test = X_test / 255
4
5 # Display the length of X_train
6 X_train[0]
7

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 7/
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb

array([[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.01176471, 0.07058824, 0.07058824,
0.07058824, 0.49411765, 0.53333333, 0.68627451, 0.10196078,
0.65098039, 1. , 0.96862745, 0.49803922, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.11764706, 0.14117647,
0.36862745, 0.60392157, 0.66666667, 0.99215686, 0.99215686,
0.99215686, 0.99215686, 0.99215686, 0.88235294, 0.6745098 ,
0.99215686, 0.94901961, 0.76470588, 0.25098039, 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.19215686, 0.93333333, 0.99215686,
0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.99215686,
0.99215686, 0.99215686, 0.98431373, 0.36470588, 0.32156863,
0.32156863, 0.21960784, 0.15294118, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.07058824, 0.85882353, 0.99215686,
0.99215686, 0.99215686, 0.99215686, 0.99215686, 0.77647059,
0.71372549, 0.96862745, 0.94509804, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0.31372549, 0.61176471,
0.41960784, 0.99215686, 0.99215686, 0.80392157, 0.04313725,
0. , 0.16862745, 0.60392157, 0. , 0. ,

1 # Flattening Xtrain & Xtest to a 2D array 2


X_train_flattened = X_train.reshape(-1,784) 3
X_test_flattened = X_test.reshape(-1,784) 4
5 #Display the shape of flattened Xtrain & Xtest 6
X_train_flattened.shape

(60000, 784)

1 X_test_flattened.shape
(10000, 784)

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 8/
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb

1 Build the model


2 model = keras.Sequential([keras.layers.Dense(10,activation='sigmoid')]) 3
4 # Apply 'adam' optimizer
5 model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
6
7 # Train the model using fit(). Give epoch as 10 8
model.fit(X_train_flattened,y_train,epochs=10)
9
10 #Evaluate the model
11 model.evaluate(X_test_flattened, y_test)
12
13 #Predict the values for flattened test dataset 14
y_predicted = model.predict(X_test_flattened) 15
16
17 #Display the first row of y_predicted 18
y_predicted[1]
19

Epoch 1/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.4654 - accuracy: 0.8789
Epoch 2/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.3033 - accuracy: 0.9158
Epoch 3/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2828 - accuracy: 0.9212
Epoch 4/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2730 - accuracy: 0.9244
Epoch 5/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2670 - accuracy: 0.9258
Epoch 6/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2621 - accuracy: 0.9269
Epoch 7/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2583 - accuracy: 0.9281
Epoch 8/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2557 - accuracy: 0.9295
Epoch 9/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2531 - accuracy: 0.9299
Epoch 10/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.2510 - accuracy: 0.9311
313/313 [==============================] - 0s 930us/step - loss: 0.2653 - accuracy: 0.9266
array([3.0872768e-01, 3.6514401e-03, 9.9983525e-01, 1.5331820e-01,
9.3455755e-12, 9.1628104e-01, 9.2582357e-01, 2.9462129e-15,
1.1721602e-01, 2.1716836e-12], dtype=float32)

1 # Display the predicted value by applying argmax() 2


np.argmax(y_predicted)
3
4 #Confirm the prediction by displaying the corresponding pixel values using matshow() ir imshow() 5
plt.matshow(X_test[1])
6

<matplotlib.image.AxesImage at 0x7f760bf3d450>

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 9/
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 1
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb

1 #Display the predicted values for test dataset. Display only first 5 predicted values 2

3 #Construct the Confusion Matrix 4

[7, 2, 1, 0, 4]

1 #Display confusion matrix using heatmap 2


import seaborn as sn
3 plt.figure(figsize = (10,7))
4 sn.heatmap(cm, annot=True, fmt='d') 5
plt.xlabel('Predicted')
6 plt.ylabel('Truth')
7

Text(69.0, 0.5, 'Truth')

USING HIDDEN LAYER

1 model = keras.Sequential([
2 keras.layers.Dense(100,activation='relu'),
3 keras.layers.Dense(10, activation='sigmoid')
4 ])
5

1 model.compile(optimizer='adam',
2 loss='sparse_categorical_crossentropy',
3 metrics=['accuracy']) 4

1 model.fit(X_train_flattened, y_train, epochs=5)

Epoch 1/5
1875/1875 [==============================] - 5s 2ms/step - loss: 0.2777 - accuracy: 0.9199
Epoch 2/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.1245 - accuracy: 0.9636
Epoch 3/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0875 - accuracy: 0.9743
Epoch 4/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0681 - accuracy: 0.9792
Epoch 5/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0537 - accuracy: 0.9835
<keras.callbacks.History at 0x7f6092d38290>

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 1
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb
t_flattened) 2 y_predicted
1
mod array([[1.9663182e-01, 1.0661781e-03, 7.0703042e-01, ..., 9.9997139e-01,
el.e 1.5860826e-02, 4.3196589e-02],
valu [1.9181639e-01, 9.9905562e-01, 9.9999869e-01, ..., 3.5023191e-08,
ate( 2.3184523e-01, 3.8511429e-05],
X_t [1.7279387e-04, 9.9927801e-01, 1.4470059e-01, ..., 4.1472608e-
est_
01, 1.3068342e-01, 4.1025519e-02],
flatt
...,
ene
d,y_ [1.7130093e-06, 8.7543085e-06, 7.7272634e-06, ..., 3.8716146e-
test) 01, 4.9101233e-02, 6.9311082e-01],
[2.6747435e-02, 1.8883626e-05, 8.4262902e-05, ..., 2.1099478e-
6 02, 6.4012915e-01, 1.1537667e-04],
[1.7884678e-01, 9.4580650e-04, 7.7513754e-03, ..., 3.5389752e-05,
- 2.2894144e-04, 5.7818677e-05]], dtype=float32)

a
c 1 y_predicted_labels = [np.argmax(i) for i in
c y_predicted] 2
u
r
1 y_predicted[0]
a
c
array([1.9663182e-01, 1.0661781e-03, 7.0703042e-01, 9.4925106e-01,
y
5.5770590e-07, 1.5124649e-02, 6.6381671e-07, 9.9997139e-
:
01,
0
1.5860826e-02, 4.3196589e-02], dtype=float32)
.
9 1 np.argmax(y_predicted[0])
7
5 7
1 1 y_predicted_labels = [np.argmax(i) for i in
y_predicted] 2 y_predicted_labels[:5]
[ 3
0
.
[7, 2, 1, 0, 4]
0
7
8 1 plt.matshow(X_test[0])
6
3 <matplotlib.image.AxesImage at 0x7f60938b09d0>
9
8
4
2
5
6
9
8
2
8
0
3
,
0
.
9
7
5
0
9
9
9
8
0
8
3
1
1

_
t
e
s

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 1
3/7/22, 8:57 PM SREEJITH T SHAJI T EXP9_HandWrittenDigitRecognition.ipynb

USING FLATTEN LAYER TO CONVERT 2D to 1D

model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(10, activation='sigmoid')
])

1 model.compile(optimizer='adam',
2 loss='sparse_categorical_crossentropy',
3 metrics=['accuracy']) 4

1 model.fit(X_train, y_train, epochs=10)

Epoch 1/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.2714 - accuracy: 0.9232
Epoch 2/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.1247 - accuracy: 0.9638
Epoch 3/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0866 - accuracy: 0.9735
Epoch 4/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0651 - accuracy: 0.9807
Epoch 5/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0507 - accuracy: 0.9847
Epoch 6/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0406 - accuracy: 0.9874
Epoch 7/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0339 - accuracy: 0.9895
Epoch 8/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0282 - accuracy: 0.9913
Epoch 9/10
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0233 - accuracy: 0.9926
Epoch 10/10
1875/1875 [==============================] - 5s 3ms/step - loss: 0.0190 - accuracy: 0.9946
<keras.callbacks.History at 0x7f6095650d10>

1 model.evaluate(X_test,y_test)

313/313 [==============================] - 1s 2ms/step - loss: 0.0858 - accuracy: 0.9758


[0.0858246311545372, 0.9757999777793884]

https://colab.research.google.com/drive/1P2w79xKLHyWOyLm0UXlAuVlP9h2DTHvQ#scrollTo=eoaqru1BouMK&printMode=true 1
3/7/22, 8:15 PM CO5 Natural Language Toolkit SREEJITH T SHAJI

Program : 10

Aim: Implement problems on natural language processing - Part of Speech tagging, N-gram & smoothening and Chunking using NLTK

Short notes:

The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. One of the more powerful aspects of the
NLTK module is the Part of Speech tagging.
Part-of-speech (POS) tagging is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having
a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.
keywords:

Corpus : Body of text, singular. Corpora is the plural of this.

Lexicon : Words and their meanings.

Token : Each “entity” that is a part of whatever was split up based on rules.

Tags and their meanings

CD cardinal digit

EX existential there (like: “there is” … think of it like “there exists”) FW

foreign word

IN preposition/subordinating conjunction JJ
adjective ‘big’

JJR adjective, comparative ‘bigger’

JJS adjective, superlative ‘biggest’

NN noun, singular ‘desk’

NNS noun plural ‘desks’

NNP proper noun, singular ‘Harrison’

NNPS proper noun, plural ‘Americans’

PDT predeterminer ‘all the kids’

POS possessive ending parent‘s PRP


personal pronoun I, he, she

PRP$ possessive pronoun my, his, hers RB

adverb very, silently,

RBR adverb, comparative better

RBS adverb, superlative best RP


particle give up

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the
neighbouring sequences of items in a document.
Steps for n-gram model:
Explore the dataset Feature
extraction

https://colab.research.google.com/drive/1I48Pi70Hlku8oHYmpnMbed1Mx8mAOBJ2?usp=sharing 1/
3/7/22, 8:15 PM CO5 Natural Language Toolkit SREEJITH T SHAJI

Train-test split

Basic pre-processing

Code to generate N-grams

Creating unigrams

Creating bigrams

Creating trigrams

Double-click (or enter) to edit

1 from nltk.corpus.reader import tagged 2


import nltk
3 from nltk.corpus import stopwords
4 from nltk.tokenize import sent_tokenize,word_tokenize 5
nltk.download('stopwords')
6 nltk.download('punkt')
7 nltk.download('averaged_perceptron_tagger')
8 sw=set(stopwords.words('english'))
9
10 #Dummy text
11 txt="Hello.I am Darsana Prasad.I am a native of Athani." 12
13 # sent_tokenize is one of instances of
14 # PunktSentenceTokenizer from the nltk.tokenize.punkt module 15
tokenized=nltk.word_tokenize(txt)
16 for i in tokenized:
17
18
19 # Word tokenizers is used to find the words
20 # and punctuation in a string
21 wordlist=word_tokenize(i)
22
23 # removing stop words from wordList
24 wordlist=[w for w in wordlist if not w in sw]
25
26
27 # Using a Tagger. Which is part-of-speech
28 # tagger or POS-tagger.
29 tagged=nltk.pos_tag(wordlist)
30 print(tagged)

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[('Hello.I', 'NN')]
[]
[('Darsana', 'NNP')]
[('Prasad.I', 'NN')]
[]
[]
[('native', 'JJ')]
[]
[('Athani', 'NN')]
[('.', '.')]

1 print(sw)

{'than', "wouldn't", 'having', "you'll", 'during', 'his', 'she', 'in', 'most', 'ours', 'how', "hasn't", 'sho

1 print(tokenized)

https://colab.research.google.com/drive/1I48Pi70Hlku8oHYmpnMbed1Mx8mAOBJ2?usp=sharing 2/
3/7/22, 8:15 PM CO5 Natural Language Toolkit SREEJITH T SHAJI

N-gram model

1 import numpy as np 2
import pandas as pd
3 import matplotlib.pyplot as plt 4
plt.style.use(style='seaborn')
5 #get the data from https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/version/5 6
colnames=['sentiment','news']
7 df=pd.read_csv('all-data.csv - all-data.csv.csv',encoding='ISO-8859-1',names=colnames) 8
df.head()

sentiment news

0 neutral According to Gran , the company has no plans t...

1 neutral Technopolis plans to develop in stages an area...

2 negative The international electronic industry company ...

3 positive With the new production plant the company woul...

4 positive According to the company 's updated strategy f...

1 df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
# Column Non-Null Count Dtype

0 sentiment 4846 non-null object


1 news 4846 non-null object
dtypes: object(2)
memory usage: 75.8+ KB

1 df['sentiment'].value_counts()

neutral 2879
positive 1363
negative 604
Name: sentiment, dtype: int64

1 y=df['sentiment'].values
2 y.shape

(4846,)

1 x=df['news'].values
2 x.shape

(4846,)

1 #Split train dataset


2 from sklearn.model_selection import train_test_split
3 x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
4 print(x_train.shape)
5 print(x_test.shape)
6 print(y_train.shape)
7 print(y_test.shape)

(3392,)
(1454,)
(3392,)
(1454,)

https://colab.research.google.com/drive/1I48Pi70Hlku8oHYmpnMbed1Mx8mAOBJ2?usp=sharing 3/
3/7/22, 8:15 PM CO5 Natural Language Toolkit SREEJITH T SHAJI

1 #Make train dataset as a dataframe 2


df1=pd.DataFrame(x_train)
3 df1=df1.rename(columns={0:'news'})
4 df2=pd.DataFrame(y_train)
5 df2=df2.rename(columns={0:'sentiment'})
6 df_train=pd.concat([df1,df2],axis=1)

1 #Make test dataset as a dataframe 2


df3=pd.DataFrame(x_test)
3 df3=df3.rename(columns={0:'news'})
4 df4=pd.DataFrame(y_test)
5 df4=df4.rename(columns={0:'sentiment'})
6 df_test=pd.concat([df3,df4],axis=1)

1 #removing punctuations
2 #library that contains punctuation 3
import string
4 string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

1 #defining the function to remove punctuation 2 def


remove_pun(text):
3 if(type(text)==float):
4 return text
5 ans=""
6 for i in text:
7 if i not in string.punctuation:
8 ans+=i
9 return ans

1 text='darsana*@123'
2 remove_pun(text)

'darsana123'

1 #storing the puntuation free text in a new column called clean_msg 2 df_train['news']=df_train['news'].apply(remove_pun)
3 df_test['news']=df_test['news'].apply(remove_pun)
4 #punctuations are removed from news column in train dataset 5
df_train.head()

news sentiment

0 ADP News Dec 11 2008 Finnish powersupply ... neutral

1 The shares subscribed will be eligible for tra... neutral

2 Both operating profit and turnover for the six... positive

3 The value of the order is nearly EUR400m neutral

4 Teleste expects to start the deliveries at the... neutral

1 import nltk
2 from nltk.corpus import stopwords 3
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...


[nltk_data] Unzipping corpora/stopwords.zip.
True

https://colab.research.google.com/drive/1I48Pi70Hlku8oHYmpnMbed1Mx8mAOBJ2?usp=sharing 4/
3/7/22, 8:15 PM CO5 Natural Language Toolkit SREEJITH T SHAJI
1 #method to generate n-grams:
2 #params:
3 #text-the text for which we have to generate n-grams
4 #ngram-number of grams to be generated from the text(1,2,3,4 etc., default value=1) 5 def
generate_N_grams(text,ngram=1):
6 words=[word for word in text.split(" ")if word not in set(stopwords.words('english'))]
7 print("After removing stopwords:",words)
8 temp=zip(*[words[i:] for i in range(0,ngram)])
9 ans=[' '.join(ngram) for ngram in temp]
10 return ans

1 name=['sajil','lidhan']
2 s=' '.join(name)
3 s

'sajil lidhan '

1 s1=['abeel','adhitya','akash','akshaya']
2 s2=[1,2,3,4]
3 s3=zip(s1,s2)
4 print(set(s3))

{('abeel', 1), ('akshaya', 4), ('akash', 3), ('adhitya', 2)}

1 generate_N_grams("The sun rises in the east",2)

After removing stopwords: ['The', 'sun', 'rises', 'east'] ['The


sun', 'sun rises', 'rises east']

1 generate_N_grams("The sun rises in the east",3)

After removing stopwords: ['The', 'sun', 'rises', 'east'] ['The


sun rises', 'sun rises east']

1 generate_N_grams("The sun rises in the east",4)

After removing stopwords: ['The', 'sun', 'rises', 'east'] ['The


sun rises east']

https://colab.research.google.com/drive/1I48Pi70Hlku8oHYmpnMbed1Mx8mAOBJ2?usp=sharing 5/

You might also like