You are on page 1of 61

MADANAPALLE INSTITUTE OF TECHNOLOGY & SCIENCE

(UGC-Autonomous)

Dept. of. Computer Science & Engineering


(Artificial Intelligence)
20CAI213 DATA SCIENCE LABORATORY

Academic Year 2024

Student Manual
20CAI213 Data Science Laboratory B. Tech III Year II Semester

S.No CONTENTS PAGE No.


1. Institute Vision V
2. Institute Mission V
3. Department Vision V
4. Department Mission V
5. PEOs VI
6. POs VI
7. PSOs VII

Experiments
Experiment 1.
8. Create NumPy arrays from Python Data Structures, Intrinsic NumPy objects 1
and Random Functions.
Experiment 2.
9. Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and 7
Splitting.
Experiment 3.
10. Computation on NumPy arrays using Universal Functions and Mathematical 16
methods.
Experiment 4.
11. Import a CSV file and perform various Statistical and Comparison 26
operations on rows/columns.
Experiment 5.
12. 34
Load an image file and do crop and flip operation using NumPy Indexing.
Experiment 6.
13. Write a program to compute summary statistics such as mean, median,
mode, standard deviation and variance of the given different types of data.
Experiment 7.
14.
Create Pandas Series and DataFrame from various inputs.
Experiment 8
Import any CSV file to Pandas DataFrame and perform the following:
a. Visualize the first and last 10 records
b. Get the shape, index and column details.
15. c. Select/Delete the records(rows)/columns based on conditions.
d. Perform ranking and sorting operations.
e. Do required statistical operations on the given columns.
f. Find the count and uniqueness of the given categorical values.
g. Rename single/multiple columns

Dept. of. Computer Science & Engineering (Artificial Intelligence) II


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment 9.
Import any CSV file to Pandas DataFrame and perform the following:
a. Handle missing data by detecting and dropping/ filling missing values.
b. Transform data using apply() and map() method.
16.
c. Detect and filter outliers.
d. Perform Vectorized String operations on Pandas Series.
e. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and
Scatter Plots
Experiment 10.
17. Write a program to demonstrate Linear Regression analysis with residual
plots on a given data set
Experiment 11.
Write a program to implement the Naïve Bayesian classifier for a sample
18.
training data set stored as a .CSV file. Compute the accuracy of the classifier,
considering few test data sets
Experiment 12.
Write a program to implement k-Nearest Neighbour algorithm to classify the
19.
iris data set. Print both correct and wrong predictions using Python ML
library classes
Experiment 13.
Write a program to implement k-Means clustering algorithm to cluster the
20.
set of data storedin .CSV file. Compare the results of various “k” values for
the quality of clustering.

Dept. of. Computer Science & Engineering (Artificial Intelligence) III


20CAI213 Data Science Laboratory B. Tech III Year II Semester

1. Institution Vision
To become a globally recognized research and academic institution and there by contribute to
technological and socio-economic development of the nation.

2. Institution Mission
To foster a culture of excellence in research, innovation, entrepreneurship, rational thinking and
civility by providing necessary resources for generation, dissemination and
utilization of knowledge and in the process create an ambience for practice-based learning to
the youth for success in their careers.
.

Mission 3: Enrich employability and entrepreneurial skills in the field of AI & DS through
experiential and self-directed learning.

Dept. of. Computer Science & Engineering (Artificial Intelligence) IV


20CAI213 Data Science Laboratory B. Tech III Year II Semester

3. Program Outcomes
PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics, natural
sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate consideration
for the public health and safety, and the cultural, societal, and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one‟s own work, as a member and leader in
a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

4. Program Specific Outcomes


PSO1: Apply mathematical foundations, algorithmic principles and computing techniques in the
modelling and design of computer -based systems.
PSO2: Design and develop software in the areas of relevance under realistic constraints.
PSO3: Analyze real world problems and develop computing solutions by applying concepts of
Computer Science.

Dept. of. Computer Science & Engineering (Artificial Intelligence) V


20CAI213 Data Science Laboratory B. Tech III Year II Semester

5. Lab Syllabus

20CAI213 DATA SCIENCE LABORATORY L T P C


0 0 3 1.5
Pre-requisite: 20CSE101, Basic Programming Knowledge
Course Description:
This course is designed to equipping students to be able to use python programming for solving
data science problems.
Course Objectives:
1. To train the students in solving computational problems
2. To elucidate solving mathematical problems using Python programming language
3. To understand the fundamentals of Python programming concepts and its applications.
4. Practical understanding of building different types of models and their evaluation

List of Programs:
1. Create NumPy arrays from Python Data Structures, Intrinsic NumPy objects and Random
Functions.
2. Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.
3. Computation on NumPy arrays using Universal Functions and Mathematical methods.
4. Import a CSV file and perform various Statistical and Comparison operations on rows/columns.
5. Load an image file and do crop and flip operation using NumPy Indexing.
6. Write a program to compute summary statistics such as mean, median, mode, standard
deviation and variance of the given different types of data.
7. Create Pandas Series and DataFrame from various inputs.
8. Import any CSV file to Pandas DataFrame and perform the following:
a. Visualize the first and last 10 records
b. Get the shape, index and column details.
c. Select/Delete the records(rows)/columns based on conditions.
d. Perform ranking and sorting operations.
e. Do required statistical operations on the given columns.
f. Find the count and uniqueness of the given categorical values.
g. Rename single/multiple columns

Dept. of. Computer Science & Engineering (Artificial Intelligence) VI


20CAI213 Data Science Laboratory B. Tech III Year II Semester

9. Import any CSV file to Pandas DataFrame and perform the following:
a. Handle missing data by detecting and dropping/ filling missing values.
b. Transform data using apply() and map() method.
c. Detect and filter outliers.
d. Perform Vectorized String operations on Pandas Series.
e. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter Plots
10. Write a program to demonstrate Linear Regression analysis with residual plots on a given data
set
11. Write a program to implement the Naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets
12. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions using Python ML library classes
13. Write a program to implement k-Means clustering algorithm to cluster the set of data storedin
.CSV file. Compare the results of various “k” values for the quality of clustering.
Course Outcomes:
Upon successful completion of the course, students will be able to
1. Illustrate the use of various data structures.
2. Analyze and manipulate Data using Numpy and Pandas.
3. Creating static, animated, and interactive visualizations using Matplotlib.
4. Understand the implementation procedures for the machine learning algorithms.
5. Identify and apply Machine Learning algorithms to solve real-world problems using
appropriate data sets.
Text Book(s)
1. Wes McKinney, “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython”, O‟Reilly, 2nd Edition,2018.
2. Jake VanderPlas, “Python Data Science Handbook: Essential Tools for Working with Data”,
O‟Reilly, 2017.
Reference Books
1. Y. Daniel Liang, “Introduction to Programming using Python”, Pearson,2012.
2. Francois Chollet, Deep Learning with Python, 1/e, Manning Publications Company, 2017.
3. Peter Wentworth, Jeffrey Elkner, Allen B. Downey and Chris Meyers, “How to Think Like
a Computer Scientist: Learning with Python 3”, 3rd edition, Available at
https://www.ict.ru.ac.za/Resources/cspw/thinkcspy3/thinkcspy3.pdf
4. Paul Barry, “Head First Python a Brain Friendly Guide” 2nd Edition, O‟Reilly, 2016 4.
Dainel Y.Chen “Pandas for Everyone Python Data Analysis” Pearson Education, 2019.
Mode of Evaluation: Continuous Internal Evaluation and End Semester Examination

Dept. of. Computer Science & Engineering (Artificial Intelligence) VII


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiments
Experiment -1

Question: To create numpy arrays from python data structures, intrinsic numpy objects
and random functions.

Aim:
To create NumPy arrays from Python Data Structures, Intrinsic NumPy objects and

Random Functions.

Algorithm:
Step 1: Install Numpy and rename it as np for the ease of use

Step 2: Use Numpy arrays with different data types and learn its usage

Step 3: NumPy has built-in functions for creating arrays from scratch practice it:

1. zeros(shape) will create an array filled with 0 values with the specified shape. The

default dtype is float64. Similarly ones will create 1 in place of zeroes

>> np.zeros((2, 3))

Output: array([[ 0., 0., 0.], [ 0., 0., 0.]])

2. arange() will create arrays with regularly incrementing values

>> np.arange(10)

Output: array([0,1,2,3,4,5,6,7,8,9])

>> np.arange(2, 10, dtype=float)

Output: array([2.,3.,3.,4.,5.,6.,7.,8.,9.,])

3. linspace() will create arrays with a specified number of elements, and spaced equally

between the specified beginning and end values.

>> np.linspace(1, 26, 3)

Output: array([ 1. , 13.5, 26. ])

4. indices() will create a set of arrays (stacked as a one-higher dimensioned array), one

Dept. of. Computer Science & Engineering (Artificial Intelligence) 1


20CAI213 Data Science Laboratory B. Tech III Year II Semester

per dimension with each representing variation in that dimension

>> np.indices((3,3))

Output: array([[[0, 0, 0], [1, 1, 1], [2, 2, 2]], [[0, 1, 2], [0, 1, 2], [0, 1, 2]]])

Step 4: Practicing more functions from Numpy documentation

Step 5: Learn most commonly used Random functions in Numpy and practice it

1. random.rand(d0, d1, ..., dn) Random values in a given shape.

>> np.random.rand()

Output: 0.6926529371565405

>>np.random.rand(3,2)

Output: array([[0.05142471, 0.93007891], [0.66937849, 0.868983 ],

[0.92730776, 0.98828719]])

2. Practice the functions like shuffle(), Permutations(), Uniform(), seed(), choice() etc

Source Code:
# Zeroes()

np.zeros((2,3))

output: array([[0.,0.,0.],[0,.0.,0.]])

# Arrange()

np.arange(10)

output: array([0,1,2,3,4,5,6,7,8,9])

# Linspace()

np.linspace(1,26,3)

output: array([1.,13.5,26.])

# Indices()

np.indices((3,3))

output: array([[[0,0,0],[1,1,1],[2,2,2]],[[0,1,2],[0,1,2],[0,1,2]]])

# Random.rand()

Dept. of. Computer Science & Engineering (Artificial Intelligence) 2


20CAI213 Data Science Laboratory B. Tech III Year II Semester

np.random.rand()

output: 0.692652937156405

np.random.rand(3,2)

output: array([[0.05142471, 0.93007891],[0.66937849, 0.868983],[0.92730776, 0.98828719]])

Theory:
A, numpy.zeros(shape, dtype=float, order='C', *, like=None)

Return a new array of given shape and type, filled with zeros.

Parameters:
shapeint or tuple of ints

Shape of the new array, e.g., (2, 3) or 2.

dtypedata-type, optional

The desired data-type for the array, e.g., numpy.int8. Default is numpy.float64.

order{„C‟, „F‟}, optional, default: „C‟

Whether to store multi-dimensional data in row-major (C-style) or column-major


(Fortran-style) order in memory.

likearray_like, optional

Reference object to allow the creation of arrays which are not NumPy arrays. If an array-
like passed in as like supports the __array_function__ protocol, the result will be defined
by it. In this case, it ensures the creation of an array object compatible with that passed in
via this argument.

Eg: np.zeros(5)
array([ 0., 0., 0., 0., 0.])

ii, numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)

Return evenly spaced values within a given interval.

arange can be called with a varying number of positional arguments:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 3


20CAI213 Data Science Laboratory B. Tech III Year II Semester

 arange(stop): Values are generated within the half-open interval [0, stop) (in other words,
the interval including start but excluding stop).
 arange(start, stop): Values are generated within the half-open interval [start, stop).
 arange(start, stop, step) Values are generated within the half-open interval [start, stop),
with spacing between values given by step.

For integer arguments the function is roughly equivalent to the Python built-in range, but returns
an ndarray rather than a range instance.

When using a non-integer step, such as 0.1, it is often better to use numpy.linspace.

See the Warning sections below for more information.

Parameters:
startinteger or real, optional

Start of interval. The interval includes this value. The default start value is 0.

stopinteger or real

End of interval. The interval does not include this value, except in some cases where step
is not an integer and floating point round-off affects the length of out.

stepinteger or real, optional

Spacing between values. For any output out, this is the distance between two adjacent
values, out[i+1] - out[i]. The default step size is 1. If step is specified as a position
argument, start must also be given.

dtypedtype, optional

The type of the output array. If dtype is not given, infer the data type from the other input
arguments.

likearray_like, optional

Reference object to allow the creation of arrays which are not NumPy arrays. If an array-
like passed in as like supports the __array_function__ protocol, the result will be defined
by it. In this case, it ensures the creation of an array object compatible with that passed in
via this argument.

Eg: np.arange(0, 5, 0.5, dtype=int)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
np.arange(-3, 3, 0.5, dtype=int)

Dept. of. Computer Science & Engineering (Artificial Intelligence) 4


20CAI213 Data Science Laboratory B. Tech III Year II Semester

array([-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8])


iii, numpy.indices(dimensions, dtype=<class 'int'>, sparse=False)[source]

Return an array representing the indices of a grid.

Compute an array where the subarrays contain index values 0, 1, … varying only along the
corresponding axis.

Parameters:
dimensionssequence of ints

The shape of the grid.

dtypedtype, optional

Data type of the result.

sparseboolean, optional

Return a sparse representation of the grid instead of a dense representation. Default is


False.

New in version 1.17.

Returns:
gridone ndarray or tuple of ndarrays
If sparse is False:

Returns one array of grid indices, grid.shape = (len(dimensions),) + tuple(dimensions).

If sparse is True:

Returns a tuple of arrays, with grid[i].shape = (1, ..., 1, dimensions[i], 1, ..., 1) with
dimensions[i] in the ith place

Eg: x = np.arange(20).reshape(5, 4)
row, col = np.indices((2, 3))
x[row, col]
array([[0, 1, 2],
[4, 5, 6]])

iv, random.rand(d0, d1, ..., dn)

Random values in a given shape.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 5


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Create an array of the given shape and populate it with random samples from a uniform
distribution over [0, 1).

Parameters:
d0, d1, …, dnint, optional

The dimensions of the returned array, must be non-negative. If no argument is given a


single Python float is returned.

Returns:
outndarray, shape (d0, d1, ..., dn)

Random values.

Eg: np.random.rand(3,2)
array([[ 0.14022471, 0.96360618], #random
[ 0.37601032, 0.25528411], #random
[ 0.49313049, 0.94909878]])
V, random.shuffle(x)

Modify a sequence in-place by shuffling its contents.

This function only shuffles the array along the first axis of a multi-dimensional array. The order
of sub-arrays is changed but their contents remains the same.

Parameters:
xndarray or MutableSequence

The array, list or mutable sequence to be shuffled.

Returns:
None
Eg: arr = np.arange(10)
np.random.shuffle(arr)
arr
[1 7 5 2 9 4 3 6 0 8]
Result:
Practiced with Numpy Array and familiarized different random functions in it.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 6


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment -2

Question: Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and


Splitting.

Aim:
Manipulation of NumPy arrays- Indexing, Slicing, Reshaping, Joining and Splitting.

Algorithm:

Step 1: Install Numpy and rename it as np for the ease of use

Step 2: Use Numpy arrays and perform various operations by referring documentation

Step 3: NumPy has built-in functions for performing operations like

(i) array (index) : The indexes in NumPy arrays start with 0, meaning that the first
element has index 0, and the second has index 1.
(ii) slice(): Slicing in python means taking elements from one given index to another
given index.
(iii) reshaping() : This function is to change the shape of array, that is number of
elements in the array.
(iv) Joining (): Numpy array joins contents of two or more arrays using join function.
We pass a sequence of arrays that we want to join to the concatenate () function,
along with the axis.
(v) Splitting () Splitting is reverse operation of Joining we use array_split() for splitting
arrays, we pass it the array we want to split and the number of splits.
Step 4: Try different Numpy functions

Dept. of. Computer Science & Engineering (Artificial Intelligence) 7


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Source Code:
# array (index) The indexes in NumPy arrays start with 0, meaning that the first element has
index 0, and the second has index

Import numpy as np

arr = np.array([1, 2, 3, 4])

print(array[2])

Output: 3

# slice() Slicing in python means taking elements from one given index to another givenindex.

arr=np.array([6, 12, 31, 45, 25, 36, 37,66])

print(arr[2:5])

Output: [ 31 45 25]

# reshaping() This function is to change the shape of array, that is number of elements in the
array

arr=np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(3,4)

print(newarr)

Output: [[ 1 2 3 4]

[ 5 6 7 8]

[ 9 10 11 12]]

# Joining () Numpy array joins contents of two or more arrays using join function. We pass a
sequence of arrays that we want to join to the concatenate () function, along with the axis

arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

arr = np.concatenate((arr1, arr2))

print(arr)

# Splitting () Splitting is reverse operation of Joining we use array_split() for splitting arrays, we
pass it the array we want to split and the number of splits.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 8


20CAI213 Data Science Laboratory B. Tech III Year II Semester

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 3)

print(newarr)

Theory:
Python Numpy Array Indexing:

ndarrays can be indexed using the standard Python x[obj] syntax, where x is the array and obj the
selection. There are different kinds of indexing available depending on obj: basic indexing,
advanced indexing, and field access.

Most of the following examples show the use of indexing when referencing data in an array. The
examples work just as well when assigning to an array. See Assigning values to indexed arrays
for specific examples and explanations on how assignments work.

Note that in Python, x[(exp1, exp2, ..., expN)] is equivalent to x[exp1, exp2, ..., expN]; the latter
is just syntactic sugar for the former.

Basic indexing

Single element indexing

Single element indexing works exactly like that for other standard Python sequences. It is 0-
based, and accepts negative indices for indexing from the end of the array.

>>>x = np.arange(10)

>>>x[2]

Output:

>>>x[-2]

Output:

It is not necessary to separate each dimension‟s index into its own set of square brackets.

>>>x.shape = (2, 5) # now x is 2-dimensional

Dept. of. Computer Science & Engineering (Artificial Intelligence) 9


20CAI213 Data Science Laboratory B. Tech III Year II Semester

>>>x[1, 3]

Output:

>>>x[1, -1]

Output:

Note that if one indexes a multidimensional array with fewer indices than dimensions, one gets a
subdimensional array. For example:

>>>x[0]

Output:

array([0, 1, 2, 3, 4])

Python NumPy Array Slicing:

Python NumPy array slicing is used to extract some portion of data from the actual array. Slicing
in python means extracting data from one given index to another given index, however, NumPy
slicing is slightly different. Slicing can be done with the help of (:). A NumPy array slicing
object is constructed by giving start, stop, and step parameters to the built-in slicing function.
This slicing object is passed to the array to extract some portion of the array.

The syntax of Python NumPy slicing is [start : stop : step]

Start : This index by default considers as „0‟

stop : This index considers as a length of the array.

step : By default it is considered as „1‟.

Example:

>>>import numpy as np

>>>arr = np.array([1, 2, 3, 4, 5, 6, 7])

>>>print(arr[1:5])

Output:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 10


20CAI213 Data Science Laboratory B. Tech III Year II Semester

[2, 3,4,5]

Reshape NumPy Array :

NumPy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python. Numpy is basically used for creating array of n
dimensions.

Reshaping numpy array simply means changing the shape of the given array, shape basically
tells the number of elements and dimension of array, by reshaping an array we can add or
remove dimensions or change number of elements in each dimension. In order to reshape a
numpy array we use reshape method with the given array.

Syntax : array.reshape(shape)

Argument : It take tuple as argument, tuple is the new shape to be formed

Return : It returns numpy.ndarray

Example:

Reshape From 1-D to 2-D:

>>>import numpy as np

>>>arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

>>>newarr = arr.reshape(4, 3)

>>>print(newarr)

Output:

[[ 1 2 3]

[ 4 5 6]

[ 7 8 9]

[10 11 12]]

NumPy Joining Array:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 11


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Joining means putting contents of two or more arrays in a single array.

In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.We pass a
sequence of arrays that we want to join to the concatenate() function, along with the axis. If axis
is not explicitly passed, it is taken as 0.

Example:

>>>import numpy as np

>>>arr1 = np.array([1, 2, 3])

>>>arr2 = np.array([4, 5, 6])

>>>arr = np.concatenate((arr1, arr2))

>>>print(arr)

Output:

[1, 2,3,4,5,6]

Splitting NumPy Arrays:

Splitting is reverse operation of Joining.

Joining merges multiple arrays into one and Splitting breaks one array into multiple.

We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.

Example:

>>>import numpy as np

>>>arr = np.array([1, 2, 3, 4, 5, 6])

>>>newarr = np.array_split(arr, 3)

>>>print(newarr)

Output:

[array([1, 2]), array([3, 4]), array([5,6])]

Result:
Practiced with Numpy Array and familiarized with different operations on Numpy array.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 12


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment -3

Question: Computation of NumPy arrays using universal functions and mathematical


methods.

Aim:
Computation of NumPy arrays using universal functions and mathematical methods.

Algorithm:

Step 1: Install Numpy and rename it as np for the ease of use.

Step 2: Use Numpy arrays and perform various operations by referring documentation.

Step 3: Math operations like add,subtract,log and absolute.

Step 4: Trigonometric functions like sin, cos, tan and it‟s inverse.

Step 5: Comparison functions like greater,greater_equal,less and not_equal.

Step 6: Floating functions like fmod,floor,ceil and trunc.

Step 7: Apply various statistical functions (mean, median, mode, std, var etc.) on values in an np
array

Dept. of. Computer Science & Engineering (Artificial Intelligence) 13


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Source Code:
#Importing numpy as np

import numpy as np

#Multiply all elements in the array

np.multiply.reduce([2,3,5])

Output : 30

#Identity Multiplication

A=np.multiply.identity

Output : 1

#Creating array from 0 to 5

x1 = np.arange(6)

#applying power funtion to every element in the array.

np.power(x1, 3)

Output : array([ 0, 1, 8, 27, 64, 125])

#finding square root of elements in numpy array

np.sqrt([1,4,9])

Output : array([1., 2., 3.])

#Finding sin values

np.sin(np.pi/2.)

Output : 1.0

Dept. of. Computer Science & Engineering (Artificial Intelligence) 14


20CAI213 Data Science Laboratory B. Tech III Year II Semester

#Finding cos values

np.cos(0)

Output : 1.0

#Applying bitwise or between two numbers

np.bitwise_or(13, 16)

Output : 29

#Inverting the values of array

np.invert(np.array([True, False]))

Output : array([False, True])

#applying·greater·than·operator·on·elements·of·two·different·arrays.

np.greater([4,2],[2,2])

Output : array([ True, False])

#applying less than operator on elements of two different arrays.

np.less([1, 2], [2, 2])

Output : array([ True, False]

Dept. of. Computer Science & Engineering (Artificial Intelligence) 15


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Theory:

NumPy Universal functions are in actual the mathematical functions. The NumPy mathematical
functions in NumPy are framed as Universal functions. These Universal (mathematical NumPy
functions) operate on the NumPy Array and perform element-wise operations on the data values.

The universal NumPy functions belong to the numpy.ufunc class in Python. Some of the basic
mathematical operations are called internally when we invoke certain operators. For example,
when we frame x + y, it internally invokes the numpy.add() universal function.

We can even create our own universal functions using frompyfunc() method.

Syntax:

numpy.frompyfunc(function-name, input, output)

 function-name: name of the function to be framed as a universal function


 input: The number of input arrays
 output: The number of output arrays

1. Universal Trigonometric Functions in NumPy

1. numpy. deg2raf(): This function helps us convert degree value to radians.


2. numpy.sinh() function: Calculates the hyperbolic sine value.
3. numpy.sin() function: Calculates the inverse of the sine hyperbolic value.
4. numpy.hypot() function: Calculates the hypotenuse for the right angled triangle
structure.

Example:

import numpy as np

data = np.array([0, 30, 45])

rad = np.deg2rad(data)

Dept. of. Computer Science & Engineering (Artificial Intelligence) 16


20CAI213 Data Science Laboratory B. Tech III Year II Semester

# hyperbolic sine value


print('Sine hyperbolic values:')
hy_sin = np.sinh(rad)
print(hy_sin)

# inverse sine hyperbolic


print('Inverse Sine hyperbolic values:')
print(np.sin(hy_sin))

# hypotenuse
b=3
h=6
print('hypotenuse value for the right angled triangle:')
print(np.hypot(b, h))

Output:

Sine hyperbolic values:


[0. 0.54785347 0.86867096]
Inverse Sine hyperbolic values:
[0. 0.52085606 0.76347126]
hypotenuse value for the right angled triangle:
6.708203932499369

2. Universal Statistical functions

1. numpy.amin() function: Represents the minimum value from the array.


2. numpy.amax() function: Represents the maximum value from the array.
3. numpy.ptp() function: It represents the range of values of an array across an axis which
is calculated by subtracting the minimum value from the maximum value.
4. numpy.average() function: It calculates the average of the array elements.

Example:

import numpy as np

data = np.array([10.2,34,56,7.90])
print('Minimum and maximum data values from the array: ')
print(np.amin(data))
print(np.amax(data))

print('Range of the data: ')


print(np.ptp(data))

Dept. of. Computer Science & Engineering (Artificial Intelligence) 17


20CAI213 Data Science Laboratory B. Tech III Year II Semester

print('Average data value of the array: ')


print(np.average(data))

Output:

Minimum and maximum data values from the array:


7.9
56.0
Range of the data:
48.1
Average data value of the array:
27.025000000000002

Result:
Familiarized with different Universal functions in Numpy and performed various
mathematical operations with it.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 18


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment -4

Question: Import a CSV file and perform various Statistical and Comparison operations
on rows/columns.

Aim:
Import a CSV file and perform various Statistical and Comparison operations on
rows/columns.

Algorithm:

Step 1: Create or download a csv file to practice

Step 2: Import this file to the drive and to colab for practicing different operations.

Step 3: Import Numpy and Pandas to use data frames and apply various statistical operations

Step 4: Display column titles of imported CSV file and start analyzing data in it.

Step 5: Apply the functions like info (), head (), describe () etc and learn it‟s uses.

Step 6: Apply the functions like slicing and variations in slicing.

Step 7: Choose appropriate feature from the dataset and find mean, median, mode, std and
variance

Dept. of. Computer Science & Engineering (Artificial Intelligence) 19


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Source code:

import pandas as pd

df=pd.read_csv(„Data.csv‟) #Any local folder

df.head() #returns the first 5 rows of the dataset

df.tails() #returns the last 5 rows of the dataset

df.shape #returns the dimensions of the dataframe

df.columns.tolist() #extract all the column names as a list

df.describe() #shows count,mean,std etc.for each column

df.max() #returns max value for all the columns

df.min() #returns mix value for all columns

df[„Lscore‟].mean() #returns the mean of that column

df[„Lscore‟].median() #returns the median of that column

df.sort_values(„Lscore‟).head() #sorting

Dept. of. Computer Science & Engineering (Artificial Intelligence) 20


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Theory:

CSV is a typical file format that is often used in domains like Monetary Services, etc. Most
applications can enable you to import and export knowledge in CSV format.

Thus, it is necessary to induce a good understanding of the CSV format to higher handle the data
you are used with every day.

What is a CSV?

CSV (Comma Separated Values) may be a simple file format accustomed to store tabular data,
like a spreadsheet or database. CSV file stores tabular data (numbers and text) in plain text. Each
line of the file could be a data record. Each record consists of 1 or more fields, separated by
commas, the utilization of the comma as a field separator is that the source of the name for this
file format.

Basic Operations with CSV Files

In Basic operations, we are going to understand the subsequent three things:

How to work with CSV files

How to open a CSV file

How to Save a CSV file


️ Working with CSV Files

Working with CSV files isn‟t that tedious task but it‟s pretty straightforward. However, counting
on your workflow, there can become caveats that you simply might want to observe out for.

Opening a CSV File

If you‟ve got a CSV file, you‟ll open it in Excel without much trouble. Just open Excel, open and
find the CSV file to figure with (or right-click on the CSV file and choose Open in Excel). After
you open the file, you‟ll notice that the info is simply plain text put into different cells.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 21


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Saving a CSV File

If you wish to save lots of your current workbook into a CSV file, you have got to use the
subsequent commands:

File -> Save As… and choose CSV file.

More often than not, you‟ll get this warning:

Understanding CSV Files - saving as CSV from xls

There are Dataset in pandas of different file formats like some CSV, HTML, XLSX, etc.

To work with CCS file .read_csv() function of pandas is used.

For Excel or XLSX file format .read_excel() and for HTML file format .read_html() functions of
pandas are used.

For the pdf file, we need to install the tabula-py package.

Result:
Learned to import csv files and started to work on selected attribute for various statistical
studies.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 22


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment-5
Question: Load an image file and do crop and flip operation using NumPy Indexing.

Aim: To load an image file and do crop and flip operation using NumPy Indexing.

Algorithm:

Step1: Install required Packages.

 Numpy
 Matplotlib
 Pillow

Step2: Import Numpy as np

Step3: Import Image module of pillow library to crop and flip the image easily.

Step4: Download and load the image to do the experiment.

Step5: To crop the image:

 Convert the Image into Array to crop the image.


 Slice the array with required dimensions to crop the image.
 Convert Image into Array to plot.
 Display the image using show() function.

Step6: To flip the image by applying transpose function:

 Open the image as object.


 Flipping the vertically and horizontally.
 Closing all the image objects.
 Display the image using show() function.

Step7: Do the same image and apply rotate function to rotate in specified angle and display.

Source Code:

#Importing the required libraries


from PIL import Image

Dept. of. Computer Science & Engineering (Artificial Intelligence) 23


20CAI213 Data Science Laboratory B. Tech III Year II Semester

import numpy as np
import matplotlib.pyplot as plt
import cv2

#importing the image


img_in=Image.open('/content/IMG_20221128_120157.jpg')
array=np.array(img_in)

#cropping the image using array slicing


cropped_array=array[50:350,150:450,:]
Image.fromarray(cropped_array)

#flipping the image from left to right


Image.fromarray(np.fliplr(img_in))

#flipping the image from up to down


Image.fromarray(np.flipud(img_in))

Result:

Loaded an image file and performed crop and flip operation using NumPy Indexing.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 24


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment-6
Question: Compute summary statistics such as mean, median, mode, standard deviation, and
variance of the given different types of data.

Aim: A program to compute summary statistics such as mean, median, mode, standard
deviation, and variance of the given different types of data.

Step 1: Import pandas as pd

Step 2: Apply some Descriptive Statistics in Python Pandas on list data

Step 3: Apply same statistics on dictionary

Step 4: Learn role of summary statistics on data set through application on small csv file

Step 5: Practice different functions like cumsum(), prod(),abs() etc.

Source Code:

#importing pandas

import pandas as pd

#csv file name


data = pd.read_csv("housing.csv")

# initializing the titles and rows list


mean1=data["housing_median_age"].mean()

median1=data["housing_median_age"].median()

mode1=data["housing_median_age"].mode()

std1=data["housing_median_age"].std()

var1=data["housing_median_age"].var()

Dept. of. Computer Science & Engineering (Artificial Intelligence) 25


20CAI213 Data Science Laboratory B. Tech III Year II Semester

#printing inputs and outputs


print("mean is"+str(mean1))

print("median is"+str(median1))

print("mode is"+str(mode1))

print("str is"+str(std1))

print("var is"+str(var1))

THEORY:
Statistics is concerned with collecting and then analyzing that data. It includes methods for
collecting the samples, describing the data, and then concluding that data. NumPy is the
fundamental package for scientific calculations and hence goes hand-in-hand for NumPy
statistical Functions.

NumPy contains various statistical functions that are used to perform statistical data analysis.
These statistical functions are useful when finding a maximum or minimum of elements. It is
also used to find basic statistical concepts like standard deviation, variance, etc.

NumPy Statistical Functions:


Mean
Mean is the sum of the elements divided by its sum and given by the following
formula:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 26


20CAI213 Data Science Laboratory B. Tech III Year II Semester

It calculates the mean by adding all the items of the arrays and then divides it by
the number of elements. We can also mention the axis along which the mean can
be calculated.

import numpy as np
a = np.array([5,6,7])
print(a)
print(np.mean(a))
Output
[5 6 7]
6.0

Median
Median is the middle element of the array. The formula differs for odd and even
sets.

It can calculate the median for both one-dimensional and multi-dimensional arrays.
Median separates the higher and lower range of data values.
import numpy as np
a = np.array([5,6,7])
print(a)
print(np.median(a))
Output
[5 6 7]
6.0

Mode

Dept. of. Computer Science & Engineering (Artificial Intelligence) 27


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Mode refers to most repeating element in array

It can calculate the mode for both one-dimensional and multi-dimensional arrays.

from scipy import stats as st


import numpy as np
abc = np.array([1, 1, 2, 2, 2, 3, 4, 5])
print(st.mode(abc))
Output
ModeResult(mode=array([2]), count=array([3]))

Standard Deviation
Standard deviation is the square root of the average of square deviations from mean. The formula
for standard deviation is:

import numpy as np
a = np.array([5,6,7])
print(a)
print(np.std(a))
Output
[5 6 7]
0.816496580927726
Variance
In probability theory and statistics, variance is the expectation of the squared deviation of
a random variable from its population mean or sample mean. Variance is a measure
of dispersion, meaning it is a measure of how far a set of numbers is spread out from their
average value.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 28


20CAI213 Data Science Laboratory B. Tech III Year II Semester

a = np.array([[1, 2], [3, 4]])


print(a)
print(np.var(a))

Output
array([[1,2],
[3,4]])
1.25

Summary
These functions are useful for performing statistical calculations on the array elements. NumPy
statistical functions further increase the scope of the use of the NumPy library. The objective of
statistical functions is to eliminate the need to remember lengthy formulas. It makes processing
more user-friendly.

Result:
A program to compute summary statistics such as mean, median, mode, standard deviation, and
variance of the given different types of data is successfully executed.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 29


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment-7
Question: Create panda series and data frame from various types of inputs

Aim: To create panda series and data frame from various types of inputs.

Algorithm:

Step 1: Create simple panda series by importing pandas library

Step 2: Create data frame with pandas library from multiple series of data

Step 3: Add new columns to the data frame

Step 4: Create data frame from the dictionary, list and list of list

Step 5: Assign index to the data frame.

Step 6: Create pandas data frame using DataFrame() function and add data to it.

Step 7: Create pandas data frame using using zip() function.

Step 8: Constructing Series from a dictionary with an Index specified

Step 9: Load any small csv file to pandas data frame and display it.

Source Code:
#iimport pandas

Import pandas as pd

author=[„jitender‟,‟purnima‟,‟arpit‟,‟jyoti‟]

auth_series=pd.series(author)

print(auth_series)

article=[210,211,114,178]

article_series=pd.series(article)

#creating dataframe using dictionary

Dept. of. Computer Science & Engineering (Artificial Intelligence) 30


20CAI213 Data Science Laboratory B. Tech III Year II Semester

frame={„author‟:auth_series,‟article‟:article_series}

result=pd.dataframe(frame)

print(result)

#adding new column to data frame

age=[21,21,24,23]

result[„age‟]=pd.series(age)

print(result)

#creating pandas dataframes using zip() function

ls=list(zip(author,article,age))

df1=pd.dataframe(ls,columns=[„author‟,‟article‟,‟age‟])

print(df1)

#converting dataframe into csv file

df.to_csv(“exp7.csv”)

#converting csv file into dataframe

df2=pd.read_csv(“exp7.csv”)

df2

THEORY:

PANDAS SERIES:

Series is a type of list in pandas which can take integer values, string values, double values and
more. But in Pandas Series we return an object in the form of list, having index starting
from 0 to n, Where n is the length of values in series. Later in this article, we will discuss
dataframes in pandas, but we first need to understand the main difference

Dept. of. Computer Science & Engineering (Artificial Intelligence) 31


20CAI213 Data Science Laboratory B. Tech III Year II Semester

between Series and Dataframe. Series can only contain single list with index, whereas
dataframe can be made of more than one series or we can say that a dataframe is a collection of
series that can be used to analyse the data.

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet.
Labels need not be unique but must be a hashable type. The object supports both integer and label-
based indexing and provides a host of methods for performing operations involving the index.

PANDAS DATAFRAME:

Pandas is a python package designed for fast and flexible data processing, manipulation and
analysis. Pandas has a number of fundamental data structures (a data management and storage
format). If you are working with two-dimensional labelled data, which is data that has both
columns and rows with row headers — similar to a spreadsheet table, then the DataFrame is the
data structure that you will use with Pandas.

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data


structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three
principal components, the data, rows, and columns.

In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created
from the lists, dictionary, and from a list of dictionary etc.

Result:

Created panda series and data frame from various types of inputs

Dept. of. Computer Science & Engineering (Artificial Intelligence) 32


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment-8
Question: Import any CSV file to Pandas DataFrame and perform the following:
a. Visualize the first and last 10 records
b. Get the shape, index and column details.
c. Select/Delete the records(rows)/columns based on conditions.
d. Perform ranking and sorting operations.
e. Do required statistical operations on the given columns.
f. Find the count and uniqueness of the given categorical values.
g. Rename single/multiple columns

Aim: Familiarizing some basic operations on CSV file with Pandas Data Frame

Algorithm:

Step 1: Create simple panda series from CSV and visualize the first and last 10 records.

Step 2: From the CSV imported in step 1 get the shape, index and column details.

Step 3: Now Select/Delete the records(rows)/columns based on conditions.

Step 4: Perform ranking and sorting operations on his CSV data.

Step 5: Apply some statistical operations on the given columns.

Step 6: In the attached CSV data find the count and uniqueness of the given categorical values.

Step 7: Change the name of single/multiple columns in the attached CSV.

Step 8: Change the name of columns while loading csv file

Source Code:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 33


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Dept. of. Computer Science & Engineering (Artificial Intelligence) 34


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Dept. of. Computer Science & Engineering (Artificial Intelligence) 35


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Dept. of. Computer Science & Engineering (Artificial Intelligence) 36


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Result:

Familiarizing some basic operations on CSV file with Pandas Data Frame is successfully
completed.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 37


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment-9
Question:
Import any CSV file to Pandas DataFrame and perform the following:
a. Handle missing data by detecting and dropping/ filling missing values.
b. Transform data using apply() and map() method.
c. Detect and filter outliers.
d. Perform Vectorized String operations on Pandas Series.
e. Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter Plots
AIM:

Familiarizing more operations on csv file with Pandas DataFrame.

ALGORITHM:

1. Import the required packages


2. Upload the CSV file and read the data.
3. Display the top rows and shape of data set.
4. Find the null values and it‟s sum
5. Fill the null values with mode value of the particular column
6. Using boxplot plot the required columns individually.
7. Then apply the different functions like lower(),upper(),min(),max() and so on.
8. Use histogram to plot the gender ration
9. Using different types of plotting techniques like box plot, scatter plot plot the different
columns .
10. Also can use different functions other than those and plot the graphs.
SOURCE CODE:

import pandas as pd

# for reading the data set


df = pd.read_csv('data.csv')

# printing the dataset


print(df)

# for columns details

print(df.dtypes)

Dept. of. Computer Science & Engineering (Artificial Intelligence) 38


20CAI213 Data Science Laboratory B. Tech III Year II Semester

# describe

Print(df.describe())

# for selecting the one particular column

print(df[“owner”])

# for top and bottom rows of data set

print(df.head())

print(df.tail())

# for slicing a row

df2=df[0:3]

print(df2)

# for copy the data

copied_data=df.copy()

print(copied_data)

# for dropping Nan values

print(copied_data.dropna())

DATASET:

https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-
cardekho/download?datasetVersionNumber=3

Result:

Familiarizing more operations on CSV file with Pandas Data Frame is successfully completed.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 39


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Experiment-10
Question: Demonstrate Linear Regression analysis with residual plots on a given data set

Aim: To Implement Linear Regression

Linear regression:

Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are
using to predict the other variable‟s value is called the independent variable.

This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression fits
a straight line or surface that minimizes the discrepancies between predicted and actual output
values. There are simple linear regression calculators that use a “least squares” method to
discover the best-fit line for a set of paired data. You then estimate the value of X (dependent
variable) from Y (independent variable).

Residuals Plot:

A residual plot is a type of plot that displays the fitted values against the residual values for a
regression model.This type of plot is often used to assess whether or not a linear regression
model is appropriate for a given dataset and to check for heteroscedasticity of residuals.

Residuals, in the context of regression models, are the difference between the observed value of
the target variable (y) and the predicted value (ŷ), i.e. the error of the prediction. The residuals
plot shows the difference between residuals on the vertical axis and the dependent variable on
the horizontal axis, allowing you to detect regions within the target that may be susceptible to
more or less error.

Visualizer ResidualsPlot

Quick Method residuals_plot()


SOURCE CODE:
Models Regression
# Importing required Libraries
Workflow Model evaluation import pandas as pd

Dept. of. Computer Science & Engineering (Artificial Intelligence) 40


20CAI213 Data Science Laboratory B. Tech III Year II Semester

from sklearn.linear_model import LinearRegression


import matplotlib.pyplot as plt

# Reading dataset
data = pd.read_csv("/content/salary.csv")
data.head()

output :
YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

#taking dependent and independent values


x = data['YearsExperience'].values
x = x.reshape(-1,1)
y = data['Salary'].values

# Plotting the actual values


plt.scatter(x,y,c='r',marker='x')
plt.grid(True)

output :

Dept. of. Computer Science & Engineering (Artificial Intelligence) 41


20CAI213 Data Science Laboratory B. Tech III Year II Semester

# fitting the data into model


lr = LinearRegression()
lr.fit(x,y)
y_pred = lr.predict(x)
# plotting the predicted values
plt.plot(x,y_pred,c='k')
plt.scatter(x,y_pred)
plt.scatter(x,y)
plt.title('Profit prediction')
plt.show()
output :

# Score of our model


lr.score(x,y)

output :
0.9569566641435086
Result:

Implementation of linear regression is completed successfully.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 42


20CAI213 Data Science Laboratory B. Tech III Year II Semester

EXPERIMENT - 11
Question: Implement the Naïve Bayesian classifier for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering few test data sets
AIM:

To Implement the Naïve Bayesian classifier for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering few test data sets

ALOGRITHM:

Step 1: start the program.

Step 2: import required libraries.

Step 3: upload the dataset.

Step 4: from sklearn import train_test_split.

Step 5: split the dataset into 40% testing data and 60% training data.

Step 6: train the model and test it.

Step 7: import the metrics and compute the accuracy_score of the model.

Step 8: end the program.

6. Theory of Naive Bayes Algorithm

Dept. of. Computer Science & Engineering (Artificial Intelligence) 43


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes’ Theorem
Bayes‟ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes‟ theorem is stated mathematically as the following
equation:

where A and B are events and P(B) ≠ 0.


 Basically, we are trying to find probability of event A, given the event B is true. Event
B is also termed as evidence.
 P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).
 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
Now, with regards to our dataset, we can apply Bayes‟ theorem in following way:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 44


20CAI213 Data Science Laboratory B. Tech III Year II Semester

where, y is class variable and X is a dependent feature vector (of size n) where:

Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st
row of dataset)
X = (Rainy, Hot, High, False)
y = No
So basically, P(y|X) here means, the probability of “Not playing golf” given that the weather
conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.
Naive assumption
Now, its time to put a naive assumption to the Bayes‟ theorem, which is, independence among
the features. So now, we split evidence into the independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:

which can be expressed as:


Now, as the denominator remains constant for a given input, we can remove that term:
Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability.

So, finally, we are left with the task of calculating P(y) and P(x i | y).
Please note that P(y) is also called class probability and P(xi | y) is called conditional
probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(x i | y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to do
some precomputations on our dataset.
We need to find P(x i | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 45


20CAI213 Data Science Laboratory B. Tech III Year II Semester

CODE:

# load the iris dataset


from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set
y_pred = gnb.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy :", metrics.accuracy_score(y_test, y_pred))

Dept. of. Computer Science & Engineering (Artificial Intelligence) 46


20CAI213 Data Science Laboratory B. Tech III Year II Semester

EXPERIMENT - 12
Question:Implement k-Nearest Neighbour algorithm to classify the iris data set. Print both correct
and wrong predictions using Python ML library classes
AIM:

To Implement k-Nearest Neighbour algorithm to classify the iris data set. Print both
correct and wrong predictions using Python ML library classes.

ALGORITHM:

Step 1: start the program.

Step 2: import required libraries.

Step 3: import iris dataset.

Step 4: spilt the data into 80:20 , 80 for training the data and the remaining

20 is for testing the data.

Step 5: using KNN classifier , fit the x_train and y_train.

Step 6: now, the train_accuracy and test_accuracy.

Step 7: plot the graph between accuracy and n_neighbors.

Step 8: end the program.

SOURCE CODE:

#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

#Taking x,y co-ordinates


X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5, rando
m_state = 4)

#Using seaborn plot the graph

Dept. of. Computer Science & Engineering (Artificial Intelligence) 47


20CAI213 Data Science Laboratory B. Tech III Year II Semester

plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

#Assigning the classifier


knn5 = KNeighborsClassifier(n_neighbors = 10)
knn1 = KNeighborsClassifier(n_neighbors=1)

#Train the model


knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)
y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)

#print the outputs


from sklearn.metrics import accuracy_score
print("Accuracy with k=10", accuracy_score(y_test, y_pred_5)*100)
print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)

#plot the graph


plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=10", fontsize=20)
plt.subplot(1,2,2)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=1", fontsize=20)
plt.show()

Dept. of. Computer Science & Engineering (Artificial Intelligence) 48


20CAI213 Data Science Laboratory B. Tech III Year II Semester

KNN ALGORITHM

Introduction

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.

 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
 The following two properties would define KNN well:

1. Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
2. Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm
because it doesn‟t assume anything about the underlying data.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 49


20CAI213 Data Science Laboratory B. Tech III Year II Semester

 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 50


20CAI213 Data Science Laboratory B. Tech III Year II Semester

EXPERIMENT - 13

Question:Implement k-Means clustering algorithm to cluster the set of data storedin .CSV file.
Compare the results of various “k” values for the quality of clustering.

AIM:

To Implement k-Means clustering algorithm to cluster the set of data storedin .CSV file.
Compare the results of various “k” values for the quality of clustering.

ALGORITHM:

Step 1: Select the number k to decide the number of cluster.


Step 2: Select random k points or centroid(it can be other from the input dataset).
Step 3: Assign each data point to their closest centroid which will form the

predefined k clusters.
Step 4: Calculate the variance and place a new centroid of each cluster.

Step 5: Repeat the step-3 , which means reassign each data point to the new closest
centroid of each cluster.
Step 6: If any reassignment occur , then go to step-4 else go to finish.
Step 7: The model is ready.

Dept. of. Computer Science & Engineering (Artificial Intelligence) 51


20CAI213 Data Science Laboratory B. Tech III Year II Semester

SOURCE CODE and OUTPUTS:

Dept. of. Computer Science & Engineering (Artificial Intelligence) 52


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Dept. of. Computer Science & Engineering (Artificial Intelligence) 53


20CAI213 Data Science Laboratory B. Tech III Year II Semester

Dept. of. Computer Science & Engineering (Artificial Intelligence) 54

You might also like