DT EXP 1.2 Ankit

University Institute of Engineering
Department of Computer Science & Engineering
Experiment: 1.2
Student Name:Ankit Singh UID:22BCS16684

Branch: Computer Science and Engineering (CSE) Section/Group: 22BCS106-B
Semester: 1st Date of Performance: 25 Oct
th 2022
Subject Name: Disruptive Technologies-1

Subject Code: 22ECH-102
1. Aim of the Practical: Explore, visualize, transform and summarize input datasets for
building Classification/regression/prediction models.
2. Tool Used: Google Colaboratory
3. Basic Concepts/Command Description: Exploratory data analysis (EDA) is an approach
or a philosophy for data analysis that employs a variety of techniques (mostly graphical) to
maximize our insight into a data set, uncover underlying structures, test underlying
assumptions etc.
Treating missing values by dropping records with at least one missing values, replacing
values with mean, median or mode OR consider missing values as a different column,
predicting missing values using regression techniques.
Some commands and libraries used in this program with there respective functions,
a.) Pycaret library: Also known as Dataset of datasets, this library is used to import a vivid
range of datasets out of which, we can use the dataset to be studied by us.
b.) [ !pip install pycaret &> /dev/null ] is used to install the pycaret library to the program
where ‘!’ is used to run it as a shell command (a shell is a computer program which exposes an operating
system’s service to a human user or other program) and not as a notebook command.
c.) [ from pycaret.datasets import get_data ] is used to load all the datasets present within the Pycaret library.
d.) [ dataSets = get_data(‘index’) ] is used to showcase all the datasets present within ths Pycaret library.
e.) [ diabetesDataSet = get_data(“diabetes”)

print(type(diabetesDataSet)) ] this extracts the information from the dataset called “diabetes”
in pycaret and curates its application under the name “diabetesDataSet” so whenever we
use “diabetesDataSet”, the information from the dataset called “diabetes” in the pycaret
is extracted for further operations.
f.) [ diabetesDataSet.describe() ] is used to get the statistical summary of the data set.
g.) [ pandas library ] Pandas library is quite useful in manipulating mathematical data.
h.) [ .loc and .iloc functions ] iloc is a number based whereas loc is a name based function, iloc
can tell about both rows and columns but loc tells only about the rows, other than this,
loc is good for both boolean and non-boolean series but iloc does not work for boolean.
i.) [ import matplotlib.pyplot as plt ] imports a library called matplotlib with the module called
pyplot which helps in visualizing the data with the help of graphs, pie charts, histograms
etc.
This is now used with a short name called “plt” also called an alias name, so that we don’t
have to use the longer name again and again.
4. Codes and Outputs:

Classifications
a.) diabetesDataSet = get_data("diabetes") # SN is 7
# This is binary classification dataset. The values in "Class variable" have two (binary) values.
print(type(diabetesDataSet))
b.) #Get the statistical summary of the dataset
diabetesDataSet.describe()
c.) print("diabetesDataSet.shape -->", diabetesDataSet.shape)
print("Rows -->", diabetesDataSet.shape[0]) ##axis 0---row

print("Columns -->", diabetesDataSet.shape[1]) ###column
d.) # Syntax --> iloc[ ROW, COL_Position]
#diabetesDataSet.iloc[0:10, 5:]
# Also Try
#diabetesDataSet.iloc[10:100, :-2]
diabetesDataSet.iloc[20:30, 1:5]
e.) # Syntax --> loc[ ROW, COL_Names_in_List ]

#diabetesDataSet.loc[:, ['Diabetes pedigree function','Age (years)']]
# Also Try
#diabetesDataSet.loc[:10 , ['Diabetes pedigree function','Age (years)']]
diabetesDataSet.loc[20:80 , ['Diabetes pedigree function','Age (years)']]
Missing value Treatment

f.) diabetesDataSet.isnull().sum()
#diabetesDataSet.dropna() (FOR DELETING ROWS WITH VALUE NA)
g.) diabetesDataSet.fillna(0) (TO REPLACE THE NULL VALUE WITH A 0)

Exploratory Data Analysis (EDA)

h.) #histogram plot
import matplotlib.pyplot as plt
#x=tuple(range(768))
x = diabetesDataSet['Age (years)']
#y = diabetesDataSet['Number of times pregnant']
plt.xlabel('Age Range')
plt.ylabel('Frequency')
#plt.title('Salary Vs Age')
plt.hist(x, color = "b", bins=50)
plt.show()
i.) ##Scatter plot

y = diabetesDataSet['Number of times pregnant']
plt.ylabel('Number of times pregnant')
plt.scatter(x, y, marker = "*",color = "g")
plt.show()
j.) ##Subplot
#histogram plot
plt.ylabel('Frequency')
plt.subplot(121)
plt.hist(x, color = "b", bins=50)
#########################################################################
plt.ylabel('Number of times pregnant')
plt.subplot(122)
plt.scatter(x, y, marker = "*",color = "g")
plt.show()
5. Result and Summary:

After this Experiment, we are now capable of exploring a given data set under
an umbrella of functions which can be used to derive and describe the desired
output under given conditions.
6. Learning Outcomes (What I Have Learnt):

I. Fundamentals of Python Programming and its contribution in Exploratory
Data Analysis.
II. Application and meaning of Pycaret Library in Python and using it to explore the
given Dataset.
III. Learnt about the basics of Exploratory Data Analysis and its contribution towards
development and evolution of Datasets to enhance and aid scientific studies.
Ex- In Medical field, we can have a Data Set for a number of Patients, but to study them, we
need to analyse it (explore), visualize it (in form of graphs or charts) and transform by
adding newer records time to time.

DT EXP 1.2 Ankit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DT EXP 1.2 Ankit

Uploaded by

Copyright:

Available Formats

University Institute of Engineering

Department of Computer Science & Engineering

Student Name:Ankit Singh UID:22BCS16684

Subject Name: Disruptive Technologies-1

Department of Computer Science & Engineering

e.) [ diabetesDataSet = get_data(“diabetes”)

4. Codes and Outputs:

Department of Computer Science & Engineering

b.) #Get the statistical summary of the dataset

c.) print("diabetesDataSet.shape -->", diabetesDataSet.shape)

print("Rows -->", diabetesDataSet.shape[0]) ##axis 0---row

d.) # Syntax --> iloc[ ROW, COL_Position]

Department of Computer Science & Engineering

e.) # Syntax --> loc[ ROW, COL_Names_in_List ]

Missing value Treatment

g.) diabetesDataSet.fillna(0) (TO REPLACE THE NULL VALUE WITH A 0)

Department of Computer Science & Engineering

Exploratory Data Analysis (EDA)

i.) ##Scatter plot

Department of Computer Science & Engineering

plt.scatter(x, y, marker = "*",color = "g")

Department of Computer Science & Engineering

5. Result and Summary:

6. Learning Outcomes (What I Have Learnt):

You might also like