You are on page 1of 7

University Institute of Engineering

Department of Computer Science & Engineering

Experiment: 1.2

Student Name:Ankit Singh UID:22BCS16684


Branch: Computer Science and Engineering (CSE) Section/Group: 22BCS106-B
Semester: 1st Date of Performance: 25 Oct
th 2022

Subject Name: Disruptive Technologies-1


Subject Code: 22ECH-102

1. Aim of the Practical: Explore, visualize, transform and summarize input datasets for
building Classification/regression/prediction models.
2. Tool Used: Google Colaboratory
3. Basic Concepts/Command Description: Exploratory data analysis (EDA) is an approach
or a philosophy for data analysis that employs a variety of techniques (mostly graphical) to
maximize our insight into a data set, uncover underlying structures, test underlying
assumptions etc.
Treating missing values by dropping records with at least one missing values, replacing
values with mean, median or mode OR consider missing values as a different column,
predicting missing values using regression techniques.
Some commands and libraries used in this program with there respective functions,
a.) Pycaret library: Also known as Dataset of datasets, this library is used to import a vivid
range of datasets out of which, we can use the dataset to be studied by us.
b.) [ !pip install pycaret &> /dev/null ] is used to install the pycaret library to the program
where ‘!’ is used to run it as a shell command (a shell is a computer program which exposes an operating
system’s service to a human user or other program) and not as a notebook command.
c.) [ from pycaret.datasets import get_data ] is used to load all the datasets present within the Pycaret library.
d.) [ dataSets = get_data(‘index’) ] is used to showcase all the datasets present within ths Pycaret library.
University Institute of Engineering

Department of Computer Science & Engineering

e.) [ diabetesDataSet = get_data(“diabetes”)


print(type(diabetesDataSet)) ] this extracts the information from the dataset called “diabetes”
in pycaret and curates its application under the name “diabetesDataSet” so whenever we
use “diabetesDataSet”, the information from the dataset called “diabetes” in the pycaret
is extracted for further operations.
f.) [ diabetesDataSet.describe() ] is used to get the statistical summary of the data set.
g.) [ pandas library ] Pandas library is quite useful in manipulating mathematical data.
h.) [ .loc and .iloc functions ] iloc is a number based whereas loc is a name based function, iloc
can tell about both rows and columns but loc tells only about the rows, other than this,
loc is good for both boolean and non-boolean series but iloc does not work for boolean.
i.) [ import matplotlib.pyplot as plt ] imports a library called matplotlib with the module called
pyplot which helps in visualizing the data with the help of graphs, pie charts, histograms
etc.
This is now used with a short name called “plt” also called an alias name, so that we don’t
have to use the longer name again and again.

4. Codes and Outputs:


Classifications
a.) diabetesDataSet = get_data("diabetes") # SN is 7

# This is binary classification dataset. The values in "Class variable" have two (binary) values.

print(type(diabetesDataSet))
University Institute of Engineering

Department of Computer Science & Engineering

b.) #Get the statistical summary of the dataset

diabetesDataSet.describe()

c.) print("diabetesDataSet.shape -->", diabetesDataSet.shape)

print("Rows -->", diabetesDataSet.shape[0]) ##axis 0---row


print("Columns -->", diabetesDataSet.shape[1]) ###column

d.) # Syntax --> iloc[ ROW, COL_Position]

#diabetesDataSet.iloc[0:10, 5:]
# Also Try
#diabetesDataSet.iloc[10:100, :-2]
diabetesDataSet.iloc[20:30, 1:5]
University Institute of Engineering

Department of Computer Science & Engineering

e.) # Syntax --> loc[ ROW, COL_Names_in_List ]


#diabetesDataSet.loc[:, ['Diabetes pedigree function','Age (years)']]
# Also Try
#diabetesDataSet.loc[:10 , ['Diabetes pedigree function','Age (years)']]
diabetesDataSet.loc[20:80 , ['Diabetes pedigree function','Age (years)']]

Missing value Treatment


f.) diabetesDataSet.isnull().sum()
#diabetesDataSet.dropna() (FOR DELETING ROWS WITH VALUE NA)

g.) diabetesDataSet.fillna(0) (TO REPLACE THE NULL VALUE WITH A 0)


University Institute of Engineering

Department of Computer Science & Engineering

Exploratory Data Analysis (EDA)


h.) #histogram plot
import matplotlib.pyplot as plt
#x=tuple(range(768))
x = diabetesDataSet['Age (years)']
#y = diabetesDataSet['Number of times pregnant']
plt.xlabel('Age Range')
plt.ylabel('Frequency')
#plt.title('Salary Vs Age')
plt.hist(x, color = "b", bins=50)
plt.show()

i.) ##Scatter plot


import matplotlib.pyplot as plt
#x=tuple(range(768))
x = diabetesDataSet['Age (years)']
y = diabetesDataSet['Number of times pregnant']
plt.xlabel('Age Range')
plt.ylabel('Number of times pregnant')
#plt.title('Salary Vs Age')
plt.scatter(x, y, marker = "*",color = "g")
plt.show()
University Institute of Engineering

Department of Computer Science & Engineering

j.) ##Subplot
#histogram plot
import matplotlib.pyplot as plt
#x=tuple(range(768))
x = diabetesDataSet['Age (years)']
y = diabetesDataSet['Number of times pregnant']

plt.xlabel('Age Range')
plt.ylabel('Frequency')
#plt.title('Salary Vs Age')
plt.subplot(121)
plt.hist(x, color = "b", bins=50)

#########################################################################

x = diabetesDataSet['Age (years)']
y = diabetesDataSet['Number of times pregnant']

plt.xlabel('Age Range')
plt.ylabel('Number of times pregnant')

#plt.title('Salary Vs Age')

plt.subplot(122)

plt.scatter(x, y, marker = "*",color = "g")

plt.show()
University Institute of Engineering

Department of Computer Science & Engineering

5. Result and Summary:


After this Experiment, we are now capable of exploring a given data set under
an umbrella of functions which can be used to derive and describe the desired
output under given conditions.

6. Learning Outcomes (What I Have Learnt):


I. Fundamentals of Python Programming and its contribution in Exploratory
Data Analysis.
II. Application and meaning of Pycaret Library in Python and using it to explore the
given Dataset.
III. Learnt about the basics of Exploratory Data Analysis and its contribution towards
development and evolution of Datasets to enhance and aid scientific studies.
Ex- In Medical field, we can have a Data Set for a number of Patients, but to study them, we
need to analyse it (explore), visualize it (in form of graphs or charts) and transform by
adding newer records time to time.

You might also like