Professional Documents
Culture Documents
Experiment: 1.2
1. Aim of the Practical: Explore, visualize, transform and summarize input datasets for
building Classification/regression/prediction models.
2. Tool Used: Google Colaboratory
3. Basic Concepts/Command Description: Exploratory data analysis (EDA) is an approach
or a philosophy for data analysis that employs a variety of techniques (mostly graphical) to
maximize our insight into a data set, uncover underlying structures, test underlying
assumptions etc.
Treating missing values by dropping records with at least one missing values, replacing
values with mean, median or mode OR consider missing values as a different column,
predicting missing values using regression techniques.
Some commands and libraries used in this program with there respective functions,
a.) Pycaret library: Also known as Dataset of datasets, this library is used to import a vivid
range of datasets out of which, we can use the dataset to be studied by us.
b.) [ !pip install pycaret &> /dev/null ] is used to install the pycaret library to the program
where ‘!’ is used to run it as a shell command (a shell is a computer program which exposes an operating
system’s service to a human user or other program) and not as a notebook command.
c.) [ from pycaret.datasets import get_data ] is used to load all the datasets present within the Pycaret library.
d.) [ dataSets = get_data(‘index’) ] is used to showcase all the datasets present within ths Pycaret library.
University Institute of Engineering
# This is binary classification dataset. The values in "Class variable" have two (binary) values.
print(type(diabetesDataSet))
University Institute of Engineering
diabetesDataSet.describe()
#diabetesDataSet.iloc[0:10, 5:]
# Also Try
#diabetesDataSet.iloc[10:100, :-2]
diabetesDataSet.iloc[20:30, 1:5]
University Institute of Engineering
j.) ##Subplot
#histogram plot
import matplotlib.pyplot as plt
#x=tuple(range(768))
x = diabetesDataSet['Age (years)']
y = diabetesDataSet['Number of times pregnant']
plt.xlabel('Age Range')
plt.ylabel('Frequency')
#plt.title('Salary Vs Age')
plt.subplot(121)
plt.hist(x, color = "b", bins=50)
#########################################################################
x = diabetesDataSet['Age (years)']
y = diabetesDataSet['Number of times pregnant']
plt.xlabel('Age Range')
plt.ylabel('Number of times pregnant')
#plt.title('Salary Vs Age')
plt.subplot(122)
plt.show()
University Institute of Engineering