Eda Using Univariate Analysis
Eda Using Univariate Analysis
What is EDA?
Exploratory Data Analysis (EDA) is a method used to analyze and summarize
datasets.
It will give you the basic understanding of your data, it’s distribution, null values and
much more.
You can either explore data using graphs or through some python functions.
There are three type of analysis. Univariate, Bivariate and Multivariate. In the
univariate, you will be analyzing a single attribute. But in the bivariate, you will be
analyzing an attribute with the target attribute. And in Multivariate, you will be
analyzing multipule attribute together.
In the non-graphical approach, you will be using functions such as shape, summary,
describe, isnull, info, datatypes and more.
In the graphical approach, you will be using plots such as scatter, box, bar, density
and correlation plots.
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 1/16
9/25/24, 11:24 AM eda-using-univariate-analysis
#shape of data
print(f"Number of records present in given dataset is: {df.shape[0]} ")
print(f"Number of attributes present in given dataset is: {df.shape[1]} ")
Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
Newell, Mr.
659 660 0 1 Arthur male 58.0 0 2 35273 1
Webster
Dorking,
A/5.
283 284 1 3 Mr. Edward male 19.0 0 0
10482
Arthur
Gilinski, Mr.
588 589 0 3 male 22.0 0 0 14973
Eliezer
Kent, Mr.
487 488 0 1 Edward male 58.0 0 0 11771 2
Austin
Meyer, Mrs.
Edgar PC
375 376 1 1 female NaN 1 0 8
Joseph 17604
(Leila Saks)
Persson, Mr.
267 268 1 3 male 25.0 1 0 347083
Ernst Ulrik
Ross, Mr.
583 584 0 1 male 36.0 0 0 13049 4
John Hugo
Hart, Mrs.
Benjamin F.C.C.
440 441 1 2 female 45.0 1 1 2
(Esther Ada 13529
Bloomfield)
Goldenberg,
453 454 1 1 Mr. Samuel male 49.0 1 0 17453 8
L
Baclini,
448 449 1 3 Miss. Marie female 5.0 2 1 2666
Catherine
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 2/16
9/25/24, 11:24 AM eda-using-univariate-analysis
For any data, it is good to start by knowing its information such as data type, null
values and many more.
Descriptive statistics include those that summarize the central tendency, dispersion
and shape of a dataset's distribution, excluding NaN values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [7]: df.describe(include="int64")
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 3/16
9/25/24, 11:24 AM eda-using-univariate-analysis
In [8]: df.describe(include="object")
Duplicate values
We can use the df.duplicate.sum() function to the sum of duplicate value present if
any.
It will show the number of duplicate values if they are present in the data.
Out[9]: 0
In [10]: df.isnull().sum()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 4/16
9/25/24, 11:24 AM eda-using-univariate-analysis
Out[10]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [11]: #We can also try heat map to visualize null values
plt.figure(figsize=(8,6))
sns.heatmap(df.isnull(), cbar= False)
Out[11]: <AxesSubplot:>
Clearly visible form heat map and python code that we have null values in Age and
Cabin attributes.
1. Categorical Data
In [12]: df.info()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 5/16
9/25/24, 11:24 AM eda-using-univariate-analysis
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [13]: x = df["Survived"].value_counts()[0]
y = df["Survived"].value_counts()[1]
In [15]: df["Pclass"].value_counts()[1]
Out[15]: 216
In [17]: df["Embarked"].value_counts()
Out[17]: S 644
C 168
Q 77
Name: Embarked, dtype: int64
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 6/16
9/25/24, 11:24 AM eda-using-univariate-analysis
a. Countplot
Countplot to plot count of each category of categorical feature
b. Pie Chart
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 7/16
9/25/24, 11:24 AM eda-using-univariate-analysis
For getting percentage of each category of a categorical feature, we can make use of
pie chat.
# first plot
plt.subplot(2,2,1)
plt.pie(df["Survived"].value_counts(), autopct='%.2f')
plt.title("Survived")
# second plot
plt.subplot(2,2,2)
plt.pie(df["Sex"].value_counts(), autopct='%.2f')
plt.title("Sex")
# third plot
plt.subplot(2,2,3)
plt.pie(df["Pclass"].value_counts(), autopct='%.2f')
plt.title("Pclass")
# fourth plot
plt.subplot(2,2,4)
plt.pie(df["Embarked"].value_counts(), autopct='%.2f')
plt.title("Embarked")
# to plot
plt.show()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 8/16
9/25/24, 11:24 AM eda-using-univariate-analysis
Numerical Data
a. Histogram
A great way to get started exploring a single variable is with the histogram. A
histogram divides the variable into bins, counts the data points in each bin, and
shows the bins on the x-axis and the counts on the y-axis.
Age Attribute:
In [22]: df["Age"].describe()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 9/16
9/25/24, 11:24 AM eda-using-univariate-analysis
In [23]: # A skewness value greater than 1 or less than -1 indicates a highly skewed dist
# A value between 0.5 and 1 or -0.5 and -1 is moderately skewed.
# A value between -0.5 and 0.5 indicates that the distribution is fairly symmetr
df["Age"].skew()
Out[23]: 0.38910778230082704
In [24]: plt.hist(df["Age"])
plt.show()
b. Distribution Plot
In [25]: sns.distplot(df["Age"])
plt.show()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 10/16
9/25/24, 11:24 AM eda-using-univariate-analysis
c. Boxplot
Boxplot is a method for graphically demonstrating the locality, spread and skewness
groups of numerical data through their quartiles.
Very useful for finding outliers in numerical data.
In [26]: sns.boxplot(df["Age"])
plt.show()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 11/16
9/25/24, 11:24 AM eda-using-univariate-analysis
Fare Attribute:
In [27]: df["Fare"].describe()
Minimum fare is zero, which is not sounding good. Lets try to find out why it is so.
In [28]: df[df["Fare"]==0]
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 12/16
9/25/24, 11:24 AM eda-using-univariate-analysis
Out[28]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fa
Leonard, Mr.
179 180 0 3 male 36.0 0 0 LINE 0
Lionel
Harrison, Mr.
263 264 0 1 male 40.0 0 0 112059 0
William
Tornquist,
271 272 1 3 Mr. William male 25.0 0 0 LINE 0
Henry
Parkes, Mr.
277 278 0 2 Francis male NaN 0 0 239853 0
"Frank"
Johnson, Mr.
302 303 0 3 William male 19.0 0 0 LINE 0
Cahoone Jr
Cunningham,
413 414 0 2 Mr. Alfred male NaN 0 0 239853 0
Fleming
Campbell,
466 467 0 2 male NaN 0 0 239853 0
Mr. William
Frost, Mr.
Anthony
481 482 0 2 male NaN 0 0 239854 0
Wood
"Archie"
Johnson, Mr.
597 598 0 3 male 49.0 0 0 LINE 0
Alfred
Parr, Mr.
633 634 0 1 William male NaN 0 0 112052 0
Henry Marsh
Watson, Mr.
674 675 0 2 Ennis male NaN 0 0 239856 0
Hastings
Knight, Mr.
732 733 0 2 male NaN 0 0 239855 0
Robert J
Andrews, Mr.
806 807 0 1 male 39.0 0 0 112050 0
Thomas Jr
Fry, Mr.
815 816 0 1 male NaN 0 0 112058 0
Richard
Reuchlin,
822 823 0 1 Jonkheer. male 38.0 0 0 19972 0
John George
In [29]: len(df[df["Fare"]==0])
Out[29]: 15
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 13/16
9/25/24, 11:24 AM eda-using-univariate-analysis
In [30]: # A skewness value greater than 1 or less than -1 indicates a highly skewed dist
# A value between 0.5 and 1 or -0.5 and -1 is moderately skewed.
# A value between -0.5 and 0.5 indicates that the distribution is fairly symmetr
df["Fare"].skew()
Out[30]: 4.787316519674893
a. Histogram
In [31]: plt.hist(df["Fare"])
plt.show()
b. Distrbution Plot
In [32]: sns.distplot(df["Fare"])
plt.show()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 14/16
9/25/24, 11:24 AM eda-using-univariate-analysis
c. Box plot
In [33]: sns.boxplot(df["Fare"])
plt.show()
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 15/16
9/25/24, 11:24 AM eda-using-univariate-analysis
There are lots of method and approach out there, which we can ue for analyzing our
features of data.
That totally depend over your use case and what you want to know from data.
The questions you will be asking from data are going to be different in each case.
There is no hard and rule that we should go like this or like that. These things totally
depends over your experiment and use case.
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 16/16