0% found this document useful (0 votes)

25 views16 pages

Eda Using Univariate Analysis

Uploaded by

veilraj2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views16 pages

Eda Using Univariate Analysis

Uploaded by

veilraj2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

9/25/24, 11:24 AM eda-using-univariate-analysis

What is EDA?
Exploratory Data Analysis (EDA) is a method used to analyze and summarize
datasets.
It will give you the basic understanding of your data, it’s distribution, null values and
much more.
You can either explore data using graphs or through some python functions.
There are three type of analysis. Univariate, Bivariate and Multivariate. In the
univariate, you will be analyzing a single attribute. But in the bivariate, you will be
analyzing an attribute with the target attribute. And in Multivariate, you will be
analyzing multipule attribute together.
In the non-graphical approach, you will be using functions such as shape, summary,
describe, isnull, info, datatypes and more.
In the graphical approach, you will be using plots such as scatter, box, bar, density
and correlation plots.

In this notebook we will only look univariate analysis.

Which dataset we are going to use?

Titanic Dataset
It is one of the most popular datasets used for understanding machine learning
basics. It contains information of all the passengers aboard the RMS Titanic, which
unfortunately was shipwrecked. This dataset can be used to predict whether a given
passenger survived or not.

The Titanic Dataset contains following attributes:

PassengerID : Unique ID of a passenger

Survived: If the passenger survived(0-No, 1-Yes) Target Feature
Pclass: Passenger Class (1 = 1st, 2 = 2nd, 3 = 3rd)
Name: Name of the passenger
Sex: Male/Female
Age: Passenger age in years
SibSp: No of siblings/spouses abroad
Parch: No of paents/children abroad
Ticket: Ticket Number
Fare: Passenger Fare
Cabin: Cabin Number
Embarked: Port of Embarkation(C = Cherbough, Q = Queenstown, S =
Southampton)

In [3]: #Load the required libraries

import pandas as pd
import numpy as np

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 1/16
9/25/24, 11:24 AM eda-using-univariate-analysis

import seaborn as sns

import matplotlib.pyplot as plt

#Load the data

df = pd.read_csv('../input/titanic-data-set/titanic.csv')

#View the data

df.head()

#shape of data
print(f"Number of records present in given dataset is: {df.shape[0]} ")
print(f"Number of attributes present in given dataset is: {df.shape[1]} ")

Number of records present in given dataset is: 891

Number of attributes present in given dataset is: 12

In [4]: #Sample of data

#Gives random 10 data points from whole data
df.sample(10)

Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

Newell, Mr.
659 660 0 1 Arthur male 58.0 0 2 35273 1
Webster

Dorking,
A/5.
283 284 1 3 Mr. Edward male 19.0 0 0
10482
Arthur

Gilinski, Mr.
588 589 0 3 male 22.0 0 0 14973
Eliezer

Kent, Mr.
487 488 0 1 Edward male 58.0 0 0 11771 2
Austin

Meyer, Mrs.
Edgar PC
375 376 1 1 female NaN 1 0 8
Joseph 17604
(Leila Saks)

Persson, Mr.
267 268 1 3 male 25.0 1 0 347083
Ernst Ulrik

Ross, Mr.
583 584 0 1 male 36.0 0 0 13049 4
John Hugo

Hart, Mrs.
Benjamin F.C.C.
440 441 1 2 female 45.0 1 1 2
(Esther Ada 13529
Bloomfield)

Goldenberg,
453 454 1 1 Mr. Samuel male 49.0 1 0 17453 8
L

Baclini,
448 449 1 3 Miss. Marie female 5.0 2 1 2666
Catherine

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 2/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Basic information about data – EDA

The df.info() function will give us the basic information about the dataset.

For any data, it is good to start by knowing its information such as data type, null
values and many more.

df.describe() function Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion
and shape of a dataset's distribution, excluding NaN values.

In [5]: #Basic information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [6]: #Describe the data

df.describe(include="float64")

Out[6]: Age Fare

count 714.000000 891.000000

mean 29.699118 32.204208

std 14.526497 49.693429

min 0.420000 0.000000

25% 20.125000 7.910400

50% 28.000000 14.454200

75% 38.000000 31.000000

max 80.000000 512.329200

In [7]: df.describe(include="int64")

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 3/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Out[7]: PassengerId Survived Pclass SibSp Parch

count 891.000000 891.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 0.523008 0.381594

std 257.353842 0.486592 0.836071 1.102743 0.806057

min 1.000000 0.000000 1.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 0.000000 0.000000

50% 446.000000 0.000000 3.000000 0.000000 0.000000

75% 668.500000 1.000000 3.000000 1.000000 0.000000

max 891.000000 1.000000 3.000000 8.000000 6.000000

In [8]: df.describe(include="object")

Out[8]: Name Sex Ticket Cabin Embarked

count 891 891 891 204 889

unique 891 2 681 147 3

top Braund, Mr. Owen Harris male 347082 B96 B98 S

freq 1 577 7 4 644

Duplicate values
We can use the df.duplicate.sum() function to the sum of duplicate value present if
any.
It will show the number of duplicate values if they are present in the data.

In [9]: #Find the duplicates

df.duplicated().sum()

Out[9]: 0

No duplicate value is present.

Find the Null values

Finding the null values is the most important step in the EDA.
As it is told many a time, ensuring the quality of data is paramount.
So, let’s see how we can find the null values.

In [10]: df.isnull().sum()

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 4/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Out[10]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

In [11]: #We can also try heat map to visualize null values
plt.figure(figsize=(8,6))
sns.heatmap(df.isnull(), cbar= False)

Out[11]: <AxesSubplot:>

Clearly visible form heat map and python code that we have null values in Age and
Cabin attributes.

Graphical Method for Univariate Analysis

1. Categorical Data
In [12]: df.info()

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 5/16
9/25/24, 11:24 AM eda-using-univariate-analysis

In [13]: x = df["Survived"].value_counts()[0]
y = df["Survived"].value_counts()[1]

print(f"Number of people survived: {x}")

print(f"Number of people died: {y}")

Number of people survived: 549

Number of people died: 342

In [14]: male = df["Sex"].value_counts()[0]

female = df["Sex"].value_counts()[1]

print(f"Number of male passenger: {male}")

print(f"Number of female passenger: {female}")

Number of male passenger: 577

Number of female passenger: 314

In [15]: df["Pclass"].value_counts()[1]

Out[15]: 216

In [16]: first_class = df["Pclass"].value_counts()[1]

second_class = df["Pclass"].value_counts()[2]
third_class = df["Pclass"].value_counts()[3]

print(f"Number of 1st class passenger: {first_class}")

print(f"Number of 2nd class passenger: {second_class}")
print(f"Number of 3rd class passenger: {third_class}")

Number of 1st class passenger: 216

Number of 2nd class passenger: 184
Number of 3rd class passenger: 491

In [17]: df["Embarked"].value_counts()

Out[17]: S 644
C 168
Q 77
Name: Embarked, dtype: int64

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 6/16
9/25/24, 11:24 AM eda-using-univariate-analysis

In [18]: s_embark = df["Embarked"].value_counts()[0]

c_embark = df["Embarked"].value_counts()[1]
q_embark = df["Embarked"].value_counts()[2]

print(f"Number of passenger embarked from Southampton: {s_embark}")

print(f"Number of passenger embarked from Cherbough: {c_embark}")
print(f"Number of passenger embarked from Queenstown: {q_embark}")

Number of passenger embarked from Southampton: 644

Number of passenger embarked from Cherbough: 168
Number of passenger embarked from Queenstown: 77

a. Countplot
Countplot to plot count of each category of categorical feature

In [19]: import warnings

warnings.filterwarnings("ignore")

In [34]: # define dimensions of subplots (rows, columns)

fig, axes = plt.subplots(2, 2)

# set seaborn ascetic

sns.set(rc={'figure.figsize':(11.7,8.27)})

# create chart in each subplot

p1 = sns.countplot(df["Survived"], ax=axes[0,0])
p2 = sns.countplot(df["Pclass"], ax=axes[0,1])
p3 = sns.countplot(df["Embarked"], ax=axes[1,0])
p4 = sns.countplot(df["Sex"], ax=axes[1,1])

b. Pie Chart
localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 7/16
9/25/24, 11:24 AM eda-using-univariate-analysis

For getting percentage of each category of a categorical feature, we can make use of
pie chat.

In [21]: # set figure size for plot

plt.figure(figsize=(10,10))

# first plot
plt.subplot(2,2,1)
plt.pie(df["Survived"].value_counts(), autopct='%.2f')
plt.title("Survived")

# second plot
plt.subplot(2,2,2)
plt.pie(df["Sex"].value_counts(), autopct='%.2f')
plt.title("Sex")

# third plot
plt.subplot(2,2,3)
plt.pie(df["Pclass"].value_counts(), autopct='%.2f')
plt.title("Pclass")

# fourth plot
plt.subplot(2,2,4)
plt.pie(df["Embarked"].value_counts(), autopct='%.2f')
plt.title("Embarked")

# to plot
plt.show()

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 8/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Numerical Data

a. Histogram
A great way to get started exploring a single variable is with the histogram. A
histogram divides the variable into bins, counts the data points in each bin, and
shows the bins on the x-axis and the counts on the y-axis.

Age Attribute:

In [22]: df["Age"].describe()

Out[22]: count 714.000000

mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 9/16
9/25/24, 11:24 AM eda-using-univariate-analysis

In [23]: # A skewness value greater than 1 or less than -1 indicates a highly skewed dist
# A value between 0.5 and 1 or -0.5 and -1 is moderately skewed.
# A value between -0.5 and 0.5 indicates that the distribution is fairly symmetr

df["Age"].skew()

Out[23]: 0.38910778230082704

In [24]: plt.hist(df["Age"])
plt.show()

b. Distribution Plot
In [25]: sns.distplot(df["Age"])
plt.show()

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 10/16
9/25/24, 11:24 AM eda-using-univariate-analysis

c. Boxplot
Boxplot is a method for graphically demonstrating the locality, spread and skewness
groups of numerical data through their quartiles.
Very useful for finding outliers in numerical data.

In [26]: sns.boxplot(df["Age"])
plt.show()

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 11/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Fare Attribute:

In [27]: df["Fare"].describe()

Out[27]: count 891.000000

mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64

Minimum fare is zero, which is not sounding good. Lets try to find out why it is so.

It may possible that because of wrong entry.

In [28]: df[df["Fare"]==0]

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 12/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Out[28]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fa

Leonard, Mr.
179 180 0 3 male 36.0 0 0 LINE 0
Lionel

Harrison, Mr.
263 264 0 1 male 40.0 0 0 112059 0
William

Tornquist,
271 272 1 3 Mr. William male 25.0 0 0 LINE 0
Henry

Parkes, Mr.
277 278 0 2 Francis male NaN 0 0 239853 0
"Frank"

Johnson, Mr.
302 303 0 3 William male 19.0 0 0 LINE 0
Cahoone Jr

Cunningham,
413 414 0 2 Mr. Alfred male NaN 0 0 239853 0
Fleming

Campbell,
466 467 0 2 male NaN 0 0 239853 0
Mr. William

Frost, Mr.
Anthony
481 482 0 2 male NaN 0 0 239854 0
Wood
"Archie"

Johnson, Mr.
597 598 0 3 male 49.0 0 0 LINE 0
Alfred

Parr, Mr.
633 634 0 1 William male NaN 0 0 112052 0
Henry Marsh

Watson, Mr.
674 675 0 2 Ennis male NaN 0 0 239856 0
Hastings

Knight, Mr.
732 733 0 2 male NaN 0 0 239855 0
Robert J

Andrews, Mr.
806 807 0 1 male 39.0 0 0 112050 0
Thomas Jr

Fry, Mr.
815 816 0 1 male NaN 0 0 112058 0
Richard

Reuchlin,
822 823 0 1 Jonkheer. male 38.0 0 0 19972 0
John George

In [29]: len(df[df["Fare"]==0])

Out[29]: 15

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 13/16
9/25/24, 11:24 AM eda-using-univariate-analysis

Total 15 records are having Fare zero.

And all 15 persons are among three different class.
All 15 persons embarked from Southampton.
All are male.
This is result of wrong entry.

In [30]: # A skewness value greater than 1 or less than -1 indicates a highly skewed dist
# A value between 0.5 and 1 or -0.5 and -1 is moderately skewed.
# A value between -0.5 and 0.5 indicates that the distribution is fairly symmetr

df["Fare"].skew()

Out[30]: 4.787316519674893

a. Histogram
In [31]: plt.hist(df["Fare"])
plt.show()

Highly Right Skewed

b. Distrbution Plot
In [32]: sns.distplot(df["Fare"])
plt.show()

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 14/16
9/25/24, 11:24 AM eda-using-univariate-analysis

c. Box plot
In [33]: sns.boxplot(df["Fare"])
plt.show()

That all for this notebook: EDA using Univariate Analysis

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 15/16
9/25/24, 11:24 AM eda-using-univariate-analysis

There are lots of method and approach out there, which we can ue for analyzing our
features of data.
That totally depend over your use case and what you want to know from data.
The questions you will be asking from data are going to be different in each case.
There is no hard and rule that we should go like this or like that. These things totally
depends over your experiment and use case.

Next Notebook: EDA using Bivariate and Multivariate

Analysis.
In [ ]:

localhost:8888/doc/tree/eda-using-univariate-analysis.ipynb? 16/16

All Format
91% (32)
All Format
1 page
Live Tracker - Free PakData Sim Database On-Line 2025
No ratings yet
Live Tracker - Free PakData Sim Database On-Line 2025
9 pages
1500 Vocabulary Words
79% (107)
1500 Vocabulary Words
27 pages
XXXX XXXXXXXX: X X X X X XX
60% (5)
XXXX XXXXXXXX: X X X X X XX
2 pages
It - Stephen King's PDF
80% (10)
It - Stephen King's PDF
588 pages
Sim Owner Details - Pakistan No #1 Number Information System 2025
56% (16)
Sim Owner Details - Pakistan No #1 Number Information System 2025
3 pages
Secret Code Samsung
89% (38)
Secret Code Samsung
3 pages
میری گرم فیملی
81% (47)
میری گرم فیملی
133 pages
XXX Archita Phukan Viral Video Original XXX VIDEOS
9% (11)
XXX Archita Phukan Viral Video Original XXX VIDEOS
4 pages
50 Numerical Questions On Electricity Class 10
88% (74)
50 Numerical Questions On Electricity Class 10
49 pages
Big Book of Sex
41% (123)
Big Book of Sex
386 pages
Earseus Key
53% (15)
Earseus Key
4 pages
NADANPENKODI - Malayalam Kambi Kathakal
60% (10)
NADANPENKODI - Malayalam Kambi Kathakal
8 pages
Microsoft Office 2007 Activation Keys
85% (34)
Microsoft Office 2007 Activation Keys
2 pages
Telugu Family Sex Stories Collection
67% (102)
Telugu Family Sex Stories Collection
157 pages
Corel Draw X7 Serial Number & Activation Code
60% (42)
Corel Draw X7 Serial Number & Activation Code
1 page
All Numbers
67% (18)
All Numbers
59 pages
Telugu Boothu Kathala 5
67% (18)
Telugu Boothu Kathala 5
33 pages
Carbon and Its Compound (Prashant Kirad)
91% (267)
Carbon and Its Compound (Prashant Kirad)
21 pages
Uveit Foster
50% (6)
Uveit Foster
954 pages
Telugu Boothu Kathala 24 PDF
77% (13)
Telugu Boothu Kathala 24 PDF
20 pages
Chemistry (Annual Reports - Vol.59-1962)
100% (8)
Chemistry (Annual Reports - Vol.59-1962)
576 pages
Sample Research Paper PDF
95% (20)
Sample Research Paper PDF
36 pages
Class 12 Mathematics All Formulas
90% (176)
Class 12 Mathematics All Formulas
9 pages
Open Deed of Sale of A Motor Vehicle
81% (594)
Open Deed of Sale of A Motor Vehicle
1 page
Mineral and Energy Resources (Prashant Kirad)
92% (248)
Mineral and Energy Resources (Prashant Kirad)
20 pages
R. D. Sharma Class 9th Book PDF - Unlocked
83% (69)
R. D. Sharma Class 9th Book PDF - Unlocked
464 pages
Agriculture (Prashant Kirad)
90% (216)
Agriculture (Prashant Kirad)
22 pages
Tamil Romantic Stories Collection
40% (20)
Tamil Romantic Stories Collection
2 pages
Telugu Boothu Kathala 24
60% (10)
Telugu Boothu Kathala 24
27 pages

Eda Using Univariate Analysis

Uploaded by

Eda Using Univariate Analysis

Uploaded by

9/25/24, 11:24 AM eda-using-univariate-analysis

In this notebook we will only look univariate analysis.

Which dataset we are going to use?

The Titanic Dataset contains following attributes:

PassengerID : Unique ID of a passenger

In [3]: #Load the required libraries

import seaborn as sns

#Load the data

#View the data

Number of records present in given dataset is: 891

In [4]: #Sample of data

Basic information about data – EDA

df.describe() function Generate descriptive statistics.

In [5]: #Basic information

In [6]: #Describe the data

Out[6]: Age Fare

count 714.000000 891.000000

mean 29.699118 32.204208

std 14.526497 49.693429

min 0.420000 0.000000

25% 20.125000 7.910400

50% 28.000000 14.454200

75% 38.000000 31.000000

max 80.000000 512.329200

Out[7]: PassengerId Survived Pclass SibSp Parch

count 891.000000 891.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 0.523008 0.381594

std 257.353842 0.486592 0.836071 1.102743 0.806057

min 1.000000 0.000000 1.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 0.000000 0.000000

50% 446.000000 0.000000 3.000000 0.000000 0.000000

75% 668.500000 1.000000 3.000000 1.000000 0.000000

max 891.000000 1.000000 3.000000 8.000000 6.000000

Out[8]: Name Sex Ticket Cabin Embarked

count 891 891 891 204 889

unique 891 2 681 147 3

top Braund, Mr. Owen Harris male 347082 B96 B98 S

freq 1 577 7 4 644

In [9]: #Find the duplicates

No duplicate value is present.

Find the Null values

Graphical Method for Univariate Analysis

print(f"Number of people survived: {x}")

Number of people survived: 549

In [14]: male = df["Sex"].value_counts()[0]

print(f"Number of male passenger: {male}")

Number of male passenger: 577

In [16]: first_class = df["Pclass"].value_counts()[1]

print(f"Number of 1st class passenger: {first_class}")

Number of 1st class passenger: 216

In [18]: s_embark = df["Embarked"].value_counts()[0]

print(f"Number of passenger embarked from Southampton: {s_embark}")

Number of passenger embarked from Southampton: 644

In [19]: import warnings

In [34]: # define dimensions of subplots (rows, columns)

# set seaborn ascetic

# create chart in each subplot

In [21]: # set figure size for plot

Out[22]: count 714.000000

Out[27]: count 891.000000

It may possible that because of wrong entry.

Total 15 records are having Fare zero.

Highly Right Skewed

That all for this notebook: EDA using Univariate Analysis

Next Notebook: EDA using Bivariate and Multivariate

You might also like