You are on page 1of 12

Data Science by Nireekshan

3. Data Visualization with Seaborn

 The Seaborn library is based on the Matplotlib library and it helps in making data
visualization easier.
 This library can be used for creating both categorical and distributional plots.
 In this chapter, we will be using the titanic dataset.
 The Seaborn library comes loaded with this dataset, so your task is to call the load_dataset
function then pass the name of this dataset to it as the parameter.
 The dataset will then be loaded into your workspace.

Program loading titanic dataset


Name demo1.py

import seaborn as sns

data = sns.load_dataset('titanic')
print(data.head())

Output

Info

 The dataset contains 891 rows and 15 columns and contains information about the
passengers who boarded the unfortunate Titanic ship.
 Here the task is to predict whether or not the passenger survived depending upon different
features such as their age, ticket, cabin they boarded, the class of the ticket, etc.
 We will use the Seaborn library to see if we can find any patterns in the data.

1|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Distributional Plots

 These are the types of plots that show how statistical data is distributed.
 We will be discussing the common distributional plots provided by the Seaborn library.

Dist Plot

 We can call the distplot() function which shows the histogram distribution of a dataset for
a column.
 We can plot the price of the ticket for every passenger

Program Creating distribution plot


Name demo2.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.distplot(data['fare'])
plt.show()

Output

 The above plot shows that most of the tickets have been sold between 0 and 50 dollars.
 The visible line shows the kernel density estimation. To remove the line, you can pass a
value of False to the kde parameter. This is demonstrated below:

sns.distplot(data['fare'], kde=False)

2|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Program Creating distribution plot


Name demo3.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.distplot(data['fare'], kde=False)
plt.show()

Output

3|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Joint Plot

 We can use the jointplot() function to show the mutual distribution of each column.
 We need to provide 3 parameters to the jointplot() function.
o The first parameter is the name of the column for which you need to show the
distribution on the x-axis.
o The second parameter should be the name of the column for which you need to
show the distribution on the y-axis.
o The third and final parameter should be the name of the data frame.
 Let us now create a joint plot of the age and fare columns so that we may inspect any
underlying relationship between the two.

Program Creating joint plot


Name demo4.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')

sns.jointplot(x='age', y='fare', data=data)


plt.show()

Output

4|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Program Creating hexagon plot


Name demo5.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')

sns.jointplot(x='age', y='fare', data=data, kind='hex')

plt.show()

Output

 When creating a hexagon plot, the hexagon with the highest number of points will get a
darker color.
 This means that from the above plot, most of the passengers are aged between 10 and 20
and most of the passengers paid between 10 and 50 for tickets.

5|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Rug Plot

 This is a kind of plot that shows small bars along the x-axis for every point on the dataset.
 In this plot clearly that most of the fares are between 0 and 100, just as it was the case
with the dist plot.

Program Creating rugplot


Name demo6.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.rugplot(data['fare'])
plt.show()

Output

6|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Categorical Plots

 From the name, one can tell that categorical plots are used for plotting categorical data.
 In a categorical plot, the values are plotted in the categorical column against the other
categorical column or numeric column.

Bar Plot

 A bar plot shows the mean value of every value in a categorical column against a numeric
column.

Program Creating bar plot


Name demo7.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.barplot(x='sex', y='age', data=data)

plt.show()

Output

 The plot clearly shows that the average age for all male passengers is above 30 while the
average age of the female passengers is between 25 and 30.

7|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Cunt Plot

 This type of plot is similar to the bar plot, with the difference being that it displays the
count of categories in a specific column.
 A good application of this is when we need to calculate the total number or count of male
and female passengers.
 The output shows that males are more than females. There are about 300 females and
close to 600 males.

Program Creating bar plot


Name demo8.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.countplot(x='sex', data=data)

plt.show()

Output

8|Page nireekshan.ds@gmail.com
Data Science by Nireekshan

Box Plot

 The box plot is used to display the distribution of the categorical data in the form of
quartiles.
 The center of the box shows the median value.
 The value from the lower whisker to the bottom of the box shows the first quartile.
 From the bottom of the box to the middle of the box lies the second quartile.
 From the middle of the box to the top of the box lies the third quartile and finally from the
top of the box to the top whisker lies the last quartile.
 Now let's plot a box plot that displays the distribution for the age with respect to each
gender.
 You need to pass the categorical column as the first parameter (which is sex in our case)
and the numeric column (age in our case) as the second parameter.
 Finally, the dataset is passed as the third parameter

Program Creating a boxplot


Name demo9.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.boxplot(x='sex', y='age', data=data)

plt.show()

Output

 The first quartile starts at around 5 and ends at 22 which mean that 25% of the
passengers are aged between 5 and 22.
 The second quartile starts at around 23 and ends at around 28 which mean that 25% of
the passengers are aged between 23 and 28.

9|Page nireekshan.ds@gmail.com
Data Science by Nireekshan
 Similarly, the third quartile starts and ends between 29 and 38, hence 25% passengers
are aged within this range and finally the fourth or last quartile starts at 39 and ends
around 76.
 The part between the upper quartile and the lower quartile is known as the Inter Quartile
Range (IQR) and helps in approximating 50% of the middle data.

Program Creating a boxplot with survived


Name demo10.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.boxplot(x='sex', y='age', data=data, hue="survived")

plt.show()

Output

 Other than the information about the age of the passengers, the above plot also shows the
distribution of passengers who survived.
 The plot shows that most young males survived compared to females.

10 | P a g e nireekshan.ds@gmail.com
Data Science by Nireekshan

Violin Plot

 This type of plot is the same as the box plot, but with a violin plot, we can display all
components corresponding to a data point.

Program Creating violin plot


Name demo11.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.violinplot(x='sex', y='age', data=data)

plt.show()

Output

 For the case of the violin plot for the male, it is clear that most males are aged between 20
and 40.

11 | P a g e nireekshan.ds@gmail.com
Data Science by Nireekshan

Program Creating violin plot with survived


Name demo12.py

import seaborn as sns


import matplotlib.pyplot as plt

data = sns.load_dataset('titanic')
sns.violinplot(x='sex', y='age', data=data, hue='survived')

plt.show()

Output

 For example, a comparison between the bottoms of the violin for males who survived with
the violin for males who did not survive shows that the former is thicker than the latter.
 This tells us that the number of young males who survived is more than the number of
young males who did not survive.

12 | P a g e nireekshan.ds@gmail.com

You might also like