Data Vizualization - Jupyter Notebook

10/11/21, 2:47 PM Data Vizualization - Jupyter Notebook
1. Importing necessasry libraries

In [44]:
from matplotlib import pyplot as plt

%matplotlib inline
import pandas as pd
import seaborn as sns
#See the chart in the jupyter notebook
#import matplotlib.pyplot as plt
In [5]:
month = ['Jan','Feb','Mar','Apr','May','Jun','July']
sales = [30000,25000,50000,45000,42000,30000,33000]
In [6]:
print(month)
print(sales)
['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'July']
[30000, 25000, 50000, 45000, 42000, 30000, 33000]
In [16]:
plt.figure(figsize=(15,6))
plt.plot(month,sales)
plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthwise Sales')
plt.show()
Import mtcars dataset
localhost:8888/notebooks/Data science/Data Vizualization.ipynb 1/20

In [20]:
cars_data = pd.read_csv('mtcars.csv') #Automobile

cars_data
Out[20]:
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

2. Perform Initial analysis and try to get 2 insights from this data.
In [21]:
cars_data.shape
Out[21]:
(32, 11)
In [22]:
cars_data.isna().sum()
Out[22]:
mpg 0
cyl 0
disp 0
hp 0
drat 0
wt 0
qsec 0
vs 0
am 0
gear 0
carb 0
dtype: int64
In [23]:
cars_data.describe()
Out[23]:
mpg cyl disp hp drat wt qsec v
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.00000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.43750
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.50401
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.00000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.00000
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.00000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.00000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.00000
Insights/Inference

In [24]:
cars_data.cyl.unique()
Out[24]:
array([6, 4, 8], dtype=int64)
In [25]:
cars_data.gear.unique()
Out[25]:
In [26]:
cars_data.carb.unique()
Out[26]:
array([4, 1, 2, 3, 6, 8], dtype=int64)
In [28]:
cars_data.groupby(by='am')['mpg'].mean().round(2)
Out[28]:
am
0 17.15
1 24.39
Name: mpg, dtype: float64
Inferences
1. On an average, automatic cars are giving 17.15mpg and manual cars are giving 24.39mpg.
2. On an average, 4 cyl cars are giving 26.66mpg, 6cyl cars are giving 19.74mpg and 8cyl are giving
15mpg.
3. On an average, automatic car with V-shaped engine is giving 15mpg and manual cars with V-shaped
is giving 19mpg and also automatic cars with straight-shaped engine is giving 20.74mpg and manual
cars with straight-shaped engine is giving 28mpg. So it is evident that manual cars with straight-shaped
engine is giving more mileage compared to all other divisions.
In [30]:
cars_data.groupby(by = 'cyl')['mpg'].mean().round(2)
Out[30]:
cyl
4 26.66
6 19.74
8 15.10

In [31]:
cars_data.groupby(by = ['vs','am'])['mpg'].mean().round(2)
Out[31]:
vs am
0 0 15.05
1 19.75
1 0 20.74
1 28.37
In [32]:
pd.crosstab(cars_data['cyl'],cars_data['gear']) #Frequency table
Out[32]:
gear 3 4 5
cyl
4 1 8 2
6 2 4 1
8 12 0 2
1. Univariate Analysis
To understand the quality of 1 variable.
In [33]:
cars_data.head()
Out[33]:
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

In [39]:
Out[39]:
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.00000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.43750
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.50401
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.00000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.00000
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.00000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.00000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.00000
In [60]:
cars_data.gear.unique()
Out[60]:
In [63]:
cars_data.gear.value_counts()
Out[63]:
3 15
4 12
5 5
Name: gear, dtype: int64

In [74]:
plt.figure(figsize = (10,10))
plt.pie(x = cars_data.gear.value_counts(),data=cars_data,labels=[3,4,5],explode=[0.02,0.02,
#Note: In pie-chart, pick up a discrete data and pass with its value and counts
plt.show()

In [38]:
plt.hist(x = 'mpg',data=cars_data) #Always pick 1 continous to understand the frequency dis

plt.title('MPG Distribution')
plt.show()
In [40]:
plt.hist(x = 'cyl',data=cars_data) #Wring datatype for histogram

plt.title('CYL Distribution')
plt.show()
In [35]:
cars_data.cyl.value_counts()
Out[35]:
8 14
4 11
6 7
Name: cyl, dtype: int64

In [57]:
sns.countplot(x='cyl',y=None,data=cars_data)
Out[57]:
<AxesSubplot:xlabel='cyl', ylabel='count'>
Boxplot - is used to detect the outliers --> the most extreme points and the least points
In [92]:
Out[92]:
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.00000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.43750
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.50401
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.00000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.00000
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.00000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.00000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.00000

In [91]:
plt.boxplot(x = 'hp',data = cars_data) #Continous data

plt.show()
2. Bivariate Analysis
Barplot
In [43]:
Out[43]:
cyl
4 26.66
6 19.74
8 15.10

In [42]:
plt.bar(x = 'cyl',height = 'mpg',data=cars_data)

plt.show() #Wrong calculation
In [47]:
Out[47]:
cyl
4 26.66
6 19.74
8 15.10

In [59]:
plt.figure(figsize=(6,5))
sns.barplot(x='cyl',y='mpg',data=cars_data,)
plt.title('Cyl Vs MPG',size = 20)
plt.show()
Scatter Plot
Is used to check the linear association/relationship between two variable. Note: 2variables it need to be
continous.

In [75]:
cars_data.head()
Out[75]:
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
In [76]:
plt.scatter(x = 'hp', y = 'mpg', data = cars_data)
Out[76]:
<matplotlib.collections.PathCollection at 0x27a630fc700>

In [77]:
plt.scatter(x = 'drat', y = 'mpg', data = cars_data)
Out[77]:
<matplotlib.collections.PathCollection at 0x27a63361d00>
In [78]:
sns.scatterplot(x = 'drat', y = 'mpg', data = cars_data)
Out[78]:
<AxesSubplot:xlabel='drat', ylabel='mpg'>

In [79]:
sns.lmplot(x = 'drat', y = 'mpg', data = cars_data)
Out[79]:
<seaborn.axisgrid.FacetGrid at 0x27a63540d60>
Boxplot as bivariate analysis

In [94]:
sns.boxplot(x='cyl',y='mpg',data=cars_data)
plt.show()
Correlation Matrix

In [87]:
corr = cars_data.corr().round(2)
corr
Out[87]:
mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
3. Multivariate Analysis

In [83]:
sns.pairplot(data = cars_data)
Out[83]:
<seaborn.axisgrid.PairGrid at 0x27a63520e20>

In [89]:
plt.figure(figsize = (15,8))
sns.heatmap(data = corr,annot=True)
Out[89]:
<AxesSubplot:>

In [95]:
import plotly.express as px
# This dataframe has 244 lines, but 4 distinct values for `day`
df = px.data.tips()
fig = px.pie(df, values='tip', names='day')
fig.show()
In [ ]:

Data Vizualization - Jupyter Notebook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Vizualization - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

10/11/21, 2:47 PM Data Vizualization - Jupyter Notebook

1. Importing necessasry libraries

from matplotlib import pyplot as plt

['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'July']

[30000, 25000, 50000, 45000, 42000, 30000, 33000]

Import mtcars dataset

localhost:8888/notebooks/Data science/Data Vizualization.ipynb 1/20

cars_data = pd.read_csv('mtcars.csv') #Automobile

mpg cyl disp hp drat wt qsec vs am gear carb

0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

8 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

9 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

10 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

11 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3

12 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3

13 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3

14 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

15 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

17 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

20 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2

22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

23 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4

24 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

25 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

26 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

29 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

localhost:8888/notebooks/Data science/Data Vizualization.ipynb 2/20

mpg cyl disp hp drat wt qsec v

count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.00000

mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.43750

std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.50401

min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.00000

25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.00000

50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.00000

75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.00000

max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.00000

localhost:8888/notebooks/Data science/Data Vizualization.ipynb 3/20

array([6, 4, 8], dtype=int64)

array([4, 3, 5], dtype=int64)

array([4, 1, 2, 3, 6, 8], dtype=int64)

Name: mpg, dtype: float64

Name: mpg, dtype: float64

localhost:8888/notebooks/Data science/Data Vizualization.ipynb 4/20

Name: mpg, dtype: float64

pd.crosstab(cars_data['cyl'],cars_data['gear']) #Frequency table

To understand the quality of 1 variable.

mpg cyl disp hp drat wt qsec vs am gear carb

0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4