You are on page 1of 33

Plotting and Visualization

Lecture 5

Sergio Caballero, Ph.D.


sergioac@mit.edu

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 1
Outline

• Introduc4on to seaborn
• Plo6ng numerical data
• Plo6ng categorical data
• Plo6ng the distribu4on of a dataset

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 2
Introduction to Seaborn

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 4
seaborn
• Seaborn is a Python data visualization library based
on matplotlib.

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 5
seaborn.set_style
• Set the aesthe4c style of the plots.

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 6
Choosing color palettes
• Seaborn makes it easy to select and use color paleSes that
are suited to the kind of data you are working with.
Qualitative color palettes Sequential color palettes

Diverging color palettes

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 7
Plo6ng numerical data

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 8
Visualizing statistical relationships
• The goal is to understand how numerical variables in a
dataset relate to each other.
• relplot() is a function for visualizing statistical relation-
ships producing two common plots:
§ scatter plots
§ line plots

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 9
relplot()
relplot(x='A', y='B', hue='C', size='D', style='E',
kind='line', data=df)
• Arguments:
§ x,y: names of variables (columns); must be numeric
§ hue, size, style: grouping variable that produces elements
with different colors, sizes or styles; categorical or numeric
§ kind: kind of plot to draw; 'scatter' (the default) or 'line'
§ data: DataFrame

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 10
Scatter plots
• Scatter plots are a useful way of examining the relationship
between two one-dimensional datasets.
• Example:

routes:

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 11
Scatter plots (Cont.)

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 12
Line plots
• A line plot displays informa4on as a series of data points
connected by straight line segments.
• Example:

route_r001:

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 13
Line plots (Cont.)

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 14
Line plots (Cont.)

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 15
Plotting categorical data

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 17
Plotting categorical data
• catplot() is a func4on that shows the rela4onship
between a numerical and one or more categorical variables.
• It can be used for:
§ Categorical scaSer plots
§ Categorical distribu4on plots
o Boxplot
o Violinplot

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 18
catplot()
catplot(x='A', y='B', hue='C', order=['A1', 'A2'],
kind='box', data=df)
• Arguments:
§ x,y: names of variables (columns)
§ hue: grouping variable that produces elements with different
colors
§ kind: kind of plot to draw; 'strip' (the default), 'box',
'violin'
§ order: order to plot the categorical levels
§ data: DataFrame

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 19
Categorical scatter plot
• A categorical scaSer plot represents categorical data with a
scaSer plot.
• Example:

data:

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 20
Categorical scatter plot (Cont.)
Overlapping points Categorical variable on y-axis

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 21
Boxplots
• A boxplot shows the three quartile values of the distribution
along with extreme values. The whiskers extend to points
that lie within 1.5 IQRs of the lower and upper quartile.

data:

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 22
Boxplots (Cont.)
Categorical variable on x-axis Categorical variable on y-axis

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 23
Violin plots
• A violin plot combines a boxplot with an estimation of the
distribution of values. The quartiles and whiskers are shown
inside the violin.

data:

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 24
Bar plots
• barplot() is a function that represents an estimate of
central tendency for a numeric variable (height of each
rectangle) and provides some indication of the uncertainty
around that estimate.
Measure of uncertainty

Mean

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 25
barplot()
barplot(x='A', y='B', hue='C', order=['A1', 'A2'],
ci='sd', data=df)
• Arguments:
§ x,y: names of variables (columns); x-categorical, y-numerical
§ hue: grouping variable that produces elements with different
colors
§ order: order to plot the categorical levels
§ ci: size of confidence intervals (float or 'sd'); default 95% CI
§ data: DataFrame

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 26
Bar plots (Cont.)
Average dura4on by route
• 95% confidence interval • Standard deviation

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 27
Plotting the distribution of a dataset

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 29
Plotting the distribution of a dataset
• When analyzing your data, often the first thing you will want
to do is to get a sense for how a variable is distributed.
• You can do this by creating a:
§ Histogram (distplot)
§ Density plot (kdeplot)

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 30
distplot()
distplot(a=df['A'], bins=x, kde=False)
• Arguments:
§ a: observed data; numerical
§ bins: number of bins; integer
§ kde: whether to plot a gaussian kernel density estimate; default
True

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 31
Histograms
• Histogram and KDE • Only histogram

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 32
Histograms (Cont.)
• Specifying the number of bins
Default

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 33
kdeplot()
kdeplot(data=df['A'], shade=True)
• Arguments:
§ data:observed data; numerical
§ shade: if True, area under the KDE curve is shaded

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 34
Density plots

Default

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 35
References
• Seaborn tutorial: hJps://seaborn.pydata.org/tutorial.html

SCM.254 Applied Programming and Data Analysis in Python | L5: Plotting and Visualization | Page 37

You might also like