5-Information & Visualization

Data Science
Visualization
Topics
• We will have today an overview of:
• Introduction / Definition
• History
• Examples
• Workflow / Pipeline
• Software overview
• Hands-on exercises
• Resources
“Sci vis” versus “Info vis”
• Visualization: converting raw data to a form that is

viewable and understandable to humans.
• Scientific visualization: specifically concerned

with data that has a well-defined representation in
2D or 3D space (e.g., from simulation mesh or
scanner).
*Adapted from The ParaView

Tutorial, Moreland
Introduction to Information Visualization - Fall 2013

Information visualization
• Information visualization: concerned with data that

does not have a well-defined representation in 2D or
3D space (i.e., “abstract data”).

Pre-history
• Selected figures
– William Playfair (1821) – line, bar charts, etc.
– Charles Joseph Minard (1869) – Napoleon’s march, etc.
– Jacques Bertin (1967) – “semiology of graphics”
– John Tukey (1977) – “exploratory data analysis”
– Edward Tufte (1983) – statistical graphics standards/practices
• 1985 NSF Workshop on Scientific Visualization
• 1990: S.K.Card, et al. Readings in Information
Visualization: Using Vision to Think

Examples
⚫ Network visualization
(vizster)

Examples
⚫ Geo data
mapping
⚫ Demo

Examples
• Treemap
• Demo

Examples
• Circle chart
• Demo

Examples
⚫ Population
“Trendalyzer”
⚫ Demo

Additional Examples
• NY Times words, words, numbers
• Visual Complexity (from book by Manuel Lima)
• 50 examples (from June 2009, somewhat dated)
• D3 Gallery

Visualization components
• Color
• Size
• Texture
• Proximity
• Annotation
• Interactivity
– Selection / Filtering
– Zoom
– Animation

Info vis workflow / pipeline
• Acquire
• Parse
• Filter
• Mine
• Represent
• Refine
• Interact
* Adapted from Fry, Visualizing Data

https://www.oreilly.com/library/view/visualizing-data/9780596514556/ch01.html
• Acquire
Obtain the data, whether from a file on a disk or a source over a network.
[p. 7, Fry, Visualizing Data]

• Parse
Provide some structure for the data’s meaning and order it into categories.

• Filter/Mine
Filter
Remove all but the data of interest.
Mine
Apply methods from statistics or data
mining as a way to discern patterns or
place the data in mathematical context.

• Represent
Choose a basic visual model, such as a bar graph, list, or tree.

• Refine
Improve the basic representation to make it clearer and more visually engaging.

• Interact
Add methods for manipulating the data or controlling what features are visible.
• Demo

Iris Sample Data Set
• data visualization techniques can also be illustrated with the Iris Plant
data set (more later).
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– Three flower types (classes):
• Setosa
• Virginica
• Versicolour
– Four (non-class) attributes
• Sepal width and length
• Petal width and length
2.1, 2.5, 1.4, 5.0 → Setosa (S)

1.0, 1.2, 3.0, 4.0 → Virginica (V)
1.0, 2.4, 5.0, 1.0 → Versicolour (R)
Visualization software
• Host language (C/C++/Java/Python) plus OpenGL

• Stat/math package with graphics
– R
– MATLAB
• Special-purpose info viz software
– Earth mapping, biological network visualization, etc.
• Browser-enabled graphics/info viz packages
– Google Charts
– Processing / Processing.js
– Java + Flash (becoming rarer)

Hands-on using Python
• Line chart;
A line chart is used for the representation of continuous data points.
This visual can be effectively utilized when we want to understand the
dependence between two variables.
**currently

• Line chart;
import pandas as pd # for dataframes

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # for plotting graphs
df = sns.load_dataset("iris")
df=df.groupby('sepal_length')['sepal_width'].sum().to_frame().reset_index()
#Creating the line chart
plt.plot(df['sepal_length'], df['sepal_width'])
#Adding the aesthetics

plt.title('Chart title')
plt.xlabel('X label')
plt.ylabel('Y label')
#Show the Chart Plot

plt.show() **currently

• Line chart;
A line chart is used for the representation of continuous data points.
This visual can be effectively utilized when we want to understand the
dependence between two variables.
**currently

• Bar chart;
A bar chart is used when we want to compare metric values across
different subgroups of the data. If we have a greater number of groups,
a bar chart is preferred over a column chart.
**currently

• Bar chart;
import numpy as np
#Creating the dataset

df = sns.load_dataset('titanic')
df=df.groupby('who')['fare'].sum().to_frame().reset_index()
#Creating the bar chart

plt.barh(df['who'],df['fare'],color = ['#F0F8FF','#E6E6FA','#B0E0E6'])

plt.title('Bar Chart Who vs. Fare')
plt.xlabel('Total Paid Fare')
plt.ylabel('Who (Man, Woman, Child)')
#Show the Bar Chart Plot **currently

plt.show()
• Bar chart;
A bar chart is used when we want to compare metric values across
different subgroups of the data. If we have a greater number of groups,
a bar chart is preferred over a column chart.
**currently

• Bar plot;
A bar plot is from the seaborn library. It is slightly different than the
module from matlibplot.
**currently

• Bar plot;
import numpy as np

#Creating the bar plot

sns.barplot(x = 'fare',y = 'who',data = df,palette = "Blues")

plt.title('Bar Chart Who vs. Fare')

plt.show()
• Bar plot;
A bar plot is from the seaborn library. It is slightly different than the
module from matlibplot.
**currently

• Column chart;
Column charts are mostly used when we need to compare a single
category of data between individual sub-items, for example, when
comparing revenue between regions.
**currently

• Column chart;
import numpy as np

#Creating the column plot

plt.bar(df['who'],df['fare'],color = ['#F0F8FF','#E6E6FA','#B0E0E6']

plt.title('Chart Who vs. Fare')

plt.show()
• Column chart;
Column charts are mostly used when we need to compare a single
category of data between individual sub-items, for example, when
comparing total fare across different passenger types.
**currently

• Grouped Column chart;
A grouped column chart is used when we want to compare the values in
certain groups and sub-groups (e.g., passenger type, ticket class).
**currently

import numpy as np

#Creating a grouped Column chart

df_pivot = pd.pivot_table(df, values="fare",index="who",columns="class", aggfunc=np.mean)
ax = df_pivot.plot(kind="bar",alpha=0.5)

plt.title('Chart Who vs. Fare')
plt.xlabel('Who')
plt.ylabel('Total Fare')
#Show the Chart Plot

plt.show()
**currently

A grouped column chart is used when we want to compare the values in
certain groups and sub-groups (e.g., passenger type, ticket class).
**currently

• Stacked column chart;
A stacked column chart is used when we want to compare the total
sizes across the available groups and the composition of the different
subgroups.
**currently

• Stacked bar chart;
import numpy as np

# Stacked bar chart
df = pd.DataFrame(columns=["A","B", "C","D"],
data=[["E",0,1,1],
["F",1,1,0],
["G",0,1,0]])
df.plot.bar(x='A', y=["B", "C","D"], stacked=True, width = 0.4,alpha=0.5)
#Show the Chart Plot **currently

plt.show()
• Stacked bar chart;
A stacked column chart is used when we want to compare the total
sizes across the available groups and the composition of the different
subgroups.
**currently

• Pie chart;
Pie charts can be used to identify proportions of the different
components as they are related to the whole.
**currently

• Pie chart;

import numpy as np

cars = ['AUDI', 'BMW', 'NISSAN',
'TESLA', 'HYUNDAI', 'HONDA']
data = [20, 15, 15, 14, 16, 30]
#Creating the pie chart
plt.pie(data, labels = cars,colors = ['#F0F8FF','#E6E6FA','#B0E0E6','#7B68EE'
,'#483D8B','#0000FF'])
#Show the plot
plt.show()
**currently

• Pie chart;
Pie charts can be used to identify proportions of the different
components as they are related to the whole.
**currently

• Area chart;
Area charts are used to track changes over time for one or more groups.
Area graphs are preferred over line charts when we want to capture the
changes over time for more than 1 group.
**currently

• Area chart;

import numpy as np

x=range(1,7) #from 1 to columns+1
y=[ [1,4,6,8,19,2], [2,2,7,10,12,1], [2,8,5,10,6,3] ] #3 rows x 6 columns
#Creating the area chart
ax = plt.gca()
ax.stackplot(x, y, labels=['A','B','C'],alpha=0.5)
plt.legend(loc='upper left')
plt.xlabel('X axis title')
plt.ylabel('Y axis title')
#Show the plot
plt.show() **currently

• Area chart;
Area charts are used to track changes over a common variable (e.g.
days) for one or more groups. Area graphs are preferred over line charts
when we want to capture the changes over time for more than 1 group.
**currently

• Column histogram;
Column histograms are used to observe the distribution (frequency of
occurrence) of a single variable within a data set.
**currently


import numpy as np
#Reading the dataset

penguins = sns.load_dataset("penguins")
#Creating the column histogram
ax = plt.gca()
ax.hist(penguins['flipper_length_mm'], color='green',alpha=0.5, bins=10)
plt.legend(loc='upper left')
#Show the plot
plt.show()
**currently

Column histograms are used to observe the distribution (frequency of
occurrence) of a single variable within a data set. Here we have 10 bins.
**currently

• Line histogram;
Line histograms are used to observe the distribution (frequency of
occurrence) of two or more variables within a data set.
Kernel Density Estimate (KDE)

KDE is a way to estimate the probability density function (PDF) of the
random variable. KDE is a means of data smoothing.
**currently

• Line histogram;
import numpy as np

mean = 10
stdv= 2
dist = pd.DataFrame(np.random.normal(loc=mean, scale=stdv, size=(1000, 1)),co
lumns=['rnd_data'])
print(dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2))
fig, ax = plt.subplots() #more details later
#Creating the line histogram

dist.plot.kde(ax=ax, legend=False, title='Histogram: rnd_data & KDE')
dist.plot.hist(density=True, ax=ax)
ax.set_ylabel('Probability') #Y-Label
ax.grid(axis='y') #Y-Grid lines **currently
plt.show()
• Line histogram;
rnd_data
min 3.57
max 16.97
mean 10.00
std 1.97
**currently

• Line histogram;
Kernel Density Estimate (KDE)

KDE is a way to estimate the probability density function (PDF) of the
random variable. KDE is a means of data smoothing.
**currently

• Line histogram;

import numpy as np
from scipy import stats

df_1 = np.random.normal(0, 1, (1000, ))
density = stats.gaussian_kde(df_1)
n, x, _ = plt.hist(df_1, bins=np.linspace(-3, 3, 50), histtype=u'step', density=True)
plt.plot(x, density(x))
#Show the plot
plt.show()
**currently

• Line histogram;
**currently

• Line histogram (alternative way);

import numpy as np

mean = 10
stdv= 2
dist = pd.DataFrame(np.random.normal(loc=mean, scale=stdv, size=(1000, 1)),columns=['rnd_data'])
print(dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2))
fig, ax = plt.subplots() #more details later

dist.plot.kde(ax=ax, legend=False, title='Histogram: rnd_data & KDE')
dist.plot.hist(density=True, ax=ax)
ax.set_ylabel('Probability') #Y-Label
ax.grid(axis='y') #Y-Grid lines
plt.show()
**currently

• Line histogram;
rnd_data
min 3.57
max 16.97
mean 10.00
std 1.97
**currently

• Scatter plot;
Scatter plots can be used to identify relationships between two variables. It
can be effectively used in circumstances where the dependent variable can
have multiple values for a value of the independent variable.
**currently

• Line histogram (alternative way);

import numpy as np

df = sns.load_dataset("tips")
#Creating the scatter plot
plt.scatter(df['total_bill'],df['tip'],alpha=0.5 )
#Show the plot
plt.show()
**currently

• Scatter plot;
Scatter plots can be used to identify relationships between two variables. It
can be effectively used in circumstances where the dependent variable can
have multiple values for a value of the independent variable.
**currently

• Bubble chart;
They are like scatter plots, but used to depict and show relationships among
three variables. The third variable is represented via the bubble area.
**currently

• Bubble chart;
import numpy as np

np.random.seed(42)
N = 100
x = np.random.normal(170, 20, N)
y = x + np.random.normal(5, 25, N)
colors = np.random.rand(N)
area = (25 * np.random.rand(N))**2
df = pd.DataFrame({
'X': x,
'Y': y,
'Colors': colors,
"bubble_size":area})
#Creating the bubble chart
plt.scatter('X', 'Y', s='bubble_size',alpha=0.5, data=df)
#Show the plot **currently
plt.show()
• Bubble chart;
They are like scatter plots, but used to depict and show relationships among
three variables. The third variable is represented via the bubble area.
**currently

• Box chart;
The box chart is used to show the range of the distribution, its central value,
and its variability.
**currently

• Box chart;
import numpy as np

df_1 = [[1,2,5], [5,7,2,2,5], [7,2,5]]
df_2 = [[6,4,2], [1,2,5,3,2], [2,3,5,1]]
#Creating the box plot
ticks = ['A', 'B', 'C']
plt.figure()
bpl = plt.boxplot(df_1, positions=np.array(range(len(df_1)))*2.0-0.4, sym='', widths=0.6)
bpr = plt.boxplot(df_2, positions=np.array(range(len(df_2)))*2.0+0.4, sym='', widths=0.6)
plt.plot([], c='#D7191C', label='Label 1')
plt.plot([], c='#2C7BB6', label='Label 2')
plt.legend()
plt.xticks(range(0, len(ticks) * 2, 2), ticks)
plt.xlim(-2, len(ticks)*2)
plt.ylim(0, 8)
plt.tight_layout()
**currently
#Show the plot
plt.show()
• Box chart;
The box chart is used to show the range of the distribution, its central value,
and its variability.
**currently

• Venn Diagram;
Venn diagrams are used to see the relationships between two or three sets of
items. It highlights the similarities and differences.
**currently

• Venn Diagram;
from matplotlib_venn import venn3
#Making venn diagram

venn3(subsets = (10, 8, 22, 6,9,4,2))
plt.show()
**currently
• Venn Diagram;
Venn diagrams are used to see the relationships between two or more sets of
items. It highlights the similarities and differences.
**currently

• Tree Maps;
Tree Maps are primarily used to display data that is grouped and nested in a
hierarchical structure and observe the contribution of each component.
**currently

• Tree Maps;
import squarify
sizes = [40, 30, 5, 25, 10]

squarify.plot(sizes)
# Show the plot
plt.show()
**currently
• Tree Maps;
Tree Maps are primarily used to display data that is grouped and nested in a
hierarchical structure and observe the contribution of each component.
**currently

• Sub plots;
Subplots are powerful visualizations that help easy comparisons between
plots.
**currently

• Sub plots;
import numpy as np

df = sns.load_dataset("iris")
df=df.groupby('sepal_length')['sepal_width'].sum().to_frame().reset_index()
#Creating the subplot
fig, axes = plt.subplots(nrows=2, ncols=2)
ax=df.plot('sepal_length', 'sepal_width',ax=axes[0,0])
ax.get_legend().remove()
ax.set_title('Chart title')
ax.set_xlabel('X axis title')
ax.set_ylabel('Y axis title')
**currently
#Show the plot
plt.show()
• Sub plots;
Subplots are powerful visualizations that help easy comparisons between
plots.
**currently

Resources
⚫ Books
– Visual Complexity, Mapping Patterns of Information , Manuel Lima
– The Visual Display of Quantitative Information, Edward Tufte
– Information Visualization: Beyond the Horizon, Chaomei Chen
– JavaScript: The Definitive Guide, David Flanagan
– Getting Started with D3, Mike Dewar
– Visualizing Data, Ben Fry
– Interactive Data Visualization for the Web, Scott Murray

Resources
⚫ Web sites
– http://processingjs.org/
– http://d3js.org/, https://github.com/mbostock/d3/wiki/API-Reference
– http://code.google.com/apis/ajax/playground/
– http://www.edwardtufte.com/tufte/
– http://www.visualcomplexity.com/
– http://www.webdesignerdepot.com/2009/06/50-great-examples-of-data-
visualization/
– http://fellinlovewithdata.com/
– http://infosthetics.com/
– http://visual.ly/

Conclusion
• We have learned an array of different libraries

which can be used to their full potential by
understanding the use-case and the requirement.
The syntax and the semantics vary from package
to package and it is essential to understand the
challenges and advantages of the different
libraries.
Assignment 5
• Download a data set

• Perform the following visualization plots on it:
-Pie chart
-Line histogram
-Column histogram
-Bubble chart
-Subplots
-Box plot, at least two combined

5-Information &amp; Visualization

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5-Information &amp; Visualization

Uploaded by

Copyright:

Available Formats

Data Science

• We will have today an overview of:

• Visualization: converting raw data to a form that is

• Scientific visualization: specifically concerned

*Adapted from The ParaView

Introduction to Information Visualization - Fall 2013

• Information visualization: concerned with data that

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

* Adapted from Fry, Visualizing Data

[p. 7, Fry, Visualizing Data]

[p. 8, Fry, Visualizing Data]

[p. 10, Fry, Visualizing Data]

[p. 10, Fry, Visualizing Data]

[p. 12, Fry, Visualizing Data]

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

2.1, 2.5, 1.4, 5.0 → Setosa (S)

• Host language (C/C++/Java/Python) plus OpenGL

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

import pandas as pd # for dataframes

#Adding the aesthetics

#Show the Chart Plot

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

#Creating the dataset

#Creating the bar chart

#Adding the aesthetics

#Show the Bar Chart Plot **currently

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

#Creating the dataset

#Creating the bar plot

#Adding the aesthetics

#Show the Bar Chart Plot **currently

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

#Creating the dataset

#Creating the column plot

#Adding the aesthetics

#Show the Bar Chart Plot **currently

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

#Creating the dataset

#Creating a grouped Column chart

#Adding the aesthetics

#Show the Chart Plot

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

Introduction to Information Visualization - Fall 2013

#Creating the dataset

df.plot.bar(x='A', y=["B", "C","D"], stacked=True, width = 0.4,alpha=0.5)

#Show the Chart Plot **currently

Introduction to Information Visualization - Fall 2013

5-Information & Visualization

5-Information & Visualization