You are on page 1of 55

PHUONG NGUYEN

DATA VISUALIZATION
A PICTURE IS WORTH A THOUSAND WORDS
2
3
https://www.gapminder.org/tools/

http://bit.ly/factfulness-rosling

4
Tài sản các đại gia chứng khoán 2010 – 2019

5
Most Popular Programming Languages 1999 – 2019

6
CONTENT
1. INTRODUCTION TO DATA VISUALIZATION
WHAT AND WHY

2. BASIC PLOTS
LINE PLOTS, BAR CHARTS, SCATTER PLOTS, BOXPLOTS,
HISTOGRAMS, HEATMAPS…

3. MULTIDIMENSIONAL VISUALIZATION
ADDING VARIABLES, MANIPULATIONS…

4. FROM VISUALIZATION TO BUSINESS INTELLIGENCE


BUSINESS INTELLIGENCE WITH TABLEAU

5. DATA VISUALIZATION FOR MACHINE LEARNING TASKS


SUPERVISED LEARNING AND UNSUPERVISED LEARNING

7
INTRODUCTION TO DATA VISUALIZATION
Possible goals in visualization:

1. Exploration and
preliminary
analysis
2. Presentation
(reporting &
storytelling)

8
PYTHON LIBRARIES
FOR DATA VISUALIZATION
▪ matplotlib - oldest and most flexible
▪ https://matplotlib.org
▪ seaborn, pandas - wrappers around matplotlib; help
create plots quickly, but knowledge of matplotlib
allows better control over final plot
▪ https://seaborn.pydata.org
▪ https://pandas.pydata.org

https://github.com/nnbphuong/datascience4biz/blob
/master/Exploration_using_Visualization_Plots.ipynb

9
BASIC PLOTS
▪ LINE PLOTS
▪ BAR CHARTS
▪ BOXPLOTS
▪ HISTOGRAMS
▪ SCATTER PLOTS

10
AMTRAK RIDERSHIP
▪ Amtrak: a US railway
company
▪ Dataset: The series
of monthly ridership
between January
1991 and March
2004
# Load, convert Amtrak data for time series analysis
Amtrak_df = pd.read_csv(‘Amtrak.csv’, squeeze=True)
Amtrak_df['Date'] = pd.to_datetime(Amtrak_df.Month,
format='%d/%m/%Y')
ridership_ts = pd.Series(Amtrak_df.Ridership.values,
index=Amtrak_df.Date)

11
LINE PLOT FOR TIME SERIES

12
ALTERNATIVE CODES FOR LINE PLOT

ridership_ts.plot(ylim=[1300, 2300], legend=False)


plt.xlabel('Year') # set x-axis label
plt.ylabel('Ridership (in 000s)') # set y-axis label

plt.plot(ridership_ts.index, ridership_ts)
plt.xlabel('Year') # set x-axis label
plt.ylabel('Ridership (in 000s)') # set y-axis label
14
BOSTON HOUSING

15
BOSTON HOUSING

## Boston housing data


housing_df = pd.read_csv('BostonHousing.csv')
housing_df = housing_df.rename(columns={'CAT. MEDV’: 'CAT_MEDV'})

16
BAR CHART FOR CATEGORICAL VARIABLE

Average median neighborhood value for neighborhoods


that do and do not border the Charles River

17
ALTERNATIVE CODES FOR BAR CHART
# Compute mean MEDV per CHAS = (0, 1)
ax = housing_df.groupby('CHAS').mean().MEDV.plot(kind='bar')
ax.set_ylabel('Avg. MEDV')

# Compute mean MEDV per CHAS = (0, 1)


dataForPlot = housing_df.groupby('CHAS').mean().MEDV
fig, ax = plt.subplots()
ax.bar(dataForPlot.index, dataForPlot, color=['C5', 'C1'])
ax.set_xticks([0, 1], False)
ax.set_xlabel('CHAS')
ax.set_ylabel('Avg. MEDV')
BOXPLOT
▪ Top outliers defined as
those above
Q3+1.5(Q3-Q1) outliers
▪ “max” = maximum of
non-outliers “max”

▪ Analogous definitions Quartile 3


for bottom outliers and Median

for “min” Quartile 1

▪ Details may differ “min”

across software
SIDE-BY-SIDE BOXPLOT
Side-by-side boxplots are useful for comparing subgroups
Houses in neighborhoods on Charles river (1) are more valuable than
those not (0)

ax = housing_df.boxplot(column='MEDV', by='CHAS')
ax.set_ylabel('MEDV')
plt.suptitle('') # Suppress the titles
plt.title('')
HISTOGRAM

## histogram of MEDV
ax = housing_df.MEDV.hist()
ax.set_xlabel('MEDV'); ax.set_ylabel('count')
SCATTER PLOT

MEDV vs. LSTAT


ALTERNATIVE CODES FOR SCATTER PLOT

# Scatter plot with axes names


housing_df.plot.scatter(x='LSTAT', y='MEDV', legend=False)

# Set the color of points and draw as open circles.


plt.scatter(housing_df.LSTAT, housing_df.MEDV, color=‘C3’,
facecolor='none')
plt.xlabel('LSTAT'); plt.ylabel('MEDV')

23
MULTIDIMENSIONAL VISUALIZATION
▪ ADDING VARIABLES
▪ Scatter plot with color added
▪ Panel of bar plots
▪ Scatter plot matrix
▪ Heatmap to highlight correlations
▪ Parallel coordinates plot…
▪ MANIPULATION: Rescaling, Aggregation,
Zooming, Filtering…
SCATTER PLOT WITH COLOR ADDED

NOX vs. LSTAT


low median value high median value

25
PANEL OF BAR PLOTS

27
SCATTER PLOT MATRIX
Diagonal plot is
the frequency
distribution for
the variable

28
HEATMAP TO HIGHLIGHT CORRELATIONS

30
PARALLEL COORDINATE PLOT

All variables are rescaled to 0-1 scale Each line is a single record

32
BAR CHART RACE

34
RESCALING

35
AGGREGATION AND ZOOMING

37
45
46
47
FROM VISUALIZATION
TO BUSINESS INTELLIGENCE
▪ The ability to interact with plots, and link them
together turns plotting into an analytical tool
that supports continuous exploration of the
data. Several commercial visualization tools
provide powerful capabilities along these lines.
▪ Tableau has spent hundreds of millions of
dollars on software R&D and review of
interactions with customers to hone interfaces
that allow analysts to interact with data via plots
smoothly and efficiently.

48
Magic Quadrant for Analytics and Business Intelligence Platforms
Source: Gartner (2019)

49
Magic Quadrant for Analytics and Business Intelligence Platforms
Source: Gartner (2020)

50
Tableau on Google Trends
https://trends.google.com/trends/explore?date=all&geo=US&q=Tableau,Qlik,Power%20BI

51
WHAT IS TABLEAU?
▪ Tableau is one of the most fast-growing data
visualization tools which is currently being used in
the Business Intelligence industry.
▪ It is the best way to change or transform the raw
set of data into an easily understandable format
with zero technical skills and coding knowledge.
▪ Tableau allows you to accomplish numerous tasks,
including: Data connection, integration, and
preparation; Data exploration; Data visualization;
Data analysis; Data storytelling

52
https://www.cnbc.com/2020/01/23/this-map-shows-the-latest-spread-of-the-coronavirus.html

53
TABLEAU PRODUCT LINE

54
TABLEAU PRODUCT LINE
▪ The decision on which Tableau product to
download comes down to four key attributes:
▪ Connectivity: What data sources do you need to
access?
▪ Distribution: Who do you want to see your
dashboard and how will you share it with them?
▪ Automation: Do you need your work to update
automatically on a refresh schedule?
▪ Security: Do you require an on-premise level of
security or can your work be saved in the cloud?

55
TABLEAU VS. EXCEL
Tableau MS Excel
Tableau is basically a data visualization tool Excel is basically a spreadsheet for working
which provides pictorial and graphical with data in rows and columns. You need to
representations of data. first represent your data into the tabular format
and then you can apply visualizations on top of
it.
In Tableau, you can gain insights that you never When it comes to Excel, you need to have a
thought possible. You can play around with prior knowledge of the insight that you want
interactive visualizations, deploy data drilling and then work with various formulae in order to
tools, and explore various data that is available, get there along with the tabulation needed.
and you don’t need to have any specific
knowledge of the insight that you are looking
for beforehand.
With Tableau, it is all about an easy and In Excel, you need to have some programming
interactive approach. in order to come up with real-time data
visualization.

56
TABLEAU TERMS
▪ Like Microsoft Excel, Tableau’s format for storing
data on your disk drive is in a workbook, with a
.twb or .twbx file extension.
▪ Tableau breaks the contents of workbooks into
three types of objects:
▪ A worksheet contains a single chart.
▪ A dashboard combines two or more charts into a
single physical screen.
▪ A story combines two or more worksheets or
dashboards into a step-by-step guided analytic.

57
TABLEAU DATA TYPES
▪ Tableau expresses fields and assigns data types
automatically. Tableau supports the following data types:
▪ Text values → abc
▪ Date values → calendar
▪ Date and time values → calendar with a clock
▪ Numerical values → #
▪ Geographic values (latitude and longitude used for maps)
→ small globe
▪ Boolean values (true/false conditions) → T/F

58
DIMENSIONS VS MEASURES IN TABLEAU

The fields from the data source are visible in the Data
pane are divided into measures and dimensions.

59
DIMENSIONS VS MEASURES IN TABLEAU
▪ Measures: Measures are values that are aggregated
(Tableau supports many different aggregation types
including: Sum, Average, Median, Percentile,
Count/Count Distinct, Minimum/Maximum, Standard
Deviation/Variance).
▪ Dimensions: Dimensions are values that determine the
level of detail at which measures are aggregated. You
can think of them as slicing the measures or creating
groups into which the measures fit. The combination of
dimensions used in the view defines the view's basic
level of detail.

60
DATA VISUALIZATION
FOR MACHINE LEARNING
SUPERVISED LEARNING: PREDICTION
▪ Plot outcome on the y-axis of boxplots, bar
charts, and scatter plots.
▪ Study relation of outcome to categorical
predictors via side-by-side boxplots, bar
charts, and multiple panels.
▪ Study relation of outcome to numerical
predictors via scatter plots.
61
DATA VISUALIZATION
FOR MACHINE LEARNING
SUPERVISED LEARNING: PREDICTION
▪ Use distribution plots (boxplot, histogram) for
determining needed transformations of the
outcome variable (and/or numerical predictors).
▪ Examine scatter plots with added color/panels/size
to determine the need for interaction terms.
▪ Use various aggregation levels and zooming to
determine areas of the data with different behavior,
and to evaluate the level of global vs. local
patterns.

62
DATA VISUALIZATION
FOR MACHINE LEARNING
SUPERVISED LEARNING: CLASSIFICATION
▪ Study relation of outcome to categorical predictors using
bar charts with the outcome on the y-axis.
▪ Study relation of outcome to pairs of numerical predictors
via color-coded scatter plots (color denotes the outcome).
▪ Study relation of outcome to numerical predictors via
side-by-side boxplots: Plot boxplots of a numerical
variable by outcome. Create similar displays for each
numerical predictor. The most separable boxes indicate
potentially useful predictors.

63
DATA VISUALIZATION
FOR MACHINE LEARNING
SUPERVISED LEARNING: CLASSIFICATION
▪ Use color to represent the outcome variable on a parallel
coordinate plot.
▪ Use distribution plots (boxplot, histogram) for
determining needed transformations of numerical
predictor variables.
▪ Examine scatter plots with added color/panels/size to
determine the need for interaction terms.
▪ Use various aggregation levels and zooming to determine
areas of the data with different behavior, and to evaluate
the level of global vs. local patterns.

64
DATA VISUALIZATION
FOR MACHINE LEARNING
SUPERVISED LEARNING: TIME SERIES FORECASTING
▪ Create line graphs at different temporal aggregations to
determine types of patterns.
▪ Use zooming and panning to examine various shorter periods
of the series to determine areas of the data with different
behavior.
▪ Use various aggregation levels to identify global and local
patterns.
▪ Identify missing values in the series (that will require handling).
▪ Overlay trend lines of different types to determine adequate
modeling choices.

65
DATA VISUALIZATION
FOR MACHINE LEARNING
UNSUPERVISED LEARNING
▪ Create scatter plot matrices to identify pairwise
relationships and clustering of observations.
▪ Use heatmaps to examine the correlation table.
▪ Use various aggregation levels and zooming to
determine areas of the data with different
behavior.
▪ Generate a parallel coordinates plot to identify
clusters of observations.
66

You might also like