Chapter 3 - Data Visualization Chapter 4 - Summary Statistics

Chapter 3 – Data Visualization
Chapter 4 – Summary Statistics
Data Mining for Business Intelligence

Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010

Data Visualization
• “A picture is worth a thousand words”
• Data visualization and summary statistics help condense
data
• Effective presentation
• Supports data cleaning (identify missing values, outliers,
incorrect values, duplicates) and exploring (combine some
groups)
• Helps identify suitable variables
• Mandatory initial step for most data mining applications
Graphs for Data Exploration
Basic Plots Distribution Plots
Line Graphs Boxplots
Bar Charts Histograms
Scatterplots
Two Examples
Amtrak Ridership: Boston Housing Data:
Amtrak routinely collects Census tracts in Boston
data on ridership Several variables (14) –
Goal: To predict future crime rate, location, etc.
ridership using the series Goal 1: Predict median
of monthly ridership data value of a home in the tract
between Jan 1991 – Goal 2: Cluster census
March 2004 tracts
Line Graph for Time Series
Shows how ridership patterns of Amtrak trains change over time

Bar Chart for Categorical Variable
Determine differences
between subgroups
Example: 95% of tracts do

not border Charles River
Scatterplot
Displays relationship between two numerical variables – median values
decrease as percentage of low status population increases
Graphs
 Three most effective plots:
 bar charts – usually for categorical variables
 line graphs – time series data
 Scatterplots – relationship between 2 variables
 Used widely in the business world
 Domain knowledge and nature of the task are used to

select appropriate chart for data at hand
Distribution Plots
 Display entire distribution of a numerical variable
 Display “ how many” of each value occur in a data set or,
for continuous data or data with many possible values,
“ how many” values are in each of a series of ranges or
“ bins”
 Generally useful for prediction tasks (supervised learning)
and help determine the potential methods and variable
transformations
Histograms
Boston Housing example:
Histogram shows the

distribution of the
outcome variable
(median house value)
Boxplots
Side-by-side boxplots are useful for comparing subgroups
Boston Housing Example:

Display distribution of
outcome variable (MEDV)
for neighborhoods on
Charles river (1) and not on
Charles river (0)
Box Plot
Top outliers defined as
those above Q3+1.5(Q3-
Q1).
“ max” = maximum of
outliers
non-outliers
“ ma
x”
Analogous definitions
Quartile 3 for bottom outliers and
mean
Median
for “ min”
Quartile 1 Details may differ
“ min”
across software
Heat Maps
 Basic charts and distribution plots can display a maximum of 2
variables
Cannot represent high-dimensional data
 In data mining, often data are multi-dimensional
 Heat maps are graphical displays where color is used to
convey information
 Used to visualize:
Correlation
Missing Data
Heat maps
 Correlation table for p variables has p rows and p columns
 Data table has p columns (variables) and n rows (records)
 If n is large, a subset can be used
 Easier and faster to scan the color coding rather than the
values
 Useful when examining a large number of values but bar
charts and plots should be used for precise graphical
representations
Heatmap to highlight correlations
(Boston Housing)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1.00
ZN -0.20 1.00
INDUS 0.41 -0.53 1.00
CHAS
NOX
-0.06
0.42
-0.04
-0.52
0.06
0.76
1.00
0.09 1.00
In Excel
RM
AGE
-0.22
0.35
0.31
-0.57
-0.39
0.64
0.09
0.09
-0.30
0.73
1.00
-0.24 1.00
(using
DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 conditional
RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00
TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 formatting)
PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00
LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00
MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00
In Spotfire
Multidimensional Visualization
Adding variables
• In order to add more variables to the plot
• Categorical: hue, shape, multiple panels
• Numerical : color intensity
• Incorporating more variables has advantages

• Use for both classification and prediction tasks
• Helps adding interaction terms
Scatterplot with color added
Boston Housing
NOX vs. LSTAT

Red = low median value
Blue = high median value
Data Manipulations
Important step in pre-processing of data
Includes – variable transformations, deriving new
variables (binning, condensing categories)
Common methods:
Rescaling – can often enhance the plot and illuminate
relationships
Aggregation – temporal scale: by granularity (monthly,
weekly), geographical (by zip codes)
Zooming and Panning – reveal patterns and outliers (Google
maps – zoom certain areas of interest)
Filtering – removing some “noise” from data to focus
attention on certain data
Rescaling to log scale (on right)
“ uncrowds” the data
Rescaling removes crowding and allows a better view of the linear

relationship between the two logged-scale variables
Aggregation
Amtrak Ridership – Monthly Data
Aggregation – Monthly Average
“Seasonal aggregation”(monthly) – Peak ridership in July-August, and

there is a dip in January-February
Aggregation – Yearly Average
“Temporal aggregation”(yearly) – Ridership decreased from 1991 –

1996 and then grew again from 1996 – 2004 (with a slight drop in
2003-2004)
Scatter Plot with Labels (Utilities)
Helps visualize and identify clusters and outliers, detect patterns.

For example: Nevada and Puget are similar and away from the rest
Scaling up: Large datasets
• Scatterplots for large observations can sometimes be ineffective
• Alternatives:
• Sampling
• Reduce marker size
• Breaking data down into subsets
• Aggregation
• Jittering – slightly moving each marker by adding a small
amount of noise
Other plots/graphs
• Matrix plot – multiple scatterplots together for pairwise
relationships
• Interactive visualization
• Multiple inter-link plots (single view)
• Interactive visualization is often preferred over “static”
graphs – all plots on one screen
• Specialized Visualization
• Network graphs – actors and relations between them
(“nodes”, “edges”)
• Tree maps for hierarchical large-scale data
• Map charts for geographical data
• Spotfire software – http://spotfire.tibco.com

Linked plots
(same record is highlighted in each plot)
Network Graph – eBay Auctions
(sellers on left, buyers on right)
Circle size = # of
transactions for the node
Line width =# of auctions

for the buyer-seller pair
Arrows point from seller

to buyer
Treemap – eBay Auctions
(Hierarchical eBay data:
Category> sub-category> Brand)
Rectangle size =
average closing
price (=item
value)
Color = % sellers
with negative
feedback
(darker=more)
Map Chart
(Comparing countries’ well-being with GDP)
Darker = higher value

Summary of Data visualization tools
• Prediction and Classification
• Bar charts, scatterplots
• Boxplots, histograms
• Side-by-side boxplots, multiple panels, color added
• Aggregation methods
• Time series forecasting
• Line charts – temporal, seasonal aggregations
• Zooming and panning
• Unsupervised learning
• Matrix plots
• Heatmaps
• Aggregation, zooming and panning
• Map charts, parallel coordinate plots
Other Pre-processing steps – Chapter
2
Detecting outliers
Handling missing data
Normalizing/standardizing data
Summary Statistics: Exploring the data
• Useful initial step of data exploration
• Statistical summary of data: common metric
• Average
• Median
• Mode
• Minimum
• Maximum
• Range
• Variance and Standard deviation
• Counts & percentages
Summary Statistics – Boston Housing
Summarize Using Pivot Tables
Counts & percentages are useful

for summarizing categorical data
Boston Housing example: Count of MEDV

471 neighborhoods border the CHAS Total
Charles River (1) 0 471
35 neighborhoods do not (0) 1 35
Grand Total 506
Pivot Tables - cont.
Averages are useful for summarizing
grouped numerical data
Boston Housing example:

Compare average home values Average of MEDV
in neighborhoods that border CHAS Total
0 22.09
Charles River (1) and those 1 28.44
that do not (0) Grand Total 22.53
Conclusion
Both data visualization and summary statistics are
ways to explore, summarize and describe data
Visualization techniques are more appealing but

summary statistics are essential to quantitatively
understand the information from the data
They both help in data reduction and forming

groups/aggregates

Chapter 3 - Data Visualization Chapter 4 - Summary Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 - Data Visualization Chapter 4 - Summary Statistics

Uploaded by

Copyright:

Available Formats

Chapter 3 – Data Visualization

Chapter 4 – Summary Statistics

Data Mining for Business Intelligence

© Galit Shmueli and Peter Bruce 2010

Shows how ridership patterns of Amtrak trains change over time

Example: 95% of tracts do

 Used widely in the business world

 Domain knowledge and nature of the task are used to

Boston Housing example:

Histogram shows the

Boston Housing Example:

• Incorporating more variables has advantages

NOX vs. LSTAT

Rescaling removes crowding and allows a better view of the linear

“Seasonal aggregation”(monthly) – Peak ridership in July-August, and

“Temporal aggregation”(yearly) – Ridership decreased from 1991 –

Helps visualize and identify clusters and outliers, detect patterns.

• Spotfire software – http://spotfire.tibco.com

Line width =# of auctions

Arrows point from seller

Darker = higher value

Counts & percentages are useful

Boston Housing example: Count of MEDV

Boston Housing example:

Visualization techniques are more appealing but

They both help in data reduction and forming

You might also like