You are on page 1of 38

Chapter 3 – Data Visualization

Chapter 4 – Summary Statistics

Data Mining for Business Intelligence


Shmueli, Patel & Bruce

© Galit Shmueli and Peter Bruce 2010


Data Visualization
• “A picture is worth a thousand words”
• Data visualization and summary statistics help condense
data
• Effective presentation
• Supports data cleaning (identify missing values, outliers,
incorrect values, duplicates) and exploring (combine some
groups)
• Helps identify suitable variables
• Mandatory initial step for most data mining applications
Graphs for Data Exploration
Basic Plots Distribution Plots
Line Graphs Boxplots
Bar Charts Histograms
Scatterplots
Two Examples
Amtrak Ridership: Boston Housing Data:
Amtrak routinely collects Census tracts in Boston
data on ridership Several variables (14) –
Goal: To predict future crime rate, location, etc.
ridership using the series Goal 1: Predict median
of monthly ridership data value of a home in the tract
between Jan 1991 – Goal 2: Cluster census
March 2004 tracts
Line Graph for Time Series

Shows how ridership patterns of Amtrak trains change over time


Bar Chart for Categorical Variable
Determine differences
between subgroups

Example: 95% of tracts do


not border Charles River
Scatterplot
Displays relationship between two numerical variables – median values
decrease as percentage of low status population increases
Graphs
 Three most effective plots:
 bar charts – usually for categorical variables
 line graphs – time series data
 Scatterplots – relationship between 2 variables

 Used widely in the business world

 Domain knowledge and nature of the task are used to


select appropriate chart for data at hand
Distribution Plots
 Display entire distribution of a numerical variable
 Display “ how many” of each value occur in a data set or,
for continuous data or data with many possible values,
“ how many” values are in each of a series of ranges or
“ bins”
 Generally useful for prediction tasks (supervised learning)
and help determine the potential methods and variable
transformations
Histograms

Boston Housing example:

Histogram shows the


distribution of the
outcome variable
(median house value)
Boxplots
Side-by-side boxplots are useful for comparing subgroups

Boston Housing Example:


Display distribution of
outcome variable (MEDV)
for neighborhoods on
Charles river (1) and not on
Charles river (0)
Box Plot
Top outliers defined as
those above Q3+1.5(Q3-
Q1).
“ max” = maximum of
outliers

non-outliers
“ ma
x”
Analogous definitions
Quartile 3 for bottom outliers and
mean
Median
for “ min”
Quartile 1 Details may differ
“ min”
across software
Heat Maps
 Basic charts and distribution plots can display a maximum of 2
variables
Cannot represent high-dimensional data
 In data mining, often data are multi-dimensional
 Heat maps are graphical displays where color is used to
convey information
 Used to visualize:
Correlation
Missing Data
Heat maps
 Correlation table for p variables has p rows and p columns
 Data table has p columns (variables) and n rows (records)
 If n is large, a subset can be used
 Easier and faster to scan the color coding rather than the
values
 Useful when examining a large number of values but bar
charts and plots should be used for precise graphical
representations
Heatmap to highlight correlations
(Boston Housing)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1.00
ZN -0.20 1.00
INDUS 0.41 -0.53 1.00
CHAS
NOX
-0.06
0.42
-0.04
-0.52
0.06
0.76
1.00
0.09 1.00
In Excel
RM
AGE
-0.22
0.35
0.31
-0.57
-0.39
0.64
0.09
0.09
-0.30
0.73
1.00
-0.24 1.00
(using
DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 conditional
RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00
TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 formatting)
PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00
LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00
MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00

In Spotfire
Multidimensional Visualization
Adding variables
• In order to add more variables to the plot
• Categorical: hue, shape, multiple panels
• Numerical : color intensity

• Incorporating more variables has advantages


• Use for both classification and prediction tasks
• Helps adding interaction terms
Scatterplot with color added
Boston Housing

NOX vs. LSTAT


Red = low median value
Blue = high median value
Data Manipulations
Important step in pre-processing of data
Includes – variable transformations, deriving new
variables (binning, condensing categories)
Common methods:
Rescaling – can often enhance the plot and illuminate
relationships
Aggregation – temporal scale: by granularity (monthly,
weekly), geographical (by zip codes)
Zooming and Panning – reveal patterns and outliers (Google
maps – zoom certain areas of interest)
Filtering – removing some “noise” from data to focus
attention on certain data
Rescaling to log scale (on right)
“ uncrowds” the data

Rescaling removes crowding and allows a better view of the linear


relationship between the two logged-scale variables
Aggregation
Amtrak Ridership – Monthly Data
Aggregation – Monthly Average

“Seasonal aggregation”(monthly) – Peak ridership in July-August, and


there is a dip in January-February
Aggregation – Yearly Average

“Temporal aggregation”(yearly) – Ridership decreased from 1991 –


1996 and then grew again from 1996 – 2004 (with a slight drop in
2003-2004)
Scatter Plot with Labels (Utilities)

Helps visualize and identify clusters and outliers, detect patterns.


For example: Nevada and Puget are similar and away from the rest
Scaling up: Large datasets
• Scatterplots for large observations can sometimes be ineffective

• Alternatives:
• Sampling
• Reduce marker size
• Breaking data down into subsets
• Aggregation
• Jittering – slightly moving each marker by adding a small
amount of noise
Other plots/graphs
• Matrix plot – multiple scatterplots together for pairwise
relationships
• Interactive visualization
• Multiple inter-link plots (single view)
• Interactive visualization is often preferred over “static”
graphs – all plots on one screen
• Specialized Visualization
• Network graphs – actors and relations between them
(“nodes”, “edges”)
• Tree maps for hierarchical large-scale data
• Map charts for geographical data

• Spotfire software – http://spotfire.tibco.com


Linked plots
(same record is highlighted in each plot)
Network Graph – eBay Auctions
(sellers on left, buyers on right)

Circle size = # of
transactions for the node

Line width =# of auctions


for the buyer-seller pair

Arrows point from seller


to buyer
Treemap – eBay Auctions
(Hierarchical eBay data:
Category> sub-category> Brand)

Rectangle size =
average closing
price (=item
value)

Color = % sellers
with negative
feedback
(darker=more)
Map Chart
(Comparing countries’ well-being with GDP)

Darker = higher value


Summary of Data visualization tools
• Prediction and Classification
• Bar charts, scatterplots
• Boxplots, histograms
• Side-by-side boxplots, multiple panels, color added
• Aggregation methods
• Time series forecasting
• Line charts – temporal, seasonal aggregations
• Zooming and panning
• Unsupervised learning
• Matrix plots
• Heatmaps
• Aggregation, zooming and panning
• Map charts, parallel coordinate plots
Other Pre-processing steps – Chapter
2
Detecting outliers
Handling missing data
Normalizing/standardizing data
Summary Statistics: Exploring the data
• Useful initial step of data exploration
• Statistical summary of data: common metric
• Average
• Median
• Mode
• Minimum
• Maximum
• Range
• Variance and Standard deviation
• Counts & percentages
Summary Statistics – Boston Housing
Summarize Using Pivot Tables

Counts & percentages are useful


for summarizing categorical data

Boston Housing example: Count of MEDV


471 neighborhoods border the CHAS Total
Charles River (1) 0 471
35 neighborhoods do not (0) 1 35
Grand Total 506
Pivot Tables - cont.
Averages are useful for summarizing
grouped numerical data

Boston Housing example:


Compare average home values Average of MEDV
in neighborhoods that border CHAS Total
0 22.09
Charles River (1) and those 1 28.44
that do not (0) Grand Total 22.53
Conclusion
Both data visualization and summary statistics are
ways to explore, summarize and describe data

Visualization techniques are more appealing but


summary statistics are essential to quantitatively
understand the information from the data

They both help in data reduction and forming


groups/aggregates

You might also like