DADM S4 Basic Data Visualization

Session 4: Basic Data Visualization
“A picture is worth a thousand words”

Use of Data Visualization & Basic Charts
• Ability to condense verbal information into compact and quickly understood
graphical image.
• Primarily used in the pre-processing part of data mining process and is a
mandatory step.
• Supports data cleaning by finding incorrect values, missing values, duplicate
rows and columns with same values, outliers etc.
• The following are the basic charts used for exploration and presentation of
data.
• Histogram, Bar Chart, Line Chart, Boxplot and Scatterplot
• Advanced visualization: Heatmaps, matrix plot, treemaps, network maps, Map

charts etc.
Basic Charts: Bar, Line and Scatter plots
• Basic charts support data exploration by displaying one or two columns of a
data at a time.
• Bar chart is useful for comparing a single statistic across groups. The height
of the bar represents the value of the statistic and different bars correspond
to different groups.
• Scatter plot is useful in identifying the association or correlation between two

variables. It also gives the spread of the data.
• Line charts is generally used for showing time series data.
Distribution Plots: Histogram & Boxplot
• Generally not considered as basic charts but highly useful in statistical and
data mining contexts.
• Both the plots display the entire distribution of a numerical variable.
• Highly useful in supervised learning for determining potential data mining

methods and variable transformations.
• Both the plots gives the nature of the distribution like skewness or kurtosis
and can also be used for identifying outliers.
• Side-by-side box plots are useful in classification tasks for evaluating the
potential of numerical predictors.
• Note that the weakness of basic charts and distribution plots is that they only
display two variables and therefore can not reveal high-dimensional
information.
Graphical methods for identifying outliers
• Outliers are extreme values that go against the trend of the remaining data.
• Outliers may represent errors in data entry. Certain statistical methods are
sensitive to the presence of outliers and may deliver unreliable results.
• Histogram, Boxplot and Scatterplot.
Graphical Methods for Identifying Outliers
Numerical Methods for Identifying Outliers
• Using Z-score, a data value is treated as outlier if it falls beyond -3 and 3.
• Unfortunately, both mean and SD (part of Z-score) highly sensitive to outliers.
• Required a robust statistical methods for outlier detection which are less
sensitive to the presence of outliers.
• One such method is based on the quartiles: Q1, Q2 and Q3.
• A data value is said to be an outlier if it falls beyond the following limits.
Q3 + (1.5)IQR or Q1 - (1.5)IQR where IQR = Q3-Q1 is the spread of
middle 50% of the data.
Advanced Visualizations: Heatmap & Matrix Plot
• A graphical display of numerical data where color is used to denote values.
• In a data mining context the purpose of heatmaps are 1) visualizing correlation tables and
2) visualizing missing values in the data.
• In a p columns (variables) and n rows (observations), it is easy and faster to scan the
color-coding rather than the values.
• Darker shades correspond to stronger (positive or negative) correlation.
• In a missing value heatmap, rows correspond to records and columns to variables.
• A binary coding of original data set where 1 denotes a missing value and 0 otherwise
need to be done.
• This new binary table is then colored such that only missing value cells are colored.
• A scatter plot in multiple panels.

• All pairwise scatter plots are shown in single display.
• Each column and each row corresponds to a variable, thereby the intersections creates
all possible pairwise scatter plots.
Heatmap of Cars Data
• Heatmap based on first five variables.
Manipulations in Data Visualization
• Most of the time spent in a data mining projects is spent in pre-processing
which includes variable transformation, derivation of new variables, changing
numerical scale and binning the numerical variable etc. The following
manipulations is generally advisable.
• Rescaling: Changing the scale in a display can enhance the visibility of the plot
and illuminate relationship. The “crowding” of the data in the nearby axes can
be eliminated.
• Aggregation and Hierarchies: A useful manipulation in scaling is changing the
level of aggregation. For example, a time series data can be aggregated by
seasonal factors.
• Zooming: One can zoom certain part of the data on a plot for revealing
patterns and outliers.
• Filtering: Removing some of the observations from the plot. The purpose is to
focus on certain part of the data by eliminating the “noise”.

DADM S4 Basic Data Visualization

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DADM S4 Basic Data Visualization

Uploaded by

Copyright:

Available Formats

Session 4: Basic Data Visualization

“A picture is worth a thousand words”

• Advanced visualization: Heatmaps, matrix plot, treemaps, network maps, Map

• Scatter plot is useful in identifying the association or correlation between two

• Highly useful in supervised learning for determining potential data mining

• A scatter plot in multiple panels.

You might also like