You are on page 1of 10

Session 4: Basic Data Visualization

“A picture is worth a thousand words”


Use of Data Visualization & Basic Charts
• Ability to condense verbal information into compact and quickly understood
graphical image.
• Primarily used in the pre-processing part of data mining process and is a
mandatory step.
• Supports data cleaning by finding incorrect values, missing values, duplicate
rows and columns with same values, outliers etc.

• The following are the basic charts used for exploration and presentation of
data.
• Histogram, Bar Chart, Line Chart, Boxplot and Scatterplot

• Advanced visualization: Heatmaps, matrix plot, treemaps, network maps, Map


charts etc.
Basic Charts: Bar, Line and Scatter plots
• Basic charts support data exploration by displaying one or two columns of a
data at a time.
• Bar chart is useful for comparing a single statistic across groups. The height
of the bar represents the value of the statistic and different bars correspond
to different groups.

• Scatter plot is useful in identifying the association or correlation between two


variables. It also gives the spread of the data.
• Line charts is generally used for showing time series data.
Distribution Plots: Histogram & Boxplot
• Generally not considered as basic charts but highly useful in statistical and
data mining contexts.
• Both the plots display the entire distribution of a numerical variable.

• Highly useful in supervised learning for determining potential data mining


methods and variable transformations.
• Both the plots gives the nature of the distribution like skewness or kurtosis
and can also be used for identifying outliers.
• Side-by-side box plots are useful in classification tasks for evaluating the
potential of numerical predictors.

• Note that the weakness of basic charts and distribution plots is that they only
display two variables and therefore can not reveal high-dimensional
information.
Graphical methods for identifying outliers
• Outliers are extreme values that go against the trend of the remaining data.
• Outliers may represent errors in data entry. Certain statistical methods are
sensitive to the presence of outliers and may deliver unreliable results.
• Histogram, Boxplot and Scatterplot.
Graphical Methods for Identifying Outliers
Numerical Methods for Identifying Outliers
• Using Z-score, a data value is treated as outlier if it falls beyond -3 and 3.
• Unfortunately, both mean and SD (part of Z-score) highly sensitive to outliers.

• Required a robust statistical methods for outlier detection which are less
sensitive to the presence of outliers.
• One such method is based on the quartiles: Q1, Q2 and Q3.
• A data value is said to be an outlier if it falls beyond the following limits.
Q3 + (1.5)IQR or Q1 - (1.5)IQR where IQR = Q3-Q1 is the spread of
middle 50% of the data.
Advanced Visualizations: Heatmap & Matrix Plot
• A graphical display of numerical data where color is used to denote values.
• In a data mining context the purpose of heatmaps are 1) visualizing correlation tables and
2) visualizing missing values in the data.
• In a p columns (variables) and n rows (observations), it is easy and faster to scan the
color-coding rather than the values.
• Darker shades correspond to stronger (positive or negative) correlation.
• In a missing value heatmap, rows correspond to records and columns to variables.
• A binary coding of original data set where 1 denotes a missing value and 0 otherwise
need to be done.
• This new binary table is then colored such that only missing value cells are colored.

• A scatter plot in multiple panels.


• All pairwise scatter plots are shown in single display.
• Each column and each row corresponds to a variable, thereby the intersections creates
all possible pairwise scatter plots.
Heatmap of Cars Data
• Heatmap based on first five variables.
Manipulations in Data Visualization
• Most of the time spent in a data mining projects is spent in pre-processing
which includes variable transformation, derivation of new variables, changing
numerical scale and binning the numerical variable etc. The following
manipulations is generally advisable.

• Rescaling: Changing the scale in a display can enhance the visibility of the plot
and illuminate relationship. The “crowding” of the data in the nearby axes can
be eliminated.
• Aggregation and Hierarchies: A useful manipulation in scaling is changing the
level of aggregation. For example, a time series data can be aggregated by
seasonal factors.
• Zooming: One can zoom certain part of the data on a plot for revealing
patterns and outliers.
• Filtering: Removing some of the observations from the plot. The purpose is to
focus on certain part of the data by eliminating the “noise”.

You might also like