Professional Documents
Culture Documents
• The following are the basic charts used for exploration and presentation of
data.
• Histogram, Bar Chart, Line Chart, Boxplot and Scatterplot
• Note that the weakness of basic charts and distribution plots is that they only
display two variables and therefore can not reveal high-dimensional
information.
Graphical methods for identifying outliers
• Outliers are extreme values that go against the trend of the remaining data.
• Outliers may represent errors in data entry. Certain statistical methods are
sensitive to the presence of outliers and may deliver unreliable results.
• Histogram, Boxplot and Scatterplot.
Graphical Methods for Identifying Outliers
Numerical Methods for Identifying Outliers
• Using Z-score, a data value is treated as outlier if it falls beyond -3 and 3.
• Unfortunately, both mean and SD (part of Z-score) highly sensitive to outliers.
• Required a robust statistical methods for outlier detection which are less
sensitive to the presence of outliers.
• One such method is based on the quartiles: Q1, Q2 and Q3.
• A data value is said to be an outlier if it falls beyond the following limits.
Q3 + (1.5)IQR or Q1 - (1.5)IQR where IQR = Q3-Q1 is the spread of
middle 50% of the data.
Advanced Visualizations: Heatmap & Matrix Plot
• A graphical display of numerical data where color is used to denote values.
• In a data mining context the purpose of heatmaps are 1) visualizing correlation tables and
2) visualizing missing values in the data.
• In a p columns (variables) and n rows (observations), it is easy and faster to scan the
color-coding rather than the values.
• Darker shades correspond to stronger (positive or negative) correlation.
• In a missing value heatmap, rows correspond to records and columns to variables.
• A binary coding of original data set where 1 denotes a missing value and 0 otherwise
need to be done.
• This new binary table is then colored such that only missing value cells are colored.
• Rescaling: Changing the scale in a display can enhance the visibility of the plot
and illuminate relationship. The “crowding” of the data in the nearby axes can
be eliminated.
• Aggregation and Hierarchies: A useful manipulation in scaling is changing the
level of aggregation. For example, a time series data can be aggregated by
seasonal factors.
• Zooming: One can zoom certain part of the data on a plot for revealing
patterns and outliers.
• Filtering: Removing some of the observations from the plot. The purpose is to
focus on certain part of the data by eliminating the “noise”.