You are on page 1of 47

BMT6114 Business Analytics and

Intelligence:
Module 3
Dr. Saurav Dash,
F214, MGB
E-mail: saurav.dash@vit.ac.in
Why data visualization?
• Vision is by far our most powerful sense

• amazing connection between human visual sensors and the brain

• Humans have an extraordinary visual capacity to detect patterns

• To display data effectively we must understand a bit about Visual


Perception

• Visualization: converting raw data to a form that is viewable and


understandable to humans.
Visual perception

• any pattern in this table?


Visual perception

• Can you see any pattern in this line graph?

4
Statistics and Graphs

Anscombe’s
Quartet
• Four distinct datasets
• Each with statistics
properties that are
essentially identical

Mean of x = 9.0
Mean of y = 7.5
Variance of x = 11
Variance of y = 4.13

Nearly identical
correlation and
regression line
• Anscombe's quartet comprises four datasets that have nearly identical
simple descriptive statistics, yet appear very different when graphed.
Each dataset consists of eleven (x,y) points.
• They were constructed in 1973 by the statistician Francis Anscombe
 to demonstrate both the importance of graphing data before analyzing it
and the effect of outliers on statistical properties.
 He described the article as being intended to counter the impression
among statisticians that "numerical calculations are exact, but graphs are
rough
For all four datasets:

Property Value Accuracy

Mean of x 9 exact

Sample variance of x 11 exact

Mean of y 7.50 to 2 decimal places

Sample variance of y 4.125 plus/minus 0.003

Correlation between x and y 0.816 to 3 decimal places

to 2 and 3 decimal places,


Linear regression line y = 3.00 + 0.500x
respectively

Coefficient of determination
0.67 to 2 decimal places
of the linear regression
Statistics and Graphs

Anscombe’s Quartet

• Four distinct datasets


• Each with statistics
properties that are
essentially identical

• But when plotted, they


suddenly appear very
different

• Anscombe suggested
the combined use of
graphs and statistics
methods in data
analysis
8
• The first scatter plot (top left) appears to be a simple linear
relationship, corresponding to two variables correlated and following the
assumption of normality.
• The second graph (top right) is not distributed normally; while a
relationship between the two variables is obvious, it is not linear, and
the Pearson correlation coefficient is not relevant. A more general
regression and the corresponding coefficient of determination would be
more appropriate.
• In the third graph (bottom left), the distribution is linear, but should have
a different regression line (a robust regression would have been called
for). The calculated regression is offset by the one outlier which exerts
enough influence to lower the correlation coefficient from 1 to 0.816.
• Finally, the fourth graph (bottom right) shows an example when one
outlier is enough to produce a high correlation coefficient, even though
the other data points do not indicate any relationship between the
variables.
Why data visualization?

• People make better decisions when they’re based on understanding.


• For information to be understood, it must often be presented in visual
form because patterns, trends, and outliers require a picture for the
human brain to see and comprehend.
• Data visualization is essential for:
 Data exploration and understanding
 communicating data
 making better decisions
What is data visualization?

• a fundamental product from the Visual analytics process


• is the graphical display of abstract information for sense-making or data
analysis, and communication in a way that leads to understanding for
action.
• goal is to visualize data in a way that leads to understanding.
• Graphical presentation of data and information for
 Presentation of data, concepts, relationships
 Confirmation of hypotheses
 Exploration to discover patterns, trends, anomalies, structure, associations
• Useful across all areas of science, engineering, manufacturing,
commerce, education…..
Data representation
• The fundamental focus of data representation is mapping from data
values to graphical representations.
• Visualization designers use elementary graphical units called ―graphical
encodings‖ to map data to graphical representation.

• By graphical encoding means the use of visual display elements such


as icon color, shape, size, or position to convey information about
objects represented by the icons.
Statistics
Descriptive Inferential

Organising,
Correlational Generalising
summarising &
describing data
Relationships

Significance
Types of Data
• Qualitative
 Categorical/Nominal
 Sex: Male, Female
 Region: North, South, West,
East

 Ordinal
 How happy are you with the
customer service?
: 1- Very Unhappy 2- Unhappy 3-
Neutral 4- Unhappy 5- Very
Unhappy

• Quantitative
 Physical measurements:
Continuous or Discrete
• Discrete Variables:
 can be described using a specific and distinct point on a scale
 cannot be sub-divided any further-e.g. Gender = Male or Female
• Continuous Variables:
 Can theoretically take any value between two points on a continuum
 Are dependent on the accuracy of measuring tools
 e.g. Time = yr wk d h min s ms ms ns ps…
Levels of Measurement

1. Nominal Scale
Lowest Level

2. Ordinal Scale

3. Interval Scale

4. Ratio Scale

Highest Level
• Nominal
 The least like - real numbers
 The only property they have is identity or name (nominal=name).
 Numbers if used are simply codes for the real names of the properties
• Ordinal
 Have identity
 Have magnitude (order). A>B>C>D.
 We know relative order.
 We DO NOT know how much better A was relative to B.
 Consider two races in which we know the order of finish (ranks).
Ordinal variables

• Earl=1 • KC=1
• Greg=2 • Sarah=2
• Mike=3 • Liza=3
• Matt=4 • Marci=4

Can we say who was fastest overall?


Was Marci, who was slowest in her race, faster or slower than
Greg, who was second in his race?
• Interval
 Have identity and magnitude.
 Also have known distance between values.
 Form a true scale, but without a zero point.

• Ratio
 Have identity, magnitude and interval.
 And since they have a true zero, they can be expressed as ratios of each other.
 They are true numbers.
 All mathematical properties apply.
 The represent score data.
Summary Statistics
• not visual
• sample statistics of data X
 mean:  = i Xi / n
 mode: most common value in X
 median: X=sort(X), median = Xn/2 (half below, half above)
 quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
 interquartile range: value(Q3) - value(Q1)
 range: max(X) - min(X) = Xn - X1
 variance: 2 = i (Xi - )2 / n
 skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
 zero if symmetric;

 number of distinct values for a variable (see unique() in R)


Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
n n
1 1
s 
2 1 n
 (x  x) 2

1 n 2 1 n
[ x  ( xi ]
) 2  
2

N
 ( xi   )  N
2
 xi   2
2

n 1 i1 n 1 i1
i i
n i1 i 1 i1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)


• The formula to find skewness manually is
skewness = (mean - median)) / standard deviation
Properties of Normal Distribution Curve
• The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ:
standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis represents frequencies
• Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane
Boxplot Analysis
• Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
• Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height/length of the box is
IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box
extended to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually
Boxplots
• Shows a lot of information about a
variable in one plot
 Median
 IQR
 Outliers
 Range
 Skewness
• Negatives
 Overplotting
 Hard to tell distributional shape
 no standard implementation in software
(many options for whiskers, outliers)
Boxplots
• Boxplots provide visual summaries of:
 The center of the data (the median – the center line of the box).
 The variation or spread of the data (interquartile range – the box
height).
 The skewness of the data (the relative size of the box halves).
 Presence or absence of unusual values ( "outside” and “far outside”
values).
 Boxplots are typically put side-by-side to visually compare
 and contrast groups of data.
Box and Whisker Plot

Box plot Advantages Disadvantages

A box plot is a concise •Shows 5-point summary • Not as visually appealing


graph showing the five point and outliers as other graphs
summary.
•Easily compares two or •Exact values other than
Multiple box plots can be more data sets min, max and median can
drawn side by side to not be determined from box
compare more than one plot.
•Handles extremely large
data set.
data sets easily.
If there is an even data set…
Histogram

• Displays large amounts of data that are difficult to interpret in tabular form
• Shows the relative frequency of occurrence of the various data values
• Reveals the centering, variation, and shape of the data
• Illustrates quickly the underlying distribution of the data
• Graph displays of tabulated frequencies, shown as bars
• shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the
value, not the height as in bar charts, a crucial distinction when the
categories are not of uniform width
Histograms Often Tell More than Boxplots

• The two histograms shown in the


left may have the same boxplot
representation
 The same values for: min, Q1,
median, Q3, max
• But they have rather different
data distributions
Histograms

Histogram Advantages Disadvantages

A histogram is a type of • Visually strong •Cannot read exact


bar graph that displays values from
continuous data in •Can compare to histogram because
ordered columns called normal curve data is grouped into
intervals. categories.
Categories are of •Usually vertical
continuous measure axis is a frequency •More difficult to
such as time, length, count of items compare two data
temperature, etc. falling into each sets.
Bars have the same category.
width and are drawn •Use only with
next to each other continuous data
with no gaps. (intervals).
Single Variable Visualization
• Histogram:
 Displays large amounts of data that are difficult to interpret in tabular form
 Shows the relative frequency of occurrence of the various data values
 Reveals the centering, variation, and shape of the data
 Illustrates quickly the underlying distribution of the data
 Shows center, variability, skewness
 outliers, or strange patterns.
Issues with Histograms
• For small data sets, histograms can be misleading.
 Small changes in the data, bins, or anchor can deceive

• For large data sets, histograms can be quite effective at illustrating


general properties of the distribution.

• Histograms effectively only work with 1 variable at a time


Explanatory versus Exploratory Graphs
• An Exploratory Visualization is:
 A way to get familiar with the data one is working with.
 Often drawn from a series of set graphical techniques.
 A slow process – looking at the data in order to find the one or two things that are
interesting.

• An Explanatory Visualization is:


 Showing those one or two interesting things about the data.
 The process of using data in order to tell a story or to reinforce a narrative you have
already determined based on your other research.
 The primary type of data visualization we interact with both inside and outside of the
academy beyond our own work.

• These two are not mutually exclusive.


 Sometimes the process of expanding upon or modifying an explanatory
visualization created for another purpose can yield interesting results.
EDA and Visualization
• Exploratory Data Analysis (EDA) and Visualization are very important
steps in any analysis task.

• get to know the data!


 distributions (symmetric, normal, skewed)
 data quality problems
 outliers
 correlations and inter-relationships
 subsets of interest
 suggest functional relationships
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
 means, medians, quantiles, histograms, boxplots
•should always look at every variable - will learn something!
• data-driven (model-free)
• Think interactive and visual
 Humans are the best pattern recognizers
 You can use more than 2 dimensions!
 x,y,z, space, color, time….

• Especially useful in early stages of data mining


 detect outliers (e.g. assess data quality)
 test assumptions (e.g. normal distributions or skewed?)
 identify useful raw data & transforms (e.g. log(x))

• Bottom line: it is always well worth looking at data!


Quantile Plots
Quantile plot of Ithaca Maximum Temperature, January 1987
Quantile plots visually
1.0

0.9
portray the
0.8 quantiles, or
Cumulative Frequency

0.7

0.6
percentiles (which
0.5 equal the quantiles
0.4
times 100) of the
0.3

0.2 distribution of
0.1
sample data.
0.0
0 10 20 30 40 50 60
Maximum Temperature (degree F) Advanatges:
1. All of the data are
2. Every point has a distinct position, displayed, unlike a
without overlap. boxplot.
3. Arbitrary categories are not required,
as with histograms.
Quantile Plots
• A quantile plot is a plot of the data values on the vertical axis against
an empirical assessment of the fraction of observations exceeded by
the data value….

• A very useful quantile plot is the Normal-Quantile-Quantile plot. It is


often used by analysts to determine whether a data set came from a
normal distribution.

• A Normal Quantile Quantile plot is a plot of the empirical (data)


quantiles against the corresponding quantiles of the normal
distribution…
Typical Deviations from Straight Line Patterns

• Outliers

• Curvature at both ends (long or short tails)

• Convex/concave curvature (asymmetry)

• Horizontal segments, plateaus, gaps


Outliers
Long Tails
Short Tails
Asymmetry

You might also like