Professional Documents
Culture Documents
Intelligence:
Module 3
Dr. Saurav Dash,
F214, MGB
E-mail: saurav.dash@vit.ac.in
Why data visualization?
• Vision is by far our most powerful sense
4
Statistics and Graphs
Anscombe’s
Quartet
• Four distinct datasets
• Each with statistics
properties that are
essentially identical
Mean of x = 9.0
Mean of y = 7.5
Variance of x = 11
Variance of y = 4.13
Nearly identical
correlation and
regression line
• Anscombe's quartet comprises four datasets that have nearly identical
simple descriptive statistics, yet appear very different when graphed.
Each dataset consists of eleven (x,y) points.
• They were constructed in 1973 by the statistician Francis Anscombe
to demonstrate both the importance of graphing data before analyzing it
and the effect of outliers on statistical properties.
He described the article as being intended to counter the impression
among statisticians that "numerical calculations are exact, but graphs are
rough
For all four datasets:
Mean of x 9 exact
Coefficient of determination
0.67 to 2 decimal places
of the linear regression
Statistics and Graphs
Anscombe’s Quartet
• Anscombe suggested
the combined use of
graphs and statistics
methods in data
analysis
8
• The first scatter plot (top left) appears to be a simple linear
relationship, corresponding to two variables correlated and following the
assumption of normality.
• The second graph (top right) is not distributed normally; while a
relationship between the two variables is obvious, it is not linear, and
the Pearson correlation coefficient is not relevant. A more general
regression and the corresponding coefficient of determination would be
more appropriate.
• In the third graph (bottom left), the distribution is linear, but should have
a different regression line (a robust regression would have been called
for). The calculated regression is offset by the one outlier which exerts
enough influence to lower the correlation coefficient from 1 to 0.816.
• Finally, the fourth graph (bottom right) shows an example when one
outlier is enough to produce a high correlation coefficient, even though
the other data points do not indicate any relationship between the
variables.
Why data visualization?
Organising,
Correlational Generalising
summarising &
describing data
Relationships
Significance
Types of Data
• Qualitative
Categorical/Nominal
Sex: Male, Female
Region: North, South, West,
East
Ordinal
How happy are you with the
customer service?
: 1- Very Unhappy 2- Unhappy 3-
Neutral 4- Unhappy 5- Very
Unhappy
• Quantitative
Physical measurements:
Continuous or Discrete
• Discrete Variables:
can be described using a specific and distinct point on a scale
cannot be sub-divided any further-e.g. Gender = Male or Female
• Continuous Variables:
Can theoretically take any value between two points on a continuum
Are dependent on the accuracy of measuring tools
e.g. Time = yr wk d h min s ms ms ns ps…
Levels of Measurement
1. Nominal Scale
Lowest Level
2. Ordinal Scale
3. Interval Scale
4. Ratio Scale
Highest Level
• Nominal
The least like - real numbers
The only property they have is identity or name (nominal=name).
Numbers if used are simply codes for the real names of the properties
• Ordinal
Have identity
Have magnitude (order). A>B>C>D.
We know relative order.
We DO NOT know how much better A was relative to B.
Consider two races in which we know the order of finish (ranks).
Ordinal variables
• Earl=1 • KC=1
• Greg=2 • Sarah=2
• Mike=3 • Liza=3
• Matt=4 • Marci=4
• Ratio
Have identity, magnitude and interval.
And since they have a true zero, they can be expressed as ratios of each other.
They are true numbers.
All mathematical properties apply.
The represent score data.
Summary Statistics
• not visual
• sample statistics of data X
mean: = i Xi / n
mode: most common value in X
median: X=sort(X), median = Xn/2 (half below, half above)
quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
interquartile range: value(Q3) - value(Q1)
range: max(X) - min(X) = Xn - X1
variance: 2 = i (Xi - )2 / n
skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
zero if symmetric;
N
( xi ) N
2
xi 2
2
n 1 i1 n 1 i1
i i
n i1 i 1 i1
• Displays large amounts of data that are difficult to interpret in tabular form
• Shows the relative frequency of occurrence of the various data values
• Reveals the centering, variation, and shape of the data
• Illustrates quickly the underlying distribution of the data
• Graph displays of tabulated frequencies, shown as bars
• shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the
value, not the height as in bar charts, a crucial distinction when the
categories are not of uniform width
Histograms Often Tell More than Boxplots
0.9
portray the
0.8 quantiles, or
Cumulative Frequency
0.7
0.6
percentiles (which
0.5 equal the quantiles
0.4
times 100) of the
0.3
0.2 distribution of
0.1
sample data.
0.0
0 10 20 30 40 50 60
Maximum Temperature (degree F) Advanatges:
1. All of the data are
2. Every point has a distinct position, displayed, unlike a
without overlap. boxplot.
3. Arbitrary categories are not required,
as with histograms.
Quantile Plots
• A quantile plot is a plot of the data values on the vertical axis against
an empirical assessment of the fraction of observations exceeded by
the data value….
• Outliers