You are on page 1of 69

Unit 3.

Data Exploration

Assoc. Prof Nguyen Manh Tuan


1
AGENDA

 Roles of data exploration


 Datasets – Data types
 Descriptive statistics (univariate, multivariate)
 Data visualization

2/20/2024 internal use


Roles of Data Exploration

 Understanding data
 Data preparation
 Key part of 2 phases (Data Understanding & Data
Preparation in CRISP-DM)
 Data quality report
 Data mining tasks
 Interpreting data mining results

2/20/2024 internal use


Roles of Data Exploration

2/20/2024 internal use


Roles of Data Exploration

 A data quality report includes tabular reports that describe the


characteristics of each feature in an analytics base table (ABT) using
standard statistical measures of central tendency and variation.
 The tabular reports are accompanied by data visualizations:
 A histogram for each continuous feature in an ABT
 A bar plot for each categorical feature in an ABT

2/20/2024 internal use


Roles of Data Exploration
Cardinality: the
number of
distinct values
present for a
feature
• Card = 1
• Categorical
as
Continuous

Missing values:
• 60%
excessive

Outliers: values
lie far away from
the central
tendency of a
feature

2/20/2024 internal use


Example: Motor Insurance Fraud

2/20/2024 internal use


Example: Motor Insurance Fraud

2/20/2024 internal use


Example: Motor Insurance Fraud

2/20/2024 internal use


Example: Motor Insurance Fraud

2/20/2024 internal use


Example: Motor Insurance Fraud

2/20/2024 internal use


Example: Motor Insurance Fraud

2/20/2024 internal use


Datasets – Data types

http://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg#mediaviewer/File:Iris_versicolor_3.jpg

2/20/2024 internal use


Datasets – Data types

2/20/2024 internal use


Datasets – Data types
 Software developers are not encouraged to ask questions, but data scientists
are:
 What exciting things might you be able to learn from a given data set?
 What data sets might get you there?
 What things do you/your people really want to know?

 The good data scientist develops a curiosity about the domain/application


they are working in.
 They talk shop with the people whose data they are working on.
 They read the newspaper every day, to get a broader perspective on the
world. 6W – Interrogative
Let’s Practice Asking Questions: framework
Who, What, How, Where, When, and Why on the datasets.

2/20/2024 internal use


Example: IMDb - Movie Data

2/20/2024 internal use


Example: IMDb - Actor Data

2/20/2024 internal use


MOVIE questions
 Can we predict how well people will like a movie? What about its gross?
 How well does movie gross correlate with viewer ratings or awards?
 What was the highest rated film each year, or the best in each genre? Which movies
lost the most money, had the highest-powered casts, or got the least favorable
reviews?
 Which actors appeared in the most films? Earned the most money? Appeared in the
lowest rated films? Had the longest career or the shortest lifespan?
 How do Hollywood movies compare to Bollywood movies, in terms of ratings,
budget, and gross? Are American movies better received than foreign
films, and how does this differ between U.S. and non-U.S. reviewers
 What does the social network of actors look like? (Six degrees of Kevin Bacon)
 What is the age distribution of actors and actresses in film?
 Do stars live longer or shorter lives than the bit players or public?

2/20/2024 internal use


Example: NYC TaxiCab Data
 Gives driver/owner, pickup/dropoff location, and fare data for every taxi trip
taken.
 Data obtained from NYC via Freedom of Information Act Request (FOA)

2/20/2024 internal use


TaxiCab Questions
 How much do drivers make each night?
 How far do they travel?
 How much slower is traffic during rush hour?
 Where are people traveling to/from at different times of the day?
 Do faster drivers get tipped better?
 Where should drivers go to pick up their next fare?
 But the bigger questions have to do with understanding transportation in
the city. We can use the taxi travel times as a sensor to measure the level of
traffic in the city at a fine level.
 How much slower is traffic during rush hour than other times, and where are delays the
worst?
 Identifying problem areas is the first step to proposing solutions, by changing the timing
patterns of traffic lights, running more buses, or creating high-occupancy only lanes.

2/20/2024 internal use


TaxiCab Questions
 Similarly, we can use the taxi data to measure transportation flows across
the city.
 Where are people traveling to, at different times of the day? This tells
us much more than just congestion.
 By looking at the taxi data, we should be able to see tourists going from hotels to attractions,
executives from fancy neighborhoods to Wall Street, and drunks returning home from
nightclubs after a bender.
 Data like this is essential to designing better transportation systems.
 It is wasteful for a single rider to travel from point a to point b when there is another
rider at point a+ who also wants to get there.
 Analysis of the taxi data enables accurate simulation of a ride sharing system, so we can
accurately evaluate the demands and cost reductions of such a service.

2/20/2024 internal use


Datasets – Data types
 Types: structured vs unstructured data
 Structured: matrix (objects/items; properties of objects/items)
 Unstructured: collection of tweets  matrix (tweet; frequently used vocabulary
word)
 Feature (or dimensions, attributes, variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address
Common structured data file types:
 Data types:  XLS (ExceL Spreadsheets)
 Categorial/ Nominal:  CSV (Comma Separated Values)
 TSV (Tab Separated Values)
 Continuous/ Numeric:  XML (eXtensible Markup Language)
 RSS (Really Simple Syndication)
Interval-scaled  JSON (JavaScript Object Notation)
Ratio-scaled

2/20/2024 internal use


Datasets – Data types (Categorial)
 Nominal: categories, states, or “names of things”/labels describing the properties of the
object under investigation
 Hair_color = {black, blond, brown, grey, red}; marital status, occupation, ID numbers, zip codes
 Binary (binominal)
 Nominal attribute with only 2 states (0 and 1) black hair =0, blond =1, brown =2, ..
 Symmetric binary: both outcomes equally important
 e.g., gender  max/min hair color ?!
 my hair color – your hair color ?!
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)
 Polynominal: Nominal attribute with multiple states (0,1,2, ..)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between successive values is not
known.
 Size = {small, medium, large}, grades, army rankings

2/20/2024 internal use


Datasets – Data types (Numeric)
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Additive/subtractive; logical (greater/less than; equal to) operators
 Ratio/real
 Inherent zero-point
 We can speak of values as being an order of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts, monetary quantities
 Ratio operations

2/20/2024 internal use


Descriptive Statistics - Univariate

2/20/2024 internal use


Measuring the Central Tendency

 Mean (algebraic measure) (sample vs. population): 1 n


x
Note: n is sample size and N is population size.
x 
n

i 1
xi 
N
 Weighted arithmetic mean: n

w
i 1
i xi
x  n
 Median:
 Middle value if odd number of values, or average of the middle
w
i 1
i

two values otherwise

 Mode:
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 If each data value occurs only once, then there is no mode.

2/20/2024 internal use


Symmetric vs. Skewed Data

 Median, mean and mode of symmetric,


symmetric
positively/right and negatively/left skewed
data

(unimodal) skewed right (unimodal) skewed left

2/20/2024 internal use


Well-known, common distributions

A feature
characterized by a
multimodal
distribution has
two
or more very
commonly
occurring ranges of
values that are
clearly
separated.
2/20/2024 internal use
Properties of Normal Distribution Curve

 The 68 - 95 – 99.7 rule is a useful characteristic of the normal distribution.


The rule states that approximately:
- 68% of the observations will be within one σ of µ Features following a normal
- 95% of observations will be within two σ of µ distribution are characterized by
- 99.7% of observations will be within three σ of µ. a strong tendency towards a
central value and symmetrical
variation to either side of this.
Properties of Normal Distribution Curve

2/20/2024 internal use


Properties of Normal Distribution Curve

2/20/2024 internal use


Measuring the Dispersion of Data

 Quartiles, outliers and boxplots


 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
n n
1 1 2
2 1 n 2 1 n 2 1 n 2 2   ( xi   ) 2  x  2
s  (xi  x)  n 1[
n 1 i1 i 1
xi  ( xi ) ]
n i1
N i 1 N i 1
i

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

2/20/2024 internal use


Measuring the Dispersion of Data

X={1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1}. What are Q1, Q2 and Q3?.

1) X (ascending): X={1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5}


2) X: 14 observations, 7th (6.8) and 8th (7.2). Median = Q2 = average of 7th and
8th = 7
3) Q1: median of X1={1; 1; 2; 2; 4; 6; 6.8}. X1: 7 observations, median of X1 is
2. => Q1 = 2
4) Q3: median of X2={7.2; 8; 8.3; 9; 10; 10; 11.5}. X2: 7 obsrrvations, median of
X2 is 9. => Q3 = 9
--> Conclusion on Q1: ¼ of X: ≤2, ¾ of the X ≥2.

2/20/2024 internal use


Boxplot Analysis

 Five-number summary of a distribution


 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e.,
the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to Minimum
and Maximum
 Outliers: points beyond a specified outlier threshold,
plotted individually

2/20/2024 internal use


Descriptive Statistics - Multivariate

 Datapoint:
 Correlation analysis
 Covariance analysis

2/20/2024 internal use


Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test
2
(Observed  Expected)
2  
Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population (?)

2/20/2024 internal use


Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)

2 (250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2


      507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are correlated in the group.

2/20/2024 internal use


Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment coefficient)


n n

rA, B 
i 1 (ai  A)(bi  B) 
 i 1
(ai bi )  n A B
(n  1) A B (n  1) A B
where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation. rA,B = 0: independent; rAB < 0: negatively correlated
Correlation Analysis (viewed as linear relationship)

Correlation measures the linear relationship between


objects
To compute correlation, we standardize data objects, A
and B, and then take their dot product
a'k  (ak  mean( A)) / std ( A)
b'k  (bk  mean( B)) / std ( B )

correlation( A, B )  A' B'

2/20/2024 internal use


Covariance analysis (Numeric Data)

 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or expected values of
A and B, σA and σB are the respective standard deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values. Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence

2/20/2024 internal use


Covariance Analysis: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4,
11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will their prices rise or fall
together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

2/20/2024 internal use


Descriptive Statistics - Multivariate
Data Visualization - Histogram

2/20/2024 internal use


Data Visualization - Class stratified Histogram

2/20/2024 internal use


Data Visualization - Quantile plot

2/20/2024 internal use


Histogram Analysis
 Histogram: Graph display of tabulated
frequencies, shown as bars 40
 It shows what proportion of cases fall into each 35
of several categories
30
 Differs from a bar chart in that it is the area of
25
the bar that denotes the value, not the height as
in bar charts, a crucial distinction when the 20
categories are not of uniform width 15
 The categories are usually specified as non- 10
overlapping intervals of some variable. The
5
categories (bars) must be adjacent.
0
10000 30000 50000 70000 90000

46
2/20/2024 internal use
Histograms Often Tell More than Boxplots

 The two histograms


shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data distributions

47
2/20/2024 internal use
Data Visualization - Distribution plot

2/20/2024 internal use


Data Visualization: multivariate/high dimensional data

Scatter plot
Scatter multiple
Scatter matrix
Bubble plot
Density chart
Parallel chart
Deviation chart
Andrews curves

2/20/2024 internal use


Data Visualization - Scatter plot

2/20/2024 internal use


Data Visualization - Scatter multiple

2/20/2024 internal use


Data Visualization - Scatter matrix

2/20/2024 internal use


Data Visualization - Bubble plot

2/20/2024 internal use


Data Visualization - Density chart

2/20/2024 internal use


Data Visualization - Parallel chart

2/20/2024 internal use


Data Visualization - Deviation chart

2/20/2024 internal use


Data Visualization - Andrews curves

2/20/2024 internal use


More on Data Preparation

 Some data preparation techniques change the way data is represented just to
make it more compatible with certain machine learning algorithms.
 Normalization
 Binning
 Sampling
 Normalization techniques can be used to change a continuous feature to fall
within a specified range while maintaining the relative differences between the
values for the feature.

2/20/2024 internal use


More on Data Preparation

 Normalization techniques can be used to change a continuous feature to fall


within a specified range while maintaining the relative differences between the
values for the feature.
More on Data Preparation

 Binning involves converting a continuous feature into a categorical feature.


To perform binning, we define a series of ranges (called bins) for the continuous
feature that correspond to the levels of the new categorical feature we are creating.
 equal-width binning
 equal-frequency binning
 Deciding on the number of bins can be difficult. The general trade-off is this:
 If we set the number of bins to a very low number we may lose a lot of information
 If we set the number of bins to a very high number then we might have very few instances in
each bin or even end up with empty bins.
More on Data Preparation

 The equal-width binning algorithm splits the range of the feature values into b bins
each of size range/b.

2/20/2024 internal use


More on Data Preparation

 Equal-frequency binning first sorts the continuous feature values into ascending order
and then places an equal number of instances into each bin, starting with bin #1.
 The number of instances placed in each bin is simply the total number of instances divided by the
number of bins, b.

2/20/2024 internal use


More on Data Preparation

 Sample: a smaller percentage from the larger dataset/population.


 We need to be careful when sampling, however, to ensure that the resulting
datasets are still representative of the original data and that no unintended
bias is introduced during this process.
 Common forms of sampling include:
random sampling
stratified sampling
 Random sampling: randomly selects a proportion of s% of the instances
from a large dataset to create a smaller set.
 Random sampling is a good choice in most cases as the random nature of the
selection of instances should avoid introducing bias.

2/20/2024 internal use


More on Data Preparation

 Stratified sampling is a sampling method that ensures that the relative


frequencies of the levels of a specific stratification feature are maintained in
the sampled dataset.
 To perform stratified sampling:
 the instances in a dataset are divided into groups (or strata), where each group
contains only instances that have a particular level for the stratification feature
 s% of the instances in each stratum are randomly selected
 these selections are combined to give an overall sample of s% of the original
dataset.

2/20/2024 internal use


More on Data Preparation

 A data quality issue is loosely defined as anything unusual about the dataset.
 The most common data quality issues are:
- missing values
- irregular cardinality
- outliers

More on Data Preparation

 Approaches of handling data quality issues


 Drop any features that have missing value.
 Apply complete case analysis.
 Derive a missing indicator feature from features with missing value.
 Imputation replaces missing feature values with a plausible estimated value
based on the feature values that are present.
 The most common approach to imputation is to replace missing values for a feature
with a measure of the central tendency of that feature.
 We would be reluctant to use imputation on features missing in excess of 30% of
their values and would strongly recommend against the use of imputation on
features missing in excess of 50% of their values.

2/20/2024 internal use


More on Data Preparation

 The easiest way to handle outliers is to use a clamp transformation that


clamps all values above an upper threshold and below a lower threshold to
these threshold values, thus removing the offending outliers

where ai is a specific value of feature a, and lower and upper are the lower and upper
thresholds.

2/20/2024 internal use


More on Data Preparation

2/20/2024 internal use


THE END

2/20/2024 internal use

You might also like