Unit 3 Data Exploration (P)

Unit 3.
Data Exploration
Assoc. Prof Nguyen Manh Tuan

1
AGENDA
 Roles of data exploration

 Datasets – Data types
 Descriptive statistics (univariate, multivariate)
 Data visualization
2/20/2024 internal use

Roles of Data Exploration
 Understanding data
 Data preparation
 Key part of 2 phases (Data Understanding & Data
Preparation in CRISP-DM)
 Data quality report
 Data mining tasks
 Interpreting data mining results


 A data quality report includes tabular reports that describe the

characteristics of each feature in an analytics base table (ABT) using
standard statistical measures of central tendency and variation.
 The tabular reports are accompanied by data visualizations:
 A histogram for each continuous feature in an ABT
 A bar plot for each categorical feature in an ABT

Cardinality: the
number of
distinct values
present for a
feature
• Card = 1
• Categorical
as
Continuous
Missing values:
• 60%
excessive
Outliers: values
lie far away from
the central
tendency of a
feature

Example: Motor Insurance Fraud






Datasets – Data types
http://commons.wikimedia.org/wiki/File:Iris_versicolor_3.jpg#mediaviewer/File:Iris_versicolor_3.jpg


 Software developers are not encouraged to ask questions, but data scientists
are:
 What exciting things might you be able to learn from a given data set?
 What data sets might get you there?
 What things do you/your people really want to know?
 The good data scientist develops a curiosity about the domain/application

they are working in.
 They talk shop with the people whose data they are working on.
 They read the newspaper every day, to get a broader perspective on the
world. 6W – Interrogative
Let’s Practice Asking Questions: framework
Who, What, How, Where, When, and Why on the datasets.

Example: IMDb - Movie Data

Example: IMDb - Actor Data

MOVIE questions
 Can we predict how well people will like a movie? What about its gross?
 How well does movie gross correlate with viewer ratings or awards?
 What was the highest rated film each year, or the best in each genre? Which movies
lost the most money, had the highest-powered casts, or got the least favorable
reviews?
 Which actors appeared in the most films? Earned the most money? Appeared in the
lowest rated films? Had the longest career or the shortest lifespan?
 How do Hollywood movies compare to Bollywood movies, in terms of ratings,
budget, and gross? Are American movies better received than foreign
films, and how does this differ between U.S. and non-U.S. reviewers
 What does the social network of actors look like? (Six degrees of Kevin Bacon)
 What is the age distribution of actors and actresses in film?
 Do stars live longer or shorter lives than the bit players or public?

Example: NYC TaxiCab Data
 Gives driver/owner, pickup/dropoff location, and fare data for every taxi trip
taken.
 Data obtained from NYC via Freedom of Information Act Request (FOA)

TaxiCab Questions
 How much do drivers make each night?
 How far do they travel?
 How much slower is traffic during rush hour?
 Where are people traveling to/from at different times of the day?
 Do faster drivers get tipped better?
 Where should drivers go to pick up their next fare?
 But the bigger questions have to do with understanding transportation in
the city. We can use the taxi travel times as a sensor to measure the level of
traffic in the city at a fine level.
 How much slower is traffic during rush hour than other times, and where are delays the
worst?
 Identifying problem areas is the first step to proposing solutions, by changing the timing
patterns of traffic lights, running more buses, or creating high-occupancy only lanes.

TaxiCab Questions
 Similarly, we can use the taxi data to measure transportation flows across
the city.
 Where are people traveling to, at different times of the day? This tells
us much more than just congestion.
 By looking at the taxi data, we should be able to see tourists going from hotels to attractions,
executives from fancy neighborhoods to Wall Street, and drunks returning home from
nightclubs after a bender.
 Data like this is essential to designing better transportation systems.
 It is wasteful for a single rider to travel from point a to point b when there is another
rider at point a+ who also wants to get there.
 Analysis of the taxi data enables accurate simulation of a ride sharing system, so we can
accurately evaluate the demands and cost reductions of such a service.

 Types: structured vs unstructured data
 Structured: matrix (objects/items; properties of objects/items)
 Unstructured: collection of tweets  matrix (tweet; frequently used vocabulary
word)
 Feature (or dimensions, attributes, variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address
Common structured data file types:
 Data types:  XLS (ExceL Spreadsheets)
 Categorial/ Nominal:  CSV (Comma Separated Values)
 TSV (Tab Separated Values)
 Continuous/ Numeric:  XML (eXtensible Markup Language)
 RSS (Really Simple Syndication)
Interval-scaled  JSON (JavaScript Object Notation)
Ratio-scaled

Datasets – Data types (Categorial)
 Nominal: categories, states, or “names of things”/labels describing the properties of the
object under investigation
 Hair_color = {black, blond, brown, grey, red}; marital status, occupation, ID numbers, zip codes
 Binary (binominal)
 Nominal attribute with only 2 states (0 and 1) black hair =0, blond =1, brown =2, ..
 Symmetric binary: both outcomes equally important
 e.g., gender  max/min hair color ?!
 my hair color – your hair color ?!
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)
 Polynominal: Nominal attribute with multiple states (0,1,2, ..)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between successive values is not
known.
 Size = {small, medium, large}, grades, army rankings

Datasets – Data types (Numeric)
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Additive/subtractive; logical (greater/less than; equal to) operators
 Ratio/real
 Inherent zero-point
 We can speak of values as being an order of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts, monetary quantities
 Ratio operations

Descriptive Statistics - Univariate

Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n

x
Note: n is sample size and N is population size.
x 
n

i 1
xi 
N
 Weighted arithmetic mean: n
w
i 1
i xi
x  n
 Median:
 Middle value if odd number of values, or average of the middle
w
i 1
i
two values otherwise
 Mode:
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 If each data value occurs only once, then there is no mode.

Symmetric vs. Skewed Data
 Median, mean and mode of symmetric,

symmetric
positively/right and negatively/left skewed
data
(unimodal) skewed right (unimodal) skewed left

Well-known, common distributions
A feature
characterized by a
multimodal
distribution has
two
or more very
commonly
occurring ranges of
values that are
clearly
separated.
Properties of Normal Distribution Curve
 The 68 - 95 – 99.7 rule is a useful characteristic of the normal distribution.

The rule states that approximately:
- 68% of the observations will be within one σ of µ Features following a normal
- 95% of observations will be within two σ of µ distribution are characterized by
- 99.7% of observations will be within three σ of µ. a strong tendency towards a
central value and symmetrical
variation to either side of this.


Measuring the Dispersion of Data
 Quartiles, outliers and boxplots

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
n n
1 1 2
2 1 n 2 1 n 2 1 n 2 2   ( xi   ) 2  x  2
s  (xi  x)  n 1[
n 1 i1 i 1
xi  ( xi ) ]
n i1
N i 1 N i 1
i
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Measuring the Dispersion of Data
X={1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1}. What are Q1, Q2 and Q3?.
1) X (ascending): X={1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5}

2) X: 14 observations, 7th (6.8) and 8th (7.2). Median = Q2 = average of 7th and
8th = 7
3) Q1: median of X1={1; 1; 2; 2; 4; 6; 6.8}. X1: 7 observations, median of X1 is
2. => Q1 = 2
4) Q3: median of X2={7.2; 8; 8.3; 9; 10; 10; 11.5}. X2: 7 obsrrvations, median of
X2 is 9. => Q3 = 9
--> Conclusion on Q1: ¼ of X: ≤2, ¾ of the X ≥2.

Boxplot Analysis
 Five-number summary of a distribution

 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third quartiles, i.e.,
the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to Minimum
and Maximum
 Outliers: points beyond a specified outlier threshold,
plotted individually

Descriptive Statistics - Multivariate
 Datapoint:
 Correlation analysis
 Covariance analysis

Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
2
(Observed  Expected)
2  
Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population (?)

Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)
2 (250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2

      507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are correlated in the group.

Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product moment coefficient)

n n
rA, B 
i 1 (ai  A)(bi  B) 
 i 1
(ai bi )  n A B
(n  1) A B (n  1) A B
where n is the number of tuples, A and B are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation. rA,B = 0: independent; rAB < 0: negatively correlated
Correlation Analysis (viewed as linear relationship)
Correlation measures the linear relationship between

objects
To compute correlation, we standardize data objects, A
and B, and then take their dot product
a'k  (ak  mean( A)) / std ( A)
b'k  (bk  mean( B)) / std ( B )
correlation( A, B )  A' B'

Covariance analysis (Numeric Data)
 Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected values of
A and B, σA and σB are the respective standard deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values. Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence

Covariance Analysis: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4,
11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will their prices rise or fall
together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

Descriptive Statistics - Multivariate
Data Visualization - Histogram

Data Visualization - Class stratified Histogram

Data Visualization - Quantile plot

Histogram Analysis
 Histogram: Graph display of tabulated
frequencies, shown as bars 40
 It shows what proportion of cases fall into each 35
of several categories
30
 Differs from a bar chart in that it is the area of
25
the bar that denotes the value, not the height as
in bar charts, a crucial distinction when the 20
categories are not of uniform width 15
 The categories are usually specified as non- 10
overlapping intervals of some variable. The
5
categories (bars) must be adjacent.
0
10000 30000 50000 70000 90000
46
Histograms Often Tell More than Boxplots
 The two histograms

shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data distributions
47
Data Visualization - Distribution plot

Data Visualization: multivariate/high dimensional data
Scatter plot
Scatter multiple
Scatter matrix
Bubble plot
Density chart
Parallel chart
Deviation chart
Andrews curves

Data Visualization - Scatter plot

Data Visualization - Scatter multiple

Data Visualization - Scatter matrix

Data Visualization - Bubble plot

Data Visualization - Density chart

Data Visualization - Parallel chart

Data Visualization - Deviation chart

Data Visualization - Andrews curves

More on Data Preparation
 Some data preparation techniques change the way data is represented just to
make it more compatible with certain machine learning algorithms.
 Normalization
 Binning
 Sampling
 Normalization techniques can be used to change a continuous feature to fall
within a specified range while maintaining the relative differences between the
values for the feature.

 Normalization techniques can be used to change a continuous feature to fall

within a specified range while maintaining the relative differences between the
values for the feature.
 Binning involves converting a continuous feature into a categorical feature.

To perform binning, we define a series of ranges (called bins) for the continuous
feature that correspond to the levels of the new categorical feature we are creating.
 equal-width binning
 equal-frequency binning
 Deciding on the number of bins can be difficult. The general trade-off is this:
 If we set the number of bins to a very low number we may lose a lot of information
 If we set the number of bins to a very high number then we might have very few instances in
each bin or even end up with empty bins.
 The equal-width binning algorithm splits the range of the feature values into b bins
each of size range/b.

 Equal-frequency binning first sorts the continuous feature values into ascending order
and then places an equal number of instances into each bin, starting with bin #1.
 The number of instances placed in each bin is simply the total number of instances divided by the
number of bins, b.

 Sample: a smaller percentage from the larger dataset/population.

 We need to be careful when sampling, however, to ensure that the resulting
datasets are still representative of the original data and that no unintended
bias is introduced during this process.
 Common forms of sampling include:
random sampling
stratified sampling
 Random sampling: randomly selects a proportion of s% of the instances
from a large dataset to create a smaller set.
 Random sampling is a good choice in most cases as the random nature of the
selection of instances should avoid introducing bias.

 Stratified sampling is a sampling method that ensures that the relative

frequencies of the levels of a specific stratification feature are maintained in
the sampled dataset.
 To perform stratified sampling:
 the instances in a dataset are divided into groups (or strata), where each group
contains only instances that have a particular level for the stratification feature
 s% of the instances in each stratum are randomly selected
 these selections are combined to give an overall sample of s% of the original
dataset.

 A data quality issue is loosely defined as anything unusual about the dataset.
 The most common data quality issues are:
- missing values
- irregular cardinality
- outliers

 Approaches of handling data quality issues

 Drop any features that have missing value.
 Apply complete case analysis.
 Derive a missing indicator feature from features with missing value.
 Imputation replaces missing feature values with a plausible estimated value
based on the feature values that are present.
 The most common approach to imputation is to replace missing values for a feature
with a measure of the central tendency of that feature.
 We would be reluctant to use imputation on features missing in excess of 30% of
their values and would strongly recommend against the use of imputation on
features missing in excess of 50% of their values.

 The easiest way to handle outliers is to use a clamp transformation that

clamps all values above an upper threshold and below a lower threshold to
these threshold values, thus removing the offending outliers
where ai is a specific value of feature a, and lower and upper are the lower and upper
thresholds.



THE END

Unit 3 Data Exploration (P)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3 Data Exploration (P)

Uploaded by

Copyright:

Available Formats

Unit 3.

Assoc. Prof Nguyen Manh Tuan

 Roles of data exploration

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

 A data quality report includes tabular reports that describe the

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

 The good data scientist develops a curiosity about the domain/application

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

 Mean (algebraic measure) (sample vs. population): 1 n

two values otherwise

2/20/2024 internal use

 Median, mean and mode of symmetric,

(unimodal) skewed right (unimodal) skewed left

2/20/2024 internal use

 The 68 - 95 – 99.7 rule is a useful characteristic of the normal distribution.

2/20/2024 internal use

2/20/2024 internal use

 Quartiles, outliers and boxplots

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

2/20/2024 internal use

1) X (ascending): X={1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5}

2/20/2024 internal use

 Five-number summary of a distribution

2/20/2024 internal use

2/20/2024 internal use

2/20/2024 internal use

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

2 (250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2

2/20/2024 internal use

 Correlation coefficient (also called Pearson’s product moment coefficient)

Correlation measures the linear relationship between

correlation( A, B )  A' B'

2/20/2024 internal use

 Covariance is similar to correlation

2/20/2024 internal use

 It can be simplified in computation as

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.