Data Preprocessing I

Data Pre-processing I – Dealing with
Structured Data
Symbiosis International (Deemed University)

Session Objectives
By the end of this session, you will be able to:
1. Understand the fundamentals of data objects &

features/attributes of structured data.
2.Figure out the statistical tools for performing
descriptive & inferential analysis of any given
data set.
3.Understand the association between the
variables/attributes.
Know about Data
 Voluminous Data
Generation (several
gigabytes and more).
 Data would be noisy.
 Variety of sources may

involve.
Knowing about data,

Data Mining at Conceptual Level
values and metadata.
Data Object & Attribute
 Data sets are composed
of data objects.
 An entity is represented
by a data object.
 Attributes
are
commonly used
to
known objects
define as of samples,
data.
 Data
examples,objects
instances,
are
Typical Company Database
data points.
also
Attribute Types
 An attribute is a data field that represents a data object’s
trait or feature.
 Dimension, feature, variable are used interchangbly.
Example: In a bank database, the customer can have
attributes i.e., ID, Name, Address, Contact, Date of Birth,
Nominees
The set of attributes used to describe a data object or a
data point is called feature vector or attribute vector.
 Univariate or Bivariate distribution.
Attribute Types
Broad Classification
 Categorical - Qualitative Variables, Identify Group
Membership
 Numerical - Quantitative Variables, Describe Numerical
Properties of Cases, Have Measurement Units, Discrete
and Continuous
Attribute Types
Attribute Scaling
 Nominal - Labels or Names used to Identify the
Characteristics of an Observation, E.g. Name,
Board, Gender, Role, PRN Number, Mobile
Number etc.,
Numerically coded, No ordering.
 Ordinal – Exhibits the Properties of Nominal Data, Tells

us that Whether a Particular Observation has
More Characteristics than Other Observation/s,
Order is meaningful. E.g. Student Grade, Customer
Attribute Scaling Attribute Types
 Interval - Exabits the Properties of Ordinal Data, How
Much Characteristics an Observation got Compared
to Other Observation/s, Interval Between the
Observations can be Expressed using Fixed Unit of
Measure, E.g. Temperature, IQ, Time etc,
Interval Data are Always Numeric and can’t be
0, and can be added or Subtracted.
 Ratio – Data has All the Properties of Interval Data and
Ratio of two Values is Meaningful, E.g. Distance, Runs,
Weights, Heights etc., Addition, Subtraction,
Multiplication and Division are Possible.
Attribute Types
Attribute Scaling (Summary)
Ratio Scale
Numerical
Data
Interval Scale
Ordinal Scale
Categorical
Data
Nominal Scale
Attribute Types
Discrete vs Continuous
 Discrete attribute - Finite or countably infinite collection

of values that can be expressed as integers or
not, Countably finite values: Hair color, Gender, medical
test, and drink size, Countably infinite values: customer
ID.
 Continuous attribute - Real/Floating point values are

represented, can be finite/infinite. E.g. Car Speed, Tip
amount to waiter etc.
Statistical Operations on Data
Describing Categorical Data
 Frequency Distribution Table – Frequency and Relative
frequency
 Frequency - Listing the distinct values of attributes and their frequency
occurrences.
 E.g. Customer Satisfaction contains values i.e. Good, Good, Very Good, Good,
Very Good, Bad, Very Bad, Bad, Bad.
 Frequency table can be formed as: Good as 3, Very Good as 2, Bad as 3
and Very Bad as 1 (Total data object =9)
 Relative Frequency – Measures the ratio of frequency distributions to the total
number of objects.
 E.g. Relative frequencies for categories Good, Very Good, Bad and Very Bad are
1/3, 2/9, 1/3 and 1/9 respectively.
 Graphical Summary – Pie Charts, Bar Charts and Pareto
Charts. 3.5
3.5
3
3
2.5 2.5
2 2
1
3
2 1.5
Very Bad 1.5
1
Very Good 1
3
Good 0.5
0.5
Bad 0
0 Very Bad Very Good Good Bad
Pie Chart: proportional

Good Very Good Bad Very Bad
representation of relative Bar Chart: Need place the distinct Pareto Chart: Bar
frequencies for each category. categories on horizontal axis or vice Chart Sorted by
versa and the frequencies or relative frequency/relative
frequencies of each category will be frequency
placed in vertical axis or vice versa.
 Numeric Summary – Mode and Median

 Mode – Highest frequency or relative frequency.
 Longest bar in bar chart and widest portion in pie chart.
 Two or more categories tie for the highest frequency then we call that attribute
bimodal or multimodal.
 Median – Determines the category of middle element or value of any attribute.

 Data values should be in order to apply median.
 If n i.e. number of objects is odd then (n+1)/2 th category value would be the
median. Otherwise consider both n/2 th and n/2 + 1 th value as the median.
Describing Numerical Data
 Small Distinct Data Objects
 The number of distinct data objects are relatively small, we can consider each
distinct value as a category and evaluate their frequencies and relative
frequencies as we did for categorical data.
 Graphically we can represent these observations using a bar chart.
 Large Categories of Data Objects
 Organize the data into the number of classes (5 – 20 Classes)
 Each data object belongs to only one class.
 Evaluate the frequencies and relative frequencies of the classes.
 Use histogram to generate the graphical summary.
 Measure of Central Tendency
 Talks about the typical value or values of an attribute.
 (Arithmetic) mean is the most common and effective quantitative measure of
the central value of a set of data.
 𝒙′ = 𝒘𝟏𝒙𝟏+𝒘𝟐𝒙𝟐+𝒘𝟑𝒙𝟑+…+𝒘𝑵𝒙𝑵
𝒘𝟏+𝒘𝟐+….+𝒘𝑵
 Cutting off the lower and higher extremes by 2% - 20% after sorting the data values and
then evaluating the mean is called trimmed mean.
 Mean can also be evaluated from assumed mean.(Discrete Series & Continuous Series)
 We need to find out the mid-point of each class interval and multiply with the frequency of
occurrences of data point belonging to the class interval for finding mean of continuous
attributes. Rest would be same.
 Median Evaluation is similar to the Categorical data if number of observations is minimum.
 𝑁
Combined Mean - 2
−(U CF )
 For high number of observations: 𝐌𝒆𝒅𝒊𝒂𝒏 = 𝐿 + ∗𝑤
𝑓 𝑚𝑒𝑑𝑖𝑎𝑛
 Measure of Central Tendency
 Mode for continuous series, when the data values are provided in class intervals
and frequency of each interval is known will be defined as,
𝑓1 − 𝑓0
𝑀𝑜𝑑𝑒 = 𝐿 + ∗𝑤
2𝑓1 − 𝑓0 − 𝑓2
 Relationship among Mean, Median and Mode.
45
40 mean = median = mode if perfectly

35
symmetric
30
mean > median > mode if positively
Frequency
25
20
Symmetric
Positively Skewed (Asymmetric)
skewed
15 Negatively Skewed (Asymmetric)
mode > median > mean if
10
5
negatively skewed
0
0 2 4 6 8
Number of Class Intervals
 Measure of Dispersion
 Dispersion measures the spread of numerical data.
 Techniques - Range, Variance, Standard Deviation, Percentile and Interquartile
Range.
 Range, it is the difference between the largest and smallest values of the
observation. Hence, the range can talk more about the data sets. Sensitive to
outliers.
 Variance takes into consideration all the data values or observations. It
evaluates the deviations of the data values from the mean and aggregates all
deviations to provide a numeric value.
 Variance formula: 𝜎 2 = 𝑁1 σ 𝑁 (𝑥 𝑖 −𝑥)′)2
𝑖=1
 Square root of variance is the Standard Deviation.
 Percentile –
o 100p percentile of a data set containing n records.
o p is the percentile that has the values in [0,1].
 Measure of Dispersion
 Percentile –
o Determine n*p.
o If n*p is not an integer then determine the smallest integer greater than
n*p. The value at that position would be 100p percentile.
o If n*p is an integer then the mean values of the position n*p and n*p + 1
is the 100p percentile.
 Quartile –
o 25th percentile is called the first quartile (Q1), 50th
percentile called the second quartile or median (Q2) and 75th
percentile called the third quartile (Q3). That is, the quartile breaks the
data set into four parts.
o Interquartile Range IQR => Q3 – Q1.
o Values of data below Q1 – 1.5IQR and above Q3 + 1.5IQR can be classified as
Association between Catégorial Variables.
 Use of Contingency Table
 E.g.1. Association Between Gender and Owning a Smartphone.
 100 samples are collected and we have 44 female and 56 male students. 76
owned smartphones and 24 did not.
 Finally, 34 female owned smartphone and 42 males owned smartphone.
 E.g.2. Income level (ordinal variable with values: low, medium and high) and
smartphone ownership (nominal variable with values: yes or no).
 Income level is an ordinal variable then in contingency table, low, medium and
high values can be coded as 1, 2, 3 respectively to maintain the ordering.
 Use of Contingency Table
 For E.g. 1, it can be observed that 24% of population do not own a smartphone
whereas, 76% own a smartphone. This distribution is consistent for male
and female. Around 23% of female do not own a smartphone and 77%
owns a smartphone. 25% of male do not owns a smartphone and
75% owns a smartphone. --- Gender and Owning a Smartphone not
associated.
 On contrary, for E.g.2, ownership distribution is 38% and 62%. Although, such
distribution is not consistent with income levels. Only 10% of high-
income group do not own a smartphone 41% and 64% of medium
and low-income group do not a smartphone. --- Income level and
 Use of Stacked Bar Chart
E.g.1. The proportion of smartphone E.g.2. The proportion of ownership is not same
ownership is same for male and for high, medium and low-income groups
female.
 Row Relative Frequency and Column Relative Frequency
 Division of each cell of Contingency Table by Row total – Row relative frequency
 Division of each cell of Contingency Table by Column total – Column relative
frequency
E.g. 1. Row Relative Frequency E.g. 2. Row Relative Frequency
 Row Relative Frequency and Column Relative Frequency
 Division of each cell of Contingency Table by Row total – Row relative frequency
 Division of each cell of Contingency Table by Column total – Column relative
frequency
E.g. 1. Column Relative Frequency E.g. 2. Column Relative Frequency
 Utility of Row & Column Relative Frequencies for Finding
Association:
 Knowing information about one variable provides information about the other
variable – Association of two variables.
 If the row relative frequencies (or column relative frequencies) have same patterns
for all rows (or columns) – Two variables are not associated.
 If row relative frequencies (or column relative frequencies) have different patterns
for some rows (or some columns) – Two variables are associated.
E.g. 1. E.g. 2.
Association between Numerical Variables.
 How to interpret association in scatter plot.
 Quantification of numeric association.
Example I: Example II:

 Pattern Observing in Scatterplot:
 Direction
 Curvature or Linear
 Variation or Tightly Clustered
 Outliers
Measuring the Strength of Association
 Quantification of association.
 Use covariance and correlation to quantify the strength of association.
 Consider two numeric attributes A and B, as well as a collection of n observations i.e.
{(a1, b1), …., (an, bn)}. The expected values for A and B are also known as the mean
values of A and B say A’ and B’.
′ 𝑏 𝑖 −𝐵 ′
 σ 𝑛𝑖=1 𝑎 𝑖 −𝐴
𝐶𝑜𝑣
𝑛
𝐴, 𝐵 =
𝐴,𝐵 = 𝐶𝑜𝑣 𝐴,𝐵
𝜎𝐴 𝜎𝐵
 𝑟
Types of Distribution - Introduction
Bernoulli Trial:
Types of Distribution.
Bernoulli Trial:
Bernoulli Trial:
Bernoulli Trial
Examples:
Bernoulli Trial
Examples:
Bernoulli Trial Real Life Examples:
Binomial & Geometric
Distribution:
Association between Variables.
 Introduction to Hypothesis Testing:
 We would come to some conclusion about population based on sample data set in
hypothesis testing.
 Based on samples, we divide data into null (H0) and alternate hypothesis (H1).
 Collect different evidences through applying different techniques to find out the
truthfulness of null or alternate hypothesis.
Summary: Evaluate n mutually exclusive statements on population using sample data.

H0 H1
Ok Type 2 Error
Reject
Type 1 Error Ok
Not Reject
Association between Variables.
 Types of Hypothesis Testing:
Gender Age Group Weight (Kg) Height (cm)
M Elderly 70 1.4
F Adult 6.5 1.2
…… ……. …..
…… …… …… …..
 Gender | Whether there is a difference in Male and Female Proportion? | H1: Yes, H0: No | Test: One
Sample Proportion Test since only one categorical variable. | P ≤ 0.05 then Reject H0.
 Gender & Age Group | Is there is any difference over Male and Female Proportion based on Age
Group? | H1: Yes, H0: No | Test: Chi-squared Test since two categorical variables. | P ≤ 0.05 then
Reject H0.
 Numeric Feature like Height | Test: T-test | One numeric variable
 Two Numerical Variables | Test: Correlation (-1 to + 1)
 One Numerical and One Categorical | If categorical has two categories  T-test otherwise ANOVA
Test
Question & Answer
 Measuring the Strength of Association (Examples)
Session Outcomes
In this session you learned about:
1. Data & Attributes
2. Statistical Operations for observing data and
attributes.
3. Measuring of association between Categorial
Variables.
4. Measuring of association between Numerical
Variables.
Thank You

Data Preprocessing I

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing I

Uploaded by

Copyright:

Available Formats

Data Pre-processing I – Dealing with

Symbiosis International (Deemed University)

1. Understand the fundamentals of data objects &

 Data would be noisy.

 Variety of sources may

Knowing about data,

 Ordinal – Exhibits the Properties of Nominal Data, Tells

Attribute Scaling (Summary)

 Discrete attribute - Finite or countably infinite collection

 Continuous attribute - Real/Floating point values are

Pie Chart: proportional

 Numeric Summary – Mode and Median

 Median – Determines the category of middle element or value of any attribute.

40 mean = median = mode if perfectly

Example I: Example II:

Summary: Evaluate n mutually exclusive statements on population using sample data.

You might also like