Types of Data

21-09-2020
DATA MINING:
Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering
AGENDA
ISSUES
RELATED TO
DATA
01
TYPES OF
02 DATA
DATA
QUALITY 03
2
1
21-09-2020
LEARNING OUTCOMES
Define attributes,
precision, bias and
Accuracy 01 Explain the
importance of
02 knowing nature of
input data.
Discuss issues in
Measurement and
Data Collection.
03 Classify attributes
04
as binary, discrete
or continuous.
ISSUES IN DATA
Types of Data
I
Quality of Data
II
Pre processing
IIi
Data Analysis
IV
4
2
21-09-2020
IMPORTANCE OF KNOWING YOUR DATA
Hi,
I have attached the data file that I mentioned earlier. Each line contains
the information for a single patient and consists of five fields. We want
to predict the last field using the other fields. I do not have time to
provide any more information about the data since ill be out of station
but hopefully that wont slow you down.
Thanks and see you in a couple of days
Despite some misgivings, you proceed to analyse the data. First few
rows are as follows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6

S
T So, you got the data for all the patients?
A
T Yes, I haven’t had much time for analysis, but I D
I have few interesting results A
S T
T Amazing, There were so many data issues with
this set of patients that I couldn’t do much A
I M
C I
I Oh? I didn’t hear about any possible problems N
A E
N R
Well, first there is field 5, the variable we want to
predict. It is common knowledge among people who
analyze this type of data that if you work with the log
of the values, but I didn’t discover this until later.
Was it mentioned to you?
3
21-09-2020

S
No!
T
But surely you heard about what happened to
A field 4? It is supposed to be measured on a
T scale of 1 to 10, with 0 indicating a missing D
I value, but because of data entry error all 10s A
S were changed to 0s. Unfortunately since some T
T of the patients have missing values for this
A
I field it is impossible to say whether a 0 in this
field is a real 0 or a 10. Quite a few of the M
C records have that problems. I
I N
A Interesting. Were there any other problems? E
N R
Yes, fields 2 and 3 are basically the same but I

assume that you probably noticed that

S Yes, but these fields were only weak predictors
T of field 5
A Anyway, given all those problems I am
T surprised you were able to accomplish anything D
I A
S True, but my results are really quite good. Field T
T 1 is a very strong predictor of field 5. I'm
A
I surprised that this wasn't noticed before.
M
C I
What? Field 1 is just an identification number
I N
A Nonetheless, my results speak for themselves E
N R
Oh, no! I just remembered. We assigned the ID
numbers after we sorted the records based on field
5. there is a strong connection but its meaningless.
Sorry
4
21-09-2020
ATTRIBUTES &
MEASUREMENTS TYPES OF DATA SETS
We address the issue
of describing data
01 TYPES OF DATA 02 We describe some
most common data
using attributes types
TYPES OF DATA
● A data set is a
collection of data
objects.
● Data objects are

described using
number of
attributes.
10
5
21-09-2020
What is an Attribute?
An attribute is a property or characteristic of an object that

may vary, either from one object to another or from one time
to another.
Student_ID Student_Name Student_Age
123 Apurva 18
124 Shivangi 17
125 Shreya 18
11
What is a Measurement?
An measurement scale is a rule (function) that associated a Attribute: Weight

numerical or symbolic value with an attribute of an object. of the apple
The process of measurement is the application of

measurement scale to associate a value with a particular
attribute of an specific object.
Measurement
scale: Decimal
number upto 1
decimal place
12
6
21-09-2020
Type of an Attribute
The values used to represent an attribute may have

properties that are not properties of the attribute itself, and
vice versa.
Avg of age: makes sense.

Example 1: Employee Avg of ID? No, no sense.
Age and ID Number
ID captures only distinctiveness of integers
Age: integer
ID Number: integer Age captures almost all properties of integers
however it has limit which integers don’t
13
The different types of attributes
DISTINCTNESS NOMINAL
ORDER ORDINAL
ADDITION
INTERVAL
MULTIPLICATION
RATIO
14
7
21-09-2020
Different types of attributes

ATTRIBUTE
DESCRIPTION EXAMPLES OPERATIONS
TYPE
Provide only enough Mode, entropy,
Zip codes, ID nos.,
NOMINAL information to distinguish one contingency, correlation,
eye colour, gender
object from another chi square test
Grades, economic Median, percentiles, rank
Provide enough information to
ORDINAL status, educational correlation, run tests,
order objects
qualifications sign tests
Calendar dates, Mean, Std. Dev.,
Difference between values are
INTERVAL temperature in C, Pearson’s correlation, t
meaningful
temperature in F and F tests
Temperature in Geometric mean,
Differences and ratios of the
RATIO Kelvin, age, mass, harmonic mean, percent
values are meaningful length, monetary qty variation
15
Examples of Different types of attributes

ATTRIBUTE
DESCRIPTION EXAMPLES
TYPE
Categories (no
genotype, blood type, zip code, gender, race, eye color,
NOMINAL ordering or
political party
direction)
Ordered Categories socio economic status (“low income", "middle income", "high
ORDINAL (rankings, order, income”), income level (“less than 50K”, “50K-100K”, “over 100K”),
scaling) satisfaction rating (“dislike”, “neutral”, “like”).
Difference between
temperature (Farenheit), temperature (Celcius), pH, SAT score
INTERVAL measurements but
(200-800), credit score (300-850)
no true zero
Difference between enzyme activity, dose amount, reaction rate, flow rate,
RATIO measurements, concentration, pulse, weight, length, temperature in Kelvin (0.0
true zero exists Kelvin really does mean “no heat”), survival time.
16
8
21-09-2020
Transformations that define attribute levels

ATTRIBUTE
TRANFORMATION COMMENT
TYPE
Any one – to – one If all employee ID were to be reassigned it wouldn’t make a

NOMINAL
mapping difference
Any order preserving socio economic status (“low income", "middle income", "high income”)
ORDINAL
change of values can also be well represented as (1,2,3}
new_value = a*old_value Scale (0 – 800) if to be mapped to another interval say (1000 – 1800)
INTERVAL + b; To map 400: x = 400 + 1000= 1400; a = 1, b=1000
Scale (200 – 800) if to be mapped to (1000 – 1800)
a and b are constants
400: X=(600/800)*400 + 1000 = 1300
enzyme activity, dose amount, reaction rate, flow rate, concentration,

RATIO new_value = a*old_value pulse, weight, length, temperature in Kelvin (0.0 Kelvin really does
mean “no heat”), survival time.
17
Describing Attributes by Number of Values

Attributes
Attributes assume Value are real

only two values Binary Discrete Continuous numbers
Represented as
Boolean or integers Represented as
0/1 Floating Point
Attributes have finite
or countable infinite
set of values
Represented as
Integers
18
9
21-09-2020
Think of Appropriate examples

Attributes
Binary Discrete Continuous Temperature

True/false
Pressure
Yes/No
Attendance Height
Weight
Gender
Roll No
Name
Blood type
19
Asymmetric Attributes
Only Presence – a non-zero attribute value - is regarding as
important
Object: Student
Attribute: Records if student took a particular course at a university
Attribute value: 1/0
1 2 3 4 5 6
A 1 0 0 1 1 0
B 1 1 0 0 1 0
C 0 1 1 0 1 0
D 1 0 0 1 0 1
E 1 1 0 1 0 0
20
10
21-09-2020
Types of Data Sets

General Characteristics of Data
Sets
Dimensionality Sparsity Resolution
Number of Data sets with Data can be

attributes that the asymmetric recorded at various
objects in the data features have resolutions
set possess most values of
0 : Sparse
Data
21
Types of Data Sets
Data set
Record Graph Ordered

based Based Data
22
11
21-09-2020
Record Based Data
Record based
Data set
Transaction The Data The Sparse

Record Data
Data matrix Data Matrix
23
Record based Data
•No explicit relationship among records or

data fields
•Every record has the same number of
attributes
•Usually stored either in flat files or a
relational database
24
12
21-09-2020
Transaction or Market
Basket Data
•Variation of record data
•Each record involves a
set of items.
•Can be viewed as set of
records whose attributes
are asymmetric
•Most often attributes are
binary
•Can be discrete or
continuous
25
The Data Matrix
•If data objects in a collection all have the

same fixed set of numeric attributes then
the data objects can be thought of as points
in a multidimensional space.
•Set of such data can be interpreted as a
m*n matrix
•Variation of record data
•Because it contains numeric attributes,
matrix operations can be applied
•Most used for statistical data
26
13
21-09-2020
The Sparse Data Matrix
•Special case of data matrix in which

attributes are of same type and asymmetric.
•Eg: Transaction data represented as
records, Document term matrix
27
Graph Based Data
Graph based Data

set
Data with
relationships Data with Objects
among other that are graphs
objects
28
14
21-09-2020
Graph based Data set

Data with relationships Data with Objects that
among other objects are graphs
•Relationships
among object •Objects contain
convey a lot of sub objects
information •Represented as
•Data represented graphs
as a graph •Substructure
•Objects: nodes mining
•Relationships: links
29
Ordered Data
Ordered
Data set
Sequence Sequential Time Series Spatial

Data Data Data Data
30
15
21-09-2020
Sequential Data
•Also referred to as
temporal data
•Each record has time
associated with it
•Eg: Sequential
transaction data
31
Sequence Data
Consists of a data set

that is a sequence of
individual entities such
as a sequence of words
or letters.
Similar to sequential
except no time attached
Eg: Genomic data
32
16
21-09-2020
Time Series Data
Each record is a time

series
Series of measurements
taken over time
Temporal
autocorrelation: if two
measurements are close
in time then the values
are often very similar
33
Spatial Data
Positions or areas
Eg: weather data
Spatial Autocorrelation:
objects that are physically
close tend to be similar in
other ways as well
34
17
21-09-2020
Handling Non – Record

Data
Record oriented techniques can be applied to non record data by
extracting features from data objects and using these features to
create record corresponding to each object.
In some cases it is easy to represent the data in a record format

but this type of representation does not capture all the information.
35
MEAUREMENT AND DATA ISSUES RELATED TO

COLLECTION ISSUES APPLICATIONS
We address
challenges
the
faced
01 DATA QUALITY 02 We describe some
application specific
when measuring data issues
36
18
21-09-2020
Data Quality
Data Mining applications are often applied to data that was
collected for another purpose or for the future, but unspecified
application
Therefore data mining usually cannot address data quality at

source
Data mining focuses on the detection and correction of data

quality problems
Data mining uses algorithms that can tolerate poor data quality.
37
Measurement and Data

Collection Issues
Input data may have problems due to human error, limitations of
measuring devices or flaw in data collection process
Therefore data mining usually cannot address data quality at

source
Measurement and Data Bias, Precision and

Noise and Artifacts
collection Errors Accuracy
38
19
21-09-2020
Measurement and Data

Collection Errors
Measurement Error: Any problem resulting from the
measurement process. Eg: Value recorded is different from true
value.
For continuous attributes, the numerical difference of the

measured and true value is called error.
Data Collection Error: omitting data objects or attribute values or

inappropriately including a data object.
Both measurement errors and data collection errors can be either

systematic or random.
39
Noise and Artifacts

Noise is the random component of a measurement error. It
may involve the distortion of a value or the addition of
spurious objects.
Noise is often related to data that have temporal and spatial

components, techniques from signal or image processing
can be applied to reduce noise
40
20
21-09-2020
Noise and Artifacts

Data Errors may be more deterministic: Deterministic
distortions are called Artifacts
Eg streak in the same place on a set of photographs
41
Precision, Bias and

Accuracy
Precision: Closeness of repeated measurements (of the same
quantity) to one another.
Bias: A systematic variation of measurements from the quantity

being measured
Accuracy: The closeness of measurements to the true value of

the quantity being measured.
42
21
21-09-2020
Precision, Bias and

Accuracy
Precision is often measured as standard deviation of a set of
values.
Bias is measured by taking the difference between the mean

of the set values and the known value of the quantity being
measured.
Accuracy depends upon precision and bias.
Use of significant digits: Goal is to use as many digits to

represent the result of a measurement or a calculation as are
justified by the precision.
43
Precision, Bias and

Accuracy
Eg: weighing scale: 90kg object
{87, 88, 89, 90, 87, 89, 91}
Precision: std dev = 1.38
Bias: mean(data) – 90 = 88.7 – 90 = -1.3
44
22
21-09-2020
Outliers
Data Objects that have Values of an attribute that

characteristics that are are unusual with respect
different from most of the to the typical values for
other objects in the data that attribute:
set: Anomalous Objects Anomalous value
Outliers can be legitimate data objects or values
45
Missing Values
Ignore the
Eliminate
Estimate missing
Data
Missing value
Objects or
Values during
attributes
analysis
46
23
21-09-2020
Eliminate Data Objects or

Attributes
Eliminate objects with missing values: Simple and Effective
Solution to problem of missing values
Even a partially specified data object contains some information

and of many objects have missing values, reliable analysis
becomes difficult
A related strategy is to eliminate attributes with missing values
Eliminated attributes could be critical to analysis therefore should

be done carefully.
47
Estimate Missing Values
Sometimes missing values can be reliably estimated.
If the attribute is continuous then the average attribute value

of the nearest neighbour
If the attribute is categorical, then most commonly occurring

attribute value can be taken.
48
24
21-09-2020
Inconsistent Values
Can be caused due to erroneous data entry or scanning
Should be detected and corrected if possible.
Correction of an inconsistency requires additional or

redundant information
49
Duplicate Values
Data set might include values that are duplicate of each other.
If there are two objects that actually represent a single object then
the values of corresponding attributes may differ and these
inconsistent values must be resolved.
Care needs to be taken to avoid accidentally combining data

objects that are similar but not duplicates.
50
25
21-09-2020
Issues Related to
Applications
•Some data starts to age as soon as it is collected
Timeliness •If the data is out of date so are the models and patterns based on
it.
•Available data must contain the information necessary for the

application
Relevance •Sampling bias: occurs when a sample does not contain different
types of object in proportion to their actual occurrence in the
population.
Knowledge •Data sets are accompanied by documentation

•Other important characteristics are the precision, type of features,
about the Data scale of measurement, origin of data
51
THANKS
Do you have any questions?

youremail@freepik.com
+91 620 421 838
yourcompany.com
CREDITS: This presentation template was created
by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Please keep this slide for attribution.
52
26

Types of Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Types of Data

Uploaded by

Copyright:

Available Formats

21-09-2020

IMPORTANCE OF KNOWING YOUR DATA

IMPORTANCE OF KNOWING YOUR DATA

IMPORTANCE OF KNOWING YOUR DATA

Yes, fields 2 and 3 are basically the same but I

IMPORTANCE OF KNOWING YOUR DATA

● Data objects are

An attribute is a property or characteristic of an object that

Student_ID Student_Name Student_Age

An measurement scale is a rule (function) that associated a Attribute: Weight

The process of measurement is the application of

The values used to represent an attribute may have

Avg of age: makes sense.

The different types of attributes

Different types of attributes

Examples of Different types of attributes

Transformations that define attribute levels

Any one – to – one If all employee ID were to be reassigned it wouldn’t make a

enzyme activity, dose amount, reaction rate, flow rate, concentration,

Describing Attributes by Number of Values

Attributes assume Value are real

Think of Appropriate examples

Binary Discrete Continuous Temperature

Types of Data Sets

Dimensionality Sparsity Resolution

Number of Data sets with Data can be

Types of Data Sets

Record Graph Ordered

Record Based Data

Transaction The Data The Sparse

Record based Data

•No explicit relationship among records or

The Data Matrix

•If data objects in a collection all have the

The Sparse Data Matrix

•Special case of data matrix in which

Graph Based Data

Graph based Data

Graph based Data set

Sequence Sequential Time Series Spatial

Consists of a data set

Time Series Data

Each record is a time

Handling Non – Record

In some cases it is easy to represent the data in a record format

MEAUREMENT AND DATA ISSUES RELATED TO

Therefore data mining usually cannot address data quality at

Data mining focuses on the detection and correction of data

Measurement and Data

Therefore data mining usually cannot address data quality at

Measurement and Data Bias, Precision and

Measurement and Data

For continuous attributes, the numerical difference of the

Data Collection Error: omitting data objects or attribute values or

Both measurement errors and data collection errors can be either

Noise and Artifacts

Noise is often related to data that have temporal and spatial

Noise and Artifacts

Eg streak in the same place on a set of photographs

Precision, Bias and

Bias: A systematic variation of measurements from the quantity

Accuracy: The closeness of measurements to the true value of

Precision, Bias and