You are on page 1of 26

21-09-2020

DATA MINING:
Prof. Sherica Lavinia Menezes
Asst. Professor
Computer Engineering Department
Goa College of Engineering

AGENDA
ISSUES
RELATED TO
DATA
01
TYPES OF
02 DATA

DATA
QUALITY 03
2

1
21-09-2020

LEARNING OUTCOMES
Define attributes,
precision, bias and
Accuracy 01 Explain the
importance of
02 knowing nature of
input data.
Discuss issues in
Measurement and
Data Collection.
03 Classify attributes

04
as binary, discrete
or continuous.

ISSUES IN DATA
Types of Data
I
Quality of Data
II
Pre processing
IIi
Data Analysis
IV
4

2
21-09-2020

IMPORTANCE OF KNOWING YOUR DATA

Hi,
I have attached the data file that I mentioned earlier. Each line contains
the information for a single patient and consists of five fields. We want
to predict the last field using the other fields. I do not have time to
provide any more information about the data since ill be out of station
but hopefully that wont slow you down.
Thanks and see you in a couple of days
Despite some misgivings, you proceed to analyse the data. First few
rows are as follows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6

IMPORTANCE OF KNOWING YOUR DATA


S
T So, you got the data for all the patients?
A
T Yes, I haven’t had much time for analysis, but I D
I have few interesting results A
S T
T Amazing, There were so many data issues with
this set of patients that I couldn’t do much A
I M
C I
I Oh? I didn’t hear about any possible problems N
A E
N R
Well, first there is field 5, the variable we want to
predict. It is common knowledge among people who
analyze this type of data that if you work with the log
of the values, but I didn’t discover this until later.
Was it mentioned to you?

3
21-09-2020

IMPORTANCE OF KNOWING YOUR DATA


S
No!
T
But surely you heard about what happened to
A field 4? It is supposed to be measured on a
T scale of 1 to 10, with 0 indicating a missing D
I value, but because of data entry error all 10s A
S were changed to 0s. Unfortunately since some T
T of the patients have missing values for this
A
I field it is impossible to say whether a 0 in this
field is a real 0 or a 10. Quite a few of the M
C records have that problems. I
I N
A Interesting. Were there any other problems? E
N R

Yes, fields 2 and 3 are basically the same but I


assume that you probably noticed that

IMPORTANCE OF KNOWING YOUR DATA


S Yes, but these fields were only weak predictors
T of field 5
A Anyway, given all those problems I am
T surprised you were able to accomplish anything D
I A
S True, but my results are really quite good. Field T
T 1 is a very strong predictor of field 5. I'm
A
I surprised that this wasn't noticed before.
M
C I
What? Field 1 is just an identification number
I N
A Nonetheless, my results speak for themselves E
N R
Oh, no! I just remembered. We assigned the ID
numbers after we sorted the records based on field
5. there is a strong connection but its meaningless.
Sorry

4
21-09-2020

ATTRIBUTES &
MEASUREMENTS TYPES OF DATA SETS
We address the issue
of describing data
01 TYPES OF DATA 02 We describe some
most common data
using attributes types

TYPES OF DATA

● A data set is a
collection of data
objects.

● Data objects are


described using
number of
attributes.

10

5
21-09-2020

What is an Attribute?

An attribute is a property or characteristic of an object that


may vary, either from one object to another or from one time
to another.

Student_ID Student_Name Student_Age

123 Apurva 18

124 Shivangi 17

125 Shreya 18

11

What is a Measurement?

An measurement scale is a rule (function) that associated a Attribute: Weight


numerical or symbolic value with an attribute of an object. of the apple

The process of measurement is the application of


measurement scale to associate a value with a particular
attribute of an specific object.

Measurement
scale: Decimal
number upto 1
decimal place

12

6
21-09-2020

Type of an Attribute

The values used to represent an attribute may have


properties that are not properties of the attribute itself, and
vice versa.

Avg of age: makes sense.


Example 1: Employee Avg of ID? No, no sense.
Age and ID Number
ID captures only distinctiveness of integers
Age: integer
ID Number: integer Age captures almost all properties of integers
however it has limit which integers don’t

13

The different types of attributes

DISTINCTNESS NOMINAL

ORDER ORDINAL

ADDITION
INTERVAL

MULTIPLICATION
RATIO

14

7
21-09-2020

Different types of attributes


ATTRIBUTE
DESCRIPTION EXAMPLES OPERATIONS
TYPE
Provide only enough Mode, entropy,
Zip codes, ID nos.,
NOMINAL information to distinguish one contingency, correlation,
eye colour, gender
object from another chi square test
Grades, economic Median, percentiles, rank
Provide enough information to
ORDINAL status, educational correlation, run tests,
order objects
qualifications sign tests
Calendar dates, Mean, Std. Dev.,
Difference between values are
INTERVAL temperature in C, Pearson’s correlation, t
meaningful
temperature in F and F tests
Temperature in Geometric mean,
Differences and ratios of the
RATIO Kelvin, age, mass, harmonic mean, percent
values are meaningful length, monetary qty variation

15

Examples of Different types of attributes


ATTRIBUTE
DESCRIPTION EXAMPLES
TYPE
Categories (no
genotype, blood type, zip code, gender, race, eye color,
NOMINAL ordering or
political party
direction)
Ordered Categories socio economic status (“low income", "middle income", "high
ORDINAL (rankings, order, income”), income level (“less than 50K”, “50K-100K”, “over 100K”),
scaling) satisfaction rating (“dislike”, “neutral”, “like”).

Difference between
temperature (Farenheit), temperature (Celcius), pH, SAT score
INTERVAL measurements but
(200-800), credit score (300-850)
no true zero
Difference between enzyme activity, dose amount, reaction rate, flow rate,
RATIO measurements, concentration, pulse, weight, length, temperature in Kelvin (0.0
true zero exists Kelvin really does mean “no heat”), survival time.

16

8
21-09-2020

Transformations that define attribute levels


ATTRIBUTE
TRANFORMATION COMMENT
TYPE

Any one – to – one If all employee ID were to be reassigned it wouldn’t make a


NOMINAL
mapping difference

Any order preserving socio economic status (“low income", "middle income", "high income”)
ORDINAL
change of values can also be well represented as (1,2,3}

new_value = a*old_value Scale (0 – 800) if to be mapped to another interval say (1000 – 1800)
INTERVAL + b; To map 400: x = 400 + 1000= 1400; a = 1, b=1000
Scale (200 – 800) if to be mapped to (1000 – 1800)
a and b are constants
400: X=(600/800)*400 + 1000 = 1300

enzyme activity, dose amount, reaction rate, flow rate, concentration,


RATIO new_value = a*old_value pulse, weight, length, temperature in Kelvin (0.0 Kelvin really does
mean “no heat”), survival time.

17

Describing Attributes by Number of Values


Attributes

Attributes assume Value are real


only two values Binary Discrete Continuous numbers
Represented as
Boolean or integers Represented as
0/1 Floating Point
Attributes have finite
or countable infinite
set of values

Represented as
Integers

18

9
21-09-2020

Think of Appropriate examples


Attributes

Binary Discrete Continuous Temperature


True/false
Pressure
Yes/No
Attendance Height
Weight

Gender
Roll No
Name
Blood type

19

Asymmetric Attributes
Only Presence – a non-zero attribute value - is regarding as
important

Object: Student
Attribute: Records if student took a particular course at a university
Attribute value: 1/0

1 2 3 4 5 6
A 1 0 0 1 1 0
B 1 1 0 0 1 0
C 0 1 1 0 1 0
D 1 0 0 1 0 1
E 1 1 0 1 0 0

20

10
21-09-2020

Types of Data Sets


General Characteristics of Data
Sets

Dimensionality Sparsity Resolution

Number of Data sets with Data can be


attributes that the asymmetric recorded at various
objects in the data features have resolutions
set possess most values of
0 : Sparse
Data

21

Types of Data Sets

Data set

Record Graph Ordered


based Based Data

22

11
21-09-2020

Record Based Data

Record based
Data set

Transaction The Data The Sparse


Record Data
Data matrix Data Matrix

23

Record based Data

•No explicit relationship among records or


data fields
•Every record has the same number of
attributes
•Usually stored either in flat files or a
relational database

24

12
21-09-2020

Transaction or Market
Basket Data
•Variation of record data
•Each record involves a
set of items.
•Can be viewed as set of
records whose attributes
are asymmetric
•Most often attributes are
binary
•Can be discrete or
continuous

25

The Data Matrix

•If data objects in a collection all have the


same fixed set of numeric attributes then
the data objects can be thought of as points
in a multidimensional space.
•Set of such data can be interpreted as a
m*n matrix
•Variation of record data
•Because it contains numeric attributes,
matrix operations can be applied
•Most used for statistical data

26

13
21-09-2020

The Sparse Data Matrix

•Special case of data matrix in which


attributes are of same type and asymmetric.
•Eg: Transaction data represented as
records, Document term matrix

27

Graph Based Data

Graph based Data


set

Data with
relationships Data with Objects
among other that are graphs
objects

28

14
21-09-2020

Graph based Data set


Data with relationships Data with Objects that
among other objects are graphs

•Relationships
among object •Objects contain
convey a lot of sub objects
information •Represented as
•Data represented graphs
as a graph •Substructure
•Objects: nodes mining
•Relationships: links

29

Ordered Data

Ordered
Data set

Sequence Sequential Time Series Spatial


Data Data Data Data

30

15
21-09-2020

Sequential Data

•Also referred to as
temporal data
•Each record has time
associated with it
•Eg: Sequential
transaction data

31

Sequence Data

Consists of a data set


that is a sequence of
individual entities such
as a sequence of words
or letters.
Similar to sequential
except no time attached
Eg: Genomic data

32

16
21-09-2020

Time Series Data

Each record is a time


series
Series of measurements
taken over time
Temporal
autocorrelation: if two
measurements are close
in time then the values
are often very similar

33

Spatial Data

Positions or areas
Eg: weather data
Spatial Autocorrelation:
objects that are physically
close tend to be similar in
other ways as well

34

17
21-09-2020

Handling Non – Record


Data
Record oriented techniques can be applied to non record data by
extracting features from data objects and using these features to
create record corresponding to each object.

In some cases it is easy to represent the data in a record format


but this type of representation does not capture all the information.

35

MEAUREMENT AND DATA ISSUES RELATED TO


COLLECTION ISSUES APPLICATIONS
We address
challenges
the
faced
01 DATA QUALITY 02 We describe some
application specific
when measuring data issues

36

18
21-09-2020

Data Quality
Data Mining applications are often applied to data that was
collected for another purpose or for the future, but unspecified
application

Therefore data mining usually cannot address data quality at


source

Data mining focuses on the detection and correction of data


quality problems

Data mining uses algorithms that can tolerate poor data quality.

37

Measurement and Data


Collection Issues
Input data may have problems due to human error, limitations of
measuring devices or flaw in data collection process

Therefore data mining usually cannot address data quality at


source

Measurement and Data Bias, Precision and


Noise and Artifacts
collection Errors Accuracy

38

19
21-09-2020

Measurement and Data


Collection Errors
Measurement Error: Any problem resulting from the
measurement process. Eg: Value recorded is different from true
value.

For continuous attributes, the numerical difference of the


measured and true value is called error.

Data Collection Error: omitting data objects or attribute values or


inappropriately including a data object.

Both measurement errors and data collection errors can be either


systematic or random.

39

Noise and Artifacts


Noise is the random component of a measurement error. It
may involve the distortion of a value or the addition of
spurious objects.

Noise is often related to data that have temporal and spatial


components, techniques from signal or image processing
can be applied to reduce noise

40

20
21-09-2020

Noise and Artifacts


Data Errors may be more deterministic: Deterministic
distortions are called Artifacts

Eg streak in the same place on a set of photographs

41

Precision, Bias and


Accuracy
Precision: Closeness of repeated measurements (of the same
quantity) to one another.

Bias: A systematic variation of measurements from the quantity


being measured

Accuracy: The closeness of measurements to the true value of


the quantity being measured.

42

21
21-09-2020

Precision, Bias and


Accuracy
Precision is often measured as standard deviation of a set of
values.

Bias is measured by taking the difference between the mean


of the set values and the known value of the quantity being
measured.

Accuracy depends upon precision and bias.

Use of significant digits: Goal is to use as many digits to


represent the result of a measurement or a calculation as are
justified by the precision.

43

Precision, Bias and


Accuracy
Eg: weighing scale: 90kg object
{87, 88, 89, 90, 87, 89, 91}

Precision: std dev = 1.38

Bias: mean(data) – 90 = 88.7 – 90 = -1.3

44

22
21-09-2020

Outliers

Data Objects that have Values of an attribute that


characteristics that are are unusual with respect
different from most of the to the typical values for
other objects in the data that attribute:
set: Anomalous Objects Anomalous value

Outliers can be legitimate data objects or values

45

Missing Values

Ignore the
Eliminate
Estimate missing
Data
Missing value
Objects or
Values during
attributes
analysis

46

23
21-09-2020

Eliminate Data Objects or


Attributes
Eliminate objects with missing values: Simple and Effective
Solution to problem of missing values

Even a partially specified data object contains some information


and of many objects have missing values, reliable analysis
becomes difficult

A related strategy is to eliminate attributes with missing values

Eliminated attributes could be critical to analysis therefore should


be done carefully.

47

Estimate Missing Values

Sometimes missing values can be reliably estimated.

If the attribute is continuous then the average attribute value


of the nearest neighbour

If the attribute is categorical, then most commonly occurring


attribute value can be taken.

48

24
21-09-2020

Inconsistent Values

Can be caused due to erroneous data entry or scanning

Should be detected and corrected if possible.

Correction of an inconsistency requires additional or


redundant information

49

Duplicate Values

Data set might include values that are duplicate of each other.

If there are two objects that actually represent a single object then
the values of corresponding attributes may differ and these
inconsistent values must be resolved.

Care needs to be taken to avoid accidentally combining data


objects that are similar but not duplicates.

50

25
21-09-2020

Issues Related to
Applications
•Some data starts to age as soon as it is collected
Timeliness •If the data is out of date so are the models and patterns based on
it.

•Available data must contain the information necessary for the


application
Relevance •Sampling bias: occurs when a sample does not contain different
types of object in proportion to their actual occurrence in the
population.

Knowledge •Data sets are accompanied by documentation


•Other important characteristics are the precision, type of features,
about the Data scale of measurement, origin of data

51

THANKS

Do you have any questions?


youremail@freepik.com
+91 620 421 838
yourcompany.com
CREDITS: This presentation template was created
by Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Please keep this slide for attribution.

52

26

You might also like