Professional Documents
Culture Documents
Data attributes
• An attribute is a property or characteristic of an object.
―Attribute is also known as variable, field, characteristic, dimension, or feature
• Attribute values are numbers or symbols assigned to an attribute for a
particular object
• Distinction between attributes and attribute values
• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
Attributes
• This is the First step of Data Data-preprocessing.
• Before preprocessing data is differentiated wrt. different types of
attributes.
• Description of attribute types:
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Discrete, Continuous)
Nominal Attributes – related to names : The values of a Nominal attribute are name of things,
some kind of symbols.
Values of Nominal attributes represents some category or state and that’s why nominal
attribute also referred as categorical attributes and there is no order (rank, position)
among values of nominal attribute.
Types of Attributes
• There are different types of attributes
Categorical
Qualitative
female}
hardness of
Ordinal attribute median,
minerals,
Ordinal values also order {good, better, best}, percentiles, rank
objects. correlation, run
grades, street
(<, >) numbers tests, sign tests
Fahrenheit
meaningful. (+, - ) F tests
Numeric
temperature in
For ratio variables, Kelvin,
geometric mean,
both differences and monetary
Ratio ratios are quantities, harmonic mean,
percent variation
meaningful. (*, /) counts, age, mass,
length, current
Discrete and Continuous Attributes
• Discrete Attribute • Continuous Attribute
– Has only a finite or countably – Has real numbers as attribute
infinite set of values values
– Examples: zip codes, counts, or the – Examples: temperature, height, or
set of words in a collection of weight.
documents – Practically, real values can only be
– Often represented as integer measured and represented using a
variables. finite number of digits.
– Note: binary attributes are a – Continuous attributes are typically
special case of discrete attributes represented as floating point
variables.
Asymmetric vs Symmetric Attributes
• Asymmetric • Symmetric
• outcomes not equally important. • both outcomes equally important
e.g., medical test (positive vs. e.g., gender
negative)
• Convention: assign 1 to most
important outcome (e.g., HIV
positive)
Data Similarity and
Dissimilarity
Why?
• In data mining applications, such as clustering, outlier analysis, and
nearest-neighbor classification etc., we need ways to assess how alike
or unalike objects are in comparison to one another.
• A cluster is a collection of data objects such that the objects within a
cluster are similar to one another and dissimilar to the objects in
other clusters.
• Outlier analysis also employs clustering-based techniques to identify potential
outliers as objects that are highly dissimilar to others.
• d(i, i) = 0
• d(i, j) = d(j, i)
(For readability, we do not show the d(j, i) entries; the matrix is symmetric.)
• sim(i, j) = 1 – d(i, j)
Proximity Measures for Nominal Attributes
• A nominal attribute can take on two or more states
• The states are representations of some state, names, categories etc.
• Non-quantifiable in nature, thus cannot participate in numerical
calculations.
• How to measure proximity among these attributes:
• Let the number of states of a nominal attribute be M.
• The states can be denoted by letters, symbols, or a set of integers, such as 1,
2, …, M. (such integers are used just for data handling and do not represent
any specific ordering)
• The dissimilarity between two nominal objects i and j can be
computed based on the ratio of mismatches:
• 4 Objects, 2 attributes
• p=2
• d(i,j) = (p-m)/p
where ||x|| is the Euclidean norm of vector x = (x1, x2,…, xp), defined as
• Similarly, ||y|| is the Euclidean norm of vector y
• The measure computes the cosine of the angle between vectors x and
y.
• A cosine value of 0 means that the two vectors are at 90 degrees to
each other (orthogonal) and have no match.
• The closer the cosine value to 1, the smaller the angle and the greater
the match between vectors.
• Cosine similarity does not follow the 4 rules, it is referred to as a
nonmetric measure.
• Cosine similarity between two term-frequency vectors. Suppose that
x and y are the first two term-frequency vectors. That is,
x = (5,0,3,0,2,0,0,2,0,0) and y = (3,0,2,0,1,1,0,1,0,1). How similar are x
and y?
x·y=5×3+0×0+3×2+0×0+2×1+0×1+0×0+2×1
+ 0 × 0 + 0 × 1 = 25
Sim(x, y) = x.y/(||x||||y||)
= 25/(6.48 x 4.12)
=0.936 (up to 3 decimal places)
• Therefore, if we were using the cosine similarity measure to compare
these documents, they would be considered quite similar.