Professional Documents
Culture Documents
16
What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat
instance 10
10 No Single 90K Ye s
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-
10), grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
• Examples: length, time
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: =
– Order: <>
– Addition: +-
– Multiplication: */
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a unit temperature in Celsius deviation, Pearson's
of measurement exists.
(+, - ) or Fahrenheit correlation, t and F
tests
Ratio For ratio variables, both differences monetary quantities, geometric mean,
and ratios are meaningful. (*, /) counts, age, mass, length, harmonic mean,
electrical current
percent variation
Interval data, also called an integer, is defined as a data type which is measured along a scale, in which each
point is placed at equal distance from one another. Interval data always appears in the form of numbers or
numerical values where the distance between the two points is standardized and equal.
Interval data cannot be multiplied or divided, however, it can be added or subtracted. Interval data is
measured on an interval scale. A simple example of interval data: The difference between 100 degrees
Fahrenheit and 90 degrees Fahrenheit is the same as 60 degrees Fahrenheit and 70 degrees Fahrenheit.
The difference between interval and ratio data is simple. Ratio data has a defined zero point. Income,
height, weight, annual sales, market share, product defect rates, time to repurchase, unemployment
rate, and crime rate are examples of ratio data.
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits.
– Continuous attributes are typically represented as floating-point variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes T i d R e f u n d Marital Taxable
Status Income Cheat
1 Ye s Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Ye s Married 120K No
5 No Divorced 95K Ye s
6 No Married 60K No
7 Ye s Divorced 220K No
8 No Single 85K Ye s
9 No Married 75K No
10 No Single 90K Ye s
10
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
ut
timeo
on
seas
h
coac
e
gam
e
scor
m
tea
ll
ba
st
lo
pl
w
n
i
y
a
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 B re a d , C o k e , M i l k
2 B e e r, B r e a d
3 B e e r, C o k e , D i a p e r, M i l k
4 B e e r, B re a d , D i a p e r, M i l k
5 C o k e , D i a p e r, M i l k
Graph Data
• Examples: Facebook graph and HTML Links
2
5 1
2
5
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
Difference between Data Analytics and Data Mining:
A key difference between data analytics and data mining is that data
mining does not require any preconceived hypothesis or notions before
tackling the data. It simply compiles it into useful formats. However,
data analysis does need a hypothesis to test, as it is looking for answers
to particular questions.
Data mining can be undertaken by a single specialist with excellent
technological skills. With the right software, they are able to collect the
data ready for further analysis. At this stage, a larger team simply isn’t
required. From here, a data mining specialist will usually report their
findings to the client, leaving the next steps in someone else’s hands.