Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge

You might also like

You are on page 1of 31

Lecture on Data

Source: Books by
Tan, Steinbach, Kumar ; Han, Kamber &
Pei; Evans; Dinesh Kumar + Experiential
Knowledge
Big Data
 Big data refers to high volume of data generated at high
velocity that contains large variety of data.

 Much of big data is available in real time, and much of which


is uncertain or unpredictable.

 According to Gartner Report, data is classified as big data,


when:
– Volume: Exabytes (1018 )
– Velocity: Sub-second
– Variety: 25+ formats
– Veracity: Accuracy of the data
Big Data

 “The effective use of big data has the potential to


transform economies, delivering a new wave of
productivity growth and consumer surplus. Using big data
will become a key basis of competition for existing
companies, and will create new competitors who are able
to attract employees that have the critical skills for a big
data world.”
- McKinsey Global Institute, 2011
Sources of Big Data

 Transactional data that are generated at high speed


(mobile services, banking and financial services,
healthcare, entertainment, etc).

 Machine generated structured data (electricity and


water meters, sensors installed in various systems).
 Machine generated unstructured data (videos,
satellite images etc (Give Example: Agri Prodn
advance estimates – conventional and satel. based)
 Social media data.
Let us understand Data
What is a Data Set?

 A data set is a collection of data Attributes


objects. Data object is also known
as record, point, vector, pattern,
event, case, sample, observation, Tid Refund Marital Taxable
entity, or instance. Status Income Cheat

 A data object is characterized by a 1 Yes Single 125K No


number of attributes. 2 No Married 100K No
 An attribute is a property or 3 No Single 70K No

characteristic of an object 4 Yes Married 120K No

– Examples: eye color of a person, Objects 5 No Divorced 95K Yes

temperature, etc. 6 No Married 60K No

– Attribute is also known as variable, 7 Yes Divorced 220K No

field, characteristic, or feature 8 No Single 85K Yes


9 No Married 75K No
10 No Single 90K Yes
10
Attribute Values

 Attribute values are numbers or symbols assigned to


an attribute

 Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different

– ID has no limit but age has a maximum and minimum value


– IDs can not be averaged. IDs are used only for distinguishing
employees.
Data Analysis Technique should be
consistent with the type of attribute.
Types of Attributes (S. Smith Stevens –
Psychologist)
 There are different types of attributes
– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
 Ordinal but have constant differences between observations
and have arbitrary zero points. Examples: calendar dates,
temperatures in Celsius or Fahrenheit.
– Ratio
 Examples: continuous and have a natural zero. temperature
in Kelvin, length, time, counts
Properties of Attribute Values

 The type of an attribute depends on which of the


following properties of numbers it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, 2 test
enough information to distinguish
one object from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
al
rv
te
In
l
va
er
Example : Classifying Data Elements

t
In
io
at
R
io
at
R
io
at
R
io
at
R al
ic
or
eg
at al
C ic
or
eg
at
C
al
in
rd
O
al
ic
or
eg
at
C
Properties of Attribute Values (Cont)

 Nominal and ordinal variables are collectively


referred as categorical or qualitative variables
- should be treated like symbols, even if represented by integers
 Integer and ratio variables are collectively
referred as numeric or quantitative variables
- can be integer valued or continuous

Types of attributes can also be described in terms of transformations


that do not change the meaning of an attribute.
Permissible Transformations for Attribute Types

Attribute Transformation Comments


Level

Nominal Any permutation of values If all employee ID numbers


were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).

Ratio new_value = a * old_value Length can be measured in


meters or feet.
Describing Attributes by number of values -Discrete
and Continuous Attributes
 Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes and
assume only 2 values

 Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits (limited precision).
– Continuous attributes are typically represented as floating-point variables.
Typically, nominal and ordinal attributes discrete while integer and ratio
attributes are continuous. (However, count attributes which are discrete
are also ratio attributes)
Asymmetric Attributes

For asymmetric attributes, only PRESENCE (non


zero attribute value) is regarded as important.
Types of data sets
 Record
– Data Matrix
– Document Data
– Transaction Data

 Graph
– World Wide Web
– Molecular Structures

 Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Important Characteristics of Structured Data

– Dimensionality
Number of attributes that the objects in the data set possess.
Curse of Dimensionality
 Dimensionality reduction
– Sparsity
For some data sets such as those with asymmetric features, most
attributes of an object have 0 values. An advantage in practical
terms.
– Resolution
 Properties of data are different at different resolutions (earth -
uneven/smooth)
Patterns depend on the scale of resolution (atmospheric
pressure - movement of storms)
Record Data

 Data that consists of a collection of records, each


of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Transaction Data (Market Basket Data)

 A special type of record data, where


– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
This is an example of Sparse data matrix (Explain:
sparse, discrete, continuous)
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Transaction Data

 Can be viewed as a set of records whose fields are asymmetric


attributes. Most often, the attributes are binary (indicating an item
was purchased or not). It can be discrete or continuous eg number of
items purchased and the amount spent on those items.
Data Matrix or Pattern Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points (vectors) in a multi-dimensional space, where each
dimension represents a distinct attribute describing the
object

 Such data set can be represented by an (m X n) matrix,


where there are m rows, one for each object, and n
columns, one for each attribute.

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

 Each document becomes a `term' vector,


– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document. (this is also sparse
data matrix)

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Graph Data

Data objects can be presented as graphs:

- when it is useful to capture relationship among


data objects.

- When objects have sub-objects


Graph Data

 Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Chemical Data

 Benzene Molecule: C6H6


Ordered Data
(Sequential Data, Sequence Data, Time Series Data, Spatial Data)
Sequential Data
 Sequential of transaction data (also called temporal Data).Helps in
finding patterns.
Time Customer Items Purchased
t1 C1 A,B
t2 C3 A,C
t2 C1 C,D
t3 C2 A,D
t4 C2 E
t5 C1 A,E
Customer Time and Items
Purchased
C1 (t1: A,B), (t2: C,D), (t5:
A,E)
C2 (t3: A,D) (t4: E)
C3 (t2: A,C)
Sequence Data
 Data consisting of sequence of words or letters. Similar to sequential data
except that there are no time stamps, instead there are positions in
ordered sequence
 Sample Genomic (genetic) sequence data (expressed using 4
nucleotides from which DNA is constructed.
Problem: Predicting similarities in structure & functions of genes from
similarities in nucleotide sequence.

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Time Series Data
 Special type of sequential data; series of measurements taken over
time;
 Average monthly temperature of Delhi from 1980 to 2017
 Daily prices of various stocks

Note: When working with temporal data, always analyze temporal


auto-correlation
Ordered Data
 Spatio-Temporal Data (eg weather data for various locations (precipitation,
temperature, pressure, rainfall)

Average Monthly
Temperature of
land and ocean
(Lat-Long)

Explain: Spatial
auto-correlation
Handling Non Record Data
Most data analysis ( or data mining) algorithms are
designed for record data or its variations (eg transaction
data and data matrices)

Record oriented techniques can be applied to non-record


data by extracting features from data objects and using
these features to create a record corresponding to each
object (image –> binary matrix)

You might also like