You are on page 1of 14

Data Mining

Lecture 2: Data
Variety of Data
Multiple types of data: tables, location of the user
time series, images, graphs,
text, etc friendship information
check-ins to venues
Spatial and temporal aspects

Interconnected dataComplex opinions through twitter


images through cameras
queries to search engines
Example- Transaction Data

Billions of real-life customers


WALMART: 20M transactions per day
AT&T 300 M calls per day
Credit card companies: billions of
transactions per day.
Example- Document Data
Web as a document repository: estimated 50 billions of
web pages

Wikipedia: 4 million articles (and counting)

Online news portals: steady stream of 100s of new


articles every day

Twitter: ~300 million tweets every day


Example- Network Data
Web: 50 billion pages linked via hyperlinks

Facebook: 500 million users

Twitter: 300 million users

Instant messenger: ~1billion users

Blogs: 250 million blogs worldwide


Example- Behavioral Data
Mobile phones today record a large amount of information about the user
behavior
GPS records position
Camera produces images
Communication via phone and SMS
Text via facebook updates
Association with entities via check-ins
Amazon collects all the items that you browsed, placed into your basket, read
reviews, purchased.
Google and Bing record all your browsing activity via toolbar plugins. They also
record the queries you asked, the pages you saw and the clicks you did.
Data collected for millions of users on a daily basis
What is Data? Attributes

Collection of data objects and their


attributes ID Refun Marital Taxabl Cheat
d Status e
Income
An attribute is a property or
characteristic of an object 1 Yes Single 125K No
2 No Married 100K No
Examples: eye color of a person, 3 No Single 70K No
temperature, etc.
4 Yes Married 120K No
Attribute is also known as variable, Objects 5 No Divorced 95K Yes
field, characteristic, or feature
6 No Married 60K No
A collection of attributes describe an 7 Yes Divorced 220K No
object
8 No Single 85K Yes

Object is also known as record, point, 9 No Married 75K No


case, sample, entity, or instance 10 No Single 90K Yes
Types of Attributes
Categorical
Eye color
Words
Ranking {good, fair, bad}
Height {tall, medium, short}
Gender
Model of a car
Types of Attributes
Numeric/Quantitative
Date
Time
Temperature
Weight
Length
Value
Types of Attributes

Zip code?

Phone Number?

Bank Accounts?
Types of Attributes
Discrete (finite and countable)
Number of children in a household
Number of languages a person speaks
Number of people sleeping in this class
Continuous (0 to infinity)
Height of children
Weight of cars
Time to wake up in the morning
Speed of the train
Types of Attributes

Nominal Ordinal

Categorically discrete Quantities with natural ordering


data
Not comparable
Name of school
Rating scale of the runners
Type of the car finishing a race

9 vs 10 and 6 vs 7
Types of Attributes
Interval Ratio

Ordinal but equally split Interval with a meaningful zero

Distance is meaningful 80C is not twice as hot as 40C

28C and 29C VS 34C Most count numbers


and 35C
the number of clients in past six
Survey options months.
Ratio
Absolute zero

Interval
Distance is meaningful

Ordinal
Attributes can be ordered

Nominal
Attributes are only names;
weakest