You are on page 1of 81

SUSHIL KULKARNI

JAI-HIND COLLEGE
sushiltry@yahoo.co.in
Social Networks : Example
Technology used
What is Data Mining?
DM Process & Example
DM Queries
DM Tasks and Methods
Relation & Data Warehouse
What is ETL ?
Data Preprocessing
What is a Network?
node
Lin
k

node node node

node node

node node

node
node node
node node

node node

node node

Web Definition : A set of nodes, points, or locations


connected by means of data, voice, and video
communications for the purpose of exchange.
Social Networks

 A social network is
a social structure of
people, related
(directly or indirectly)
to each other through
a common relation or
interest
Social Network Analysis
 Social network analysis [SNA] is the mapping and measuring
of relationships and flows between people, groups,
organizations, computers or other information/knowledge
processing entities.

 The nodes in the network are the people and groups while the
links show relationships or flows between the nodes.
A shift in approach: from ‘synthesis’ to
‘analysis’
Cognitive
Problems Cognitive network for B
network for A
• High cost of
manual surveys
• Survey bias B
- Perceptions of
individuals may be
incorrect
• Logistics
- Organizations
are now spread A
Cognitive
across several network for C
countries.

Sdfdsfsdf
Fvsdfsdfsd
C
Employee
Sdfdsfsd
fdfsd f
Sdfdsfsdf Fvsdfsdfs Sdfdsfsd
Sdfsdfs

Surveys
dfdfsd f
` Sdfdsfsd Fvsdfsdfs
f dfdfsd
Sdfsdfs Sdfdsfsd
` f

- Email Analysis
Sdfsdfs
`

- Web logs

Electronic
Synthesis communication

Social Shift in approach Social Cognitive


Network network network
Technology
Various technologies that help in creating
Social Networks are:

 Email
 Blogs
 Social Networking Software like Orkut,
Face Book, Flickr etc.
SOCIAL NETWORK:
Profile & Platforms

USENET
SOCIAL NETWORK:
Profile & Platforms

Social Community
SOCIAL NETWORK: Growth
SOCIAL NETWORK : Growth Rate
SOCIAL NETWORK : Growth Rate
Technology :
 What is Your Network?
- When your connections invite their connections, your
Network starts to grow.
- Your Network is your connections, their connections, and
so on out from you at the center.

 How do you classify users?


- Your Network contains professionals out to “three degrees”
that is, friends-of-friends-of-friends. If each person had 10
connections (and some have many more) then your
network would contain 10,000 professionals.

 How do you see who is in your Network?


Facebook lets you see your network as one large group of
searchable professional profiles.
SOCIAL NETWORK: Visualization
FRIEND FRIEND
FRIEND

ME
FRIEND

FRIEND
ON ANY OF SOCIAL NETWORK
Name
Gender
Age
Birth date/Home town
School attended FRIEND
Interests/ Hobbies
Photoes
Friends
Activities
Audio clips
Video clips

Name
Gender
Age
Birth date/Home town
School attended
Interests/ Hobbies YOU
Photoes
Friends
Activities
Audio clips
Video clips
ON ANY OF SOCIAL NETWORK
Name
Gender
Age
Birth date/Home town After making the friend,
School attended FRIEND
Interests/ Hobbies I can able to access his/ her friends
Photoes
Friends
, audios, videos, share information
Activities A friend may be from any remote site.
Audio clips
Video clips

Name
Gender
Age YOU
Birth date/Home town
School attended
Interests/ Hobbies
Photoes
Friends
Activities
Audio clips
Video clips
SOCIAL NETWORK : Growth Rate
SOCIAL NETWORK : Visualization
Between friends: How many of them ?
Male vs. Female Young vs. Old

Thin vs. Fat


SOCIAL NETWORK : Visualization
Between friends: Relationships

Thick Friends Just Friends


SOCIAL NETWORK : Visualization
Between friends: Likes

Coffee Chocolate

Friends Friends

HOW
HOWMANY
MANYOF
OFMADHURI DIXIT’S
PRASHANT FRIEND
DAMLE’S LIKE LIKE
FRIEND ? ?
FRIENDS OF A FRIENDS OF A FRIEND
SHOULD KNOW
 How many friends use a social network
regularly?
 How many friends send messages
frequently?
 What is the mood of your friend list?
 How many friends are vegetarian?
 How many friends are closest or far from
you?
 How many friends studied or studying in
your school?
FRIENDS OF A FRIENDS OF A FRIEND
SHOULD KNOW

INTERESTING PATTERNS
FROM UNKNOWN DATA
DEFINE DATA MINING
Data Mining is:

The analysis of (often large) observational


data sets to find unsuspected
relationships and to summarize the data
in novel ways that are both
understandable and useful to the data
owner.
THUS : DATA MINING
 Methods for exploring and modeling
relationships in large amount of data

 Finding hidden information in a database

 Fit data to a model


Data Mining Process
 Understand the Domain
- Understands particulars of the business
or scientific problems
 Create a Data set
- Understand structure, size, and format
of data
- Select the interesting attributes
- Data cleaning and preprocessing
Data Mining Process
 Choose the data mining task and the
specific algorithm
- Understand capabilities and limitations of
algorithms that may be relevant to the
problem

 Interpret the results, and possibly return to


 bullet 2
EXAMPLE
 Understand social networks.

 Grow connections.

 Choose appropriate built in methods to


find hidden information.
Example :E-mail Communication
 A sends an e-mail to B
 With Cc to C B
A C E
 And Bcc to D
 C forwards this e-mail to E D

 From analyzing the header, we can infer


 A and D know that A, B, C and D know about this e-mail
 B and C know that A, B and C know about this e-mail
 C also knows that E knows about this e-mail
 D also knows that B and C do not know that it knows about
this e-mail; and that A knows this fact
 E knows that A, B and C exchanged this e-mail; and that
neither A nor B know that it knows about it
 and so on and so forth …
DB VS DM PROCESSING

• Query • Query
– Well defined – Poorly defined
– SQL – No precise query language
Data Data
– Operational data – Not operational data

Output Output
– Precise – Fuzzy
– Subset of – Not a subset
database of database
QUERY EXAMPLES
Database
– Find all credit applicants with first name of Sane.
– Identify customers who have purchased
more than Rs.10,000 in the last month.
– Find all customers who have purchased milk

Data Mining
– Find all credit applicants who are poor
credit risks. (classification)
– Identify customers with similar buying
habits. (Clustering)
– Find all items which are frequently
purchased with milk. (association rules)
ARE ALL THE ‘DISCOVERED’
PATTERNS INTERESTING?
 Interestingness measures:

A pattern is interesting if it is easily


understood by humans, valid on new or
test data with some degree of purity,
potentially useful, novel, or validates
some hypothesis that a user seeks to
confirm
DATA MINING DEVELOPMENT
 Similarity Measures
 Hierarchical Clustering
 Relational Data Model  IR Systems
 SQL  Imprecise Queries
 Association Rule Algorithms  Textual Data
 Data Warehousing
 Scalability Techniques  Web Search Engines

 Bayes Theorem
 Regression Analysis
 EM Algorithm
 K-Means Clustering
 Time Series Analysis
Algorithm Design Techniques
 Algorithm Analysis  Neural Networks
 Data Structures
 Decision Tree
Algorithms
RELATION (r)
 D 1, D 2, ……, D n are domains

 Relation r is a subset of a Cartesian


product D 1× D 2× ……× D n

r ⊆ D 1× D 2 × … … × D n
EXAMPLE : r
D1 = {Ram, Shyam} , D 2 = {24, 34}

D 1× D 2 = { (Ram, 24), (Ram, 34),


(Shyam, 24), (Shyam, 34)}

r is a sub set of D 1× D 2

r = { (Ram, 24), (Shyam, 34)}


SUSHIL KULKARNI
RELATION is TABLE

NAME
Ram
Employee
TUPLES OR ROWS : t
 Instance of the relation is a tuple or row

 Notation :
t < (a(1), a(2), a(3),… a(n)):
a(i) ∈ A(i); i ∈ N >
 Example: t < (Ram,24) >
RELATION (r)

R A 1
A 2
A 3
…… A k
……. A n

a 11
a 21
a 31
…… a k1
……. a n1

a 12
a 22
a 32
…… a k2
…… a n2

t ….. ….. …….... ………… …..


a 1i
a 2i
a 3i
…… a ki
…… a n3

……. ……. ……. ……. …….


a 1m
a 2m
a 3m
a nm
…… a nm

k th attribute R of i th tuple t
WHAT IS
DATA WAREHOUSE ?
Subject-oriented:
customers, patients, students,
products, time.

Integrated: Gathered CENTRALLY from

1.several internal systems of records


2. sources external to the organization
WHAT IS
DATA WAREHOUSE ?

 Time - variant:

Use to study trends and changes.

 Non - updatable:

cannot updated by end users.


BIG PICTURE
The ETL Process
 Capture

 Scrub or data cleansing

 Transform

 Load and Index

ETL = Extract, Transform, and Load


Steps in data reconciliation

Capture = extract…obtaining a snapshot of a


chosen subset of the source data for loading
into the data warehouse

Static extract = Incremental extract =


capturing a snapshot of capturing changes that
the source data at a have occurred since the
point in time last static extract
Steps in data reconciliation

Scrub = cleanse…uses pattern


recognition and AI techniques to
upgrade data quality
Fixing errors: misspellings, Also: decoding, reformatting,
erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data
Steps in data reconciliation

Transform = convert data from


format of operational system to
format of data warehouse

Record-level: Field-level:
Selection – data partitioning single-field – from one field to
Joining – data combining one field
Aggregation – data multi-field – from many fields to
summarization one, or one field to many
Steps in data reconciliation

Load/Index = place transformed


data into the warehouse and
create indexes

Refresh mode: bulk Update mode: only


rewriting of target data at changes in source data are
periodic intervals written to data warehouse
DIRTY DATA
Data in the real world is dirty:

– incomplete: lacking attribute values,


lacking certain attributes of interest, or
containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies
in codes or names
WHY DATA
PREPROCESSING?
No quality data, no quality mining results!
Quality decisions must be based on
quality data
Data warehouse needs consistent
integration of quality data
Required for Data Mining!
Why can Data be
Incomplete?
 Attributes of interest are not available
(e.g., customer information for sales
transaction data)

 Data were not considered important at


the time of transactions, so they were
not recorded!
Why can Data be
Incomplete?
 Data not recorder because of
misunderstanding or malfunctions

 Data may have been recorded and later


deleted!

 Missing/unknown values for some data


Why can Data be
Noisy / Inconsistent ?
 Faulty instruments for data collection

 Human or computer errors

 Errors in data transmission

 Technology limitations (e.g., sensor data come


at a faster rate than they can be processed)
Why can Data be
Noisy / Inconsistent ?
 Inconsistencies in naming conventions or
data codes (e.g., 2/5/2002 could be 2 May
2002 or 5 Feb 2002)

 Duplicate tuples, which were received twice


should also be removed
Major Tasks in Data
Preprocessing
outliers=exceptions!
Data cleaning
– Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
Data integration
– Integration of multiple databases or files
Data transformation
– Normalization and aggregation
Major Tasks in Data
Preprocessing
Data reduction
– Obtains reduced representation in volume
but produces the same or similar
analytical results

Data discretization
– Part of data reduction but with particular
importance, especially for numerical data
Forms of data preprocessing
DATA CLEANING

Data cleaning tasks


- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
HOW TO HANDLE MISSING
DATA?
 Ignore the tuple: usually done when class
label is missing (assuming the tasks in
classification)— not effective when the
percentage of missing values per attribute
varies considerably.

 Fill in the missing value manually: tedious +


infeasible?
HOW TO HANDLE MISSING
DATA?
 Use a global constant to fill in the missing value:
e.g., “unknown”, a new class?!

 Use the attribute mean to fill in the missing value

 Use the attribute mean for all samples belonging


to the same class to fill in the missing value:
smarter
 Use the most probable value to fill in the missing
value: inference-based such as Bayesian formula
or decision tree
HOW TO HANDLE MISSING
DATA?
Age Income Team Gender

23 24,200 Red Sox M

39 ? Yankees F

45 45,390 ? F

Fill missing values using aggregate functions (e.g.,


average) or probabilistic estimates on global value
distribution
E.g., put the average income here, or put the most
probable income based on the fact that the person is
39 years old
E.g., put the most frequent team here
HOW TO HANDLE NOISY DATA?
Discretization

The process of partitioning continuous


Variables into categories is called
Discretization.
HOW TO HANDLE NOISY DATA?
Discretization : Smoothing techniques

Binning method:
- first sort data and partition into (equi-depth) bins
- then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.

Clustering
- detect and remove outliers
HOW TO HANDLE NOISY DATA?
Discretization : Smoothing techniques

Combined computer and human inspection


- computer detects suspicious values,
which are then checked by humans

Regression
- smooth by fitting the data into regression
functions
SIMPLE DISCRETISATION
METHODS: BINNING
Equal-width (distance) partitioning:

- It divides the range into N intervals of equal size:


uniform grid
- if A and B are the lowest and highest values of the
attribute, the width of intervals will be:
W = (B-A)/N.
- The most straightforward
- But outliers may dominate presentation
- Skewed data is not handled well.
SIMPLE DISCRETISATION

METHODS: BINNING
Equal-depth (frequency) partitioning:

- It divides the range into N intervals, each


containing approximately same number of
samples
- Good data scaling – good handing of
skewed data
BINNING : EXAMPLE
 Binning is applied to each individual feature
(attribute)

 Set of values can then be discretized by replacing


each value in the bin, by bin mean, bin median, bin
boundaries.

 Example: Set of values of attribute Age:


 0. 4 , 12, 16, 14, 18, 23, 26, 28
EXAMPLE: EQUI- WIDTH BINNING
Example : Set of values of attribute Age:
0. 4 , 12, 16, 16, 18, 23, 26, 28
Take bin width = 10

Bin # Bin Elements Bin Boundaries

1 {0,4} [ - , 10)

2 { 12, 16, 16, 18 } [10, 20)

3 { 23, 26, 28 } [ 20, +)


EXAMPLE: EQUI- DEPTH BINNING
Example : Set of values of attribute Age:
0. 4 , 12, 16, 16, 18, 23, 26, 28
Take bin depth = 3

Bin # Bin Elements Bin Boundaries

1 {0,4, 12} [ - , 14)

2 { 16, 16, 18 } [14, 21)

3 { 23, 26, 28 } [ 21, +)


SMOOTHING USING BINNING
METHODS
 Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
 Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
SIMPLE DISCRETISATION
METHODS: BINNING
number
of values
Example: customer ages

Equi-width
binning: 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-depth
binning: 0-22 22-31 62-80
38-44 48-55
32-38 44-48 55-62
THANK YOU !

Any Questions?
SUSHIL KULKARNI
sushiltry@yahoo.co.in