SUSHIL KULKARNI JAI-HIND COLLEGE sushiltry@yahoo.co.

in

Social Networks : Example Technology used What is Data Mining? DM Process & Example DM Queries DM Tasks and Methods Relation & Data Warehouse What is ETL ? Data Preprocessing

What is a Network?
node

Lin k

node

node

node node

node node node
node

node node node
node

node

node

node

node

Web Definition : A set of nodes, points, or locations connected by means of data, voice, and video communications for the purpose of exchange.

Social Networks

 A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest

Social Network Analysis
 Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.  The nodes in the network are the people and groups while the links show relationships or flows between the nodes.

A shift in approach: from ‘synthesis’ to
‘analysis’
Problems • High cost of manual surveys • Survey bias - Perceptions of individuals may be incorrect • Logistics - Organizations are now spread across several countries. Cognitive network for A B Cognitive network for B

A

Cognitive network for C

Employee Surveys

Sdfdsfsdf Fvsdfsdfsd fdfsd Sdfdsfsdf Sdfsdfs `

Sdfdsfsd f Fvsdfsdfs Sdfdsfsd dfdfsd f Sdfdsfsd Fvsdfsdfs f dfdfsd Sdfsdfs Sdfdsfsd ` f Sdfsdfs `

C

- Email - Web logs

Analysis

Synthesis Social Network

Electronic communication Shift in approach Social network Cognitive network

Technology
Various technologies that help in creating Social Networks are:  Email  Blogs  Social Networking Software like Orkut, Face Book, Flickr etc.

SOCIAL NETWORK:
Profile & Platforms

USENET

SOCIAL NETWORK:
Profile & Platforms

Social Community

SOCIAL NETWORK: Growth

SOCIAL NETWORK : Growth Rate

SOCIAL NETWORK : Growth Rate

Technology :
 What is Your Network? - When your connections invite their connections, your Network starts to grow. - Your Network is your connections, their connections, and so on out from you at the center.  How do you classify users? - Your Network contains professionals out to “three degrees” that is, friends-of-friends-of-friends. If each person had 10 connections (and some have many more) then your network would contain 10,000 professionals.  How do you see who is in your Network? Facebook lets you see your network as one large group of searchable professional profiles.

SOCIAL NETWORK: Visualization
FRIEND FRIEND FRIEND

ME FRIEND

FRIEND

ON ANY OF SOCIAL NETWORK
Name Gender Age Birth date/Home town School attended Interests/ Hobbies Photoes Friends Activities Audio clips Video clips

FRIEND

Name Gender Age Birth date/Home town School attended Interests/ Hobbies Photoes Friends Activities Audio clips Video clips

YOU

ON ANY OF SOCIAL NETWORK
Name Gender Age Birth date/Home town School attended Interests/ Hobbies Photoes Friends Activities Audio clips Video clips

FRIEND

After making the friend, I can able to access his/ her friends , audios, videos, share information A friend may be from any remote site.

Name Gender Age Birth date/Home town School attended Interests/ Hobbies Photoes Friends Activities Audio clips Video clips

YOU

SOCIAL NETWORK : Growth Rate

SOCIAL NETWORK : Visualization Between friends: How many of them ?
Male vs. Female Young vs. Old

Thin vs. Fat

SOCIAL NETWORK : Visualization Between friends: Relationships

Thick Friends

Just Friends

SOCIAL NETWORK : Visualization Between friends: Likes

Coffee

Chocolate

Friends

Friends

HOW MANY OF MADHURI DIXIT’S FRIEND LIKE LIKE ? HOW MANY OF PRASHANT DAMLE’S FRIEND ?

FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOW
 How many friends use a social network regularly?  How many friends send messages frequently?  What is the mood of your friend list?  How many friends are vegetarian?  How many friends are closest or far from you?  How many friends studied or studying in your school?

FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOW

INTERESTING PATTERNS FROM UNKNOWN DATA

DEFINE DATA MINING
Data Mining is: The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

THUS : DATA MINING
 Methods for exploring and modeling relationships in large amount of data  Finding hidden information in a database  Fit data to a model

Data Mining Process
 Understand the Domain - Understands particulars of the business or scientific problems  Create a Data set - Understand structure, size, and format of data - Select the interesting attributes - Data cleaning and preprocessing

Data Mining Process
 Choose the data mining task and the specific algorithm - Understand capabilities and limitations of algorithms that may be relevant to the problem  Interpret the results, and possibly return to  bullet 2

EXAMPLE
 Understand social networks.  Grow connections.  Choose appropriate built in methods to find hidden information.

Example :E-mail Communication
 A sends an e-mail to B B  With Cc to C A C E  And Bcc to D D  C forwards this e-mail to E  From analyzing the header, we can infer  A and D know that A, B, C and D know about this e-mail  B and C know that A, B and C know about this e-mail  C also knows that E knows about this e-mail  D also knows that B and C do not know that it knows about this e-mail; and that A knows this fact  E knows that A, B and C exchanged this e-mail; and that neither A nor B know that it knows about it  and so on and so forth …

DB VS DM PROCESSING
• Query
– Well defined – SQL

• Query
– Poorly defined – No precise query language

Data
– Operational data

Data
– Not operational data

Output
– Precise – Subset of database

Output
– Fuzzy – Not a subset of database

QUERY EXAMPLES
Database
– Find all credit applicants with first name of Sane.
– Identify customers who have purchased more than Rs.10,000 in the last month. – Find all customers who have purchased milk

Data Mining

– Find all credit applicants who are poor

credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules)

ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING?
 Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of purity, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

DATA MINING DEVELOPMENT
 Relational Data Model  SQL  Association Rule Algorithms  Data Warehousing  Scalability Techniques  Similarity Measures  Hierarchical Clustering  IR Systems  Imprecise Queries  Textual Data  Web Search Engines  Bayes Theorem  Regression Analysis  EM Algorithm  K-Means Clustering  Time Series Analysis  Algorithm Design Techniques  Algorithm Analysis  Data Structures  Neural Networks  Decision Tree Algorithms

RELATION (r)
 D 1, D 2, ……, D n are domains  Relation r is a subset of a Cartesian product D 1× D 2× ……× D n

r ⊆ D 1× D 2 × … … × D n

EXAMPLE : r
D1 = {Ram, Shyam} , D 2 = {24, 34}
D 1× D 2 = { (Ram, 24), (Ram, 34), (Shyam, 24), (Shyam, 34)}

r is a sub set of D 1× D 2 r = { (Ram, 24), (Shyam, 34)}
SUSHIL KULKARNI

RELATION is TABLE

Employee

NAME Ram

TUPLES OR ROWS : t
 Instance of the relation is a tuple or row  Notation : t < (a(1), a(2), a(3),… a(n)): a(i) ∈ A(i); i ∈ N >  Example: t < (Ram,24) >

RELATION (r)
R
A a a
1 1 1

A a a a a

2 2 1

A a a a a

3 3 1

……

A

k k 1

……. ……. …… …… ……

A a a a a

n n 1

…… a …… a …… a a

1 2

2 2

3 2

k2

n 2

t

….. a a
1 i

….. ……....
2i 3i

…………
ki

…..
n 3

……. …….
1 m 2 m

…….
3 m

…….
nm

…….
nm

k th attribute R of i th tuple t

WHAT IS DATA WAREHOUSE ?
Subject-oriented: customers, patients, students, products, time. Integrated: Gathered CENTRALLY from 1.several internal systems of records 2. sources external to the organization

WHAT IS DATA WAREHOUSE ?
 Time - variant: Use to study trends and changes.  Non - updatable: cannot updated by end users.

BIG PICTURE

The ETL Process
 Capture  Scrub or data cleansing  Transform  Load and Index

ETL = Extract, Transform, and Load

Steps in data reconciliation

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

Steps in data reconciliation

Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps in data reconciliation

Transform = convert data from format of operational system to format of data warehouse

Record-level:
Selection – data partitioning Joining – data combining Aggregation – data summarization

Field-level:
single-field – from one field to one field multi-field – from many fields to one, or one field to many

Steps in data reconciliation

Load/Index = place transformed data into the warehouse and create indexes

Refresh mode: bulk
rewriting of target data at periodic intervals

Update mode: only
changes in source data are written to data warehouse

DIRTY DATA
Data in the real world is dirty: – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names

WHY DATA PREPROCESSING?
No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Required for Data Mining!

Why can Data be Incomplete?
 Attributes of interest are not available (e.g., customer information for sales transaction data)  Data were not considered important at the time of transactions, so they were not recorded!

Why can Data be Incomplete?
 Data not recorder because of misunderstanding or malfunctions  Data may have been recorded and later deleted!  Missing/unknown values for some data

Why can Data be Noisy / Inconsistent ?
 Faulty instruments for data collection  Human or computer errors  Errors in data transmission  Technology limitations (e.g., sensor data come at a faster rate than they can be processed)

Why can Data be Noisy / Inconsistent ?
 Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)  Duplicate tuples, which were received twice should also be removed

Major Tasks in Data Preprocessing
outliers=exceptions!

Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration
– Integration of multiple databases or files

Data transformation
– Normalization and aggregation

Major Tasks in Data Preprocessing
Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization
– Part of data reduction but with particular importance, especially for numerical data

Forms of data preprocessing

DATA CLEANING
Data cleaning tasks
- Fill in missing values - Identify outliers and smooth out noisy data - Correct inconsistent data

HOW TO HANDLE MISSING DATA?
 Ignore the tuple: usually done when class
label is missing (assuming the tasks in classification)— not effective when the percentage of missing values per attribute varies considerably.

 Fill in the missing value manually: tedious + infeasible?

HOW TO HANDLE MISSING DATA?
 Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

HOW TO HANDLE MISSING DATA?
Age 23 39 45 Income 24,200 ? 45,390 Team Red Sox Yankees ? Gender M F F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent team here

HOW TO HANDLE NOISY DATA? Discretization
The process of partitioning continuous Variables into categories is called Discretization.

HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques
Binning method:

- first sort data and partition into (equi-depth) bins - then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering

- detect and remove outliers

HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques
Combined computer and human inspection - computer detects suspicious values, which are then checked by humans Regression - smooth by fitting the data into regression functions

SIMPLE DISCRETISATION METHODS: BINNING
Equal-width (distance) partitioning:

- It divides the range into N intervals of equal size: uniform grid
- if A and B are the lowest and highest values of the attribute, the width of intervals will be:

W = (B-A)/N. - The most straightforward - But outliers may dominate presentation - Skewed data is not handled well.

SIMPLE DISCRETISATION METHODS: BINNING
Equal-depth (frequency) partitioning:

- It divides the range into N intervals, each containing approximately same number of samples - Good data scaling – good handing of skewed data

BINNING : EXAMPLE
 Binning is applied to each individual feature (attribute)  Set of values can then be discretized by replacing each value in the bin, by bin mean, bin median, bin boundaries.  Example: Set of values of attribute Age:  0. 4 , 12, 16, 14, 18, 23, 26, 28

EXAMPLE: EQUI- WIDTH BINNING
Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin width = 10

Bin # 1 2 3

Bin Elements {0,4} { 12, 16, 16, 18 } { 23, 26, 28 }

Bin Boundaries [ - , 10) [10, 20) [ 20, +)

EXAMPLE: EQUI- DEPTH BINNING
Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin depth = 3

Bin # 1 2 3

Bin Elements {0,4, 12} { 16, 16, 18 } { 23, 26, 28 }

Bin Boundaries [ - , 14) [14, 21) [ 21, +)

 Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34  Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29  Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

SMOOTHING USING BINNING METHODS

SIMPLE DISCRETISATION METHODS: BINNING
Example: customer ages
number of values

Equi-width binning:

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-depth binning:

0-22

22-31 62-80 38-44 48-55 32-38 44-48 55-62

THANK YOU ! Any Questions?
SUSHIL KULKARNI sushiltry@yahoo.co.in

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times