You are on page 1of 29

Database Clustering and

Summary Generation

Tae-Wan
Tae-WanRyu
Ryuand
andChristoph
ChristophF.F.Eick
Eick
Similarity
Similarity Measures
Measures For
For Multi-valued
Multi-valued
Attributes
Attributes for
for Database
Database Clustering
Clustering
Tae-wan Ryu and Christoph F. Eick
Department of Computer Science
University of Houston

Talk Organization
Database Clustering
Problems of Database Clustering
Extended Data Sets
Similarity Measures for Sets and Bags
An Architecture for Database Clustering
Summary and Conclusion
General
General KDD
KDD Steps
Steps

Data sources Selected/Preprocessed data Transformed data Extracted information Knowledge

Select/preprocess Transform Data mine Interpret/Evaluate/Assimilate

Data preparation
Research
Research Goal
Goal

To develop methodologies, techniques, and tools to create


summaries from databases using cluster analysis and genetic
programming

Our approach
Partition the database into groups of similar objects using cluster
analysis
Find commonalities that objects belonging to each group share
using genetic programming
Database
Database Summary
Summary Generation
Generation
Steps
Steps and
and Example
Example
< Steps > < Example >

Database Restaurant database

Database Clustering
Clusters

Groups of
similar objects
Young White color Retired

Summary Generation

Midnight Dinner Lunch

Summaries describing
the commonalities
within each group
An
An Example
Example Schema
Schema Diagram
Diagram

Marriage

hssn wssn mdate numkid


husband, 1:n
wife, 1:n

ehusband, n:1 ewife, n:1


Employee superv, n:1 Department
name ssn address sex salary superssn dno dnum dname

works_for, n:1 works_loc, 1:n

Works_on Project control, 1:n Dept_loc


essn pno hours pname pnum ploc dnum dnum dloc

works_on, 1:n project, n:1


Preprocessing
Preprocessing for
for
Database
Database Clustering
Clustering

Preparing input data sets for clustering


Appropriate data selection and preparation from a database is
important task

Key Problems
How to support a users viewpoint including attribute selection
Data model discrepancy between storage format and the input
format that clustering algorithms assume
How to cope with structural information, especially 1:n and n:m
relationships
Input
Input Format
Format for
for Data
Data Mining
MiningAlgorithms
Algorithms

Data Format for Input Data Sets


Single flat file format (basically, the data set has to be
stored as a single(!) relation)
Complex and structured formats

Problem: Almost all existing data mining and clustering


approaches assume that input data set is in single flat file
format.
An
AnExample
ExampleDatabase
Databaseto
toIllustrate
Illustratethe
theProblems
Problemswith
with
Relationship
RelationshipInformation
Informationin inDatabase
DatabaseClustering
Clustering

Person Purchase Joined result


ssn name age sex ssn location ptype amount date name age sex ptype amount location
111111111 Johny 43 M 111111111 Warehouse 1 400 02-10-96 Johny 43 M 1 400 Mall
222222222 Andy 21 F 111111111 Grocery 2 70 05-14-96 Johny 43 M 2 70 Grocery
333333333 Post 67 M 111111111 Mall 3 200 12-24-96 Johny 43 M 3 200 Warehouse
444444444 Jenny 35 F 222222222 Mall 2 300 12-23-96 Andy 42 F 2 300 Mall
222222222 Grocery 3 100 06-22-96 Andy 42 F 3 100 Grocery
333333333 Mall 1 30 11-05-96 Post 67 M 1 30 Mall
Jenny 35 F null null null

(a) (b)

ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n

(a) an example of Personal relational database, (b) a joined table from Person and
Purchase relations
Existing
ExistingApproaches
Approaches

Applying aggregate functions or generalization


operators to convert a multi-valued attribute into a single
valued attribute.

Problems
User has to make a critical decision (e.g., which aggregate
function to use?)
Valuable related information may be lost.
Extended
Extended Data
Data Sets
Sets
name age sex ptype amount location
Johny 43 M 1 400 Mall
Johny 43 M 2 70 Grocery
Johny 43 M 3 200 Warehouse
Andy 42 F 2 100 Mall
Andy 42 F 3 100 Grocery
Post 67 M 1 30 Mall
Jenny 35 F null null null

name age sex p.ptype p.amount p.location


Johny 43 M {1,2,3} {400,70,200} {Mall, Grocery, Warehouse}
Andy 21 F {2,3} {100,100} {Mall, Grocery}
Post 67 M 1 30 Mall
Jenny 35 F null null null

A converted table with a bag of values

How to measure similarity between bags of values?


Group similarity measures are needed.
Approaches
Approaches for
for Database
Database Clustering
Clustering

Structured Manual transformation Flat file Clustering algorithms


database

<Current approach>

Structured Automated Extended Generalized


database preprocessing data set Clustering algorithms

<Proposed approach>
Related
Related Work
Work

LABYRINTH (Thompson et al.)


Ketterlins extended COBWEB
KATE (Manago et al.)

SUBDUE (Holder et al.)

INLEN (Ribeiro et al.)

KBG (Bisson et al.), KLUSTER (Kietz et al.)


Research
Research Objectives
Objectives for
for Database
Database Clustering
Clustering

To alleviate the representational gab between databases


on the one hand and input formats of clustering algorithms
on the other hand
To design and implement semi-automatic tools to facilitate
database clustering
To generalize clustering algorithms
Generating
Generating Extended
Extended Data
Data Sets
Sets From
From
aa Structured
Structured Database
Database

Database
d1, d2, , dn

Users
Extended data set
interests and
generator
objectives

Extended
data set1
AAUnified
Unified Similarity
Similarity Measure
Measure for
for
Clustering
Clustering Extended
Extended Data
Data Sets
Sets
Group Similarity Measures
Mixed Types: qualitative, quantitative types.

Qualitative type: Tverskys set-theoretical similarity models.


Contrast model
S(a,b) = f(AB) f(A B) f(B A),
where a and b be two objects, and A and B denote the sets of features for some
, , 0; f is the cardinality of the set

Ratio model (e.g., normalized similarity)


S(a,b) = f(AB) / [f(AB) + f(A B) + f(B A)], , 0
Group
Group Similarity
Similarity Measures...
Measures... continued
continued

Quantitative type: group average


Group average between group A and B
n
d ( A, B ) d(a,b)i n ,
i 1
where n is the total number of object-pairs, d(a,b)i is the dissimilarity measure for
the ith pair of objects a and b,
a A, b B.

By taking the average of all the inter-object measures for those pairs of
objects from which each object of a pair is in different groups.
AAFramework
Framework for
for Mixed
Mixed Type
Type Similarity
Similarity
Measures
Measures for
for Extended
Extended Data
Data Sets
Sets
Gowers similarity measure for data sets with mixed-types.
m m
S ( a, b) wi si ( ai , bi ) / wi
i 1 i 1

Extended similarity measure for multi-valued data sets with mixed-types.

l q l q
S ( a, b) [ wi sl ( ai , bi ) w j sq ( a j , b j )]/ ( wi w j )
i 1 j 1 i 1 j 1
where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for
qualitative attributes and quantitative attributes respectively.
Clustering
ClusteringAlgorithms
Algorithms for
for Extended
Extended Data
Data Sets
Sets

Nearest-neighbor clustering
DBSCAN
Leader algorithm
Hierarchical clustering
Database
Database Clustering
Clustering Environment
Environment

A set of
Library of clusters
clustering algorithms

Extended Similarity
Clustering Tool
measure
Data set
Library of
similarity
measures
Data Extraction User Interface Similarity
Tool Measure Tool

Type and
Default choice weight
and domain information
information
DBMS
AA More
More Detailed
Detailed Tool
Tool Architecture
Architecture

Other
Processed Query result
Flat Extended Pre-
Data file data set processor
mining data DBMS
tools Query
Form
Our data translator
User's interests and objectives
mining tools

Database name
Join form Relationship
definitions
Data set of interest
Selected attributes
Other information
AA Join
Join Template
Template Form
Form

A Join Template Form

Begin-spec
Database-name: DB;
Link-definitions: Link-list;
Begin-join
Dataset-of-interest: Dsetintrest;
Selected-attributes: Attr-list;
Objective-attributes: Obj-attr-list;
Extended-data-set: E;
End-join
End-spec
An
An Example
Example of
of the
the Interface
Interface of
of
the
the Extended
Extended Data
Data Set
Set Generation
Generation Tool
Tool
Begin-spec
DB-name: Company
Link-definitions:
superv(Employee.ssn, Employee.superssn),
husband(Employee.ssn, Marriage.hssn),
wife(Employee.ssn,Marriage.wssn),
ehusband(Marriage.hssn, Employee.ssn),
ewife(Marriage.wssn, Employee.ssn),
works_on(Employee.ssn, Works_on.essn),
project(Works_on.pno, Project.pnum),
works_for(Employee.dno, Department.dnum),
works_loc(Department.dnum, Dept_loc.dnum)
Begin-join
Dateset-of-interest: Employee
Selected-attributes: ssn, sex, salary,
superv.salary, wife.ewife.salary,
works_on.hours, works_on.project.pname,
works_for.works_loc.dloc
Objective-attributes: ssn
Output-data-set: E1
End-join
End-spec
Algorithm
Algorithm to
to Generate
Generate Extended
Extended Data
Data Sets
Sets

Project the Data Set of Interest by Primary key and


Selected Attributes
Join the Data set of Interest and related data sets to
get all related attributes for each join-path
Group attributes together that describe the same
object
Summary
Summary Representation
Representation
Our approach uses database queries as our summary
representation language.
Queries that compute the objects belonging to a cluster and no
other objects are considered to be perfect summaries for a cluster.
An example query for a cluster
(SELECTssnnameaddress
FROMpersonpurchase
WHERE(amountspent>1000)and
(paymenttype=cash)and
(storename=fleamarket))
Typically, members in the cluster have spent more than
$1,000 cash for shopping in a flea-market
Summary
Summary and
and Contributions
Contributions
Discussed the data model discrepancy between database storage
format and input data format for traditional clustering algorithms
Discussed the problems of dealing with relationship information in
database clustering
Presented a different way of representing related information
using extended data sets
Introduced the design and architecture of an automatic tools to
generate extended data sets from databases
Generalized the traditional similarity measures and present a
framework to cope with extended data sets in similarity-based
clustering
Architecture
Architecture of
of MASSON
MASSON
cluster g1
g2
Clustering
Object set Schema information ... module
gk
user input system input
user interface

GP based discovery system


generate apply

DB
select
Query set DBMS
user input Interface
KB GP engine
Domain Query result
knowledge evaluate return

system input

Discovered query set


Evolution
Evolution Process
Process

Initial generation generation2 Generationn

evolve evolve evolved population


Initial population evolved population
q11, q12,..,q1m q21, q22,..,q2m qn1, qn2,..,qnm
selection selection
crossover crossover selection
mutation mutation
Solution Q

n: number of generation
m: the size of population
Evolution
Evolution Process
Process

Initial generation generation2 Generationn

evolve evolve evolved population


Initial population evolved population
q11, q12,..,q1m q21, q22,..,q2m qn1, qn2,..,qnm
selection selection
crossover crossover selection
mutation mutation
Solution Q

n: number of generation
m: the size of population

You might also like