Database Clustering and Summary Generation: Tae-Wan Ryu and Christoph F. Eick

Database Clustering and
Summary Generation
Tae-Wan
Tae-WanRyu
Ryuand
andChristoph
ChristophF.F.Eick
Eick
Similarity
Similarity Measures
Measures For
For Multi-valued
Multi-valued
Attributes
Attributes for
for Database
Database Clustering
Clustering
Tae-wan Ryu and Christoph F. Eick
Department of Computer Science
University of Houston
Talk Organization
Database Clustering
Problems of Database Clustering
Extended Data Sets
Similarity Measures for Sets and Bags
An Architecture for Database Clustering
Summary and Conclusion
General
General KDD
KDD Steps
Steps
Data sources Selected/Preprocessed data Transformed data Extracted information Knowledge
Select/preprocess Transform Data mine Interpret/Evaluate/Assimilate
Data preparation
Research
Research Goal
Goal
To develop methodologies, techniques, and tools to create

summaries from databases using cluster analysis and genetic
programming
Our approach
Partition the database into groups of similar objects using cluster
analysis
Find commonalities that objects belonging to each group share
using genetic programming
Database
Database Summary
Summary Generation
Generation
Steps
Steps and
and Example
Example
< Steps > < Example >
Database Restaurant database
Database Clustering
Clusters
Groups of
similar objects
Young White color Retired
Summary Generation
Midnight Dinner Lunch
Summaries describing
the commonalities
within each group
An
An Example
Example Schema
Schema Diagram
Diagram
Marriage
hssn wssn mdate numkid

husband, 1:n
wife, 1:n
ehusband, n:1 ewife, n:1

Employee superv, n:1 Department
name ssn address sex salary superssn dno dnum dname
works_for, n:1 works_loc, 1:n
Works_on Project control, 1:n Dept_loc

essn pno hours pname pnum ploc dnum dnum dloc
works_on, 1:n project, n:1

Preprocessing
Preprocessing for
for
Database
Database Clustering
Clustering
Preparing input data sets for clustering

Appropriate data selection and preparation from a database is
important task
Key Problems
How to support a users viewpoint including attribute selection
Data model discrepancy between storage format and the input
format that clustering algorithms assume
How to cope with structural information, especially 1:n and n:m
relationships
Input
Input Format
Format for
for Data
Data Mining
MiningAlgorithms
Algorithms
Data Format for Input Data Sets

Single flat file format (basically, the data set has to be
stored as a single(!) relation)
Complex and structured formats
Problem: Almost all existing data mining and clustering

approaches assume that input data set is in single flat file
format.
An
AnExample
ExampleDatabase
Databaseto
toIllustrate
Illustratethe
theProblems
Problemswith
with
Relationship
RelationshipInformation
Informationin inDatabase
DatabaseClustering
Clustering
Person Purchase Joined result

ssn name age sex ssn location ptype amount date name age sex ptype amount location
111111111 Johny 43 M 111111111 Warehouse 1 400 02-10-96 Johny 43 M 1 400 Mall
222222222 Andy 21 F 111111111 Grocery 2 70 05-14-96 Johny 43 M 2 70 Grocery
333333333 Post 67 M 111111111 Mall 3 200 12-24-96 Johny 43 M 3 200 Warehouse
444444444 Jenny 35 F 222222222 Mall 2 300 12-23-96 Andy 42 F 2 300 Mall
222222222 Grocery 3 100 06-22-96 Andy 42 F 3 100 Grocery
333333333 Mall 1 30 11-05-96 Post 67 M 1 30 Mall
Jenny 35 F null null null
(a) (b)
ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n
(a) an example of Personal relational database, (b) a joined table from Person and
Purchase relations
Existing
ExistingApproaches
Approaches
Applying aggregate functions or generalization

operators to convert a multi-valued attribute into a single
valued attribute.
Problems
User has to make a critical decision (e.g., which aggregate
function to use?)
Valuable related information may be lost.
Extended
Extended Data
Data Sets
Sets
name age sex ptype amount location
Johny 43 M 1 400 Mall
Johny 43 M 2 70 Grocery
Johny 43 M 3 200 Warehouse
Andy 42 F 2 100 Mall
Andy 42 F 3 100 Grocery
Post 67 M 1 30 Mall
name age sex p.ptype p.amount p.location

Johny 43 M {1,2,3} {400,70,200} {Mall, Grocery, Warehouse}
Andy 21 F {2,3} {100,100} {Mall, Grocery}
Post 67 M 1 30 Mall
A converted table with a bag of values
How to measure similarity between bags of values?

Group similarity measures are needed.
Approaches
Approaches for
for Database
Database Clustering
Clustering
Structured Manual transformation Flat file Clustering algorithms

database
<Current approach>
Structured Automated Extended Generalized

database preprocessing data set Clustering algorithms
<Proposed approach>
Related
Related Work
Work
LABYRINTH (Thompson et al.)

Ketterlins extended COBWEB
KATE (Manago et al.)
SUBDUE (Holder et al.)
INLEN (Ribeiro et al.)
KBG (Bisson et al.), KLUSTER (Kietz et al.)

Research
Research Objectives
Objectives for
for Database
Database Clustering
Clustering
To alleviate the representational gab between databases

on the one hand and input formats of clustering algorithms
on the other hand
To design and implement semi-automatic tools to facilitate
database clustering
To generalize clustering algorithms
Generating
Generating Extended
Extended Data
Data Sets
Sets From
From
aa Structured
Structured Database
Database
Database
d1, d2, , dn
Users
Extended data set
interests and
generator
objectives
Extended
data set1
AAUnified
Unified Similarity
Similarity Measure
Measure for
for
Clustering
Clustering Extended
Extended Data
Data Sets
Sets
Group Similarity Measures
Mixed Types: qualitative, quantitative types.
Qualitative type: Tverskys set-theoretical similarity models.

Contrast model
S(a,b) = f(AB) f(A B) f(B A),
where a and b be two objects, and A and B denote the sets of features for some
, , 0; f is the cardinality of the set
Ratio model (e.g., normalized similarity)

S(a,b) = f(AB) / [f(AB) + f(A B) + f(B A)], , 0
Group
Group Similarity
Similarity Measures...
Measures... continued
continued
Quantitative type: group average

Group average between group A and B
n
d ( A, B ) d(a,b)i n ,
i 1
where n is the total number of object-pairs, d(a,b)i is the dissimilarity measure for
the ith pair of objects a and b,
a A, b B.
By taking the average of all the inter-object measures for those pairs of
objects from which each object of a pair is in different groups.
AAFramework
Framework for
for Mixed
Mixed Type
Type Similarity
Similarity
Measures
Measures for
for Extended
Extended Data
Data Sets
Sets
Gowers similarity measure for data sets with mixed-types.
m m
S ( a, b) wi si ( ai , bi ) / wi
i 1 i 1
Extended similarity measure for multi-valued data sets with mixed-types.
l q l q
S ( a, b) [ wi sl ( ai , bi ) w j sq ( a j , b j )]/ ( wi w j )
i 1 j 1 i 1 j 1
where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for
qualitative attributes and quantitative attributes respectively.
Clustering
ClusteringAlgorithms
Algorithms for
for Extended
Extended Data
Data Sets
Sets
Nearest-neighbor clustering
DBSCAN
Leader algorithm
Hierarchical clustering
Database
Database Clustering
Clustering Environment
Environment
A set of
Library of clusters
clustering algorithms
Extended Similarity
Clustering Tool
measure
Data set
Library of
similarity
measures
Data Extraction User Interface Similarity
Tool Measure Tool
Type and
Default choice weight
and domain information
information
DBMS
AA More
More Detailed
Detailed Tool
Tool Architecture
Architecture
Other
Processed Query result
Flat Extended Pre-
Data file data set processor
mining data DBMS
tools Query
Form
Our data translator
User's interests and objectives
mining tools
Database name
Join form Relationship
definitions
Data set of interest
Selected attributes
Other information
AA Join
Join Template
Template Form
Form
A Join Template Form
Begin-spec
Database-name: DB;
Link-definitions: Link-list;
Begin-join
Dataset-of-interest: Dsetintrest;
Selected-attributes: Attr-list;
Objective-attributes: Obj-attr-list;
Extended-data-set: E;
End-join
End-spec
An
An Example
Example of
of the
the Interface
Interface of
of
the
the Extended
Extended Data
Data Set
Set Generation
Generation Tool
Tool
Begin-spec
DB-name: Company
Link-definitions:
superv(Employee.ssn, Employee.superssn),
husband(Employee.ssn, Marriage.hssn),
wife(Employee.ssn,Marriage.wssn),
ehusband(Marriage.hssn, Employee.ssn),
ewife(Marriage.wssn, Employee.ssn),
works_on(Employee.ssn, Works_on.essn),
project(Works_on.pno, Project.pnum),
works_for(Employee.dno, Department.dnum),
works_loc(Department.dnum, Dept_loc.dnum)
Begin-join
Dateset-of-interest: Employee
Selected-attributes: ssn, sex, salary,
superv.salary, wife.ewife.salary,
works_on.hours, works_on.project.pname,
works_for.works_loc.dloc
Objective-attributes: ssn
Output-data-set: E1
End-join
End-spec
Algorithm
Algorithm to
to Generate
Generate Extended
Extended Data
Data Sets
Sets
Project the Data Set of Interest by Primary key and

Selected Attributes
Join the Data set of Interest and related data sets to
get all related attributes for each join-path
Group attributes together that describe the same
object
Summary
Summary Representation
Representation
Our approach uses database queries as our summary
representation language.
Queries that compute the objects belonging to a cluster and no
other objects are considered to be perfect summaries for a cluster.
An example query for a cluster
(SELECTssnnameaddress
FROMpersonpurchase
WHERE(amountspent>1000)and
(paymenttype=cash)and
(storename=fleamarket))
Typically, members in the cluster have spent more than
$1,000 cash for shopping in a flea-market
Summary
Summary and
and Contributions
Contributions
Discussed the data model discrepancy between database storage
format and input data format for traditional clustering algorithms
Discussed the problems of dealing with relationship information in
database clustering
Presented a different way of representing related information
using extended data sets
Introduced the design and architecture of an automatic tools to
generate extended data sets from databases
Generalized the traditional similarity measures and present a
framework to cope with extended data sets in similarity-based
clustering
Architecture
Architecture of
of MASSON
MASSON
cluster g1
g2
Clustering
Object set Schema information ... module
gk
user input system input
user interface
GP based discovery system

generate apply
DB
select
Query set DBMS
user input Interface
KB GP engine
Domain Query result
knowledge evaluate return
system input
Discovered query set

Evolution
Evolution Process
Process
Initial generation generation2 Generationn
evolve evolve evolved population

Initial population evolved population
q11, q12,..,q1m q21, q22,..,q2m qn1, qn2,..,qnm
selection selection
crossover crossover selection
mutation mutation
Solution Q
n: number of generation
m: the size of population
Evolution
Evolution Process
Process
Initial generation generation2 Generationn
evolve evolve evolved population

Initial population evolved population
q11, q12,..,q1m q21, q22,..,q2m qn1, qn2,..,qnm
selection selection
crossover crossover selection
mutation mutation
Solution Q
n: number of generation
m: the size of population

Database Clustering and Summary Generation: Tae-Wan Ryu and Christoph F. Eick

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Database Clustering and Summary Generation: Tae-Wan Ryu and Christoph F. Eick

Uploaded by

Copyright:

Available Formats

Database Clustering and

Data sources Selected/Preprocessed data Transformed data Extracted information Knowledge

Select/preprocess Transform Data mine Interpret/Evaluate/Assimilate

To develop methodologies, techniques, and tools to create

Database Restaurant database

Midnight Dinner Lunch

hssn wssn mdate numkid

ehusband, n:1 ewife, n:1

works_for, n:1 works_loc, 1:n

Works_on Project control, 1:n Dept_loc

works_on, 1:n project, n:1

Preparing input data sets for clustering

Data Format for Input Data Sets

Problem: Almost all existing data mining and clustering

Person Purchase Joined result

Applying aggregate functions or generalization

name age sex p.ptype p.amount p.location

A converted table with a bag of values

How to measure similarity between bags of values?

Structured Manual transformation Flat file Clustering algorithms

Structured Automated Extended Generalized

LABYRINTH (Thompson et al.)

SUBDUE (Holder et al.)

INLEN (Ribeiro et al.)

KBG (Bisson et al.), KLUSTER (Kietz et al.)

To alleviate the representational gab between databases

Qualitative type: Tverskys set-theoretical similarity models.

Ratio model (e.g., normalized similarity)

Quantitative type: group average

Extended similarity measure for multi-valued data sets with mixed-types.

A Join Template Form

Project the Data Set of Interest by Primary key and

GP based discovery system

Discovered query set

Initial generation generation2 Generationn

evolve evolve evolved population

Initial generation generation2 Generationn

evolve evolve evolved population

You might also like