Attribution Non-Commercial (BY-NC)

686 views

Attribution Non-Commercial (BY-NC)

- DecTree Supplement
- Data Mining
- Data Mining: Concepts and Techniques
- Data Mining
- Introduction to Database
- IIJIT-2014-04-03-021
- Process for Predicting Earthquakes Through Data Mining
- Powerpoint presentation on somlething
- Op Research Notes
- Buffer Management in Distributed Database Systems-A Data Mining-Based Approach
- Chapter 6 - Supervised Machine Learning—Classification
- Efficient Algorithm for Mining Frequent Patterns Java Project
- Max Closed Patterns
- Data Mining - Lecture 1
- AI-chapter6_机器学习_-_决策树_0[1].3
- remotesensing-08-00666
- A New Model for Profitable Pattern Mining
- IJMER-47020105.pdf
- Improved FP Tree_Report
- AP 26261267

You are on page 1of 57

on Parallel and Distributed Computing and Networks (PDCN’98)

14 December 1998

Copyright c 1998

ACSys

– ANU, CSIRO, (Digital), Fujitsu, Sun, SGI

– Five programs: one is Data Mining

– Aim to work with collaborators to solve real problems and

feed research problems to the scientists

– Brings together expertise in Machine Learning, Statistics,

Numerical Algorithms, Databases, Virtual Environments

1

ACSys

About Us

Machine Learning

Numerical Methods

Numerical Methods

2

ACSys

Outline

– History

– Motivation

– Link Analysis: Association Rules

– Predictive Modeling: Classification

– Predictive Modeling: Regression

– Data Base Segmentation: Clustering

3

ACSys

knowledge from large datasets.

– Extremely large datasets

– Discovery of the non-obvious

– Useful knowledge that can improve processes

– Can not be done manually

data visualisation of very large databases at a high level of

abstraction, without a specific hypothesis in mind.

4

ACSys

Database

Machine

Learning

Visualisation

Performance Applied

Statistics

Computers

Parallel Pattern

Algorithms Recognition

5

ACSys

– data warehousing,

– data selection,

– data preprocessing,

– data transformation,

– data mining,

– interpretation/evaluation

6

ACSys

7

ACSys

Link Analysis

association rules, sequential patterns, time sequences

Predictive Modelling

tree induction, neural nets, regression

Database Segmentation

clustering, k-means,

Deviation Detection

visualisation, statistics

8

ACSys

9

ACSys

Sales/Marketing

– Provide better customer service

– Improve cross-selling opportunities (beer and nappies)

– Increase direct mail response rates

Customer Retention

– Identify patterns of defection

– Predict likely defections

– Identify inappropriate or unusual behaviour

10

ACSys

Mt Stromlo Observatory

11

ACSys

Some Research

Virtual Environments

Feature Selection

12

ACSys

Outline

– History

– Motivation

– Link Analysis: Association Rules

– Predictive Modeling: Classification

– Predictive Modeling: Regression

– Data Base Segmentation: Clustering

13

ACSys

– Customers becoming more demanding

– Markets are saturated

Drivers

– Focus on the customer, competition, and data assets

Enablers

– Increased data hoarding

– Cheaper and faster hardware

14

ACSys

Research Community

– KDD Workshops 1989, 1991, 1993, 1994

– KDD Conference annually since 1995

– KDD Journal since 1997

– ACM SIGKDD http://www.acm.org/sigkdd

Commercially

– Research: IBM, Amex, NAB, AT&T, HIC, NRMA

– Services: ACSys, IBM, MIP, NCR, Magnify

– Tools: TMC, IBM, ISL, SGI, SAS, Magnify

15

ACSys

Outline

– History

– Motivation

– Link Analysis: Association Rules

– Predictive Modeling: Classification

– Predictive Modeling: Regression

– Data Base Segmentation: Clustering

16

ACSys

– Offers many challenging problems

– Enormous databases now exist and readily available

– Statistics limited computationally

– Relevance of statistics if we do not sample

– There are not enough statisticians to go around!

– Limited computationally, useful on toy problems, but . . .

17

ACSys

– More than 1,000,000 entities/records/rows

– From 10 to 10,000 fields/attributes/variables

– Giga-bytes and tera-bytes

– Decisions must be made rapidly

– Decisions must be made with maximum knowledge

18

ACSys

– Add value to the data holding

– Competitive advantage

– More effective decision making

– Support high level and long term decision making

– Fundamental move in use of Databases

19

ACSys

Information overload

Internet navigation

20

ACSys

Outline

– History

– Motivation

– Predictive Modeling: Classification

– Predictive Modeling: Regression

– Data Base Segmentation: Clustering

21

ACSys

Link Analysis

links between individuals rather than characterising whole

use observations to learn to predict

partition data into similar groups

22

ACSys

Outline

– History

– Motivation

– Predictive Modeling: Classification

– Predictive Modeling: Regression

– Data Base Segmentation: Clustering

23

ACSys

– Given

A dataset of customer transactions

A transaction is a collection of items

– Find

Correlations between items as rules

Examples

– Supermarket baskets

– Attached mailing in direct marketing

24

ACSys

– IF x and y THEN z with confidence c

if x and y are in the basket, then so is z in c% of cases

– IF x and y THEN z with support s

the rule holds in s% of all transactions

25

ACSys

Example

Transaction Items

12345 ABC

12346 AC

12347 AD

12348 BEF

Input Parameters: confidence = 50%; support = 50%

26

ACSys

Typical Application

Millions of transactions

27

ACSys

12345 ABC A 75%

12346 AC B 50%

12347 AD C 50%

12348 BEF A; C 50%

Rule A ) C

s = s(A; C ) = 50%

28

ACSys

Algorithm Outline

– sets of items with at least minimum support

– Apriori and AprioriTid and newer algorithms

– For ABCD and AB in large itemset the rule AB ) CD

holds if ratio s(ABCD)/s(AB) is large enough

– This ratio is the confidence of the rule

29

ACSys

HIC Example

– 6.8 million records X 120 attributes (3.5GB)

– 15 months preprocessing then 2 weeks data mining

– cmin = 50% and smin = 1%, 0.5%, 0.25%

(1% of 6.8 million = 68,000)

– Unexpected/unnecessary combination of services

– Refuse cover saves $550,000 per year

30

ACSys

Psuedo Algorithm

(2) for (k = 2; Fk,1 6= ;; k + +) do begin

(3) Ck = apriori gen(Fk,1)

(4) for all transactions t 2 T

(5) subset(Ck; t)

(6) Fk = fC 2 Ck jc.count minsupg

(7) end S

(8) Answer = Fk

31

ACSys

Candidates and produces local support counts for local

transactions

operation

candidates (large support), but unable to deal with large

number of candidates (small support).

[Agrawal & Shafer 96]

32

ACSys

candidates. Need to move transaction data between

processors via all to all communication

not as good as Count Distribution Algorithm for large

transaction data size

33

ACSys

point to point

candidate itemsets fall below a given threshold

processors so that each processor gets itemsets that begin

only with a subset of all possible items

34

ACSys

Data Distribution Algorithm

Distribution Algorithm

[Han, Karypis & Kumar 97]

35

ACSys

Outline

– History

– Motivation

– Link Analysis: Association Rules

– Predictive Modeling: Classification

– Predictive Modeling: Regression

– Data Base Segmentation: Clustering

36

ACSys

of past decisions that can be used to make decisions for

unseen cases.

37

ACSys

Classification: C5.0

Started as Decision Tree Induction, now Rule Induction, also

Salford Systems http://www.salford-systems.com

38

ACSys

partitioning algorithm

Basic motivation:

– A dataset contains a certain amount of information

– A random dataset has high entropy

– Work towards reducing the amount of entropy in the data

– Alternatively, increase the amount of information exhibited

by the data

39

ACSys

Algorithm

40

ACSys

Algorithm

Select best S* in S

– true: form a leaf node

– false: recurse with Ci as new training set

41

ACSys

Discriminating Descriptions

categorical attributes

– define a disjoint cell for each possible value: sex = “male”

– can be grouped: transport 2 (car, bike)

continuous attributes

– define many possible binary partitions

– Split A < 24 and A 24

– Or split A < 28 and A 28

42

ACSys

Information Measure

of the dataset

P

The information content is mj =1 ,pj log(pj )

– pj is the probability of making a particular decision

– there are m possible decisions

P

Same as entropy: mj =1 pj log(1=pj ).

43

ACSys

Information Measure

P

info(T ) = mj =1 ,pj log(pj )

is the amount of information

needed to identify class of an object in T

P

Expected information requirement is then the weighted sum:

infox(T ) = ni=1 jjTTijj info(Ti)

44

ACSys

Information Measure

which maximises information gain

avoid various biases: Gini Index of Diversity

45

ACSys

End Result

A

b c

E Y

<63 >=63

Y N

46

ACSys

Types of Parallelism

same time

attributes in a node are distibuted among processors

for a single attribute is distributed between processors

47

ACSys

Example: ScalParC

Data Structures

– Attribute Lists: separate lists of all attributes, distributed

across processors

– Node Table: Stores node information for each record id

– Count Matrices: stored for each attribute, for all nodes at

a given level

48

ACSys

(2) do while (there are nodes requiring splitting at current level)

(3) Compute count matrices

(4) Compute best index for nodes requiring split

(5) Partition splitting attributes and update node table

(6) Partition non-splitting attributes

(7) end do

49

ACSys

Pruning

reflects the data

called overfitting

generalising a decision tree

50

ACSys

Error-Based Pruning

– prune subtrees which result in lower predicted error rate

51

ACSys

Pruning

– Error rate on training set (resubstitution error) not useful

because pruning will always increase error

– Two common techniques are cost-complexity pruning and

reduced-error pruning

weighted sum of complexity and error on training set—the

test cases used to determine weighting

directly

52

ACSys

Issues

significant changes in decision trees

understandable (thousands of nodes)

53

ACSys

Classification Rules

A

b c

E Y

<63 >=63

Y N

– A=c)Y

– A = b ^ E < 63 ) Y

– A = b ^ E 63 ) N

– Rule Pruning: Perhaps E 63 ) N

54

ACSys

Pros

– Greedy Search = Fast Execution

– High dimensionality not a problem

– Selects important variables

– Creates symbolic descriptions

Cons

– Search space is huge

– Interaction terms not considered

– Parallel axis tests only (A = v )

55

ACSys

Recent Research

Bagging

– Sample with resubstitution from training set

– Build multiple decision trees from different samples

– Use a voting method to classify new objects

Boosting

– Build multiple trees from all training data

– Maintain a weight for each instance in the training set that

reflects its importance

– Use a voting method to classify new objects

56

- DecTree SupplementUploaded byss
- Data MiningUploaded byRiteshgs
- Data Mining: Concepts and TechniquesUploaded bymahendirana
- Data MiningUploaded byani7890
- Introduction to DatabaseUploaded byKhaled Ismaiil
- IIJIT-2014-04-03-021Uploaded byAnonymous vQrJlEN
- Process for Predicting Earthquakes Through Data MiningUploaded byAnusha Saranam
- Powerpoint presentation on somlethingUploaded byAshik Ahmed
- Op Research NotesUploaded bysamaresh chhotray
- Buffer Management in Distributed Database Systems-A Data Mining-Based ApproachUploaded bybadday0202
- Chapter 6 - Supervised Machine Learning—ClassificationUploaded byAhsan Iqbal
- Efficient Algorithm for Mining Frequent Patterns Java ProjectUploaded byKavya Sree
- Max Closed PatternsUploaded byAJ Sharma
- Data Mining - Lecture 1Uploaded byAnkush Jindal
- AI-chapter6_机器学习_-_决策树_0[1].3Uploaded bydavidwang8512
- remotesensing-08-00666Uploaded bymk@metack.com
- A New Model for Profitable Pattern MiningUploaded byInnovative Research Publications
- IJMER-47020105.pdfUploaded byIJMER
- Improved FP Tree_ReportUploaded byChetan Ghodke
- AP 26261267Uploaded byAnonymous 7VPPkWS8O
- drUploaded byapi-356030343
- DataStreams-14Uploaded byharsha
- Doc in Data MiningUploaded byMahadev Aarathri
- Data MiningUploaded bySeafood Ng Tsz Foon
- UNSW Master of Data ScienceUploaded bynanaa02
- A Comprehensive Study of Major Techniques of Multi Level Frequent Pattern Mining- A SurveyUploaded byesatjournals
- IT444_Mid1 [MA]Uploaded byMB
- lesson5-PatternMiningUploaded byreski86
- data (1)Uploaded byAmit Vashisth
- Unit 7finalUploaded bymack11mack

- LufthansaUploaded byPrudhvinadh Kopparapu
- KLP-1000_Uploaded byLuis Augusto Rueda
- NX7 for Engineers and DesignersUploaded byDreamtech Press
- Python Easy Script_ManualUploaded bymarpoma
- The Network Topology of the Interbank Market (M. Boss, H. Elsinger, M. Summer, S. Thurner)Uploaded byAlizada Huseynov
- Pattern SagaUploaded byAnonymous Gm4HXi9I9V
- 4. 360 PAUploaded byJalaludeen Mohamed Akbar
- Ee2401 Power System Operation and Control l t p cUploaded byMarishaak
- IBM DS6000 _Storage_ManagerUploaded byaksmsaid
- Bearing Point - Bringing Product Life Cycle Management Into FocusUploaded byMehmet Subasi
- 129474215 Cics Interview Questions Rpgle Dml Rpg As400 Mainframe DeveloperUploaded byKankani Tejal
- Prediction of Tides Using Neural Networks at Karwar, West Coast of IndiaUploaded bySEP-Publisher
- Dieu Khien Toc Do Dong CoUploaded byhoanglongvnu
- htkbookUploaded byvpalazon
- River Basin Simulation in WEAPUploaded byAlfredo Hijar
- Prog. Theor. Exp. Phys. 2016 Itou EntanglementUploaded byjohn_k7408
- The 74HC164 Shift Register and your ArduinoUploaded bygeniunet
- Seculabs eBook - SQL Injection Error Based - ManuallyUploaded byRifqi Multazam
- 1997-02 HP JournalUploaded byElizabeth Williams
- CS2251-QBUploaded bySelva612
- 593.Mobile Programming Notes-Unit-1 Oct 2017Uploaded byBeatrice Mutapi
- TroubleshootingUploaded bytarafinou3312
- Introduction to NLPUploaded byFerooz Khan
- FEA REPORT ON 1D ANALYSIS OF STEPBARUploaded byKaran Pannu
- Automated Detection of White Matter Lesions in MRI Brain Images Using Spatiofuzzy and Spatio-Possibilistic Clustering ModelsUploaded bycseij
- Yamaha Song Filer ManualUploaded byaligator98
- Projetos SAPUploaded bycarlosnucci
- Financial_Secretary_Job_DescriptionUploaded bystjohnscongchurch
- APC _ EventUploaded byVicthor Condor
- Faci FrontgvhbjnUploaded byJielyn San Miguel