47 views

Uploaded by someoneLovesBlack

Datawarehouse
Data mining

save

- data mining
- 11 Chapter 3
- Data mining & analysis
- E-Filing Report
- Cluster Graphs
- Chap1 Intro
- Cluster Analysis to Be Shared
- Clustering Lecture
- Data Mining
- What is Web Mining
- Data Mining
- Cluster Good Paper
- KHALIQ PAPER
- D1_lit
- Applying Clustering and Association Rule Learning for Finding Patterns in Herbal Formulae
- Performance Evaluation of K-means Clustering Algorithm with Various Distance Metrics
- association rules
- Lecture 6
- 332-1128-1-PB
- Data Mining Curriculum Proposal
- lkSMC04
- SPtata2
- Analysis of Clustering Approaches for Data Mining in Large Data Sources
- Web Document Clustering Using Fuzzy Equivalence Relations
- Harmonized Scheme for Data Mining Technique to Progress Decision Support System in an Uncertain Situation
- Hunc
- ID3
- ABSTRACT-An Efficient Data Mining
- Practical Data Analysis Cookbook - Sample Chapter
- 04767706
- COCOMO
- A Case Study to Assess the Validity of Function Points
- Data Mining
- Db Query Optimization
- What is an API 1.0
- Enterprise Resources Planning
- Scientific Writing
- Tata Cara Penulisan Pustaka
- 1 Yatendra Singh Pundhir 143 BKMNIJSE Jan Feb Mar 2013-Libre
- How to Write a Scientific Paper
- Metrik Proses Dan Metrik Proyek
- Renstra
- Cloud
- Cloud
- AutoCount Accounting User Manual
- Akutansi Bmt Hal 53-54 (1)
- Article Captcha website security.docx
- sesi3manajemenrisikok3
- SistemPendukungKeputusan2.pdf
- PengenalanVB6
- Module Openerp
- DSS
- BPMN-UML
- Dasar Phyton
- Pendahuluan Kecerdasan Buatan
- 04 Analisis Proses Bisnisawal Lengkapl
- Relm Dss
- DSS CaseStudy
- Rpkps Manajemen Proyek Sistem Informasi

You are on page 1of 72

Universitas Indonesia

2012

Data Mining

More data is

generated:

Bank, telecom, other

business transactions ...

Scientific Data:

astronomy, biology, etc

Web, text, and ecommerce

**More data is captured:
**

Storage technology

**faster and cheaper
**

DBMS capable of

handling bigger DB

**We have large data stored in one or more
**

database/s.

We starved to find new information

within those data (for research usage,

competitive edge, etc).

We want to identify patterns or rules

(trends and relationships) in those data.

We know that a certain data exist inside a

database, but what are the consequences

of that data’s existence?

There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all 4. “Data Mining for Scientific and Engineering Applications” .000 Number of analysts 500.000 1.000 The Data Gap 3.000. Grossman. Kumar.000 0 1995 1996 1997 1998 1999 From: R. C.000 2.000 3.500.000 2. Kamath.500. V.000.000 Total new disk (TB) since 1995 1.000.500.000.

Data mining: discovering interesting patterns from large amounts of data (Han and Kamber. 1998). . valid and actionable information from large databases and then using the information to make crucial business decisions (Cabena et al. 2001). Data Mining is a process of extracting previously unknown.

The thin red line of data mining: it is all about finding patterns or rules by extracting data from large databases in order to find new information that could lead to new knowledge. comprehensive. previously unknown. . 2005]: The process of extracting valid. and actionable information from large databases and using it to make crucial business decisions. Definition from [Connolly.

g.com. Amazon. O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e. O’Rurke.What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien.) . Amazon rainforest.

Data Mining in Knowledge Discovery Process Integration Da ta Se & lect Cl io ea n nin g DATA Ware house ma tio n Mi nin Knowledge g __ __ __ __ __ __ __ __ __ Transformed Data Target Data Knowledge Patterns and Rules Understanding Raw Data Tra ns f or Interpretation & Evaluation .

e. spatial data mining.g. web mining … 9 .. Clustering Classification Association Rules Other Methods: Outlier detection Sequential patterns Prediction Trends and analysis of changes Methods for special data types.

in the case of association between items bought by customers in supermarket: 90% of transactions that purchase bread and butter also purchase milk Antecedent: bread and butter Consequent: milk Confidence factor: 90% 10 .Association rules try to find association between items in a set of transactions. For example.

butter} Transaction by customer (T): T1: {sugar. onion. butter}.…it} T I. where I is the set of all possible items {i1. olives}. bread. salt}. Example: items sold by supermarket (I:Itemset): {sugar. onion. parsley. T2={sugar. tomato. ib. the task relevant data. salt} Database (D): {T1={salt. is a set of transactions (database of transactions).A transaction is a set of items: T={ia. T5={tomato}. cheese. …} 11 . T4={cheese. olives. T3={bread}. i2. onion. bread. salt.…in} D.

where P I. and P Q = Example: {bread} {butter. tomato} {salt} 12 .An association rule is the form: P Q. Q I. cheese} {onion.

Confidence of a rule P Q cDP Q) = sD(P Q)/sD(P) percentage of transactions that contain both P and Q in the subset of transactions that contain already P. #transactions containing P and Q divided by cardinality of D. 13 . Support of a rule P Q = Support of (P Q) in D sD(P Q ) = sD(P Q) percentage of transactions in D containing P and Q.

Thresholds: minimum support: minsup minimum confidence: minconf Frequent itemset P support of P larger than minimum support Strong rule P Q (c%) (P Q) frequent. c is larger than minimum confidence 14 .

C 1000 A. C Frequent Itemset 4000 A. D {A} 75% 5000 B. support 50% Min.Transaction ID Items Bought Min.C} 50% For rule {A} {C}: Support support = support({A. C})/support({C}) = 100. F {B} 50% {C} 50% {A. C})/support({A}) = 66.6% For rule {C} {A}: support = support({A.0% 15 . C}) = 50% confidence = support({A. B. C}) = 50% confidence = support({A. E. confidence 50% 2000 A.

The most famous algorithm is APRIORI 16 . Example: 98% of people who purchase tires and auto accessories also get automotive services done There are no restrictions on the number of items in the head or body of the rule. purchased by a customer in a visit) Find all strong rules that associate the presence of one set of items with that of another set of items. Input A database of transactions Each transaction is a list of items (Ex.

e. Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset ▪ i. 2009] . if {AB} isa frequent itemset.. both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules. Source: [Sunysb.

e. I3 T100 I1. D. consisting of 9 transactions. I4 T100 I2. Suppose min.I3. I2.TID List of Items T100 I1. Let minimum confidence required is 70%. I5 T100 I1. I4 T100 I1. support & min. We have to first find out the frequent itemset using apriori algorithm. I5 T100 I2. I2 . Then. I3 Consider a database. I3 T100 I2. support count required is 2 (i. confidence . I2. association rules will be generated using min. min_sup = 2/9 = 22%). I3 T100 I1. I2. I3 T100 I1.

•In the first iteration of the algorithm.Step 1: Generating 1-itemset Frequent Pattern Scan D for count of each candidate Itemset {l1} {l2} Compare Sup.Count {l1} 6 {l2} 7 {l3} 6 {l3} 6 {l4} 2 {l4} 2 {.5} 2 C1 L1 •The set of frequent 1-itemsets. each item is member of the set of candidate .Count candidate support count 6 with minimum support count 7 Itemset Sup. L1. consists of the candidate 1itemsets satisfying minimum support.5} 2 {.

l3} {l1.l4} {l1.l3} 4 {l1.l4} 0 {l3. Frequent Pattern Itemset Count Generate C2 candidat es from L1 {l1.l5} {l3.l3} {l2.l4} 2 {l2.l2} {l1.l5} 2 {l3.l4} {l3.l5} 2 {l2.l4} {l2. Count {l1.l5} 2 {l2.l3} 4 {l2.l5} Scan D for count of each candidate C2 Compare candidate support count with minimum support count Itemset Sup. Step 2: Generating 2-itemset Itemset Sup.l5} 1 {l4.l2} 4 {l1.l3} 4 {l2.l5} {l2.l5} 0 {l1.l4} 2 {l2.l5} C2 {l4.l4} 1 {l1.l5} 2 L2 .l2} 4 {l1.l3} 4 {l1.

is then determined. Next. consisting of those candidate 2itemsets in C2 having minimum support. the algorithm uses L1JoinL1 to generate a candidate set of 2-itemsets. L2. the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table). C2.To discover the set of frequent 2-itemsets. The set of frequent 2-itemsets. L 2. Note:We haven’t used Apriori Property yet. .

l2.l3} {l1.l2.l2.l2.l2. Generate C3 candidat es from L2 Step 3: Generating 3-itemset Compare Frequent Pattern Scan D candidate for count of each candidate Itemset Itemset Sup.l5} C3 C3 support count with minimum support count Itemset Sup.l5} 2 L3 .l3} {l1.l3} 2 2 {l1. Count {l1.l5} {l1.l2. Count 2 {l1.

**The generation of the set of candidate 3itemsets, C3, involves use of the Apriori
**

Property.

In order to find C3, we compute L2JoinL2.

C3= L2JoinL2 = {{I1, I2, I3}, {I1, I2, I5},

{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,

I5}}.

**Based on the Apriori property that all subsets
**

of a frequent itemset must also be frequent, we

can determine that four latter candidates

cannot possibly be frequent.

For example, lets take {I1, I2, I3}. The 2-item

subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.

Since all 2-item subsets of {I1, I2, I3} are

members of L2, We will keep {I1, I2, I3} in C3.

Lets take another example of {I2, I3, I5}

which shows how the pruning is performed. The

2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

**BUT, {I3, I5} is not a member of L2 and
**

hence it is not frequent violating Apriori

Property. Thus We will have to remove {I2,

I3, I5} from C3.

Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}}

after checking for all members of result of

Join operation for Pruning.

Now, the transactions in D are scanned in

order to determine L3, consisting of those

candidates 3-itemsets in C having minimum

support.

C4. Thus. and algorithm terminates. I2. What’s Next ? These frequent itemsets will be used to generate strong association rules( where strong association rules satisfy both minimum support & minimum confidence). having found all of the frequent items. I5}} is not frequent. . Step 4: Generating 4-itemset Frequent Pattern The algorithm uses L3JoinL3 to generate a candidate set of 4-itemsets. this itemset is pruned since its subset {{I2. I5}}. I3. C4= φ. Although the join results in {{I1. This completes our Apriori Algorithm. I3.

output the rule “s (I-s)” if support_count(I) / support_count(s) ≥ min_conf where min_conf is minimum confidence threshold.Step 5: Generating Association Rules from Frequent Itemsets Procedure: For each frequent itemset “I”. . For every nonempty subset s of I. generate all nonempty subsets of I.

{l5} .l5}.l5}. {l2}.{l3}. {l1.l5}.{l1.{l2.l3}. ▪ Lets take I = {l1.{l1.l2.l3}.{l2}. {l2.l5}.l5}}.{l2. In our example: We had L = {{l1}.l2.{l1.{l4}.{l5}.l3}.l5} ▪ Its all nonempty subsets are {l1.l3}. {l2. {l1.l2.l2}.l2}. {l1}.{l1.

R1: {I1.I2. say 70%. each listed with its confidence.I2.I2. R3: {I2.Let minimum confidence thresholdis .I5}/sc{I1.I2} = 2/4 = 50% ▪ R1 is Rejected.I5} = 2/2 = 100% ▪ R2 is Selected.I5}/sc{I1.I5}/sc{I2.I2} {I5} ▪ Confidence = sc{I1. The resulting association rules are shown below. R2: {I1.I5} {I2} ▪ Confidence = sc{I1. .I5} = 2/2 = 100% ▪ R3 is Selected.I5} {I1} ▪ Confidence = sc{I1.

I2} Confidence = sc{I1. R4: {I1} {I2. We have found three strong association rules.I5}/sc{I1} = 2/6 = 33% R4 is Rejected.I5} Confidence = sc{I1.I5}/ {I5} = 2/2 = 100% R6 is Selected.I5} Confidence = sc{I1.I2. In this way. R5: {I2} {I1. .I5}/{I2} = 2/7 = 29% R5 is Rejected.I2.I2. R6: {I5} {I1.

Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics. Decision Trees. Neural Networks. . ...

neural network. etc) Prepare test set to determine the accuracy of the model. use your model to classify new instance From: http://www-users. After happy with the accuracy. one of the attributes is the class.edu/~kumar/dmbook/index. Usually. with training set used to build the model and test set used to validate it.umn. Find a model for class attribute as a function of the values of other attributes (decision tree.php 32 . the given data set is divided into training and test sets.cs.Prepare a collection of records (training set ) Each record contains a set of attributes.

php 33 .ca go e t l ir ca ca go e t l ir ca tin n co u uo s ss a l c Refund Marital Status Taxable Income Cheat No No Single 75K ? 100K No Yes Married 50K ? Single 70K No No Married 150K ? Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K 2 No Married 3 No 4 60K 10 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Training Set Learn Classifier Test Set Model From: http://www-users.umn.edu/~kumar/dmbook/index.cs.

ca go e t l a c ri ca go e t l a c ri tin n co us o u ss a cl Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single. Divorced TaxInc < 80K NO NO > 80K YES 10 Training Data Married Model: Decision Tree .

te ca 10 ica r go l te ca ica r go l tin n co us o u ss a l c Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single. Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! .

Divorced TaxInc < 80K NO Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? . Refund Yes 10 No NO MarSt Single.Test Data Start from the root of tree.

▪ We know which customers decided to buy and which decided otherwise. ▪ Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques. This {buy. how much they earn. ▪ Type of business. and companyinteraction related information about all such customers. ▪ Collect various demographic. etc. 1997 37 . lifestyle. don’t buy} decision forms the class attribute. Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. where they stay. Approach: ▪ Use the data for a similar product introduced before.

From: http://www-users.php 38 . etc ▪ Label past transactions as fraud or fair transactions.cs. ▪ Learn a model for the class of the transactions. Approach: ▪ Use credit card transactions and the information on its account-holder as attributes.umn.edu/~kumar/dmbook/index. how often he pays on time. This forms the class attribute. what does he buy. ▪ Use this model to detect fraud by observing credit card transactions on an account. Fraud Detection Goal: Predict fraudulent cases in credit card transactions. ▪ When does a customer buy.

▪ How often the customer calls. Approach: ▪ Use detailed record of transactions with each of the past and present customers. to find attributes. 1997 39 . where he calls. Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. his financial status. ▪ Label the customers as loyal or disloyal. ▪ Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques. etc. marital status. what timeof-the day he calls most.

Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group. Clustering: unsupervised classification: no predefined classes 40 . called clusters. Helps users understand the natural grouping or structure in a data set.Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes.

Find “natural” grouping of instances given un-labeled data .

The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. the inter-class similarity (that is between clusters) is low. A good clustering method will produce high quality clusters in which: the intra-class similarity (that is within a cluster) is high. The quality of a clustering result also depends on the definition and representation of cluster chosen. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. 42 .

Partitioning algorithms: Construct various partitions and then evaluate them by some criterion. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. There is an agglomerative approach and a divisive approach. 43 .

sum of distances to the center of the clusters). Partitioning method: Given a number k. Global optimum: exhaustively enumerate all partitions – too expensive! Heuristic methods based on iterative refinement of an initial partition 44 .g. partition a database D of n objects into a set of k clusters so that a chosen objective function is minimized (e..

Hierarchical decomposition of the data set (with respect to a given similarity measure) into a set of nested clusters Result represented by a so called dendrogram Nodes in the dendrogram represent possible clusters can be constructed bottom-up (agglomerative approach) or top down (divisive approach) Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. 45 .

Potentially long and skinny clusters + Fast . cluster similarity = similarity of two most similar members .

d 2 . 5 } min{ 9. 2 ). 4 . 3 min{ d1 .3 .3} 3 5 d (1.1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 0 9 7 0 8 5 4 0 (1. d 2 .5 .8} 8 3 2 1 . d 2 . 2 ). 4 } min{ 10. 5 min{ d1.2) 3 4 5 (1. 3 } min{ 6.2) 0 3 3 0 4 9 7 0 5 8 5 4 0 d (1. 4 min{ d1.9} 9 4 d (1. 2 ).

5 .2. 4 min{ d(1.1 1 2 3 4 5 0 2 6 10 9 2 3 4 5 0 3 0 9 7 0 8 5 4 0 (1.2. 5 } min{ 8. 2 ). 2 .7} 7 d (1. 3 ).3) 4 5 (1. 4 . d3 .2) 3 4 5 (1.2) 0 3 3 0 4 9 7 0 5 8 5 4 0 (1. 2 . 4 } min{ 9.5} 5 4 3 2 1 . 2 ).3) 0 4 7 0 5 5 4 0 5 d (1. d 3 . 3 ).5 min{ d (1.

2 . 4 . 3). 3 ).5 } 5 4 3 2 1 .2) 3 4 5 3 4 5 0 3 0 9 7 0 8 5 4 0 (1.2. 2 .3) 4 5 (1. 5 ) min{ d (1. d (1. 3 ).2.2) 0 3 3 0 4 9 7 0 5 8 5 4 0 (1. 2.3) 0 4 7 0 5 5 4 0 5 d (1.( 4 .1 1 0 2 2 3 6 4 10 5 9 2 (1.

slow . cluster similarity = similarity of two least similar members + tight clusters .

5 } max{ 9. 2). 3 max{ d1.2) 0 3 6 0 4 10 7 0 5 9 5 4 0 d (1. 2). 5 . 4 .2) 3 4 5 (1. 5 max{ d1.9} 10 4 d (1. 2). 4 } max{10. d 2. d 2 . 3 . d 2.3} 6 5 d (1. 4 max{ d1 .8} 9 3 2 1 . 3} max{ 6.1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 0 9 7 0 8 5 4 0 (1.

2).( 4 . d 3.5) (1. 2 ).5} 7 4 3 2 1 . 5 } max{10. 5) max{ d3 . 2 ). 4 .9} 10 d 3 .1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 0 9 7 0 8 5 4 0 (1.2) 0 3 6 0 4 10 7 0 5 9 5 4 0 (1. d (1.2) 3 4 5 (1. ( 4 .2) 3 ( 4.5) max{ d (1. 5 } max{ 7. 4 .2) 0 3 6 0 (4.5) 10 7 0 5 d (1.

2) 3 ( 4. ( 4 .2) 0 3 6 0 (4.5) 10 7 0 5 d (1.2) 0 3 6 0 4 10 7 0 5 9 5 4 0 (1.5) (1. 2 ).1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 0 9 7 0 8 5 4 0 (1.( 4 .2) 3 4 5 (1. 3 ). 2 . 5 ) . ( 4 . d 3. 5 ) max{ d(1. 5 )} 10 4 3 2 1 .

54 .Dendogram: Hierarchical Clustering Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Outliers Dates Nominal/Numeric Discretization Normalization Smoothing Transformation Attribute selection 55 . Understanding the Data Data Cleaning Missing Values. Noisy Values.

What data is available? What available data is actually relevant or useful? Can the data be enriched from other sources? Are there historical datasets available? Who is the Real expert to ask questions of? (Are the results at all sensible? Or are they completely obvious?) Answers to these questions before embarking on a data mining project are invaluable later on. 56 . Can't be expected to be expert in all fields. but understanding the data can be extremely useful for data mining.

consider stratified sampling 57 . but any attribute less than 10 instances is typically not worth including Number of instances per class: More than 100 per class If very unbalanced. Number of instances available: 5000 or more for reliable results Number of attributes: Depends on data set.

etc. Goal: maximizing data quality Assess data quality Correct noted deficiencies Assess data quality Analyze data distribution: is there any strange data distribution? Analyze data elements: check inconsistencies. Conduct physical audit: ensure data recorded properly. redundant. missing values. for example: cross check data to customer Analyze business rules: check data violates business rules . outlier.

Exclude the attribute for which data is frequently missing Exclude records that have missing data Extrapolate missing values from other known values Use a predictive model to determine a value For quantitative values. such as the average. use a generic figure. .

We want all dates to be the same. Year 10. YYYYMM DD is an ISO standard. with/without the –second Other representations: Posix/Unix System Date: Number of seconds since 1 970 etc 60 . Most importantly: Does not preserve intervals. BUT it has some issues for data mining.000 AD! (we only have 4 digits) Dates BC[E] eg -0300-02-03 is not a valid YYYY-MM- DD date.

eg: Sex. C. other value is 1 (eg gender) Ordered fields: Convert to numbers to preserve order (eg A vs C grade becomes 4 and 2 respectively) Few Values: Convert each value into a new binary attribute. etc Some algorithms can't deal with nominal or numeric attributes. but Neural Networks and many clustering algorithms require only numeric attribute values. ATb. B. eg group states in the US into 5 groups of 10. ATc. each with its own (binary) attribute. Eg Decision trees deal best with nominal. Country. D then you can create 4 new attributes ATa. for example: possible values for attribute AT are A. ATd with each attribute has value either 0 or 1 Many Values: Convert into groups of values. Nominal data – without ordering. Unique Values: Ignore identifier like attributes (buang atribut) 61 . In case the algorithm requires converting Nominal to Numeric Binary field: One value is 0.

Several Discretization techniques. or a numeric value with a smaller range of values. then minimise classifier error) Lazy (Only discretize during classification (VERY lazy!)) Proportional k-Interval (Number of bins = square root of instances) 62 .Some algorithms require nominal or discrete values. How can we turn a numeric value into a nominal value. Equal Width Equal Depth Class Dependent Entropy Fuzzy (Allow some fuzziness as to the edges of the bin) Non-disjoint (Allow overlapping intervals) ChiMerge (Use Chi-Squared Test in the same way as Entropy) Iterative (Use some technique. Often called 'binning‘.

1] is 0.991 Min/Max Normalisation: v'= (v-min)/(max-min) * (newMax-newMin) + newMin Eg: 73600 in [12000. and -991 becomes -0.716 Zero-mean Normalization: v'=(v-mean)/stddev eg: mean 54000. Decimal Scaling: v' = v/10k for smallest k such that max(abs(v'))<1 Eg: -991 and 99. We might want to normalise our data such that two numeric values are comparable.225 . k is 3. v=73600 then v' = 1.98000] to [0. stddev = 16000. For example to compare age and income.

In the case of noisy data. we might want to smooth the data such that it is more uniform. and move each value some amount closer to what the function predicts (see classification) Clustering: Some clustering techniques remove outliers. . We could cluster the data to remove these noisy values. Binning: We could use some technique to discretize the data. Some possible techniques: Regression: Find the function for the data. and then smooth based on those 'bins'.

Transform data to more meaningful form For example: Birth date is transformed to Age Date of the first transaction is transformed to number of days since the customer becomes member Grades of each course are transformed to cumulative GPA .

Attribute selection ▪ Stepwise Forward Selection: Find the best attribute and add. ▪ Sampling: Select a random subset of the data and use that which is hopefully representative. ▪ Stepwise Backward Elimination: Find the worst attribute and remove. ▪ Genetic Algorithms: Use a 'survival of the fittest' along with random cross-breeding approach ▪ etc . Often not appropriate unless the algorithm is designed to do it. we may want to either remove instances or select only a portion of the complete data set to work with. Before getting to the data mining. Why? Perhaps our algorithms don't scale well to the amount of data we have Techniques: Records selection ▪ Partitioning: Split the database into sections and work with each in turn.

salary. What technique will you use to solve this problem? Given set of applicant attributes (name. You want to estimate economic growth of Indonesia given some data (GNP. Average. age. Given national examination scores. GDP. etc) . you want to decide whether you have to approve customer application on credit card or not. Poor You want to suggest to your customer about suitable pant given her/his choice of shirt. you want to group Kabupatens into three educational level: Good. etc).

Pak Bedu sudah memiliki data setiap pasien yang meliputi hasil uji dalam 5 test laboratorium serta keputusan apakah dia terkena kanker atau tidak. Untuk memutuskan hal tersebut. Berikut ini contoh datanya: ID T1 T2 T3 T4 T5 Cance r? P1 1 3 4 2 3 Yes P2 0 5 1 1 No P3 1 2 4 2 2 No P4 2 2 3 1 2 Yes P5 2 2 3 1 2 No Masalah apa yang bisa Anda temukan di data Pak Bedu? . Pak Bedu adalah seorang dokter yang ingin menggunakan TI untuk membantunya memutuskan apakah pasiennya terkena kanker atau tidak.

Pada saat Pak Bedu memeriksa satu atribut (misal T1) ditemukan 5 atribut yang distinct dengan jumlah sebagai berikut: 1 dengan jumlah 1234 2 dengan jumlah 2037 3 dengan jumlah 1659 4 dengan jumlah 1901 11 dengan jumlah 1 Apa yang bisa Anda simpulkan dengan melihat data tersebut? .

JenisKelamin (P/W). Tinggi (centimeter). Alamat. Pak Bedu menyerahkan sepenuhnya pengelolaan data pasien ke setiap klinik. TglLahir. Pak Bedu mengembangkan usaha kliniknya di 3 tempat. Tinggi (meter). Berat(kg). Berat(kg). Provinsi) . Alamat. Tinggi (meter). Kota. TglLahir. Apa yang harus Pak Bedu lakukan? Skema Klinik 1: Pasien( Nama. Umur. Provinsi) Skema Klinik 2: Pasien( Nama. Pak Bedu ingin mengetahui karakteristik dari pasien untuk usahanya dengan mengumpulkan data dari ketiga kliniknya. JenisKelamin (L/P). Kota ) Skema Klinik 3: Pasien( Nama. Berat(kg). JenisKelamin (L/P). Hanya saja Pak Bedu bingung karena setiap klinik memiliki skema data yang berbeda-beda.

Skema data yang dimiliki Pak Bedu antara lain: Tabel Mahasiswa(NPM. AsalDaerah. Umur. Kelompok). KodeMK. LamaStudi). LamaSkripsi. Pak Bedu ingin memberikan beasiswa untuk mahasiswanya yang sedang mengerjakan skripsi. Tabel Nilai(NPM. Pak Bedu ternyata juga memiliki usaha Sekolah Tinggi Kesehatan. Hanya saja Pak Bedu ingin memperoleh mahasiswa yang memiliki potensi untuk bisa menyelesaikan skripsinya dalam waktu satu semester. Tabel MataKuliah(Kode. Nama. Sebagai orang yang baik hati. Nama. Nilai) Diskusikan apa yang kira-kira bisa Anda lakukan untuk membantu Pak Bedu . Pak Bedu memiliki data nilai mahasiswa yang sudah lulus beserta lama waktu penyelesaian studinya.

kapan banyak yang sakit. Pak Bedu ingin melihat pola-pola kunjungan pasien. Pak Bedu mengembangkan kliniknya sampai 100 cabang. Pak Bedu memiliki data tentang klinik dan jumlah total kunjungan pasien setiap bulannya. daerah mana yang banyak orang sakit. Apa yang bisa Anda lakukan? . dsb.

- data miningUploaded byneharakhiani
- 11 Chapter 3Uploaded byAyanav Baruah
- Data mining & analysisUploaded byФранческо Леньяме
- E-Filing ReportUploaded byGabungan Mahasiswa Merdeka
- Cluster GraphsUploaded bysweetyramya
- Chap1 IntroUploaded byTSMFORU
- Cluster Analysis to Be SharedUploaded byAbhi Nav
- Clustering LectureUploaded byahmetdursun03
- Data MiningUploaded byKarthikNarayanan
- What is Web MiningUploaded byChintan Arora
- Data MiningUploaded byankur_sase
- Cluster Good PaperUploaded bybrijkk
- KHALIQ PAPERUploaded byAnkush Jain
- D1_litUploaded bykhushboo_khanna
- Applying Clustering and Association Rule Learning for Finding Patterns in Herbal FormulaeUploaded byidesajith
- Performance Evaluation of K-means Clustering Algorithm with Various Distance MetricsUploaded byNkiru Edith
- association rulesUploaded byjayaram
- Lecture 6Uploaded byKannu Mehta
- 332-1128-1-PBUploaded byg
- Data Mining Curriculum ProposalUploaded byNathan Alexander Sonnenfeld
- lkSMC04Uploaded byrex
- SPtata2Uploaded byTata Sua Peronilla
- Analysis of Clustering Approaches for Data Mining in Large Data SourcesUploaded byEditor IJRITCC
- Web Document Clustering Using Fuzzy Equivalence RelationsUploaded bysaeedullah81
- Harmonized Scheme for Data Mining Technique to Progress Decision Support System in an Uncertain SituationUploaded byesatjournals
- HuncUploaded bybalaji-reddy-balaji-2837
- ID3Uploaded byjbj247
- ABSTRACT-An Efficient Data MiningUploaded byKavi Selva
- Practical Data Analysis Cookbook - Sample ChapterUploaded byPackt Publishing
- 04767706Uploaded byRui Araújo

- COCOMOUploaded bysomeoneLovesBlack
- A Case Study to Assess the Validity of Function PointsUploaded bysomeoneLovesBlack
- Data MiningUploaded bysomeoneLovesBlack
- Db Query OptimizationUploaded bysomeoneLovesBlack
- What is an API 1.0Uploaded bysomeoneLovesBlack
- Enterprise Resources PlanningUploaded bysomeoneLovesBlack
- Scientific WritingUploaded bysomeoneLovesBlack
- Tata Cara Penulisan PustakaUploaded bysomeoneLovesBlack
- 1 Yatendra Singh Pundhir 143 BKMNIJSE Jan Feb Mar 2013-LibreUploaded bysomeoneLovesBlack
- How to Write a Scientific PaperUploaded bysomeoneLovesBlack
- Metrik Proses Dan Metrik ProyekUploaded bysomeoneLovesBlack
- RenstraUploaded bysomeoneLovesBlack
- CloudUploaded bysomeoneLovesBlack
- CloudUploaded bysomeoneLovesBlack
- AutoCount Accounting User ManualUploaded bysomeoneLovesBlack
- Akutansi Bmt Hal 53-54 (1)Uploaded bysomeoneLovesBlack
- Article Captcha website security.docxUploaded bysomeoneLovesBlack
- sesi3manajemenrisikok3Uploaded bysomeoneLovesBlack
- SistemPendukungKeputusan2.pdfUploaded bysomeoneLovesBlack
- PengenalanVB6Uploaded bysomeoneLovesBlack
- Module OpenerpUploaded bysomeoneLovesBlack
- DSSUploaded bysomeoneLovesBlack
- BPMN-UMLUploaded bysomeoneLovesBlack
- Dasar PhytonUploaded bysomeoneLovesBlack
- Pendahuluan Kecerdasan BuatanUploaded bysomeoneLovesBlack
- 04 Analisis Proses Bisnisawal LengkaplUploaded bysomeoneLovesBlack
- Relm DssUploaded bysomeoneLovesBlack
- DSS CaseStudyUploaded bysomeoneLovesBlack
- Rpkps Manajemen Proyek Sistem InformasiUploaded bysomeoneLovesBlack