You are on page 1of 28

UNESCO courses: Module on Knowledge Discovery and Data Mining

Prof. Ho Tu Bao Prof. Bach Hung Khang

Institute of Information Technology
Japan Advanced Institute of Science and Technology
1

Outline of the presentation
Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion

This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining”
2

Objectives This course provides: • fundamental techniques of knowledge • issues in KDD practical use and tools • case-studies of KDD application 3 discovery and data mining (KDD) .

Prerequisite for the course Nothing special but the followings are expected: • experience of computer use • basis of databases and statistics • programming skill for advanced levels 4 .

Content of the course Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge 5 .

Brief Discussion Prerequisite and Content Introduction to Lectures and Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining” 6 .Outline of the presentation Objectives.

Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge 7 .

Data Mining Methods 5. The KDD Process 3. Challenges for KDD 8 . What is KDD and Why ? 2. KDD Applications 4.Lecture 1: Overview of KDD 1.

hidden knowledge from large volumes of data. 106-1012 bytes: never see the whole data set or put it in the memory of computers Data mining algorithms? What knowledge? How to represent and use it? 9 .KDD: A Definition KDD is the automatic extraction of non-obvious.

which have been perceived. Information. Knowledge can be considered data at a high level of abstraction and generalization. and reduced to the minimum necessary to characterize the data. including facts and their relations.Data. or “objects” which we collect daily. Knowledge We often see data as a string of bits. discovered. Knowledge is integrated information. 10 . or learned as our “mental pictures”. or numbers and symbols. Information is data stripped of redundancy.

multiple. 0. -.2137. 59. BACTE(E).-. 0. 39. 37. 0. 622. M. 10. 3. 39. n.3. 0. 0. SUBACUTE. 2852. 2. 2. 15. 41. SUBACUTE.15. ? .From Data to Knowledge Medical Data by Dr. 0. n. 5. 0. F.0. 70. ACUTE. 1. 38.-. M.normal. 3. 0. VIRUS . n.. Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87. BACTERIA 16. -. negative. 712. F. 44. 1.negative. n. ?. 0.-. 2.0. BACTERIA 15. M. 0. n. 0. Tokyo Med. 6000. 680.. n. 2. abnormal. . +. -. 10. ACUTE. negative. +. 71. n. 6000. 0. abnormal. 0. 2. 1. abnormal. Tsumoto.. +. -. ABSCESS... 32. ABPC+CZX. 97. . normal. 2. 0. 0. 57. 2148. 0. n. 0. Univ. 3. 0. 1124.abnormal. abnormal. F.VIRUS 12. 0. n.15. +. 12600. ?. 10700. 38.-.. n. 0. M. 0. 1080. BACTERIA. 5. 38 attributes . abnormal. 49. 4.5%] [confidence. negative. 47. 63. 32. 48. 502.. 400..FMOX+AMK. -. predictive accuracy] 11 .-. F.ABPC+CZX. ABSCESS.5. 0.15. 10. & Dent.4. 0.

Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs 12 . Raw data is rarely of direct benefit. ? knowledge base inference engine Its true value depends on the ability to extract information useful for decision support. People gathered and stored so much data because they think some valuable assets are implicitly coded within it.Data Rich Knowledge Poor How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem.

Benefits of Knowledge Discovery Value Disseminate Generate DSS MIS Rapid Response EDP Volume EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems 13 .

KDD Applications 4. Challenges for KDD 14 . Data Mining Methods 5. The KDD Process 3.Lecture 1: Overview of KDD 1. What is KDD and Why ? 2.

potentially useful.The KDD process The non-trivial process of identifying valid. Platetsky-Shapiro. novel. Smyth (1996) Multiple process non-trivial process valid novel useful understandable Justified patterns/models Previously unknown Can be used by human and machine 15 . and ultimately understandable patterns in data .Fayyad.

The Knowledge Discovery Process a step in the KDD process consisting of methods that produce useful patterns or models from the data. under some acceptable computational efficiency limitations 5 4 Putting the results in practical use 3 Interpret and Evaluate discovered knowledge Data Mining 2 Extract Patterns/Models 1 Collect and Preprocess Data Understand the domain and Define problems KDD is inherently interactive and iterative 16 .

Data organized by function Create/select target database The KDD Process 1 Data warehousing Select sampling technique and sample data Supply missing values Eliminate noisy data 2 Find important attributes & value ranges Normalize values Transform values Create derived attributes 3 Select DM task (s) Select DM method (s) Extract knowledge Test knowledge 4 Refine knowledge Transform to different representation Query & report generation Aggregation & sequences Advanced methods 5 17 .

symbolic data) 18 . update data (deduction) Machine Learning Computer algorithms that improve automatically through experience (mainly induction. search. access. mainly numeric data) Databases Store.Main Contributing Areas of KDD [data warehouses: integrated data] [OLAP: On-Line Analytical Processing] Statistics KDD Infer info from data (deduction & induction.

Lecture 1: Overview of KDD 1. Data Mining Methods 5. KDD Applications 4. What is KDD and Why ? 2. Challenges for KDD 19 . The KDD Process 3.

Investment analysis . - Controlling and scheduling Network management Experiment result analysis etc.Fraud detection .Potential Applications Business information Manufacturing information .etc.Marketing and sales data analysis . Scientific information - Personal information Sky survey cataloging Biosequence Databases Geosciences: Quakefinder etc.Loan approval . 20 .

KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS. OLAP. etc. Web. parallel computing.) 21 .

98.10. Conferences: KDD’95. Wiederhold.10) Industry interests and competition: IBM. 99 (Asia) .KDD: A New and Fast Growing Area KDD workshops: 1989. 99 (Europe) PAKDD’00 (Kyoto. 22 . PKDD’97. Silicon Graphics. deadline 99. 98. … 80% of the Fortune 500 companies are currently involved in data mining pilot projects or using data mining systems. Interests in KDD: Special Issue on KDD of JSAI. 2000. Standford Univ.18-20. 99 (USA) PAKDD’97. JAPAN: FGCS Project (logic programming and reasoning. Inter. Microsoft. Boeing. 1994. SAS. NASA. July 1997.1993. Sun. 96. SPSS. 98. 97. 1991. recently more attention on knowledge acquisition and machine learning).4. “Knowledge Discovery is the most desirable end-product of computing”.

The KDD Process 3. Data Mining Methods 5.Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. Challenges for KDD 23 . KDD Applications 4.

identifying a finite set of categories or clusters to describe the data. maps a data item to a real-valued prediction variable. Regression discovering the most significant changes in the data Dependency Modeling Deviation and change detection finding a compact description for a subset of data Summarization 24 . Classification ? Clustering finding a model which describes significant dependencies between variables.Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them.

Neural Network 25 Cancerous Cell Data .Classification “What factors determine cancerous cells?” Examples Data Mining Algorithm Classification Algorithm General patterns .Rule Induction .Decision tree .

Classification: Rule Induction “What factors determine a cell is cancerous?” If and and Then If and and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell Color = dark Tails = 2 Nuclei = 2 Cancerous Cell (certainty = 92%) (certainty = 87%) 26 .

Classification: Decision Trees Color = dark Color = light #nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2 #tails=1 #tails=2 cancerous healthy #tails=1 healthy #tails=2 cancerous healthy cancerous 27 .

Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous 28 .