You are on page 1of 37

Data Mining

:
8. Text Mining

Romi Satria Wahono
romi@romisatriawahono.net
http://romisatriawahono.net/dm
WA/SMS: +6281586220090

1

Romi Satria Wahono
• SD Sompok Semarang (1987)
• SMPN 8 Semarang (1990)
• SMA Taruna Nusantara Magelang (1993)
• B.Eng, M.Eng and Ph.D in Software Engineering
from
Saitama University Japan (1994-2004)
Universiti Teknikal Malaysia Melaka (2014)
• Research Interests: Software Engineering,
Machine Learning
• Founder dan Koordinator IlmuKomputer.Com
• Peneliti LIPI (2004-2007)
• Founder dan CEO PT Brainmatics Cipta Informatika

2

Course Outline
1. Pengantar Data Mining

2. Proses Data Mining

3. Persiapan Data

4. Algoritma Klasifikasi

5. Algoritma Klastering

6. Algoritma Asosiasi

7. Algoritma Estimasi dan Forecasting

8. Text Mining
3

8.2 Text Clustering 7.3 Text Classification 4 . Text Mining 7.1 Text Mining Concepts 7.

7.1 Text Mining Concepts 5 .

train models to detect patterns in new and unseen text 6 . cluster. and predict • The unstructured text needs to be converted into a semi-structured dataset so that you can find patterns and even better.How Text Mining Works • The fundamental step in text mining involves converting text into semi-structured data • Once you convert the unstructured text into semi-structured data. there is nothing to stop you from applying any of the analytics techniques to classify.

Text Processing 7 .

Pengetahuan 4. Himpunan 2.…) DATA PRE-PROCESSING Data Cleaning Estimation Data Integration Prediction Data Reduction Classification Data Transformation Clustering Association Text Processing 8 . AUC. Pengolahan Data) Sesuai Karakter Data) Tree/Rule/Cluster) RMSE. Lift Ratio. Metode 3. Evaluation Data Data Mining (Pemahaman dan (Pilih Metode (Pola/Model/Rumus/ (Akurasi. Proses Data Mining 1.

Token and Tokenization • Words are separated by a special character: a blank space • Each word is called a token • The process of discretizing words within a document is called tokenization • For our purpose here.Word. each sentence can be considered a separate document. although what is considered an individual document may depend upon the context • For now. a document here is simply a sequential collection of tokens 9 .

Matrix of Terms • We can impose some form of structure on this raw data by creating a matrix. where: • the columns consist of all the tokens found in the two documents • the cells of the matrix are the counts of the number of times a token appears • Each token is now an attribute in standard data mining parlance and each document is an example 10 .

but more importantly by all the machine learning algorithms which require such tables for training • This table is called a document vector or term document matrix (TDM) and is the cornerstone of the preprocessing required for text mining 11 .Term Document Matrix (TDM) • Basically. not only by the human users as a data table. unstructured raw data is now transformed into a format that is recognized.

k 12 .TF–IDF • We could have also chosen to use the TF–IDF scores for each term to create the document vector • N is the number of documents that we are trying to mine • Nk is the number of documents that contain the keyword.

prepositions. conjunctions.” “this.Stopwords • In the two sample text documents was the occurrence of common words such as “a. conjunctions. and pronouns may need to be filtered before we perform additional analysis • Such terms are called stopwords and usually include most articles.” “and.” and other similar terms • Clearly in larger documents we would expect a larger number of such terms that do not really convey specific meaning • Most grammatical necessities such as articles. and prepositions • Stopword filtering is usually the second step that follows immediately after tokenization • Notice that our document vector has a significantly reduced size after applying standard English stopword filtering 13 . pronouns.

Stopwords Bahasa Indonesia • Lakukan googling dengan keyword: stopwords bahasa Indonesia • Download stopword bahasa Indonesia dan gunakan di Rapidminer 14 .

1980) 15 .” or “recognition” in different usages. we can simplify the conversion of unstructured text to structured data because we now only take into account the occurrence of the root terms • This process is called stemming.Stemming • Words such as “recognized. but contextually they may all imply the same meaning. for example: • “Einstein is a well-recognized name in physics” • “The physicist went by the easily recognizable name of Einstein” • “Few other physicists have the kind of name recognition that Einstein has” • The so-called root of all these highlighted words is “recognize” • By reducing terms in a document to their basic stems. The most common stemming technique for text mining in English is the Porter method (Porter.” “recognizable.

A Typical Sequence of Preprocessing Steps to Use in Text Mining 16 .

N-Grams • There are families of words in the spoken and written language that typically go together • The word “Good” is usually followed by either “Morning. called n-grams.” “Night. information extraction. among many different use cases 17 . “Day” • Grouping such terms. and analyzing them statistically can present new insights • Search engines use word n-gram models for a variety of applications. identifying speech patterns.” “Afternoon.” or in Australia. entity detection. such as: • Automatic translation.” “Evening. checking misspelling.

Rapidminer Process of Text Mining 18 .

2 Text Clustering 19 .7.

Latihan • Lakukan eksperimen mengikuti buku Matthew North (Data Mining for the Masses) Chapter 12 (Text Mining). p 189-215 • Datasets: Federalist Papers • Pahami alur text mining yang dilakukan dan sesuaikan dengan konsep yang sudah dipelajari 20 .

but there was evidence that Hamilton and Madison worked on that one together • Gillian would like to analyze paper 18’s content in the context of the other papers with known authors. and Hamilton for paper 17. Specifically. James Madison and John Jay had been the authors of the papers • The notes indicated specific authors for some papers. Gillian feels confident that paper 18 is a collaboration that John Jay did not contribute to—his vocabulary and grammatical structure was quite different from those of Hamilton and Madison 21 . after Alexander Hamilton died in the year 1804. and no one really knew at the time if ‘Publius’ was one individual or many • Years later. Paper 18 had no author named. 4 and 5. the essays that were written and published in the late 1700’s • The essays were published anonymously under the author name ‘Publius’. Business Understanding • Gillian is a historian and archivist. Madison for paper 14. to see if she can generate some evidence that the suspected collaboration between Hamilton and Madison is in fact a likely scenario • Having studied all of the Federalist Papers and other writings by the three statesmen who wrote them. but not for others. and she has recently curated an exhibit on the Federalist Papers. some notes were discovered that revealed that he (Hamilton). John Jay was revealed to be the author for papers 3.1.

2. John Jay was revealed to be the author for papers 3. and she has recently curated an exhibit on the Federalist Papers. the essays that were written and published in the late 1700’s • The essays were published anonymously under the author name ‘Publius’. Madison for paper 14. to see if she can generate some evidence that the suspected collaboration between Hamilton and Madison is in fact a likely scenario • Having studied all of the Federalist Papers and other writings by the three statesmen who wrote them. James Madison and John Jay had been the authors of the papers • The notes indicated specific authors for some papers. Gillian feels confident that paper 18 is a collaboration that John Jay did not contribute to—his vocabulary and grammatical structure was quite different from those of Hamilton and Madison 22 . Specifically. but there was evidence that Hamilton and Madison worked on that one together • Gillian would like to analyze paper 18’s content in the context of the other papers with known authors. and Hamilton for paper 17. 4 and 5. after Alexander Hamilton died in the year 1804. Data Understanding • Gillian is a historian and archivist. and no one really knew at the time if ‘Publius’ was one individual or many • Years later. some notes were discovered that revealed that he (Hamilton). Paper 18 had no author named. but not for others.

23 .

Case Study 1: Keyword Clustering. http://travel.com • Gunakan stopword Bahasa Indonesia (download dari Internet) 24 . http://sport.detik.Latihan • Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining).com 2.detik. p 284-287 • Datasets: 1.

25 .

26 .

27 .

7.3 Text Classification 28 .

Case Study 2: Predicting the Gender of Blog Authors. p 287-301 • Datasets: blog-gender-dataset.xslx • Split Data: 50% data training dan 50% data testing • Gunakan algoritma Naïve Bayes • Apply model yang dihasilkan untuk data testing • Ukur performance nya 29 .Latihan • Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining).

Latihan • Dengan berbagai konsep dan teknik yang anda kuasai. uji apakah artikel tersebut termasuk sentiment negative atau positive 30 . lakukan text classification pada dataset polarity data .small • Ambil 1 artikel di dalam folder pos.

31 .

baik filter maupun wrapper • Lakukan komparasi terhadap berbagai algoritma klasifikasi.Latihan • Dengan berbagai konsep dan teknik yang anda kuasai. lakukan text classification pada dataset polarity data • Terapkan beberapa metode feature selection. dan pilih yang terbaik 32 .

33 .

34 .

35 .

36 .

Data Mining and Knowledge Discovery Handbook Second Edition. Springer. Mark A. 2014 6. Larose. World Scientific. Witten. Ethem Alpaydin. 2012 2. 2014 4. RapidMiner: Data Mining Use Cases and Business Analytics Applications. Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications. Springer. Introduction to Machine Learning. Warren Liao and Evangelos Triantaphyllou (eds. CRC Press Taylor & Francis Group.. Hall. Elsevier. Jiawei Han and Micheline Kamber. John Wiley & Sons. Data Mining: Concepts. Oded Maimon and Lior Rokach. Models and Techniques. Florin Gorunescu. Frank Eibe. Discovering Knowledge in Data: an Introduction to Data Mining. 2007 37 . 2011 7. 2010 8. Data mining: Practical Machine Learning Tools and Techniques 3rd Edition. Ian H.Referensi 1. 3rd ed. 2005 5. Elsevier. Markus Hofmann and Ralf Klinkenberg.). Daniel T. MIT Press. 2011 3. Data Mining: Concepts and Techniques Third Edition.