DATA MINING

Kusrini

Definisi Data Mining

Garner Group:
“Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
using pattern recognition technologies as well as
statistical and mathematical techniques.”

Hand et al
“Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarize the data in novel
ways that are both understandable and useful to the
data owner”

Evangelos Simoudis in Cabena et al:
“Data mining is an interdisciplinary field bringing
together techniques from machine learning,
pattern recognition, statistics, databases, and
visualization to address the issue of information
extraction from large data bases”

Berry and Linoff:
Data mining is the process of exploration and
analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover
meaningful patterns and rules

Why data mining? The explosive growth in data collection The storing of data in data warehouses  The availability of increased access to data from Web navigation and intranet   We have to find a more effective way to use these data in decision support process than just using traditional query languages .

text  WWW …  .On what kind of data?    Data warehouses Transactional databases Advanced database systems Spacial and Temporal  Time-series  Multimedia.

Knowledge Discovery in Database .

Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7.       1. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations. Data integration (where multiple data sources may be combined) 3. for instance) 5. Data cleaning (to remove noise and inconsistent data) 2. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) .

Task in Data Mining      Classification Estimation Prediction Clustering Association .

. The derived model is based on the analysis of a set of training data (i. for the purpose of being able to use the model to predict the class of objects whose class label is unknown..e. data objects whose class label is known).Clasification   Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts.

which. for example. In classification. such as income bracket. middle income. The data mining model examines a large set of records. . could be partitioned into three classes or categories: high income. there is a target categorical variable. each record containing information on the target variable as well as a set of input or predictor variables. and low income.

.

 Algorithm:     k-nearest neighbor classification Pohon Keputusan Naïve Bayesian classification support vector machines .

yaitu berdasarkan pada pencocokan bobot dari sejumlah fitur yang ada. Misalkan diinginkan untuk mencari solusi terhadap seorang pasien baru dengan menggunakan solusi dari pasien terdahulu. . Kasus pasien lama dengan kedekatan terbesar-lah yang akan diambil solusinya untuk digunakan pada kasus pasien baru.Nearest Neighbor (KNN)     Nearest Neighbor adalah pendekatan untuk mencari kasus dengan menghitung kedekatan antara kasus baru dengan kasus lama. Untuk mencari kasus pasien mana yang akan digunakan maka dihitung kedekatan kasus pasien baru dengan semua kasus pasien lama.

Karena d2 lebih dekat dari d1 maka solusi dari pasien B lah yang akan digunakan untuk memberikan solusi pasien Baru. maka solusi yang akan diambil adalah solusi dari pasien terdekat dari pasien Baru. Seandainya d1 adalah kedekatan antara pasien Baru dan pasien A. . sedangkan d2 adalah kedekatan antara pasien Baru dengan pasien B. Ketika ada pasien Baru.     Seperti tampak pada Gambar Ada 2 pasien lama A dan B.

Rumus kedekatan Kedekatan biasanya berada pada nilai antara 0 s/d 1.  Nilai 0 artinya kedua kasus mutlak tidak mirip. sebaliknya untuk nilai 1 kasus mirip dengan mutlak.  .

Bobot antar variabel Kedekatan Jenis Kelamin Kedekatan Pendidikan Kedekatan Agama .

Rancangan Sistem .

2004) Sebuah model pohon keputusan terdiri dari sekumpulan aturan untuk membagi sejumlah populasi yang heterogen menjadi lebih kecil. anggota himpunan hasil menjadi mirip satu dengan yang lain (Berry.A.. lebih homogen dengan memperhatikan pada variabel tujuannya .. Linoff. Michael J. Gordon S. Dengan masing-masing rangkaian pembagian.Pohon Keputusan   Pohon keputusan merupakan sebuah struktur yang dapat digunakan untuk membagi kumpulan data yang besar menjadi himpunan-himpunan record yang lebih kecil dengan menerapkan serangkaian aturan keputusan.

    Pilih atribut sebagai root Buat cabang untuk masing-masing nilai Bagi kasus dalam cabang Ulangi proses untuk masing-masing cabang sampai semua kasus pada cabang memiliki kelas yang sama .

.

.

.

.

Bayesian classification terbukti memiliki akurasi dan kecepatan yang tinggi saat diaplikasikan ke dalam database .Bayesian Clasification    Bayesian classification adalah pengklasifikasi statistik yang dapat digunakan untuk memprediksi probabilitas keanggotaan suatu class. Bayesian classification didasarkan pada teorema bayes yang memiliki kemampuan klasifikasi serupa dengan decision tree dan neural network.

.

Class buys_computer?) student = “yes”. X = (age = “<=30”. income =“medium”. . credit_rating = “fair”.

credit_rating = “fair”   Dibutuh untuk memaksimalkan P(X|Ci) P(Ci) untuk i= 1.019  P(X|buys_computer=“yes”) P(buys_computer=“yes”)   = 0.444 P(income = “medium” | buys_computer =“no”) = 2/5 = 0.200 P(credit_rating= “fair” | buys_computer =“yes”) = 6/9 = 0.677 x 0.600 x 0.400 x 0.357 = 0.667 P(credit_rating= “fair” | buys_computer =“no”) = 2/5 = 0.357    Hitung P(X|Ci). untuk i=1.400 P(X|buys_computer=“yes”)  P(X|buys_computer=“no”)     = 0. 2 P(Ci) merupakan prior probability untuk setiap class berdasar data contoh :    P(buys_computer=“yes”) = 9/14 = 0.222 x 0.643 P(buys_computer=“no”) = 5/14 = 0. student = “yes”.400 = 0.222 P(age = “<30” | buys_computer =“no”) = 3/5 = 0.200 x 0.age = “<=30”.007 Kesimpulan : buys_computer = “yes” .600 P(income = “medium” | buys_computer =“yes”) = 4/9 = 0.044 = 0.643 = 0.400  P(student = “yes” | buys_computer =“yes”) = 6/9 = 0.444 x 0.677 = 0.2         P(age = “<30” | buys_computer =“yes”) = 2/9 = 0. income =“medium”.019 x 0.667 P(student = “yes” | buys_computer =“no”) =1/5=0.028  P(X|buys_computer=“no”) P(buys_computer=“no”)   = 0.044 x 0.

salary:46-50K  Status?? . age: 26-30. Department: systems.

which provide the value of the target variable as well as the predictors. Estimating the systolic blood pressure reading of a hospital patient.Estimation    Estimation is similar to classification except that the target variable is numerical rather than categorical. based on the patient’s age. based on that student’s undergraduate GPA. Models are built using “complete” records. gender. body-mass index. and blood sodium levels . Example:   Estimating the grade-point average (GPA) of a graduate student.

 Method:   simple linear regression and correlation multiple regression .

the results lie in the future Example:    Predicting the price of a stock three months into the future Predicting the percentage increase in traffic deaths next year if the speed limit is increased Predicting whether a particular molecule in drug discovery will lead to a profitable new drug for a pharmaceutical company .Prediction   Prediction is similar to classification and estimation. except that for prediction.

 Method:       simple linear regression and correlation multiple regression k-nearest neighbor classification Pohon Keputusan Naïve Bayesian classification support vector machines .