You are on page 1of 3

Introduction to Data Mining

Lab 6: Putting it all together


Phạm Đông Quân – ITDSIU19015

5.1. The data mining process


In the fifth class, we are going to look at some more global issues about the data mining process. (See
the lecture of class 5 by Ian H. Witten, [1] 1). We are going through four lessons: the data mining process,
Pitfalls and pratfalls, and data mining and ethics.
According to [1], the data mining process includes steps: ask a question, gather data, clean the data,
define new features, and deploy the result. Write down the brief for these steps:
- Ask a question: Finding a right kind of question that you can get benefit from finding the answer
to it. The right question should contain what we want to know before collecting the data. When
the right question is asked, then we collect data for the answer.
- Gather/Collect data: Get the data that gives the chance of answering the question using data
mining techniques. This process can be long, as getting all attributes needed for the answer is
not usually take only one go. Also, It might be that we have to collect multiple sets of data, then
merge them together to find what attributes we want to use.
- Clean data: Real world data is always very messy and trying to process it is painstaking. We need
to investigate the data, try to figure out some idea, detect any anomalies, and whether to
remove them or not.
- Define new features: This is the key to successful data mining, as it is the process to engineer
feature that may be done multiple time, with optimization and finding components to build the
classifier, to create a good algorithm for classification.
- Deploy the result: After the algorithm perform well in testing and evaluation, the last part is to
deploy the algorithm in the real world
Alternatively, according to (Han and Kamber, 2011), the data mining process is treated as a knowledge
discovery (KDD) process including an iterative sequence of 7-steps. Please list them all in the below:
1. Data cleaning: to remove noise and inconsistent data
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are retrieved from the database
4. Data transformation: where data are transformed and consolidated into forms appropriate for mining
by performing summary or aggregation operations
5. Data mining: an essential process where intelligent methods are applied to extract data patterns
6. Pattern evaluation: to identify the truly interesting patterns representing knowledge based on
interestingness measures
7. Knowledge presentation: where visualization and knowledge representation techniques are used to
present mined knowledge to users

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
5.2. Pitfalls and pratfalls
Follow the lecture in [1] to learn what are pitfalls and pratfalls in data mining.

Do experiments to investigate how OneR and J48 deal with missing values.

Write down the results in the following table:

Dataset OneR’s classifier model and performance J48’s classifier model and performance
weather‐
nominal.arff
(original)

Correctly Classified Instances: 6 - 42.857%


Incorrectly Classified Instances: 8 - 57.143%
Correctly Classified Instances: 7 - 50%
Incorrectly Classified Instances: 7 - 50%

weather‐
nominal.arff
(with missing
values)

Correctly Classified Instances: 5 - 35.714%


Incorrectly Classified Instances: 9 - 64.286%

Correctly Classified Instances: 5 - 35.714%


Incorrectly Classified Instances: 9 - 64.286%

Remark: how do OneR and J48 deal with missing values?


While J48 ignore the missing value when building the classifier tree, OneR treats the missing value to be
a new value type and use it when classifying. But still, both algorithms perform worse with missing
values presented in the dataset. But only one dataset cannot decide whether to ignore missing values or
not would make the algorithm perform worse, and there would also be different ways of handling
missing values that can both perform well or badly depends on the dataset.

2
5.3. Data mining and ethics
Reading

5.4. Association-rule learners


Do experiments to investigate how Apriori and FP-Growth generate association rules for datasets
vote.arff

Dataset Apriori based association rules FP-Growth based association rules


Vote.arff 1. adoption-of-the-budget-resolution=y physician-fee- 1. [el-salvador-aid=y, Class=republican]: (157) ==>
freeze=n (219) ==> Class=democrat (219) - conf= 1 [physician-fee-freeze=y]: (156) - conf= 0.99
2. adoption-of-the-budget-resolution=y physician-fee- 2. [crime=y, Class=republican]: (158) ==>
freeze=n aid-to-nicaraguan-contras=y (198) ==> [physician-fee-freeze=y]: (155) - conf= 0.98
Class=democrat (198) - conf= 1 3. [religious-groups-in-schools=y, physician-fee-
3. physician-fee-freeze=n aid-to-nicaraguan-contras=y freeze=y]: (160) ==> [el-salvador-aid=y]: (156) -
(211) ==> Class=democrat (210) - conf= 1 conf= 0.97
4. physician-fee-freeze=n education-spending=n (202) 4. [Class=republican]: (168) ==> [physician-fee-
==> Class=democrat (201) - conf= 1 freeze=y]: (163) - conf= 0.97
5. physician-fee-freeze=n (247) ==> Class=democrat 5. [adoption-of-the-budget-resolution=y, anti-
(245) - conf= 0.99 satellite-test-ban=y, mx-missile=y]: (161) ==> [aid-
6. el-salvador-aid=n Class=democrat (200) ==> aid-to- to-nicaraguan-contras=y]: (155) - conf= 0.96
nicaraguan-contras=y (197) - conf= 0.98 6. [physician-fee-freeze=y, Class=republican]: 163
7. el-salvador-aid=n (208) ==> aid-to-nicaraguan- ==> [el-salvador-aid=y]: (156) - conf= 0.96
contras=y (204) - conf= 0.98 7. [religious-groups-in-schools=y, el-salvador-
8. adoption-of-the-budget-resolution=y aid-to- aid=y, superfund-right-to-sue=y]: (160) ==>
nicaraguan-contras=y Class=democrat (203) ==> [crime=y]: (153) - conf= 0.96
physician-fee-freeze=n (198) - conf= 0.98 8. [el-salvador-aid=y, superfund-right-to-sue=y]:
9. el-salvador-aid=n aid-to-nicaraguan-contras=y (204) (170) ==> [crime=y]: (162) - conf= 0.95
==> Class=democrat (197) - conf= 0.97 9. [crime=y, physician-fee-freeze=y]: (168) ==> [el-
10. aid-to-nicaraguan-contras=y Class=democrat (218) salvador-aid=y]: (160) - conf= 0.95
==> physician-fee-freeze=n (210) - (conf= 0.96) 10. [el-salvador-aid=y, physician-fee-freeze=y]:
(168) ==> [crime=y]: (160) - conf= 0.95

You might also like