Professional Documents
Culture Documents
Data Mining and Warehousing: Project 1
Data Mining and Warehousing: Project 1
Project 1
A. Objectives
I have selected data set with the name flag. Source: The UCI Machine Learning Repository. Benefits
to be derived from association rule mining is to find all co-occurrence relationships (association).
Finding patterns can enhance predicting data for example the religion of a country from its size
and the no. of colors in its flag. I had chose this data set for its easy to read and understand and
good to apply several filters on it. Also, I found Initial interesting rules to extract from it. In later
steps several choices in different step that I made, reasoning for that will be shown at its time.
Data Cleaning:
1. Missing values
In this step missing values / noisy data / inconsistent should be resolve , as my data set complete
in the original file flag-new ,so I deleted some values from two records ( language , religion) to
create the missing value problem , and then I applied in weka the following method to resolve it:
Open the file > from Choose button > weka > Filters > unsupervised > attribute > replace
missing values > apply button > save
This replaced my missing values of my dataset with the modes and means from the training data.
The missing fields filled with (5.298429, 2.172775) respectively. The new file name: flag-newReplace missing value
I also used constant value (ex Anfal for nominal attributes and 0 for numeric attributes) as a
replacement but this time I deleted first value in the first record for the mainhue attribute using
the following method:
Open the file > from Choose button > weka > Filters > unsupervised > attribute > replace
missing values with user constant > in the area of filter click it to specify the value > in the
nominal string replacement value field write Anfal > Numeric replacement value field write 0
> ok > apply button > save
The new file name: flag-new-Replace-Anfal.arff
2. Noisy data
I used a filter that removes instances which are incorrectly classified. Using the following
method:
Open the file > from Choose button > weka > Filters > unsupervised > instance >
removeMisclassField > ok > apply button > save.
Figure 1: data before applying the function
3. Outlier detection
Now after removing the noise from my dataset, my records are 63 rows. To find outlier in my dataset I
So , before I used to have only 30 attribute but after applying the outlier , I had two new attributes which
are outlier and extreme value is in figure 6, it shows I have 5 instance having outlier and 58 dont have.
Which is good thing, thing the less the better. The IQR will put a label YES for the instance if it has
outlier and NO if its not thats in each attribute. Similar the Extreme value attribute , if the IQR finds
instance is representing extreme value then it will write the value YES otherwise NO, Figure 7 show how
many Extreme values I have.
Figure 3: Outlier
Open the file > from Choose button > weka > Filters > unsupervised > instance > remove with values >
click on the filter field (to adjust the properties)
Figure 7: No outlier
And here we go after apply the data is cleaned from outliers , similar for extreme value.
Now after removing the noise from my dataset, my records are 16 rows
Integration:
Integration is by the mean of merging two files of dataset. My data set are filled in each record
with different country name, I have considered this column as ID for the record the. First I
divided my data into Part1 and Part2. Part1 contain first 16 attributes and the 16th is (black).
Part2 contain attributes from the attribute number 16 (black2) to the last one. I repeated attribute
number 16 in both files to create redundancy problem, but I needed to change the name in Part2
to make it work. After that I ran WEKA and clicked Simple CSI.
Figure 8: Step 1
Figure 9: Step 2
Result:
Finished redirecting output to 'C: \Users\Anna\Desktop\backupDM\Merge.csv'. This way I
created file called merge and merged both part files. Now how can I remove the redundant
attribute?
Remove redundant attributes:
Because Merge file doesnt open with WEKA so I made another version of type arff so
Open Merge2 file > from Choose button > weka > Filters > unsupervised > attribute> remove >
in the field net to choose button click and specify th index of the desired attribute > ok > apply >
save
The new file : mergeAndremove
Data reduction:
The idea behind this step is to reduce the dataset. Applying reduction is further reducing the
dataset. There are types of reduction Parametric (I will apply Sampling) and Non parametric (I
will apply Principle Component Analysis PCA).
First the Sampling method:
This extracts a certain specified percentage of a given dataset and returns the reduced dataset.
Open mergeAndremove file > from Choose button > weka > Filters > unsupervised >
Instance> resample > apply > save.
Noreplacement means only reduce the data dont redundant other records
Sample size percent means to specify how much percentage to reduce. I chose 50 to
decrease the dataset to half of it.
After apply discretiztion on color attribute it visualized only 3 colors while the records have 4
colors even when I specify the bin value 4 , one of the colors is completely removed with the
redundant colors , so there is misleading in the data, therefore I canceled this filter.
Second the PCA method:
The purpose of principal components analysis is to:
Figure 15 : PCA
The PCA does not work with my dataset properly it caused lots of attributes to be deleted.
C
a
t
h
o
l
i
c
M
u
s
l
i
m
B
u
d
d
h
i
s
t
E
t
h
n
i
c
M
a
r
x
i
s
t
Figure 17: Histogram for language, using single bucket each of which represent one value for several countries
S
p
a
n
i
s
h
G
e
r
m
a
n
S
l
a
v
i
c
A
r
a
b
i
c
O
t
h
e
r
s
Transformation
In data transformation I will apply discretiztion on several attribute, colors, religion, language,
area:
Discretiztion
Open file > from Choose button > weka > Filters > unsupervised > Attribute > Discretize
> apply > save.
To some numeric attributes to be nominal I apply it on attribute color, area, religion and
language. Then I replaced the encrypted values with nominal values in the word file as
shown in figure 21 for the religion. The completed file name : DisLang. After that I
eliminated all other attributes except the name off course and implemented association rules,
described later.
Bin value differs according to how many values I have in my dataset for specific attribute, for
example the color attribute in my dataset currently has 4 colors.
Figure 19: Properties of color
Language
color
Area
Class : Religon
Asturia
German
Two
Below
Catholic
Bahrain
Arabic
Two
Below
Muslim
Bulgaria
Slavic
Five
Below
Marxist
Colombia
Spanish
Three
Below
Marxist
Conog
Others
Three
Below
Ethnic
Ecuador
Spanish
Three
Below
Catholic
Ethiopia
Others
Three
Below
Catholic
Giraltar
Spanish
Three
Below
Catholic
Kampuchea
Others
Two
Below
Buddhist
Liechtenstein
German
Three
Below
Catholic
Morocco
Arabic
Two
Below
Muslim
Poland
Slavic
Two
Below
Marxist
Spain
Spanish
Two
Above
Catholic
Thailand
Others
Three
Below
Buddhist
Vietnam
Others
Two
Below
Catholic
Yugoslavia
Slavic
Four
Below
Marxist
1) Using Weka:
I have copied this table to excel file and save it as CSV then open it in weka , go to classify tab
fom Choose button > weka > classifier > trees > J48
Then from test options choose use training set > start
In the result right click and choose visualise tree
Results:
Results:
Higest gain is Language then it is the roote
D. Resulting rules
I Found intersting pattren among (Language , religon , colors) , File name : RULL1, 34 rull were
found.
General description :
If we have certin religon we can tell which lanuage the people of that religon speaks (Rull # 8)
If w have number of colors in their flag is 2 then we can tell which language do they speck (Rull
# 20)
Ofcoures the higher is the number of attributes in the first parameter and the corrsponding
number of attribute of the rest/ resulting parameter , the higher is the confidnce of that rull
For example (Rull # 14) found for the first parameter 7 attributes , 4 of them in the corsponding
result attribute are confirming the rul of the second parameter language spanish , it has
confidence of 0.5
While ( Rull # 23) found for the first and second parameters 2 attributes , 2 of them ( all of tem
actually) in the corsponding result attribute are confirming the rul of the second parameter color
2 , it has confidence of 1.0
This is incresing the confidence rate to clients and help them to choose the suitable rull they need
based on there desition for example how much confidint of the rull they want fr specific
attributes as input.
Intersting pathes:
1) language=Spanish religion=Catholic 4 ==> colours=Three 3
2) colours=Three 7 ==> religion=Catholic 5
Then I applied the same process but for file name RULL2 and Found intersting pattren
among(religon , color , area) Ignore the rest of attributes which not important as those . Result :
18 rull were found. Intersting path:
1) area=Below religion=Catholic 6 ==> colours=Three 4
the area value was devided to two parts , Above = >1000 and Below = <1000 thousands of
square km.
Selection to show to the client depends on the interesting rules and considering client request. An
example will be shown below in the next section
E. Recommendations
The client can use the discovering rules in education/ research area or tourism information
details. It depends on his goal to use rules. Lets say the client has program to travel around the
world to spread Islam religion, and he has few details of the countries he is attempting to visit, he
needs statistics to check the environment and believes of such countries with such details to
prepare himself to the community. As small example lets narrow the range to those attributes I
have (religion, area, language). A program can be built to be associated with weka, which takes
the client details as attributes and calculate statics using weka to give him results back, for
example take a look at the program snapshots that I built below using ASP.net.
Due to lack of time and just to show representative idea, the program I built is not connected
with weka but the result are extracted from there previously
The button show the client result of exactly his search and similar results that has some of his
search data.
Experiments
Extra work, my own test based on my understanding applied on flag-for-test dataset
Second problem I create noise and resolve it with clustering in weka as follow:
Open the file > from Choose button > weka > Filters > unsupervised > attribute > Add noise>
from the appeared window I specified 50 % noise to be applied on the last attribute > click apply
button > save.
Here you can see the data before the noisy affect them, where I have unique value for orange
attribute.
In order to minimize the noise, go to Cluster tab > under cluster mode > hit the radio button
class to clusters evaluation then choose the attribute that you created the noise in it > start
button.
Figure 31 : Cluster