Data Mining - Final Project

MIDDLE EAST TECHNICAL UNIVERSITY
IE4903 - I N T R O D U C T I O N TO DATA MINING
FINAL PROJECT
Fato lbi Onur Ylmaz
January 16, 2012
Table of contents
Page Part A ... 1 Part I .. Part II . Part B.... 2 7 10
Part IV 10 Part V . 14 Part VI .... 17 Appendix ..... Appendix A.1 Appendix A.2 Appendix A.3 Appendix A.4 Appendix A.5 Appendix A.6 Appendix A.7 Appendix A.8 Appendix A.9 Appendix B.1 Appendix B.2 Appendix B.3 Appendix B.4 Appendix B.5 Appendix B.6 Appendix B.7 Appendix A.8 A-1 A-1 A-2 A-3 A-4 A-6 A-7 A-9 A-11 A-18 A-20 A-21 A-22 A-23 A-24 A-25 A-26 A-30
PART A (FlightDelays.xls)
Initialization:
Flight Status attribute is mapped to binary as ontime being 1 and delayed being 0. Using SplitTrainingAndValidation.m code, which is given in Appendix A.1, data is divided into training and validation and it saved in SplittedSets.mat file for future usage. Applying Nave rule, by using the main.m code segment provided in Appendix A.2, following confusion matrices are constructed:
Confusion Matrix (Training) Predicted Class Actual Class 0 1 0 0 333 1 0 1428
Error Report (Training) Class # Cases # Error 0 333 333 1 1428 0 Overall 1761 333
% Error 100 0 18,910
Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 0 95 1 0 345
Class 0 1 Overall
Error Report (Validation) # Cases # Error 95 95 345 0 440 95
% Error 100 0 21,591
Part I: a)
Categorical attributes: Carrier, Day of the Week, Destination, Origin, and Weather. Scheduled Departure Time is binned into categories using equal range. Since departure times range from 06:00 to 22:00, one-hour is selected as range to capture the importance of rush hours and it yielded 16 categories. In this question, randomly selected training and validation sets are gathered from MATLAB (as string and numerical tables) and saved into an Excel file. Then categorical attributes are converted to numerical mapping for easiness in Excel file named as CombineConvert.xlsx using macros. Actual Departure Time attribute is eliminated since it cannot be known to prior and Distance is eliminated since it is related to Destination and Origin. With the same approach Flight Number is eliminated and finally Tail Number is eliminated considering that being delayed is not related to the plane itself. Details of mapping of categorical values are provided in Appendix A.3.
b)
Using Naive Bayes class of MATLAB, training set is fitted and test set is predicted. Code segment which is named as main.m is provided in Appendix A.4. First 10 predictions of validation set are provided in Appendix A.5.
c)
Confusion matrix and error report for the training and test data are as following:
Confusion Matrix (Training) Predicted Class Actual Class 0 1 0 49 284 1 32 1396 Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 20 75 1 12 333
Error Report (Training) Class # Cases # Error 0 333 284 1 1428 32 Overall 1761 316 Error Report (Validation) Class # Cases # Error 0 95 75 1 345 12 Overall 440 87
% Error 85,285 2,2241 17,944
% Error 78,947 3,478 19,772
d)
In this part, 5-fold cross validation is made for Nave Bayes classifier. Using the main.m file, which is provided in Appendix A.6, 5 times the following confusion matrices are constructed:
First Run:
Confusion Matrix (Training) Predicted Class Actual Class 0 1 0 67 279 1 46 1396 Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 14 68 1 13 345 Class 0 1 Overall Error Report (Training) # Cases # Error 346 279 1415 46 1761 325 Error Report (Validation) # Cases # Error 82 68 358 13 440 81 % Error 80,635 3,251 18,455
Class 0 1 Overall
% Error 82,927 3,631 18,409
Second Run:
Class 0 1 Overall
% Error 89,655 5,666 22,272
Third Run:
Class 0 1 Overall
% Error 84,810 1,662 16,591
Fourth Run:
Class 0 1 Overall
% Error 88,298 2,312 20,682
Fifth Run:
Class 0 1 Overall
% Error 75,947 3,478 19,772
Averages:
Error Report (Training) Class 0 1 Overall # Cases 1703 7102 8805 # Error 1402 201 1603 % Error 82,32531 2,830189 18,20557 Class 0 1 Overall Error Report (Validation) # Cases 437 1763 2200 # Error 371 59 430 % Error 84,89703 3,346568 19,54545
Average total 18,47 % classification error is calculated after 5-fold validation on the entire data set including both training and validation sets. Considering Nave Rule as a benchmark, Training and Validation errors of Nave Bayes are valid; however they are very high, in other words very close to the values of Nave Rule. Considering Type I and Type II error calculations, it is assumed that finding the delayed flights is important. Therefore, Type I error is calculated as predicting an ontime flight as delayed and Type II error is calculated as predicting an delayed flight as ontime.
Training Validation
Run #1 3,25 3,63
Run #2 Run #3 Run #4 Run #5 3,31 2,83 2,52 2,24 5,67 1,66 2,31 3,48 Type-I Errors (%) for 5-fold cross validation
Average 2,83 3,35
Training Validation
Run #1 80,64 82,93
Run #2 Run #3 Run #4 Run #5 79,18 83,09 83,53 85,29 89,66 84,81 88,30 78,95 Type-II Errors (%) for 5-fold cross validation
Average 82,35 84,93
As can be seen from the tables above, this classification model is good at Type-I error; which means predicting well if a flight is ontime. However, high percentages of Type-II error show that, this model is not good at classifying if a flight is delayed.
Part II: a)
For classification tree, class attributes and attributes which can be considered as class attributes are selected as following: Carrier, Destination, Origin, Weather, Day of the Week, Day of the Month Flight Date is eliminated since Day of the Month reflects it, considering all the records are from January. Actual Departure Time is eliminated since it cannot be known prior to realization and it does not obey the nature of prediction. Distance is eliminated because it is dependent on Destination and Origin. Flight Number is eliminated because it depends on destination, origin, day of month and carrier which are already considered. Finally, tail number is eliminated since it is only related to specific airplane and it is assumed that being ontime is not related to the plane which is travelling. Scheduled Departure Time is binned into one-hour equal ranges and all mappings of classes are given in Appendix A.3. Scheduled Departure Time is binned into categories because it is thought that being delayed or on time is not directly related to the actual scheduled time but it is related to the time intervals, like rush hours. This mapping and binning operations are again made in Excel file named as CombineConvert.xls with the help of macros. Using MATLABs Classification Tree class, training set is used for constructing Classification Tree and validation set is predicted as can be seen from the main.m which is provided in Appendix A.7. Rules generated by this tree are provided in Appendix A.8. Using these rules, prediction of classes in test data yielded these confusion matrices:
Class 0 1 Overall
Error Report (Validation) # Cases # Error % Error 95 69 72,63158 345 38 11,01449 440 87 19,77273
Classification tree yielded error of 12,21 % for training set; however, error rate increased to 19,77 % for validation set as can be seen above. When it is compared to Naive Bayes, which yielded 19,54 % for validation data, Naive Bayes predicts better when it comes to predict new data. When the practical usage of this tree is considered; departure time, carrier, origin and destination, weather and day of week can give some clues about a flight will be delayed or not. However, considering the data related to only one month, Day of Month attribute does not provide any information. Therefore, it could be said that this tree is not practical for predicting a flight will have a delay or not.
When the rules are investigated, firstly, it is seen that weather is an important determinant for a flight being delayed or not. Secondly, some rule of thumbs can be generated from this tree. For instance, when weather is fine, carriers UA and US are always on time when departure time is 14:00 and afterwards. Likewise, when departure time is before 14:00, flights going to EWR and JFK are always on time. Thirdly, high number of branches between Day of Month and Day of Week are standing out in the tree which seems to be not giving useful information rather than just fitting data into tree.
b)
When mapped values are checked: Destination: 1 Departure Time: 1 Origin: 2 Day of Week: 1
Firstly, situation of weather should be known. Although, it cannot be completely predicted, weather forecasts can be used. Secondly, according to tree, carriers should be known; however in this particular situation absence of this information did not create a problem since there are two carriers CO and RU which are not checked on any branch related to this question. If weather is not fine, all flights are delayed; on the other hand, if the weather is fine, day of month information is necessary for this tree. When the tree is further analyzed,
8
it is found that these flights are on time when Day of Month is between 1 and 25. After the 25th of month, they become delayed.
Although it is important in predicting whether a flight will be delayed or not, for this example, carrier information become redundant. In practice, percentage of being on time is an important quality factor for carriers.
c)
In this question, firstly day-of-month attribute is eliminated and a full tree is constructed. In addition, pruned trees with different levels are used by the help of main.m, which is provided in Appendix A.9 . Firstly, when the tree is pruned to the maximum level (maximum of FullTree.PruneList) it could be said that it resembles to classifying by Nave Rule. Eliminating leaves and turning branches into leafs, pruning yields merged nodes and as it eliminated predictors, it resembles Nave Rule.
Top three predictors of full tree are: Weather Departure Time (Before 14:00 vs. after 14:00) Carrier (UA or US vs. others)
Pruned tree has single node as a terminal point because in order to have a completely pruned tree, it must eliminate branches and merge leaves. Continuously doing this, completely pruned tree becomes a single node tree, in other words a root.
Using only top three predictors of full three instead of best pruned tree will not take into consideration day of week and other carriers as well as destination and origin points. Although it will yield a faster classification, it will increase classification error since it over simplifies the problem.
PART B (Sun-Xpress.xls)
Initialization:
In this step, data is randomly divided into training and validation sets by using SplitTrainingAndValidation code segment, which is provided in Appendix A.1. Then, ID column is deleted and; attributes and classes for each set is separated and saved in SplittedSets.mat after normalization for future usage. Normalization is applied by subtracting the minimum and dividing to the maximum of related attribute. Normalization is undertaken in Excel and all other operations are done by using the main file provided in Appendix B.1.
Part IV: a)
In this step, MATLABs knnclassify classifier is used for predicting classes of validation set by using training set. Used code segment which is named as main.m is provided in Appendix B.2.
b)
Using the MATLAB codes mentioned in the last section, confusion matrices and error reports for k=2,3,4 and 5 are as following:
10
Confusion Matrix (Validation, k=2) Predicted Class Actual Class 0 1 0 763 103 1 115 16
Error Report (Validation, k=2) Class # Cases # Error % Error 866 103 11,89376 0 1 Overall 131 997 115 218 87,78626 21,8656
11
When the validation errors are summarized, the following table is constructed: k value Validation Classification Error (%) 2 3 4 5
21,8656 15,54664 16,04814 13,94183
As can be seen, for this situation minimum validation classification error occurred when k=5.
c)
In this part, misclassification error for training set is tracked starting from k=2 to k=50. Using the code provided in Appendix B.3, error percentages are kept in arrays and exported to Excel. Then, calculated error percentages are plotted and the following chart is constructed:
k vs. Error (%)

14,00 12,00 10,00 8,00 6,00 4,00 2,00 0,00 2 7 12 17 22 27 Error (%) 32 37 42 47
Misclassification Error for Training Set
12
The best k value which minimizes the classification error in training set is found to be k=2 where error percentage is 0,05 %. However, when the same analysis is made on classification errors for validation set; the following chart is constructed:
k vs. Error (%)

25,00
20,00
15,00
10,00
5,00
0,00 2 7 12 17 22 27 Error (%) 32 37 42 47
Misclassification Error for Validation Set
For validation set, the best k value is found to be 19 (and also 20) where misclassification error is 12,94 %.
13
Part V: a)
In this step, the normalized and split data in initialization stage is used. Firstly, using the main.m file, which is provided in Appendix B.4, these data are loaded and transposed in order to be used in neural nets. Then, using network training functions of MATLAB, code segment which is provided in Appendix B.5 is implemented for 3000 epochs. In this part, structure of network can be summarized as following: Input Layer 14 Input Hidden Layer 10 Neurons Output Layer 1 Output
Data Division: Training: Performance: Derivative:
Random (dividerand) Levenberg-Marquardt (trainlm) SSE Default (defaultderiv)
Visually, the used network can be seen below:
Using the trained network after 3000-epoch with nntool, predictions are made and cut-off value of 0,5 is used for assigning class labels to predictions. Then, following confusion matrices are constructed with the help of code provided in Appendix B.6.
14
Class 0 1 Overall
Error Report (Training) # Cases # Error % Error 3464 16 0,46189 524 3988 462 478 88,1679 11,9859
Class 0 1 Overall
Error Report (Validation) # Cases # Error % Error 866 15 1,732 131 997 117 132 89,312 13,239
b)
In this part, code segment provided in Appendix B.5 is used with 100 epochs instead of 3000 and following the same procedures and using the same codes mentioned above, the following matrices are found:
Class 0 1 Overall
Error Report (Training) # Cases # Error % Error 3464 0 0,00 524 3988 524 524 100,00 13,1394
15
Class 0 1 Overall
Error Report (Validation) # Cases # Error % Error 866 0 0,00 131 997 131 131 100,00 13,1394
When the error percentages are compared as shown in the table below, as number of epochs increase, neural network represents the training set better. However, as the network over fits to training data, it gets less powerful on predicting validation set. Therefore, it could be mentioned that 3000 epochs create an over fitting problem for this dataset when it is compared to 100 epochs.
Number of Epochs 100 3000
Training Error (%)

13,1394 11,9859
Validation Error (%)

13,1394 13,239
c)
When the error rates for validation sets are compared for Sun-Xpress question, the best model which minimizes the validation set classification error is found as k-Nearest Neighbor method where k equals 19. This yielded validation error of 12,94 %. Since all possible kvalues are checked, it is the best method for k-Nearest Neighbor method; however number of epochs between 100 and 3000 are not checked for neural networks. Therefore, it should be mentioned that there could be a better classification in this range which uses neural networks.
16
Part VI:
a)
The first code segment, main.m which is provided in Appendix B.7, includes the clustering and labeling these clusters with the majority of class labels in the related clusters. Then it uses ErrorMatrix sub-function which is provided in Appendix B.8 for calculating error percentages.
b)
In this part, split training data set in the initialization part is clustered by using kmeans function. Furthermore, two different distance measures are used which are Euclidean Distance (default) and L1 Distance. Moreover, the data set is clustered into k=2,3,4,,25 clusters. Then each cluster is labeled by using the majority of the class labels of the data points, i.e. Cluster is labeled as 0 if the average is below 0.5, and 1 otherwise. By using these cluster labels, the data included in the cluster are labeled. According to these labels, total classification error is calculated for split training set. The results are as follows: By using both Euclidean Distance and L1, misclassification error is constant, 13.14%. Misclassification Error in Training Set ( % Error vs. # of Clusters)
14 12 10 8 6 4 2 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Misclass Error
17
c)
As can be seen from part b, misclassification error remains same regardless of the distance measure and number of clusters. Hence, any of them can be chosen. Two clusters using Euclidean can be an option.
d)
Centers of each cluster are held in ctrs. First of all, the distances between centers and the test data will be calculated using a distance measure (Euclidean or L1). Then, minimum of these distances and corresponding index will be found which gives the cluster index to which the test data belongs.
e)
Each cluster is labeled as 0 when L1 distance is used. Thus each test data will be labeled as 0 which implies that the misclassification error will be calculated by simply dividing number of 1s by total number of test data, which is (131/997)*100=13.14%. When Euclidean distance is used, for almost every k value, all clusters are labeled as 0, so again the misclassification error becomes 13.14%.
18
Appendix A.1) SplitTrainingAndValidation.m

function [TrainingData, ValidationData, TrainingStr, ValidationStr]=SplitTrainingAndValidation(data,str) % Gather size of the data sizeTemp = size(data); numberOfRows = sizeTemp(1); numberOfColumns = sizeTemp(2); % Determine the limit limit = round (0.8 * numberOfRows); % Random numbers randomNumber = randperm(numberOfRows); % Training set for i = 1:limit, TrainingData(i,:)=data(randomNumber(i),:); TrainingStr(i,:)=str(randomNumber(i),:); end % Validation set counter = 1; for i = limit+1:numberOfRows, ValidationData(counter,:)=data(randomNumber(i),:); ValidationStr(counter,:)=str(randomNumber(i),:); counter = counter + 1; end end
Appendix - 1
Appendix A.2) main.m

%Reading Excel Sheet [data, strTemp] = xlsread('FlightDelays.xls'); str(:,:) = strTemp(2:max(size(strTemp)),:); %Splitting [TrainingData, ValidationData, TrainingStr, ValidationStr]=SplitTrainingAndValidation(data,str); clear data,strTemp; save SplittedSets.mat; load SplittedSets.mat; PredictedClass = round(mean(TrainingData(:,(min(size(TrainingData)))))); PredictTraining(1:max(size(TrainingData))) = PredictedClass; PredictValidation(1:max(size(ValidationData))) = PredictedClass; [ConfTraining,order] = confusionmat(TrainingData(:,(min(size(TrainingData)))),PredictTraining) [ConfValidation,order] = confusionmat(ValidationData(:,(min(size(ValidationData)))),PredictValidation)
Appendix - 2
Appendix A.3)
Binned categories of Scheduled Departure Time and mapping of class attributes:
Dept. Time
06:00 07:01 08:01 08:01 09:01 11:01 12:01 13:01 14:01 15:01 16:01 17:01 18:01 19:01 20:01 21:01
to to to to to to to to to to to to to to to to
07:00 08:00 09:00 09:00 10:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Carrier
CO DH DL MQ OH RU UA US EWR JFK LGA BWI DCA IAD
1 2 3 4 5 6 7 8 1 2 3 1 2 3
Destination
Origin
Appendix - 3

load SplittedSets.mat; xlswrite('CombineConvert.xls',TrainingData, 'Sheet3'); xlswrite('CombineConvert.xls',TrainingStr, 'Sheet2'); display('Open CombineConvert.xls and run macros...'); display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's'); data = xlsread('CombineConvert.xls','Ready'); temp = min(size(data)); temp2=temp-1; attributes(:,1:temp2)=data(:,1:temp2); classes=data(:,temp); O1 = NaiveBayes.fit(attributes(:,1:temp2),classes,'dist','mvmn'); display(''); display('Open CombineConvert.xls and clear sheets!'); reply = input('Press Enter when it is done!', 's'); display(''); xlswrite('CombineConvert.xls',ValidationData, 'Sheet3'); xlswrite('CombineConvert.xls',ValidationStr, 'Sheet2'); display('Open CombineConvert.xls and run macros...');
Appendix - 4
display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's');
data2 = xlsread('CombineConvert.xls','Ready'); temp = min(size(data2)); temp2=temp-1; attributesTest(:,1:temp2)=data2(:,1:temp2); classesTest=data2(:,temp); C1 = O1.predict(attributesTest); [cMat1,order] = confusionmat(classesTest,C1)
Appendix - 5
Appendix A.5)
First 10 predictions in test set are as following:
CRS_DEP _TIME 1730 1300 1300 830 1700 1700 1630 930 700 1610
CARRIER RU MQ RU DL US RU CO RU MQ DH
DEP_ TIME DEST 1723 1254 1252 826 1657 1650 1620 925 654 1607 EWR LGA EWR LGA LGA EWR EWR EWR LGA JFK
DISTANCE 199 214 213 214 214 213 199 199 214 228
FL_DATE 11.01.2004 31.01.2004 23.01.2004 25.01.2004 09.01.2004 02.01.2004 25.01.2004 18.01.2004 12.01.2004 22.01.2004
FL_ NUM 2097 4964 2692 1744 2180 2497 810 2582 4952 7816
ORIGIN DCA DCA IAD DCA DCA IAD DCA DCA DCA IAD
Weather 0 0 0 0 0 0 0 0 0 0
DAY_ WEEK 7 6 5 7 5 5 7 7 1 4
DAY_OF_ MONTH 11 31 23 25 9 2 25 18 12 22
TAIL_NUM N16976 N710MQ N16502 N231DN N750UW N12528 N33608 N27962 N801MQ N324UE
Flight Status ontime ontime ontime ontime ontime ontime ontime ontime delayed ontime
Prediction ontime ontime ontime ontime ontime ontime ontime ontime ontime ontime
Appendix - 6

%Reading Excel Sheet [data, strTemp] = xlsread('FlightDelays.xls'); str(:,:) = strTemp(2:max(size(strTemp)),:); %Splitting [TrainingData, ValidationData, TrainingStr, ValidationStr]=SplitTrainingAndValidation(data,str); clear data,strTemp; save SplittedSets.mat; load SplittedSets.mat; xlswrite('CombineConvert.xls',TrainingData, 'Sheet3'); xlswrite('CombineConvert.xls',TrainingStr, 'Sheet2'); display('This is for training set:'); display('Open CombineConvert.xls and run macros...'); display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's'); data = xlsread('CombineConvert.xls','Ready'); temp = min(size(data)); temp2=temp-1;
Appendix - 7
attributes(:,1:temp2)=data(:,1:temp2); classes=data(:,temp); O1 = NaiveBayes.fit(attributes(:,1:temp2),classes,'dist','mvmn'); C1 = O1.predict(attributes); [cMat1,order] = confusionmat(classes,C1) display('This is for cleaning:'); display('Open CombineConvert.xls and clear sheets!'); reply = input('Press Enter when it is done!', 's'); display('This is for validation set:'); xlswrite('CombineConvert.xls',ValidationData, 'Sheet3'); xlswrite('CombineConvert.xls',ValidationStr, 'Sheet2'); display('Open CombineConvert.xls and run macros...'); display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's');
data2 = xlsread('CombineConvert.xls','Ready'); temp = min(size(data2)); temp2=temp-1; attributesTest(:,1:temp2)=data2(:,1:temp2); classesTest=data2(:,temp); C2 = O1.predict(attributesTest); [cMat2,order] = confusionmat(classesTest,C2)
Appendix - 8

load SplittedSets.mat; xlswrite('CombineConvert.xls',TrainingData, 'Sheet3'); xlswrite('CombineConvert.xls',TrainingStr, 'Sheet2'); display('Open CombineConvert.xls and run macros...'); display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's'); data = xlsread('CombineConvert.xls','Ready'); temp = min(size(data)); temp2=temp-1; attributes(:,1:temp2)=data(:,1:temp2); classes=data(:,temp); save ReadyForTree.mat; load ReadyForTree.mat; tc = ClassificationTree.fit(attributes,classes,'PredictorNames', {'DepartureTime', 'Carrier', 'Destination', 'Origin', 'Weather', 'DayOfWeek', 'DayOfMonth'}); save('ClassificationTree','tc');
Appendix - 9
display(''); display('Open CombineConvert.xls and clear sheets!'); reply = input('Press Enter when it is done!', 's'); display(''); xlswrite('CombineConvert.xls',ValidationData, 'Sheet3'); xlswrite('CombineConvert.xls',ValidationStr, 'Sheet2'); display('This is for validation set:'); display('Open CombineConvert.xls and run macros...'); display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's'); data2 = xlsread('CombineConvert.xls','Ready'); temp = min(size(data2)); temp2=temp-1; attributesTest(:,1:temp2)=data2(:,1:temp2); classesTest=data2(:,temp); predicted = predict(tc, attributesTest); display('Confusion matrix for validation set: '); [cMat2,order] = confusionmat(classesTest,predicted)
Appendix - 10
Appendix A.8) Rules of Classification Tree

Decision tree for classification 1 if Weather<0.5 then node 2 elseif Weather>=0.5 then node 3 else 1 2 if DepartureTime<8.5 then node 4 elseifDepartureTime>=8.5 then node 5 else 1 3 class = 0 4 if Origin<2.5 then node 6 elseif Origin>=2.5 then node 7 else 1 5 if Carrier<7 then node 8 elseif Carrier>=7 then node 9 else 1 6 if Origin<1.5 then node 10 elseif Origin>=1.5 then node 11 else 1 7 if DayOfMonth<25.5 then node 12 elseifDayOfMonth>=25.5 then node 13 else 1 8 if DayOfWeek<6.5 then node 14 elseifDayOfWeek>=6.5 then node 15 else 1 9 if DayOfMonth<24.5 then node 16 elseifDayOfMonth>=24.5 then node 17 else 1 10 if DayOfWeek<6.5 then node 18 elseifDayOfWeek>=6.5 then node 19 else 1 11 if DayOfMonth<13.5 then node 20 elseifDayOfMonth>=13.5 then node 21 else 1 12 if DayOfMonth<5.5 then node 22 elseifDayOfMonth>=5.5 then node 23 else 1 13 if DayOfWeek<2.5 then node 24 elseifDayOfWeek>=2.5 then node 25 else 1 14 if DayOfMonth<1.5 then node 26 elseifDayOfMonth>=1.5 then node 27 else 1 15 if Destination<2.5 then node 28 elseif Destination>=2.5 then node 29 else 1 16 if DayOfWeek<2.5 then node 30 elseifDayOfWeek>=2.5 then node 31 else 1 17 if DepartureTime<14.5 then node 32 elseifDepartureTime>=14.5 then node 33 else 1 18 class = 1 19 class = 0 20 class = 1 21 if DayOfMonth<16.5 then node 34 elseifDayOfMonth>=16.5 then node 35 else 1
Appendix - 11
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
if Carrier<4 then node 36 elseif Carrier>=4 then node 37 else 1 if DayOfMonth<8.5 then node 38 elseifDayOfMonth>=8.5 then node 39 else 1 class = 0 class = 1 class = 1 if DayOfMonth<30.5 then node 40 elseifDayOfMonth>=30.5 then node 41 else 1 if Carrier<2.5 then node 42 elseif Carrier>=2.5 then node 43 else 0 class = 1 class = 1 if DayOfWeek<4.5 then node 44 elseifDayOfWeek>=4.5 then node 45 else 1 if DayOfWeek<3.5 then node 46 elseifDayOfWeek>=3.5 then node 47 else 1 class = 1 if Carrier<5 then node 48 elseif Carrier>=5 then node 49 else 1 if DayOfMonth<25.5 then node 50 elseifDayOfMonth>=25.5 then node 51 else 1 if DayOfWeek<2.5 then node 52 elseifDayOfWeek>=2.5 then node 53 else 1 class = 1 class = 1 if DayOfMonth<10.5 then node 54 elseifDayOfMonth>=10.5 then node 55 else 1 if DayOfMonth<25 then node 56 elseifDayOfMonth>=25 then node 57 else 1 class = 1 if DayOfMonth<14.5 then node 58 elseifDayOfMonth>=14.5 then node 59 else 1 if DayOfMonth<7.5 then node 60 elseifDayOfMonth>=7.5 then node 61 else 0 if DayOfMonth<11 then node 62 elseifDayOfMonth>=11 then node 63 else 1 class = 1 if DayOfWeek<2.5 then node 64 elseifDayOfWeek>=2.5 then node 65 else 1 class = 1 if Carrier<3.5 then node 66 elseif Carrier>=3.5 then node 67 else 1 class = 1 class = 1 if DepartureTime<3.5 then node 68 elseifDepartureTime>=3.5 then node 69 else 1
Appendix - 12
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
class = 0 if Destination<1.5 then node 70 elseif Destination>=1.5 then node 71 else 1 if Destination<1.5 then node 72 elseif Destination>=1.5 then node 73 else 1 if Carrier<4 then node 74 elseif Carrier>=4 then node 75 else 1 if Carrier<1.5 then node 76 elseif Carrier>=1.5 then node 77 else 1 if DayOfWeek<2.5 then node 78 elseifDayOfWeek>=2.5 then node 79 else 1 class = 1 if Destination<1.5 then node 80 elseif Destination>=1.5 then node 81 else 0 class = 0 if Carrier<4.5 then node 82 elseif Carrier>=4.5 then node 83 else 1 if DepartureTime<11.5 then node 84 elseifDepartureTime>=11.5 then node 85 else 1 class = 1 class = 1 class = 0 class = 1 if DepartureTime<0.5 then node 86 elseifDepartureTime>=0.5 then node 87 else 0 if DayOfWeek<3.5 then node 88 elseifDayOfWeek>=3.5 then node 89 else 1 class = 1 class = 0 class = 1 class = 1 class = 0 class = 1 if DayOfMonth<16.5 then node 90 elseifDayOfMonth>=16.5 then node 91 else 1 if DepartureTime<12.5 then node 92 elseifDepartureTime>=12.5 then node 93 else 1 if Carrier<3.5 then node 94 elseif Carrier>=3.5 then node 95 else 1 if Origin<2.5 then node 96 elseif Origin>=2.5 then node 97 else 0 if Carrier<3.5 then node 98 elseif Carrier>=3.5 then node 99 else 1 class = 0 class = 1
Appendix - 13
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
class = 0 if DepartureTime<14.5 then node 100 elseifDepartureTime>=14.5 then node 101 else 1 class = 0 class = 1 class = 1 class = 0 if DepartureTime<2.5 then node 102 elseifDepartureTime>=2.5 then node 103 else 1 class = 1 if DayOfWeek<3.5 then node 104 elseifDayOfWeek>=3.5 then node 105 else 1 class = 1 if DayOfWeek<3.5 then node 106 elseifDayOfWeek>=3.5 then node 107 else 0 if DayOfMonth<7.5 then node 108 elseifDayOfMonth>=7.5 then node 109 else 1 if DepartureTime<9.5 then node 110 elseifDepartureTime>=9.5 then node 111 else 1 if DayOfWeek<5.5 then node 112 elseifDayOfWeek>=5.5 then node 113 else 1 if Carrier<4.5 then node 114 elseif Carrier>=4.5 then node 115 else 1 class = 0 if Origin<2.5 then node 116 elseif Origin>=2.5 then node 117 else 1 if Destination<2.5 then node 118 elseif Destination>=2.5 then node 119 else 1 class = 1 class = 0 if DayOfWeek<1.5 then node 120 elseifDayOfWeek>=1.5 then node 121 else 1 class = 0 class = 1 class = 0 class = 0 class = 1 class = 0 class = 1 if DayOfWeek<3.5 then node 122 elseifDayOfWeek>=3.5 then node 123 else 1 if DepartureTime<15.5 then node 124 elseifDepartureTime>=15.5 then node 125 else 1
Appendix - 14
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
if DepartureTime<11.5 then node 126 elseifDepartureTime>=11.5 then node 127 else 1 class = 1 if Destination<2.5 then node 128 elseif Destination>=2.5 then node 129 else 1 class = 1 class = 1 if Destination<2.5 then node 130 elseif Destination>=2.5 then node 131 else 1 class = 1 class = 0 class = 0 class = 1 class = 1 if DayOfMonth<22.5 then node 132 elseifDayOfMonth>=22.5 then node 133 else 1 if DayOfMonth<12.5 then node 134 elseifDayOfMonth>=12.5 then node 135 else 1 if DayOfWeek<1.5 then node 136 elseifDayOfWeek>=1.5 then node 137 else 1 if DayOfMonth<5.5 then node 138 elseifDayOfMonth>=5.5 then node 139 else 1 if DayOfMonth<21.5 then node 140 elseifDayOfMonth>=21.5 then node 141 else 1 class = 0 class = 1 class = 1 class = 0 if Origin<2.5 then node 142 elseif Origin>=2.5 then node 143 else 1 class = 1 class = 1 if DayOfMonth<18 then node 144 elseifDayOfMonth>=18 then node 145 else 1 class = 0 if DayOfWeek<5.5 then node 146 elseifDayOfWeek>=5.5 then node 147 else 1 if DayOfWeek<3 then node 148 elseifDayOfWeek>=3 then node 149 else 0 if DayOfWeek<4.5 then node 150 elseifDayOfWeek>=4.5 then node 151 else 1 if DepartureTime<14 then node 152 elseifDepartureTime>=14 then node 153 else 1 class = 1
Appendix - 15
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
if Destination<2.5 then node 154 elseif Destination>=2.5 then node 155 else 1 if Destination<2.5 then node 156 elseif Destination>=2.5 then node 157 else 0 if DepartureTime<13.5 then node 158 elseifDepartureTime>=13.5 then node 159 else 1 class = 1 class = 1 if DayOfMonth<13.5 then node 160 elseifDayOfMonth>=13.5 then node 161 else 1 class = 0 class = 1 if Destination<2.5 then node 162 elseif Destination>=2.5 then node 163 else 1 if DayOfMonth<12.5 then node 164 elseifDayOfMonth>=12.5 then node 165 else 1 if DayOfMonth<3.5 then node 166 elseifDayOfMonth>=3.5 then node 167 else 0 class = 1 class = 0 class = 1 class = 1 class = 0 class = 1 class = 0 class = 1 class = 0 class = 1 if DayOfMonth<7.5 then node 168 elseifDayOfMonth>=7.5 then node 169 else 1 class = 1 if DepartureTime<10.5 then node 170 elseifDepartureTime>=10.5 then node 171 else 0 class = 1 if DayOfWeek<4.5 then node 172 elseifDayOfWeek>=4.5 then node 173 else 0 class = 0 class = 1 class = 0 class = 1
Appendix - 16
172 173 174 175 176 177 178 179 180 181 182 183
if DayOfMonth<7.5 then node 174 elseifDayOfMonth>=7.5 then node 175 else 0 class = 0 class = 0 if DayOfMonth<12.5 then node 176 elseifDayOfMonth>=12.5 then node 177 else 1 class = 1 if DayOfMonth<13.5 then node 178 elseifDayOfMonth>=13.5 then node 179 else 0 class = 0 if DayOfMonth<14.5 then node 180 elseifDayOfMonth>=14.5 then node 181 else 1 class = 1 if Carrier<5 then node 182 elseif Carrier>=5 then node 183 else 0 class = 1 class = 0
Appendix - 17

load SplittedSets.mat; xlswrite('CombineConvert.xls',TrainingData, 'Sheet3'); xlswrite('CombineConvert.xls',TrainingStr, 'Sheet2'); display('Open CombineConvert.xls and run macros...'); display('If this is a test run, just press Enter!'); reply = input('Otherwise, press Enter when it is done!', 's'); data = xlsread('CombineConvert.xls','Ready'); temp = min(size(data)); temp2=temp-1; attributes(:,1:temp2)=data(:,1:temp2); classes=data(:,temp); save ReadyForTree.mat; load ReadyForTree.mat; attributes(:,7)=[]; fullTree = ClassificationTree.fit(attributes,classes,'PredictorNames', {'DepartureTime', 'Carrier', 'Destination', 'Origin', 'Weather', 'DayOfWeek'},'Prune','off');
Appendix - 18
predicted2 = predict(fullTree, attributes); display('Confusion matrix for full tree: '); [cMat3,order] = confusionmat(classes,predicted2) level1pruned=prune(fullTree, 'level',max(fullTree.PruneList)); predicted3 = predict(level1pruned, attributes); display('Confusion matrix for pruned tree: '); [cMat2,order] = confusionmat(classes,predicted3)
Appendix - 19
Appendix B.1) main.m

%Reading Excel Sheet data = xlsread('SunXpress.xls','data'); %Splitting [TrainingData, ValidationData]=SplitTrainingAndValidation(data); % delete ID's TrainingData(:,1)=[]; ValidationData(:,1)=[]; % Divide classes column TrainingClasses = TrainingData(:,min(size(TrainingData))); ValidationClasses = ValidationData(:,min(size(ValidationData))); % Delete classes TrainingData(:,min(size(TrainingData)))=[]; ValidationData(:,min(size(ValidationData)))=[]; % Divide attributes TrainingAttributes=TrainingData; ValidationAttributes=ValidationData; save ('SplittedSets.mat','TrainingAttributes','TrainingClasses','ValidationAttributes','Validatio nClasses') clear; load SplittedSets.mat;
Appendix - 20

load SplittedSets.mat; display('Confusion matrix for k=2 : '); Prediction2= knnclassify(ValidationAttributes, TrainingAttributes, TrainingClasses,2); [cMat2,order] = confusionmat(ValidationClasses,Prediction2)
display('Confusion matrix for k=3 : '); Prediction3= knnclassify(ValidationAttributes, TrainingAttributes, TrainingClasses,3); [cMat3,order] = confusionmat(ValidationClasses,Prediction3) display('Confusion matrix for k=4 : '); Prediction4= knnclassify(ValidationAttributes, TrainingAttributes, TrainingClasses,4); [cMat4,order] = confusionmat(ValidationClasses,Prediction4)
display('Confusion matrix for k=5 : '); Prediction5= knnclassify(ValidationAttributes, TrainingAttributes, TrainingClasses,5); [cMat5,order] = confusionmat(ValidationClasses,Prediction5)
Appendix - 21

load SplittedSets.mat; % For training set for k=1:50; Prediction= knnclassify( TrainingAttributes, TrainingAttributes, TrainingClasses,k); [cMat,order] = confusionmat(TrainingClasses,Prediction); toplam = sum(sum(cMat)); toplamMisclassified= cMat(1,2) + cMat(2,1); PercentageError = (toplamMisclassified / toplam)*100; ErrorMatrix(k) = PercentageError; end % For validation set for k=1:50; Prediction= knnclassify(ValidationAttributes, TrainingAttributes, TrainingClasses,k); [cMat,order] = confusionmat(ValidationClasses,Prediction); toplam = sum(sum(cMat)); toplamMisclassified= cMat(1,2) + cMat(2,1); PercentageError = (toplamMisclassified / toplam)*100; ErrorMatrix2(k) = PercentageError; end
Appendix - 22

load SplittedSets.mat; TrainingAttributes=transpose(TrainingAttributes); TrainingClasses=transpose(TrainingClasses); ValidationAttributes=transpose(ValidationAttributes); ValidationClasses=transpose(ValidationClasses);
Appendix - 23
Appendix B.5) ann3000.m

inputs = TrainingAttributes; targets = TrainingClasses; hiddenLayerSize = 10; net = fitnet(hiddenLayerSize); net.divideParam.trainRatio = 100/100; net.divideParam.valRatio = 0/100; net.divideParam.testRatio = 0/100; net.performFcn = 'sse'; net.trainParam.epochs = 3000; [net,tr] = train(net,inputs,targets); outputs = net(inputs); errors = gsubtract(targets,outputs); performance = perform(net,targets,outputs); save('Network3000.mat','net');
Appendix - 24
Appendix B.6) confusionMatricesFor3000.m

PredictionsTraining=round(net_outputs_Training); for j=1:max(size(net_outputs_Validation)) if net_outputs_Validation(j)>0.5 PredictionsValidation(j)=1; else PredictionsValidation(j)=0; end end display('Confusion matrix for training set: '); [cMat2,order] = confusionmat(TrainingClasses,PredictionsTraining) display('Confusion matrix for validation set: '); [cMat3,order] = confusionmat(ValidationClasses,PredictionsValidation)
Appendix - 25

load SplittedSets.mat; for k=2:25 indexEuc(:,k) = kmeans(TrainingAttributes,k, 'distance', 'sqEuclidean','emptyaction','drop'); end for k=2:25 indexCity(:,k) = kmeans(TrainingAttributes,k,'distance','cityblock','emptyaction','drop'); end
%Labeling the clusters and data points for Euclidean distance case AvgEuc=zeros(25,25); SumEuc=zeros(25,25); CountEuc=zeros(25,25); LabelEuc(25,25)=5; LabelEuc(:,:)=5; for i=2:25 for j=1:3988 for t=1:i if indexEuc(j,i)==t SumEuc(i,t)=SumEuc(i,t)+ TrainingClasses(j,1); CountEuc(i,t)=CountEuc(i,t)+1;
Appendix - 26
end end end end for i=2:25 for t=1:i AvgEuc(i,t)=SumEuc(i,t)/CountEuc(i,t); if AvgEuc(i,t)<0.5 LabelEuc(i,t)=0; else LabelEuc(i,t)=1; end end end for i=2:25 for j=1:3988 for t=1:i if indexEuc(j,i)==t RowLabelEuc(j,i)=LabelEuc(i,t); end end end end
%Labeling the clusters and data points for L1 distance case AvgL1=zeros(25,25);
Appendix - 27
SumL1=zeros(25,25); CountL1=zeros(25,25); LabelL1(25,25)=5; LabelL1(:,:)=5; for i=2:25 for j=1:3988 for t=1:i if indexCity(j,i)==t SumL1(i,t)=SumL1(i,t)+ TrainingClasses(j,1); CountL1(i,t)=CountL1(i,t)+1; end end end end for i=2:25 for t=1:i AvgL1(i,t)=SumL1(i,t)/CountL1(i,t); if AvgL1(i,t)<0.5 LabelL1(i,t)=0; else LabelL1(i,t)=1; end end end for i=2:25 for j=1:3988 for t=1:i
Appendix - 28
if indexCity(j,i)==t RowLabelL1(j,i)=LabelL1(i,t); end end end end
%Error Matrices [ErrEuc2,ClassErr2]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,2)); [ErrEuc3,ClassErr3]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,3)); [ErrEuc4,ClassErr4]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,4)); [ErrEuc5,ClassErr5]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,5)); [ErrEuc6,ClassErr6]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,6)); [ErrEuc7,ClassErr7]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,7)); [ErrEuc8,ClassErr8]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,8)); [ErrEuc9,ClassErr9]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,9)); [ErrEuc10,ClassErr10]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,10)); [ErrEuc11,ClassErr11]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,11)); [ErrEuc12,ClassErr12]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,12)); [ErrEuc13,ClassErr13]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,13)); [ErrEuc14,ClassErr14]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,14)); [ErrEuc15,ClassErr15]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,15)); [ErrEuc16,ClassErr16]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,16)); [ErrEuc17,ClassErr17]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,17)); [ErrEuc18,ClassErr18]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,18)); [ErrEuc19,ClassErr19]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,19));
Appendix - 29
[ErrEuc20,ClassErr20]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,20)); [ErrEuc21,ClassErr21]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,21)); [ErrEuc22,ClassErr22]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,22)); [ErrEuc23,ClassErr23]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,23)); [ErrEuc24,ClassErr24]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,24)); [ErrEuc25,ClassErr25]=ErrorMatrix(TrainingClasses,RowLabelEuc(:,25));
Appendix B.8) ErrorMatrix.m

function [Errors, ClassificationError]=ErrorMatrix(M1,M2) Errors=[0 0 ; 0 0]; sz=size(M1); A=M1; P=M2; for k=1:sz(1) Errors(A(k)+1,P(k)+1)=Errors(A(k)+1,P(k)+1)+1; end ClassificationError=(Errors(1,2)+Errors(2,1))/3988; End
Appendix - 30

Data Mining - Final Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining - Final Project

Uploaded by

Copyright:

Available Formats

MIDDLE EAST TECHNICAL UNIVERSITY

IE4903 - I N T R O D U C T I O N TO DATA MINING

January 16, 2012

Page Part A ... 1 Part I .. Part II . Part B.... 2 7 10

Confusion Matrix (Training) Predicted Class Actual Class 0 1 0 0 333 1 0 1428

% Error 100 0 18,910

Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 0 95 1 0 345

Error Report (Validation) # Cases # Error 95 95 345 0 440 95

% Error 100 0 21,591

% Error 85,285 2,2241 17,944

% Error 78,947 3,478 19,772

% Error 82,927 3,631 18,409

% Error 89,655 5,666 22,272

% Error 84,810 1,662 16,591

% Error 88,298 2,312 20,682

% Error 75,947 3,478 19,772

Run #1 3,25 3,63

Average 2,83 3,35

Run #1 80,64 82,93

Average 82,35 84,93

Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 26 69 1 38 307

k vs. Error (%)

Misclassification Error for Training Set

k vs. Error (%)

0,00 2 7 12 17 22 27 Error (%) 32 37 42 47

Misclassification Error for Validation Set

Data Division: Training: Performance: Derivative:

Random (dividerand) Levenberg-Marquardt (trainlm) SSE Default (defaultderiv)

Visually, the used network can be seen below:

Confusion Matrix (Training) Predicted Class Actual Class 0 1 0 3448 16 1 462 62

Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 851 15 1 117 14

Confusion Matrix (Training) Predicted Class Actual Class 0 1 0 3464 0 1 524 0

Confusion Matrix (Validation) Predicted Class Actual Class 0 1 0 866 0 1 131 0

Number of Epochs 100 3000

Training Error (%)

Validation Error (%)

Appendix A.1) SplitTrainingAndValidation.m

Appendix A.2) main.m

CO DH DL MQ OH RU UA US EWR JFK LGA BWI DCA IAD

Appendix A.4) main.m

Appendix A.6) main.m

Appendix A.7) main.m

Appendix A.8) Rules of Classification Tree

Appendix A.9) main.m

Appendix B.1) main.m

Appendix B.2) main.m

Appendix B.3) main.m

Appendix B.4) main.m

Appendix B.5) ann3000.m

Appendix B.6) confusionMatricesFor3000.m

Appendix B.7) main.m

if indexCity(j,i)==t RowLabelL1(j,i)=LabelL1(i,t); end end end end

Appendix B.8) ErrorMatrix.m

You might also like