You are on page 1of 6

STAT8017 Assignment 1 1.

a)

Addressing method: Unsupervised data mining Data mining problem: Exploration Input attributes: age, duration of being infected with swine flu, frequency of being sick in a year, gender Addressing method: Database query Addressing method: Unsupervised data mining Data mining problem: Similarity Input attributes: height, weight, age, spectator sport

b) c)

d)

Addressing method: Supervised data mining Data mining problem: Classification Input attributes: season, age, gender

2. a) b) Appropriate data type of variable Age is Numeric Distribution of the variable Age is illustrated in below diagram.

Variable AGE shows a right-skewed distribution where outlier is found the right side of the histogram

c) Log transformation is the best power transformation to maximize the normality after several tests of different transformation. Distribution of variable AGE after transformation is as below,

d) Using binning transformation, the scatter plot of three groups is as below,

Age is divided into three age groups: a) <= 22 b) >22 AND <35 c) =>35

3. a) Predicted Non-fraud Actual Non-fraud Fraud 270 n Fraud 220-n 310

Where n is any non-negative integer not larger than 220 b) Suppose number of true negative is unchanged i.e. 270 Let x be the largest integer of number of true positives that 1% of the claims are true frauds x / (800-270) = 0.01 x = 5.3 Nearest integer to x is 5 Number of true positive = 5 Number of false positive = (5/0.01 5) = 495 Corrected matrix as below Predicted Non-fraud Actual Non-fraud Fraud 270 30 Fraud 495 5

c) Misclassification rate = (495 + 30)/800 = 65.6% Percentage of new records expected to be classified as frauds: 500/800 = 62.5%

4. a)

Possible splits are as follows i) Home owner: Yes vs No ii) Marital Status: Single vs Not Single, Married vs Not Married, Divorced vs Not Divorced iii) Annual Income: <=60K, <= 70K, <=75K, <=80K, <=90K, <=95K, <= 100K<=120K,<=120K, <=125K

b) Defaulted Borrower Root Yes No Total Entropy = -6/15log2(6/15) (9/15)log2(9/15) = 0.971 c) Observed frequency Home owner Defaulted Borrower Yes No Total Expected frequency Home owner Defaulted Borrower Yes No Total Yes 1.6 3.6 4 No 4.4 6.6 11 Total 6 9 15 Yes 1 3 4 No 5 6 11 Total 6 9 15 Frequency 6 9 15 Probability 6/15 9/15

Chi-square test statistic: X2 = (1-1.6)2/1.6 + (5-4.4)2/4.4 + (3-3.6)2/3.6 + (6-6.6)2/6.6 = 0.461 d) Suppose the target of Leaf 1, 2 and 4 is No and that of Leaf 3 is Yes Confusion matrix Predicted Non-defaulted borrower Actual Non-defaulted borrower Defaulted borrower Sensitivity = 3/6 = 0.5 Specificity = 7/9 = 0.7778 Misclassification rate = 3 + 2 / 15 = 0.333 Response rate = 3 + 3 / 15 = 0.4 7 3 Defaulted borrower 2 3

5. Summary of Comparison of decision tree methods for finding active objects A lot of astronomical data is collected from a wide range of large surveys like 2MASS (the Two Micron All Sky Survey), SDSS (the Sloan Digital Sky Survey), DENIS (the Deep Near Infrared Survey), DIVA , GAIA, etc. Scientists need some sophisticated and intelligent methods to classify active objects (e.g. quasars, BL Lac objects and active galaxies) from non-active objects like stars and galaxies. Decision tree was selected by scientists as a method to build an online system for classification of X-ray sources and star-galaxy automatically. They have employed a decision tree from WEKA (The Waikato Environment for Knowledge Analysis) which comprises of data pre-processing, classification, regression, clustering and association rules. What the scientists concern on data and decision tree would be n-dimensional parameter space, long training process of the tree and difficulty of interpretation. Decision tree, however, can overcome their concern as it can both classify categorical and numerical data like type of stars and galaxies and solar wavelength in different seasons. To find the best decision tree under WEKA, scientists evaluated over sixty methods and there are seven types of decision trees they would apply the data to. They are REPTree, RandomTree, J48, DecisionStump, Random Forest, NBTree and ADTree. The first tree is a quick method to build decision/regression tree by sorting numeric attributes. The second tree is generated randomly from a set of possible trees and it has been extensively utilized in Machine Learning. J48 is more advanced decision tree as it is developed using Depth-first strategy and recursive data partitioning. Decision Stump is a rather simple decision tree as it has one level only. Random forest is a kind of simple decision tree with improved version f classifier e.g. C4.5. NBTree is a decision tree using Bayesian method in classification and learning. In development, NBTree would sort incoming data to a leaf and apply the nave Bayes in that leaf to assign a class or categorical label to it. Lastly, ADTree also known as alternating decision tree is a kind of generalization of decision tree. Gathering specific astronomical data like multi-wavelength of stars, galaxies, quasars, other active objects from optical X-ray and infrared bands, scientists apply various types of decision tree to measure their accuracy using 10-fold cross validation to avoid over-fitting. With the same default parameters, accuracy is calculated as an

average of any 9 out of 10 training sets and the remaining ones are used as testing set for validation. Based on classification results in the article, J48 decision tree is the best one to classify active and non-active objects whereas REPTree is the worst one to classify active objects and Decision Stump is the worst one to classify non-active objects. Besides accuracy, time to build a decision tree or model is also critical. After running tests on the decision trees, it is found that Decision Stump is the fastest to build. Nevertheless, considering accuracy of classification and performance, ADTree and J48 are the best one. To further enhance decision tree in terms of performance and accuracy, extra features of star and galaxies as well as additional attributes are required. Other forms of data like images and spectra would be in need to classify additional astronomical objects such as nebulas and clusters. Due to the high quantities and complexity of astronomical data, data preprocessing would better to perform before applying training set. Unsupervised method and any outlier detection method should be used beforehand such that decision tree is more practical and accurate to help astronomers to identify active objects in short time.