You are on page 1of 4

Suppose a new species is discovered by scientists.

We can use a classification model built from the training data set shown in Table 2 to determine the class to which the creature belongs, either mammals or non-mammals. Instance Human Pigeon Elephant Leopard shark Turtle Penguin Eel dolphin spiny anteater porcupine Body temperature Gives birth Four-legged Warm-blooded Warm-blooded Warm-blooded Cold-blooded Cold-blooded Cold-blooded Cold-blooded Warm-blooded Warm-blooded Cold-blooded Yes No Yes Yes No No No Yes No No No No Yes No Yes No No No Yes Yes Class label mammals Non-mammals mammals Non-mammals Non-mammals Non-mammals Non-mammals Mammals Mammals Non-mammals

Table 2 You as one of the scientists had been asked to construct a decision tree that perfectly fits the training data showed in Table 2. Answer the following questions. a) Describe what is classification process in data mining? (1 mark) Classification is to categorize data into respective and well known classes/groups. It divides by two process which is model construction from training data and model evaluation using testing data b) State the use of training data set and testing data set. (2 marks) Training data set is used for each record contains a set of attributes, one of the attributes is the class.

Testing data set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

c) Why training data set must be larger than testing data set? (2 marks) Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. Analysis Services randomly samples the data to help ensure that the testing and training sets are similar. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model. After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model's guesses are correct.

d) Based on Table 2, which is the input and target variable/s? (4 marks)


Input: Instance, Body temperature, Target Variable: Class label Gives birth and Four-legged

e) Decision tree are constructed using goodness score of those input attributes. Compute the goodness score of each input attributes based on Table 2. (12 marks)

f)

Based on answer in (e), identify child(s) node and parent node. (3 marks) Parent node : Class Level Child(s) node : Body Temperature, Gives Birth, Four Legged

g)

Based on answer in (f), draw a decision tree that represents the training data set. (5 marks)

CLASS LEVEL
MAMMALS(4) NONE MAMMALS(6)

BODY TEMPERATURE

BODY TEMPERATURE

WARM BLOODED (4)

COLD BLOODED (5)

WARM BLOODED (1)

GIVES BIRTH

GIVES BIRTH

GIVES BIRTH

NO(1) YES (3)


FOUR LEGGED FOUR LEGGED

NO (4)

YES (1)
FOUR LEGGED

NO (1)
FOUR LEGGED

FOUR LEGGED

YES (3)

NO(2)

YES (1)

NO(2)

YES (2)

NO(1)

NO(1)

You might also like