University of Gondar: August 2011 E.C Gondar, Ethiopia

University of Gondar
Faculity of Informatics
Department of Information Technology for PG program
Year-2nd
Assignment 1 for data warehousing and data mining
(IT 617)
Set by:
Tesfaye Ashebr Sec. 3
Submited to: Instructor Mengistu Belete (Phdc)
August 2011 E.C
Gondar, Ethiopia
ACKNOWLEDGEMENT
First of all we would like to thank our GOD that helped us in all up and down and who made us
to successfully complete this assignment effectively beyond what we had anticipated. Next we
would like to express our deepest gratitude to our Instructor Mengistu Belete and who
contributed all their idea in order to make this assignment the best as much as we need.
Question 1.1. Consider the following tree splitting and two questions.
[29+,35-] A=?
True False
[21+,5-] [8+,30-]
a) A statement says: ―to decide the best attribute in each splitting of decision induction,
instead of computing information gain, you just need to compute the expected (average)
entropy in the lower level‖. Do you think it is correct or not?
b) The similar question is asked again, but the criterion ―information gain‖ is substitute by
―gain ratio‖. What is your answer then?
Solution
P=29 and N=35
Entropy (class) =
Entropy (class) =0.993651
- To calculate the entropy of true attribute P=21 and N=5

Entropy (True) =
= 0.706274
- To calculate the entropy of false attribute P=8 and N=30
Entropy (false) =
Page 1
= 0.742488
P N I(P,N)
True 21 5 0.706274
False 8 30 0.742488
Entropy of all attribute=
= 0.2869238125+0.44085225
=0.7277760625
Information gain = class attribute – entropy of all attribute

= 0.993651 – 0.7277760625
= 0.2658749375
A. Answer correct
Because sometimes the information gain that reduces its bias. It takes number and size of
branches in to account when choosing an attribute it corrects the information gain by taking the
intrinsic information of a split in to account. Generally the result is exactly the same.
B. To calculate the gain ratio

Information gain= 0.2658749375
Split info (1, 1) =
= 0.314067529
Gain ratio =
= 0.8465534095
Then the answer is Correct because information gain and entropy are the same i.e. it is the more
entropy, the more information but the only difference is it uses different attributes.
Question 1.2.
A. If the middle histogram is the true distribution, using the top or the bottom histogram as the
model implies what?
Page 2
Answer
 The top histogram diagram shows the under-fitting problem. Because the 10 specifies the
number of bins is 10. So the first call to histogram behaves the same histogram (daphe)
for its sample size is 751. Then the bin width is too small because it shows to much
individual data and does not allow the underlying pattern (frequency distribution) of the
data to be easily seen. In the top histogram it will have high training and high testing
errors.
 On the other hand, the bottom histogram diagram shows the over-fitting problem.
Because the 100 specifies the number of bins is 100. Then the bins are too large, and
again, we are unable to find the underlined trend in the data. And also in the bottom
histogram there is low variance, high variance, and high complexity. We want to watch
out for the cases when our training error is significantly lower than our validation error,
indicating severe over-fitting. In the bottom histogram it will have extremely low training
errors but high testing point.
B. Argue why the middle histogram could be the best choice for visualization. In fact the middle
one is also the one most likely to be the best model among three, why?
Answer
I argue because the middle histogram use equal width, the number of bins is 25 and the bar
graphs are more appropriate than the others. The middle histogram could be the best choice for
visualization because in, normal distribution, points on one side of the average are as likely to
occur the other side of the average.
The middle one is also the one most likely to be the best model among three because
 Thus varying of the bin width in the histogram is beneficial

 The bins are equal size
 It has lower the capacity of the model to memorize the samples
 It also maintain the balance between bias and variance
 We can summarize the discrepancy between observed values
Page 3
Why the bottom one is unlikely to be the true model?
Because, a model that is under-fit will have high training and high testing error while an over-fit
model will have extremely low training error but a high testing error. The flexibility in the model
increases the training error continually decreases due to increase flexibility. However, the error
on testing set only decrease as we add flexibility up to a certain point. In this case, that occur at 5
degrees as the as the flexibility increases beyond this point, the training error increases because
the model has memorized the training data and the noise.
Cross validation yield the second best model to perform best. The exact metrics depend on the
testing set, but on average, the best model from cross validation will outperform all other model.
Generally this post covered a lot of topics, but hopefully you now have an idea of the basics of
modeling, overfitting vs underfitting, bias vs variance, and model optimization with cross-
validation.
Question 1.3. Explain the pros and cons of using k-nearest neighbors as a model for
prediction
K- Nearest Neighbors or also known as K-NN belong to the family of supervised machine
learning algorithms which means we use labeled (Target Variable) dataset to predict the class of
new data point.
The K-NN algorithm is a robust classifier which is often used as a benchmark for more complex
classifiers such as Artificial Neural Network (ANN) or Support vector machine (SVM).
Some of pros of using K-NN
A. K-NN is pretty intuitive and simple: K-NN algorithm is very simple to understand and
equally easy to implement. To classify the new data point K-NN algorithm reads through
whole dataset to find out K nearest neighbors.
B. K-NN has no assumptions: K-NN is a non-parametric algorithm which means there are
assumptions to be met to implement K-NN. Parametric models like linear regression has
lots of assumptions to be met by data before it can be implemented which is not the case
with K-NN.
Page 4
C. No Training Step: K-NN does not explicitly build any model, it simply tags the new data
entry based learning from historical data. New data entry would be tagged with majority
class in the nearest neighbor.
D. It constantly evolves: Given it’s an instance-based learning; k-NN is a memory-based
approach. The classifier immediately adapts as we collect new training data. It allows the
algorithm to respond quickly to changes in the input during real-time use.
E. Very easy to implement for multi-class problem: Most of the classifier algorithms are
easy to implement for binary problems and needs effort to implement for multi class
whereas K-NN adjust to multi class without any extra efforts.
F. Can be used both for Classification and Regression: One of the biggest advantages of
K-NN is that K-NN can be used both for classification and regression problems.
G. One Hyper Parameter: K-NN might take some time while selecting the first hyper
parameter but after that rest of the parameters are aligned to it.
H. Variety of distance criteria to be choose from: K-NN algorithm gives user the
flexibility to choose distance while building K-NN model.
Even though K-NN has several advantages but there are certain very important disadvantages or
constraints of K-NN. Below are listed few cons of K-NN.
I. K-NN slow algorithm: K-NN might be very easy to implement but as dataset grows
efficiency or speed of algorithm declines very fast.
II. Curse of Dimensionality: KNN works well with small number of input variables but as the
numbers of variables grow K-NN algorithm struggles to predict the output of new data point.
III. K-NN needs homogeneous features: If you decide to build k-NN using a common distance,
like Euclidean or Manhattan distances, it is completely necessary that features have the same
scale, since absolute differences in features weight the same, i.e., a given distance in feature 1
must means the same for feature 2.
IV. Optimal number of neighbors: One of the biggest issues with K-NN is to choose the
optimal number of neighbors to be consider while classifying the new data entry.
V. Imbalanced data causes problems: k-NN doesn’t perform well on imbalanced data. If we
consider two classes, A and B, and the majority of the training data is labeled as A, then the
Page 5
model will ultimately give a lot of preference to A. This might result in getting the less
common class B wrongly classified.
VI. Outlier sensitivity: K-NN algorithm is very sensitive to outliers as it simply chose the
neighbors based on distance criteria.
Missing Value treatment: K-NN inherently has no capability of dealing with missing value
problem.
Question 1.4. In real world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem?
One of the important stages of data mining is preprocessing, where we prepare the data for
mining. Real-world data tends to be incomplete, noisy, and inconsistent and an important task
when preprocessing the data is to fill in missing values, smooth out noise and correct
inconsistencies.
If we specifically look at dealing with missing data, there are several techniques that can be used.
Choosing the right technique is a choice that depends on the problem domain — the data’s
domain (sales data? CRM data? …) and our goal for the data mining process.
So how can you handle missing values are
1. Ignore the data row
This is usually done when the class label is missing (assuming your data mining goal is
classification), or many attributes are missing from the row (not just one). However, you’ll
obviously get poor performance if the percentage of such rows is high.
For example, let’s say we have a database of students enrolment data (age, SAT score, state of
residence, etc.) and a column classifying their success in college to ―Low‖, ―Medium‖ and
―High‖. Let’s say our goal is to build a model predicting a student’s success in college. Data
rows who are missing the success column are not useful in predicting success so they could very
well be ignored and removed before running the algorithm.
2. Use a global constant to fill in for missing values
Page 6
Decide on a new global constant value, like ―unknown―, ―N/A‖ or minus infinity that will be used
to fill all the missing values. This technique is used because sometimes it just doesn’t make sense
to try and predict the missing value.
For example, let’s look at the students enrollment database again. Assuming the state of
residence attribute data is missing for some students. Filling it up with some state doesn’t really
make sense as opposed to using something like ―N/A‖.
3. Use attribute mean
Replace missing values of an attribute with the mean (or median if its discrete) value for that
attribute in the database.
For example, in a database of US family incomes, if the average income of a US family is X you
can use that value to replace missing income values.
4. Use attribute mean for all samples belonging to the same class
Instead of using the mean (or median) of a certain attribute calculated by looking at all the rows
in a database, we can limit the calculations to the relevant class to make the value more relevant
to the row we’re looking at.
Let’s say you have a cars pricing database that, among other things, classifies cars to “Luxury”
and “Low budget” and you’re dealing with missing values in the cost field. Replacing missing
cost of a luxury car with the average cost of all luxury cars is probably more accurate than the
value you’d get if you factor in the low budget cars.
5. Use a data mining algorithm to predict the most probable value
The value can be determined using regression, inference based tools using Bayesian formalism,
decision trees, clustering algorithms (K-Mean\Median etc.).
For example, we could use clustering algorithms to create clusters of rows which will then be
used for calculating an attribute mean or median as specified in technique #3.
Page 7
Another example could be using a decision tree to try and predict the probable value in the
missing attribute, according to other attributes in the data.
I’d suggest looking into regression and decision trees first (ID3 tree generation) as they’re
relatively easy and there are plenty of examples on the net…
Additional Notes
 Note that methods 2–5 bias the data as the filled-in value may not be correct.
 Method 5 uses the most information available in the present data to predict the missing
value so it has a better chance for generating less bias.
 Missing values may not necessarily imply an error in the data! Forms may contain
optional fields, certain attributes may be in the database for future use.
In addition to this two handle missing values we use binning, regression, clustering,
combined human and computer inspection.
Page 8

University of Gondar: August 2011 E.C Gondar, Ethiopia

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

University of Gondar: August 2011 E.C Gondar, Ethiopia

Uploaded by

Copyright:

Available Formats

University of Gondar

Submited to: Instructor Mengistu Belete (Phdc)

August 2011 E.C

P=29 and N=35

Entropy (class) =0.993651

- To calculate the entropy of true attribute P=21 and N=5

Entropy of all attribute=

Information gain = class attribute – entropy of all attribute

B. To calculate the gain ratio

 Thus varying of the bin width in the histogram is beneficial

Some of pros of using K-NN

So how can you handle missing values are

1. Ignore the data row

2. Use a global constant to fill in for missing values

3. Use attribute mean

5. Use a data mining algorithm to predict the most probable value

You might also like