You are on page 1of 7

Exercises

Data Mining
Part1: Data Understanding

Exercise1 :
Are there any difference between the 2 situations: Data mining task or not?
Situation1: We need to divide the customers of a company according to their profitability.
Situation 2: We need to forecast the profitability of new customers of the same company.
Answer:
Situation1 assume that customer profitability is measurable. So it is not datamining
task, it is only sorting task on profitability followed by setting a minmum value of
profitability as a threshold. Where a, situation 2 is a data mining since it involves a
prediction task.
Exercise2:
Let’s consider the following table for medical record of patients:
MedicalRecord ( Pat-Id, Weight, diabeteMeasure, NbOfSurgeries, HeartMeasure).
1. Using the following sample data values for 3 patients, are both weight and
diabeteMeasure columns mandatory for analyzing medical data? Yes/No and justify.
2. How reliable your conclusion is? Explain.

Answer:
The ratio Weight/diabeteMeasure ( ≈ 7) is almost the same for the 3 rows. The 2
columns seem to be correlated (contain the same information). However, this
conclusion is not reliable because it is based on very small sample. To apply feature
selection in this case, we need to have significantly bigger ground truth sample as our
base. (reasonably big size sample compared to initial population)

Exercise3:
Assume that we need to analyze customer satisfaction using “the number of customer
complaints for each product”. The company faces the following situation: the best-selling
product has the most complaints and therefore it has the worst customer satisfaction.
a) how reliable the measure is? Yes/No. Justify. If Yes, what do you suggest to fix it?
Explain.
Answer:
It is NOT reliable because the number of complaints is not sufficient to indicate how much a
product satisfies customers. ( the situation facing the company is an example). More reliable
measure is: The number of complaints / Number of products sold.
If we need to analyze customer preference of products. Assume that each product has
several similar variations. We want to analyze the extent to which a customer prefers one
product over other similar products and evaluate which one customers prefer. To prepare
for this analysis, one pair wise technique is suggested: Example: for 3 variations, we have
the customers compare variations 1 and 2, then 1 and 3, and finally 2 and 3. Finally, rank
the product variations in order of customer preference.
a. Show how inconsistent this technique is for ordinal ranking of product
variations in terms of customer preference.
b. Suggest a better approach for this situation.
Answer:
 If 1 is better than 2; 2 is better than 3 and 3 is better than 1??? This is inconsistent
ranking????
 the average of each product variation over all customers, then make an overall ranking

NB: Other data Preprocessing tasks are addressed in final Project.

Part2: NB Classification for text Documents

Exercise 1:
Using the following training document collection, use NB classifier to assign the class to the
document ‘Chennai Hyderabad’. We have 2 classes : India and England.

Doc. No. Document Class


1 Chennai Mumbai India
2 Delhi London Hyderabad England
3 Chennai Kolkata India
4 Delhi Hyderabad Pune India
5 London Bristol Chennai England

The posterior probability for n word document for a class cj is calculated as follows;
P(w1, w2, …, wn|cj) = P(cj) * P(w1|cj) * P(w2|cj) * … * P(wn|cj)

Where
Ncj: is the Total number of documents in class cj
Nc : is the total number of documents
And
Where
Nwi,cj: is the Number of times word wi appears in documents of class cj
Ncj: is the Count of words appears in all documents of class cj.

P(‘Chennai Hyderabad’ | India) = P(India) * P(Chennai | India) *


P(Hyderabad | India)??
And

P(‘Chennai Hyderabad’ | England) = P(England) * P(Chennai | England) *


P(Hyderabad | England)???

P(word | class) = P(Chennai|India) = 2/7 ? // 2 occurences of “Chennai” in documents of


class India AND 7 words in all documents in class India

P(Hyderabad | India) = 1/7


P(Chennai | England) = 1/6
P(Hyderabad | England) = 1/6

P(India) = 3/5 [How P(India) = 3/5? As per the training data, out of 5 documents, only 3 are
listed under the class 'India'.]
P(England) = 2/5
The same way:
P(‘Chennai Hyderabad’ | India) = P(India) * P(Chennai | India) * P(Hyderabad | India)
= 3/5 * 2/7 * 1/7
= 0.6 * 0.286 * 0.143
= 0.0245
P(‘Chennai Hyderabad’ | England) = P(England) * P(Chennai | England) * P(Hyderabad |
England)
= 2/5 * 1/6 * 1/6
= 0.4 * 0.167 * 0.167
= 0.0112
After the calculation, we found that P(‘Chennai Hyderabad’ | India) > P(‘Chennai
Hyderabad’ | England).

Conclusion: the assigned class of the document is “India”.

Exercise 2: Naïve Bayes with numerical attributes and Laplace smoothing


(Zero probability)
Given the training data in the table below (Tennis data with some numerical
attributes), predict the class of the following new example using Naïve Bayes
classification: (outlook=overcast, temperature=60, humidity=62, windy=false).

Solution: !!!Erreur dans le calcul de P(outlook=overcast ?Yes)=4/14 ???


la bonne valeur : 4/9

Laplace smoothing : adding alpha to numerator and (n*alpha) to


denomerator
Here, alpha =1 and n=3 (equi probability)// for the 3 values of overcast
(divide by 3)
Part3: K-Means Clustering
Exercise 1:
Consider the following 10 data points in a 2-dimensional feature space:
a(1,7) ; b(2,7) ; c(6,6); d(3,5); e(4,5); f(3,4); g(7,3); h(1,2); i(6,2); j(3,1)
Your task is the execution of the K-Means algorithm with k = 3 based on these data points.
Again, assume the Euclidean Distance as the distance measure. The initial centroids of the
three clusters are: centroid1(2,3); centroid2(4,7); centroid3(5,4)
A visualization of the data points and centroids can be seen in the figure below:
Round the centroids to integer positions. For instance, if the new position centroid1new would
be found at coordinates 2.5 and 5.3, assign centroid1rounded(3; 5) to be the new position of
the centroid. If a point has the same minimum distance to more than one centroid you can
assign the point randomly to one of the centroid candidates.
1. Run K-Means for 2 iterations and show every step of your calculation
2. The following table shows the ground truth labels of data points for the K-Means
clustering task:

Cluster I Cluster II Cluster III Cluster IV


a d h c
b e j g
f i
Compute the purity score of your K-Means clustering results of question 1.

You might also like