Professional Documents
Culture Documents
Data Mining
Part1: Data Understanding
Exercise1 :
Are there any difference between the 2 situations: Data mining task or not?
Situation1: We need to divide the customers of a company according to their profitability.
Situation 2: We need to forecast the profitability of new customers of the same company.
Answer:
Situation1 assume that customer profitability is measurable. So it is not datamining
task, it is only sorting task on profitability followed by setting a minmum value of
profitability as a threshold. Where a, situation 2 is a data mining since it involves a
prediction task.
Exercise2:
Let’s consider the following table for medical record of patients:
MedicalRecord ( Pat-Id, Weight, diabeteMeasure, NbOfSurgeries, HeartMeasure).
1. Using the following sample data values for 3 patients, are both weight and
diabeteMeasure columns mandatory for analyzing medical data? Yes/No and justify.
2. How reliable your conclusion is? Explain.
Answer:
The ratio Weight/diabeteMeasure ( ≈ 7) is almost the same for the 3 rows. The 2
columns seem to be correlated (contain the same information). However, this
conclusion is not reliable because it is based on very small sample. To apply feature
selection in this case, we need to have significantly bigger ground truth sample as our
base. (reasonably big size sample compared to initial population)
Exercise3:
Assume that we need to analyze customer satisfaction using “the number of customer
complaints for each product”. The company faces the following situation: the best-selling
product has the most complaints and therefore it has the worst customer satisfaction.
a) how reliable the measure is? Yes/No. Justify. If Yes, what do you suggest to fix it?
Explain.
Answer:
It is NOT reliable because the number of complaints is not sufficient to indicate how much a
product satisfies customers. ( the situation facing the company is an example). More reliable
measure is: The number of complaints / Number of products sold.
If we need to analyze customer preference of products. Assume that each product has
several similar variations. We want to analyze the extent to which a customer prefers one
product over other similar products and evaluate which one customers prefer. To prepare
for this analysis, one pair wise technique is suggested: Example: for 3 variations, we have
the customers compare variations 1 and 2, then 1 and 3, and finally 2 and 3. Finally, rank
the product variations in order of customer preference.
a. Show how inconsistent this technique is for ordinal ranking of product
variations in terms of customer preference.
b. Suggest a better approach for this situation.
Answer:
If 1 is better than 2; 2 is better than 3 and 3 is better than 1??? This is inconsistent
ranking????
the average of each product variation over all customers, then make an overall ranking
Exercise 1:
Using the following training document collection, use NB classifier to assign the class to the
document ‘Chennai Hyderabad’. We have 2 classes : India and England.
The posterior probability for n word document for a class cj is calculated as follows;
P(w1, w2, …, wn|cj) = P(cj) * P(w1|cj) * P(w2|cj) * … * P(wn|cj)
Where
Ncj: is the Total number of documents in class cj
Nc : is the total number of documents
And
Where
Nwi,cj: is the Number of times word wi appears in documents of class cj
Ncj: is the Count of words appears in all documents of class cj.
P(India) = 3/5 [How P(India) = 3/5? As per the training data, out of 5 documents, only 3 are
listed under the class 'India'.]
P(England) = 2/5
The same way:
P(‘Chennai Hyderabad’ | India) = P(India) * P(Chennai | India) * P(Hyderabad | India)
= 3/5 * 2/7 * 1/7
= 0.6 * 0.286 * 0.143
= 0.0245
P(‘Chennai Hyderabad’ | England) = P(England) * P(Chennai | England) * P(Hyderabad |
England)
= 2/5 * 1/6 * 1/6
= 0.4 * 0.167 * 0.167
= 0.0112
After the calculation, we found that P(‘Chennai Hyderabad’ | India) > P(‘Chennai
Hyderabad’ | England).