You are on page 1of 23

Data Mining Process - Part3

Data Mining Process


Data Transformation
 Converting data from one shape into another.
 Data transformation include various steps such as:
 Discretization : converting continuous data into a set of
intervals.
 Gold prices are changing every second, but we can not
use all the data, so we take gold price only once every
hour.
 We define specific values (bins) and use the values of
the continuous (numeric) field at these bins only.
Data Transformation
 Converting data from one shape into another.
 Data transformation include various steps such as:
 Aggregation : Grouping values of many rows together.
 The original data contains the daily sales of a store, but
we want to group them and use the monthly sales of
the store.
Data Transformation
 Converting data from one shape into another.
 Data transformation include various steps such as:
 Attribute Construction: Generating a new field from
existing field(s). (Grouping columns together).

Name Salary Commission


Ahmad 700 150
Rana 800 100

Will be transformed into :


Name Total Salary
Ahmad 850
Rana 900
Data Mining
 Before talking about the data mining phase, we will describe
the concepts of similarity and distance measurements.
 Similarity Measurement is used to find the degree of similarity
between two items.
 Distance, in contrast, is used to find the degree of dissimilarity
between them.
 Commonly, we apply similarity / distance measurements on
vectors.
Cosine Similarity
 One of the similarity measurements.
 The input is two items (documents for example), represented
as vectors, and the output is the degree of similarity between
them.
 If the two items are identical, the similarity will be 1
 If the two items are totally different, the similarity will be 0.
 Otherwise, the similarity will range between 0 and 1.
 Cos-Sim (D1,D2) = D1 . D2 / (2nd norm D1) * (2nd norm D2)
Cosine Similarity Example
 We have two training documents
 D1: Team, Team, win (sport class)
 D2: Stock, Market (Eco class)
 We have one testing document
 D3: Team
 Find the cosine similarity between D3 and D1
 Answer: Training can be represented as:
Team Win Stock Market
2 1 0 0
0 0 1 1

 Testing can be represented as:


Team Win Stock Market
1 0 0 0
Cosine Similarity Example
 Cos_Sim(1,3) = (2*1+1*0+0*0+0*0)/((sqrt(2^2+1^2))*(sqrt(1^2)))
= 2/sqrt(5) = 0.89

 Find Cos_Sim(2,3) ?
Cosine Similarity Example
 Cos_Sim(2,3) = (0*1 + 0*0 + 1*0 + 1*0) / (sqrt(1^2 + 1^2))*(sqrt(1^2))
= 0.
Euclidean Distance
 One of the distance measurements.
 The input is two items (documents for example), represented
as vectors, and the output is the distance between them.
 If the two items are identical, the distance will be 0
 No upper limit for the distance.
Distance Example
 We have two training documents
 D1: Team, Team, win (sport class)
 D2: Stock, Market (Eco class)
 We have one testing document
 D3: Team
 Find the distance between D3 and D1
 Answer: Training can be represented as:
Team Win Stock Market
2 1 0 0
0 0 1 1

 Testing can be represented as:


Team Win Stock Market
1 0 0 0
Distance Example
 Distance(1,3) = sqrt( (2-1)^2 + (1-0) ^2+ (0-0) ^2+ (0-0) ^2) =
sqrt(1+1+0+0) =
sqrt(2).

 Find Distance(2,3) ?
Distance Example
 Distance(2,3) = sqrt( (0-1)^2 + (0-0) ^2+ (1-0) ^2+ (1-0) ^2) =
sqrt(1+0+1+1) =
sqrt(3).
Our First Data Mining Algorithm!
 Back to the previous example:
 We have two training documents
 D1: Team, Team, win (sport class)
 D2: Stock, Market (Eco class)
 We have one testing document
 D3: Team
 We want to classify the testing document
 One classifier is called K Nearest Neighbor
 Simply it classify the testing document based on the closest training
document.
 Closest = max similarity or min distance
Cosine Similarity Example
 Cos_Sim(1,3) = (2*1+1*0+0*0+0*0)/((sqrt(2^2+1^2))*(sqrt(1^2)))
= 2/sqrt(5) = 0.89
 Cos_Sim(2,3) = 0
  document 1 is more similar to document 3 than document 2
to 3.
  kNN will predict the label of document 3 to be the same like
label of document 1
 Label of document 1 is sport  label of document 3 will be
sport.
Our First Data Mining Algorithm!
 Back to the previous example:
 We have two training documents
 D1: Team, Team, win (sport class)
 D2: Stock, Market (Eco class)
 We have one testing document
 D3: Team
 We want to classify the testing document using distance
Distance Example
 Distance(1,3) = sqrt( (2-1)^2 + (1-0) ^2+ (0-0) ^2+ (0-0) ^2) =
sqrt(1+1+0+0) =
sqrt(2).
 Distance(2,3) = sqrt(3)
  document 1 is closest to document 3 than document 2
  document 3 will have the label of document 1, which is
sport.
Example 2
 We have two training documents
 D1: Team, win (sport class)
 D2: Stock, Market, Stock (Eco class)
 We have one testing document
 D3: Team, Stock
1. Represent the three documents using document term matrix
2. What would be the prediction of the label of D3 if kNN is used and we
used the cosine similarity?
3. What would be the prediction of the label of D3 if kNN is used and we
used the Eucledian distance?
Answer
 We have two training documents
 D1: Team, win (sport class)
 D2: Stock, Market, Stock (Eco class)
 We have one testing document
 D3: Team, Stock
 Answer: 1. Training can be represented as:

Team Win Stock Market


1 1 0 0
0 0 2 1

 Testing can be represented as:

Team Win Stock Market


1 0 1 0
Answer

Team Win Stock Market
1 1 0 0
0 0 2 1

Team Win Stock Market


1 0 1 0
 2. If kNN is used with cosine similarity:
Cos_Sim(1,3) = (1*1+1*0+0*1+0*0)/((sqrt(1^2+1^2))*((sqrt(1^2+1^2)))
= 1/sqrt(2) sqrt(2)
= ½ = 0.5

Cos_Sim(2,3) = (0*1+0*0+2*1+1*0)/((sqrt(2^2+1^2))*((sqrt(1^2+1^2)))
= 2/sqrt(5) sqrt(2)
= 0.63
 Prediction of D3 will be economy
Answer

Team Win Stock Market
1 1 0 0
0 0 2 1

Team Win Stock Market


1 0 1 0

 3. If kNN is used with eucledian distance:


Distance(1,3) = sqrt( (1-1)^2 + (1-0) ^2+ (0-1) ^2+ (0-0) ^2) =
sqrt(0+1+1+0)
= sqrt(2)
Distance(2,3) = sqrt( (0-1)^2 + (0-0) ^2+ (2-1) ^2+ (1-0) ^2) =
sqrt(1+0+1+1) = sqrt(3)
 Label of D3 using distance will be sport.
Answer

Team Win Stock Market
1 1 0 0
0 0 2 1

Team Win Stock Market


1 0 1 0

 Note that using cosine similarity provided a result that is different from
using distance.
 Commonly, cosine similarity is more accurate than the Euclidean
distance.

You might also like