Data Mining Process

Data Mining Process - Part3
Data Mining Process

Data Transformation
 Converting data from one shape into another.
 Data transformation include various steps such as:
 Discretization : converting continuous data into a set of
intervals.
 Gold prices are changing every second, but we can not
use all the data, so we take gold price only once every
hour.
 We define specific values (bins) and use the values of
the continuous (numeric) field at these bins only.
Data Transformation
 Aggregation : Grouping values of many rows together.
 The original data contains the daily sales of a store, but
we want to group them and use the monthly sales of
the store.
Data Transformation
 Attribute Construction: Generating a new field from
existing field(s). (Grouping columns together).
Name Salary Commission

Ahmad 700 150
Rana 800 100
…
Will be transformed into :

Name Total Salary
Ahmad 850
Rana 900
Data Mining
 Before talking about the data mining phase, we will describe
the concepts of similarity and distance measurements.
 Similarity Measurement is used to find the degree of similarity
between two items.
 Distance, in contrast, is used to find the degree of dissimilarity
between them.
 Commonly, we apply similarity / distance measurements on
vectors.
Cosine Similarity
 One of the similarity measurements.
 The input is two items (documents for example), represented
as vectors, and the output is the degree of similarity between
them.
 If the two items are identical, the similarity will be 1
 If the two items are totally different, the similarity will be 0.
 Otherwise, the similarity will range between 0 and 1.
 Cos-Sim (D1,D2) = D1 . D2 / (2nd norm D1) * (2nd norm D2)
Cosine Similarity Example
 We have two training documents
 D1: Team, Team, win (sport class)
 D2: Stock, Market (Eco class)
 We have one testing document
 D3: Team
 Find the cosine similarity between D3 and D1
 Answer: Training can be represented as:
Team Win Stock Market
2 1 0 0
0 0 1 1
 Testing can be represented as:

1 0 0 0
 Cos_Sim(1,3) = (2*1+1*0+0*0+0*0)/((sqrt(2^2+1^2))*(sqrt(1^2)))
= 2/sqrt(5) = 0.89
 Find Cos_Sim(2,3) ?
 Cos_Sim(2,3) = (0*1 + 0*0 + 1*0 + 1*0) / (sqrt(1^2 + 1^2))*(sqrt(1^2))
= 0.
Euclidean Distance
 One of the distance measurements.
 The input is two items (documents for example), represented
as vectors, and the output is the distance between them.
 If the two items are identical, the distance will be 0
 No upper limit for the distance.
Distance Example
 D3: Team
 Find the distance between D3 and D1
 Answer: Training can be represented as:
2 1 0 0
0 0 1 1

1 0 0 0
Distance Example
 Distance(1,3) = sqrt( (2-1)^2 + (1-0) ^2+ (0-0) ^2+ (0-0) ^2) =
sqrt(1+1+0+0) =
sqrt(2).
 Find Distance(2,3) ?
Distance Example
 Distance(2,3) = sqrt( (0-1)^2 + (0-0) ^2+ (1-0) ^2+ (1-0) ^2) =
sqrt(1+0+1+1) =
sqrt(3).
Our First Data Mining Algorithm!
 Back to the previous example:
 D3: Team
 We want to classify the testing document
 One classifier is called K Nearest Neighbor
 Simply it classify the testing document based on the closest training
document.
 Closest = max similarity or min distance
 Cos_Sim(1,3) = (2*1+1*0+0*0+0*0)/((sqrt(2^2+1^2))*(sqrt(1^2)))
= 2/sqrt(5) = 0.89
 Cos_Sim(2,3) = 0
  document 1 is more similar to document 3 than document 2
to 3.
  kNN will predict the label of document 3 to be the same like
label of document 1
 Label of document 1 is sport  label of document 3 will be
sport.
Our First Data Mining Algorithm!
 Back to the previous example:
 D3: Team
 We want to classify the testing document using distance
Distance Example
 Distance(1,3) = sqrt( (2-1)^2 + (1-0) ^2+ (0-0) ^2+ (0-0) ^2) =
sqrt(1+1+0+0) =
sqrt(2).
 Distance(2,3) = sqrt(3)
  document 1 is closest to document 3 than document 2
  document 3 will have the label of document 1, which is
sport.
Example 2
 D1: Team, win (sport class)
 D2: Stock, Market, Stock (Eco class)
 D3: Team, Stock
1. Represent the three documents using document term matrix
2. What would be the prediction of the label of D3 if kNN is used and we
used the cosine similarity?
3. What would be the prediction of the label of D3 if kNN is used and we
used the Eucledian distance?
Answer
 D1: Team, win (sport class)
 D2: Stock, Market, Stock (Eco class)
 D3: Team, Stock
 Answer: 1. Training can be represented as:

1 1 0 0
0 0 2 1

1 0 1 0
Answer

1 1 0 0
0 0 2 1

1 0 1 0
 2. If kNN is used with cosine similarity:
Cos_Sim(1,3) = (1*1+1*0+0*1+0*0)/((sqrt(1^2+1^2))*((sqrt(1^2+1^2)))
= 1/sqrt(2) sqrt(2)
= ½ = 0.5
Cos_Sim(2,3) = (0*1+0*0+2*1+1*0)/((sqrt(2^2+1^2))*((sqrt(1^2+1^2)))
= 2/sqrt(5) sqrt(2)
= 0.63
 Prediction of D3 will be economy
Answer

1 1 0 0
0 0 2 1

1 0 1 0
 3. If kNN is used with eucledian distance:

Distance(1,3) = sqrt( (1-1)^2 + (1-0) ^2+ (0-1) ^2+ (0-0) ^2) =
sqrt(0+1+1+0)
= sqrt(2)
Distance(2,3) = sqrt( (0-1)^2 + (0-0) ^2+ (2-1) ^2+ (1-0) ^2) =
sqrt(1+0+1+1) = sqrt(3)
 Label of D3 using distance will be sport.
Answer

1 1 0 0
0 0 2 1

1 0 1 0
 Note that using cosine similarity provided a result that is different from
using distance.
 Commonly, cosine similarity is more accurate than the Euclidean
distance.

Data Mining Process - Part 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Process - Part 3

Uploaded by

Copyright:

Available Formats

Data Mining Process - Part3

Name Salary Commission

Will be transformed into :

 Testing can be represented as:

 Testing can be represented as:

Team Win Stock Market

 Testing can be represented as:

Team Win Stock Market

Team Win Stock Market

Team Win Stock Market

 3. If kNN is used with eucledian distance:

Team Win Stock Market

You might also like