You are on page 1of 21

Machine Learning

Lecture 1:

1. Rules:
∑ 𝑭𝑿
• Mean: 𝒙
̅= (Average)
𝒏
• Median: The middle value of the arranged numbers from smallest to
the largest if the total numbers are odd. ( ‫الرقم اللي في النص لو عدد االرقام‬
‫)فردي‬
The sum of the two middle values of the arranged numbers from
smallest to the largest divided by n if the total numbers are even
)‫( مجموع الرقمين اللي في النص علي عددهم لو عدد االرقام زوجي‬
• Mode: value that occurs most frequently (‫(اكتر رقم اتقرر‬
• IQR: Q3 - Q1
• Outliers: the value lower than Q1 - 1.5 (IQR) or the value higher than
Q3 + 1.5 (IQR)
̅)2
𝜮(𝒙−𝒙
• Variance: 𝑆 2 =
𝐧−𝟏
̅)2
𝜮(𝒙−𝒙
• Standard deviation: S = √
𝐧−𝟏
̅
𝒙−𝒙
• 𝐳 − 𝐬𝐜𝐨𝐫𝐞 =
𝐒
2. Examples:
1) Given the following numerical dataset:
244 191 160 187 180 176 174 205 211 183 211 180 194 200
i. Calculate the mean, median, the first quartile, the third quartile and the
IQR
ii. Are there any outliers in this data? Find them if they are existed.
iii. Construct the box plot for this data.
Solution:
‫اول خطوه ارتبهم من الصغير للكبير‬
160 174 176 180 180 183 187 191 194 200 205 211 211 244

𝟏𝟔𝟎 + 𝟏𝟕𝟒 + 𝟏𝟕𝟔 + 𝟏𝟖𝟎 + 𝟏𝟖𝟎 + 𝟏𝟖𝟑 + 𝟏𝟖𝟕 + 𝟏𝟗𝟏 + 𝟏𝟗𝟒 + 𝟐𝟎𝟎 + 𝟐𝟎𝟓 + 𝟐𝟏𝟏 + 𝟐𝟏𝟏 + 𝟐𝟒𝟒
I. Mean: 𝟏𝟒
= 177.5
𝟏𝟖𝟕 + 𝟏𝟗𝟏
Median (Q2): = 189
𝟐
dataset ‫ في النص بين ارقام ال‬median ‫ الزم احط ال‬q1 , q3 ‫علشان اجيب‬
160 174 176 180 180 183 187 189 191 194 200 205 211 211 244
.median ‫ ونص بعد ال‬median ‫وبعد كدا اقسمهم نصين نص قبل ال‬
160 174 176 180 180 183 187
.q1=180 ‫ فبكدا‬180 ‫ بتاع النص االوالني هنا بيساوي‬median ‫وبعد كدا اجيب ال‬
191 194 200 205 211 211 244
.q3=205 ‫ فبكدا‬205 ‫ بتاع النص االوالني هنا بيساوي‬median ‫وبعد كدا اجيب ال‬
IQR = q3 – q1 = 205 – 180 = 25
II. Outliers: Q1 - 1.5 (IQR) = 180 – 1.5 (25) = 142.5
Q3 + 1.5 (IQR) = 205 + 1.5 (25) = 242.5
The outliers are the numbers smaller than 142.5 and greater than 242.5
244 is outlier.
III.

Min Q1 Max
Q2 Q3

0 200 250
‫‪2) Ali and Hany took different exams (A and B) if the grade of Ali was 86‬‬
‫‪and the grade of Hany was 82. If we want to compare their test scores,‬‬
‫?‪who gets higher grade Ali or Hany‬‬
‫‪Mean‬‬ ‫‪Standard‬‬
‫)‪deviation (S‬‬
‫‪Exam A‬‬ ‫‪79‬‬ ‫‪3.8‬‬
‫‪Exam B‬‬ ‫‪77‬‬ ‫‪2.5‬‬
‫‪Solution:‬‬
‫اول حاجه الزم اعرف انا بستخدم ال ‪ z-score‬ليه اوال انا لو عندي اتنين طلبه امتحنوا نموذجين‬
‫مختلفين عايز اعرف مين جاب احسن من التاني مش معني ان حد جاب اعلي يبقي احسن ال الزم اقارن‬
‫مين احسن من حيث درجات امتحان النموذج االول للطالب االوالني واقارن من حيث درجات النموذج‬
‫التاني للطالب التاني دا بقي اسمه ‪.z-score‬‬
‫̅‬
‫𝒙‪𝒙−‬‬ ‫𝟗𝟕 ‪𝟖𝟔 −‬‬
‫= 𝑨 𝒎𝒂𝒙𝒆𝒛‬ ‫=‬ ‫‪= 1.842 For Ali‬‬
‫𝐒‬ ‫𝟖‪𝟑.‬‬

‫̅‬
‫𝒙‪𝒙−‬‬ ‫𝟕𝟕 ‪𝟖𝟐 −‬‬
‫= 𝑩 𝒎𝒂𝒙𝒆𝒛‬ ‫=‬ ‫‪= 2 For Hany‬‬
‫𝐒‬ ‫𝟓‪𝟐.‬‬

‫‪Hany gets higher grade than Ali.‬‬


Lecture 2:

1. Rules:
̅
𝒙−𝒙
1. 𝐳 − 𝐬𝐜𝐨𝐫𝐞 = ̅ 𝒊𝒔 𝒕𝒉𝒆 𝒎𝒆𝒂𝒏 𝒐𝒇 𝒙 𝒂𝒏𝒅 𝑺 𝒊𝒔 𝒕𝒉𝒆 𝒔𝒕𝒂𝒏𝒅𝒓𝒂𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏)
Where (𝒙
𝐒
∑ 𝒛𝒙 𝒛𝒚
2. 𝒓 = (Correlation coefficient rule)
𝒏−𝟏
𝒔𝒚
3. 𝒃𝟏 = 𝒓 ( )
𝒔 𝒙
̅ − 𝒃𝟏 𝒙
4. 𝒃𝟎 = 𝒚 ̅
̂ = 𝒃𝟏 𝒙 + 𝒃𝟎 (Predicted value)
5. 𝒚
6. Error =|𝒚
̂ − 𝒚| where (𝒚̂ 𝒊𝒔 𝒕𝒉𝒆 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏 𝒗𝒂𝒍𝒖𝒆 𝒂𝒏𝒅 𝒚 𝒊𝒔 𝒕𝒉𝒆 𝒓𝒆𝒂𝒍 𝒗𝒂𝒍𝒖𝒆)

2. How to know if it is strong or weak correlation and Determining Correlation


Strength?
If the Correlation is
• Equal 1 is perfect positive (direct relation)
• Equal -1 is perfect negative (inverse relation)
• Between 0.6 to 0.99 is strong positive (direct relation)
• Between -0.6 to -0.99 is strong negative (inverse relation)
• Between 0.3 to 0.6 is moderate positive (moderate relation)
• Between -0.3 to -0.6 is moderate negative (moderate relation)
• Between 0 to 0.3 is weak positive (weak relationship)
• Between 0 to -0.3 is weak negative (weak relationship)
3. Examples:
A high school has a room set aside after school for students to play games. The attendance
data are summarized in the following table:
Week (x) 1 2 3 4 5 6 7 8

Attendance (y) 73 78 84 88 29 35 39 44

I. Calculate the correlation coefficient between the attendance of students and the
semester weeks. Comment on its value. (𝒙 ̅ (Mean of weeks) = 4.50, 𝒚
̅ (Mean of
attendance) = 58.75, 𝒔𝒙 =2.449489743, 𝒔𝒚 = 24.27079374).
II. What would be the number of students attend the play games room at week number
ten?
III. What is the error in the predicted value at week 4?

Solution: -
‫ وطبعا علشان نجيبها محتاجين نعمل نكمل علي الجدول اللي فوق هنزود‬r ‫هنا احنا بنجيب ال‬ .I
‫ والتالت‬z-score of y ‫ والصف التاني هيبقي‬z-score of x ‫ صفوف اول صف هيبقي ال‬3
‫ واهم حاجه وانت بتحسبهم متنساش تاخد رقمين ارقام بعد‬score z- ‫ضرب الصفين بتوع ال‬
. r ‫( وبعد كدا طبعا نجيب ال‬point) ‫ال‬
𝒙−𝒙̅
𝐳 − 𝐬𝐜𝐨𝐫𝐞 𝒐𝒇 𝒙 =
𝐒
Week (x) 1 2 3 4 5 6 7 8
Attendance (y) 73 78 84 88 29 35 39 44
𝒛𝒙 -1.4289 -1.0206 -0.6124 -0.2041 0.2041 0.6124 1.0206 1.4289
𝒛𝒚 0.5871 0.7931 1.0403 1.2052 -1.2258 -0.9785 -0.8137 -0.6077
𝒛𝒙 ∗ 𝒛𝒚 -0.8389 -0.8095 -0.6371 -0.2460 -0.2502 -0.5992 -0.8305 -0.8684
r ‫هنا بنجمع اخر صف علشان نجيب ال‬
∑ 𝒛𝒙 𝒛𝒚 = -5.0798
∑ 𝒛𝒙 𝒛𝒚 −𝟓.𝟎𝟕𝟗𝟖
𝒓= = = -0.7257
𝒏−𝟏 𝟖−𝟏

‫وبعد كدا بنشوف الرقم اللي طلع هنالقيه‬


Between -0.6 to -0.99 is strong negative (inverse relation(
II. What would be the number of students attend the play games room
at week number ten?
̂) ‫هنا طبعا باين ان هو طالب ال‬
10 ‫ علشان مفيش اسبوع رقمه‬predicted value(𝒚
𝒃𝟏 ‫ودا طبعا علشان نجيبه الزم نجيب ال 𝟎𝒃 و‬
𝒔𝒚 𝟐𝟒.𝟐𝟕𝟎𝟕𝟗𝟑𝟕𝟒.
𝒃𝟏 = 𝒓 ( )= −𝟎. 𝟕𝟐𝟓𝟕 ( )=-7.19
𝒔 𝒙 𝟐.𝟒𝟒𝟗𝟒𝟖𝟗𝟕𝟒𝟑
̅= 58.75 - (-7.19) (4.5) = 91.1
̅ − 𝒃𝟏 𝒙
𝒃𝟎 = 𝒚
̂ = 𝒃𝟏 𝒙 + 𝒃𝟎 = 𝟗𝟏. 𝟏 − 𝟕. 𝟏𝟗 𝒙
𝒚
At 𝒙 = 𝟏𝟎, 𝒚̂ =𝟗𝟏. 𝟏 − 𝟕. 𝟏𝟗 (𝟏𝟎) = 19.2
III. what is the error in the predicted value at week 4?

̂) at x =4 ‫ وطبعا علشان نجيبه الزم نجيب‬error ‫طبعا هنا هنجيب ال‬


predicted value(𝒚
‫ في االسبوع الرابع في الجدول االساسي‬y=88 ‫وممنساش ان ال‬

̂ = 𝒃𝟏 𝒙 + 𝒃𝟎 = 𝟗𝟏. 𝟏 − 𝟕. 𝟏𝟗 𝒙
𝒚
At 𝒙 = 𝟒, 𝒚̂ =𝟗𝟏. 𝟏 − 𝟕. 𝟏𝟗 (4) = 62
Error =|𝒚
̂ − 𝒚| = |𝟔𝟐 − 𝟖𝟖| = 𝟐𝟔
Another Example:

Solution:
i. Sales = 332.0269 + 3.1924 (Advertising cost)
̂ = 𝒃𝟏 𝒙 + 𝒃𝟎 ‫لو ركزت في المعادله دي هتالقي ان القانون دا زي دي‬
𝒚
B1 =3.1924 ‫ودا معناه ان‬
Standard deviation of X and Standard deviation of y ‫وانا عندي ال‬
𝒔𝒚
𝒃𝟏 = 𝒓 ( )
𝒔𝒙
𝟔𝟏.𝟗𝟒𝟕𝟔
3.1924=r( ) , r=0.6456
𝟏𝟐.𝟓𝟐𝟕𝟕
Between 0.6 to 0.99 is strong Positive (direct relation(
ii. Calculate the real and estimated values of sales at week 30
mean(Average) ‫ فهجيبها عن طريق ال‬real value is missing ‫انا عندي ال‬
∑𝒚
̅=
𝒚 𝒏
𝟑𝟖𝟓+𝟒𝟎𝟎+𝟑𝟓𝟗+𝟑𝟔𝟓+𝒚+𝟒𝟒𝟎+𝟒𝟗𝟎+𝟒𝟐𝟎+𝟓𝟔𝟎
436.6667= =
𝟗
𝟑𝟒𝟓𝟓+𝒚
436.6667= , y=475
𝟗
To calculate estimated value we use this equation:
Sales = 332.0269 + 3.1924 (Advertising cost) = 332.0269 + 3.1924 (30) =
427.7989
Lecture 4:

1. Rules:
The Bayes Rule is a way of going from P(X|Y), known from the
training dataset, to find P(Y|X).

Bayes Rule:

Naïve Bayes:
2. Example:
Age Income Student Credit rating Buys Computer
<=30 High No Fair No
<=30 High No Excellent No
31..40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31..40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31..40 Medium No Excellent Yes
31..40 High Yes Fair Yes
>40 Medium No Excellent No
Given the following training dataset, use naive Bayesian classifier to classifier
this data:
X = (age <=30, Income = medium, Student = yes, Credit rating = Fair)
Will buy computer or not?
Solution:
class ‫ بتاعت ال‬probability ‫اول خطوه بتجيب ال‬
Class:
• C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
◼ Compute P(Ci) for each class:
◼ P(C1) = P(buys computer = “yes”) = 9/14 = 0.643
◼ P(C2) = P(buys computer = “no”) = 5/14= 0.357
class ‫ بتاعتها مع ال‬Probability ‫ في الداتا الجديده واجيب‬Column ‫تاني خطوه ابدأ امسك كل‬
Class:
• C1:buys computer = ‘yes’ C2:buys computer = ‘no’
Data to be classified:
• X = (age <=30, Income = medium, Student = yes, Credit rating = Fair)

Age Buys Computer Count Total Conditional Conditional


Probability Probability
<= 30 Yes 2 9 (2/9) 0.222222222
<= 30 No 3 5 (3/5) 0.6

Income Buys Computer Count Total Conditional Probability Conditional


Probability
Medium Yes 4 9 (4/9) 0.444444444
Medium No 2 5 (2/5) 0.4

Student Buys Count Total Conditional Conditional


Computer Probability Probability
Yes Yes 6 9 (6/9) 0.666666667
Yes No 1 5 (1/5) 0.2

Credit Rating Buys Computer Count Total Conditional Probability Conditional


Probability
Fair Yes 6 9 (6/9) 0.666666667

Fair No 2 5 (2/5) 0.4

‫الل تحت‬
‫ممكن اكتبها بطريقه تانيه زي ي‬
◼ Compute P(X|Ci) for each class
P(age = “<=30” | buys computer = “yes”) = 2/9 =
0.222
P(age = “<= 30” | buys computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys computer = “yes”) = 6/9 =
0.667 P(student = “yes” | buys computer = “no”) = 1/5
= 0.2
P(credit rating = “fair” | buys computer = “yes”) = 6/9 = 0.667
P(credit rating = “fair” | buys computer = “no”) = 2/5 = 0.4
‫متنساش الخطويطين دول‬
‫ في بعض‬yes ‫ضرب كل االرقام بتاعت ال‬
‫ في بعض‬No ‫ضرب كل االرقام بتاعت ال‬
◼ X = (age <= 30 , income = medium, student = yes, credit rating = fair)
P(X|Ci) : P(X| buys computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P(X | buys computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(yes) ‫ وطلعت الناتج اضربها في ال‬Yes ‫وبعد كدا بعد ما ضربت ال‬
P(NO) ‫ وطلعت الناتج اضربها في ال‬NO ‫وبعد كدا بعد ما ضربت ال‬

P(X|Ci)*P(Ci) : P(X| buys computer = “yes”) * P(buys computer = “yes”) = 0.028

P(X| buys computer = “no”) * P(buys computer = “no”) = 0.007

Therefore, X belongs to class (“buys computer = yes”)


Lecture 5

• Manhattan Distance= |𝒙𝟏 − 𝒙𝟐 | + |𝒚𝟏 − 𝒚𝟐 |


𝑻𝑵+𝑻𝑷
• Accuracy=
𝑨𝑳𝑳
𝑭𝑵+𝑭𝒑
• Error rate= 1 – accuracy or
𝑨𝑳𝑳
𝑻𝑷 𝑻𝑷
• Sensitivity (Recall) = =
𝑷 𝑻𝑷+𝑭𝑵
𝑻𝑵 𝑻𝑵
• Specificity = =
𝑵 𝑭𝑷+𝑻𝑵
𝑻𝑷 𝑻𝑷
• Precision = =
𝑷\ 𝑻𝑷+𝑭𝑷

Suppose we have this table


X1 X2 Class
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
◼ Given data instance X1 = 3 and X2 = 7, Determine the suitable class of this
instance using KNN algorithm and Manhattan Distance as a similarity
measure (use K=3 as the number of Neighbors to be considered).
Solution: -
‫ اللي في‬points ‫ الجديده وباقي ال‬points ‫ بين ال‬distance ‫اول خطوه بحسب ال‬
Manhattan ‫الجدول باستخدام‬
X1 X2 Distance Class
7 7 |𝟑 − 𝟕| + |𝟕 − 𝟕| = 𝟒 Bad
7 4 |𝟑 − 𝟕| + |𝟕 − 𝟒| = 𝟕 Bad
3 4 |𝟑 − 𝟑| + |𝟕 − 𝟒| = 𝟑 Good
1 4 |𝟑 − 𝟏| + |𝟕 − 𝟒| = 𝟓 Good
‫ بمعني ارتبهم باالسبقيه وبما ان ال‬distance ‫ بتاع ال‬rank ‫بعد كدا بشوف ال‬
‫ والباقي ال‬rank 3 ‫ فهاخد بأول‬k=3
X1 X2 Distance Ranked Is it included
distance in the
3-Nerest
Neighbors
7 7 4 2 Yes
7 4 7 4 No
3 4 3 1 Yes
1 4 5 3 Yes
bad ‫ وال‬Good ‫وبعد كدا بشوف التالته اللي انا خدتهم دول كانوا‬
X1 X2 Distance Ranked Is it included in Class
distance the 3-Nerest
Neighbors
7 7 4 2 Yes Bad
7 4 7 4 No
3 4 3 1 Yes Good
1 4 5 3 Yes Good
good ‫ الجديده هتبقي‬points ‫ يبقي كدا ال‬bad ‫ وواحده‬good 2 ‫وبعد كدا هالقي‬
◼ We have 2 good and 1 bad,
◼ We conclude that data instance X1 = 3 and X2 = 7 is included in good
category.
Another Example (very Important):
Consider the following table – it consists of the height, age and weight (target)
value for 10 people. As you can see

Use the K - Nearest Neighbor (KNN) algorithm and the Manhattan distance as a
similarity measure to predict whether an ID number 11 whose Height is 5.5 years
and age is 38. We need to predict the weight of this person based on their height
and age. (Use K = 5 as the number of neighbors to be considered).
Solution:
ID Height Age Distance Rank Weight
1 5 45 |𝟓 − 𝟓. 𝟓| + |𝟒𝟓 − 𝟑𝟖| = 𝟕. 𝟓 5 77
2 5.11 26 |𝟓. 𝟏𝟏 − 𝟓. 𝟓| + |𝟐𝟔 − 𝟑𝟖| = 𝟏𝟐. 𝟑𝟗 8
3 5.6 30 |𝟓. 𝟔 − 𝟓. 𝟓| + |𝟑𝟎 − 𝟑𝟖| = 𝟖. 𝟏 6
4 5.9 34 |𝟓. 𝟗 − 𝟓. 𝟓| + |𝟑𝟒 − 𝟑𝟖| = 𝟒. 𝟒 3 59
5 4.8 40 |𝟒. 𝟖 − 𝟓. 𝟓| + |𝟒𝟎 − 𝟑𝟖| = 𝟐. 𝟕 2 72
6 5.8 36 |𝟓. 𝟖 − 𝟓. 𝟓| + |𝟑𝟔 − 𝟑𝟖| = 𝟐. 𝟑 1 60
7 5.3 19 |𝟓. 𝟑 − 𝟓. 𝟓| + |𝟏𝟔 − 𝟑𝟖| = 𝟐𝟐. 𝟐 10
8 5.8 28 |𝟓. 𝟖 − 𝟓. 𝟓| + |𝟐𝟖 − 𝟑𝟖| = 𝟏𝟎. 𝟑 7
9 5.5 23 |𝟓. 𝟓 − 𝟓. 𝟓| + |𝟐𝟑 − 𝟑𝟖| = 𝟏𝟓 9
10 5.6 32 |𝟓. 𝟔 − 𝟓. 𝟓| + |𝟑𝟐 − 𝟑𝟖| = 𝟔. 𝟏 4 58

𝟕𝟕+𝟓𝟗+𝟕𝟐+𝟔𝟎+𝟓𝟖
ID 11 = = 𝟔𝟓. 𝟐 𝒌𝒈
𝟓
3. Example On Confusion Matrix:
Actual Class\Predicted class cancer = yes cancer = no Total

cancer = yes 90 (TP) 210 (FN) 300

cancer = no 140 (FP) 9560 (TN) 9700

Total 230 9770 10000

◼ Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples


in class I that were labeled by the classifier as class J
◼ Calculate Accuracy, Error rate, Sensitivity, Specificity, Precision.
Solution:
... ‫ او‬TN ‫ او‬TP ‫ببساطه انا عرفت منين انها‬
TP (true positive) ‫ بتاعها‬column ‫ و في نفس ال‬cancer = yes ‫ ال‬row ‫اوال هو في‬
FN ‫ هتبقي‬cancer = no ‫ بتاع‬column ‫ وفي‬cancer = yes ‫ ال‬row ‫طيب هي لو في‬
(False Negative)
TN ‫ هتبقي‬cancer = no ‫ بتاع‬column ‫ وفي‬cancer = no ‫ ال‬row ‫طيب لو هي في‬
(true negative)
o Precision = TP/TP+FP = 90/230 = 39.13%
o Sensitivity (Recall) = TP/P = 90/300 = 30.00%
o Specificity = TN/N = 9560/9700 = 98.56%
o Accuracy (recognition rate) = TP+TN/ALL = 90+9560/10000 =
96.50%
o Error = 1- Accuracy = 1- 96.5% = 3.5 %
‫هنا لو ركزت هتالقيه بيطلب عايز نسبه ال ‪ instances‬الصحيحه فباين انها ال‬ ‫‪.i‬‬
‫‪accuracy.‬‬
‫𝟓‪𝟏𝟎+𝟏𝟓+‬‬ ‫𝟎𝟑‬
‫=‪Accuracy (recognition rate) = TP+TN/ALL‬‬ ‫=‬ ‫‪= 60%‬‬
‫𝟎𝟓‬ ‫𝟎𝟓‬
‫هنا بيسأل علي عدد ال ‪ instances‬اللي في ‪ class 2‬بمعني هشوف ال ‪ row‬بتاع‬ ‫‪.ii‬‬
‫‪ class 2‬هكتب مجموعهم‬
‫‪Class 2 instances = 5+15+3=23‬‬
‫هنا بيسأل عن ال ‪ instances‬اللي محطوطه في ‪ class 3‬غلط وهما مش في ‪class 3‬‬ ‫‪.iii‬‬
‫بمعني هشوف ال ‪ column‬بتاع ‪ class 3‬واللي مش محطوط في نفس ال ‪ row‬وال‬
‫‪ column‬بتاع ‪ class 3‬بس محطوط في ال ‪ column‬بس هجمعهم(‪)+‬‬
‫‪3+3=6‬‬
Actual/ Predicted Class A Class B Class C Class D
Class A 2417 124 2 0
Class B 117 2279 102 0
Class C 1 131 2236 105
Class D 0 2 116 2368
i. Calculate the error rate of this classifier.
ii. How many class B instances are in the dataset?
iii. How many instances were incorrectly classified with class A?
iv. Find the precision of class B.
v. Find the recall of class D.
Solution:
𝑭𝑵+𝑭𝒑
I. Error = =
𝑨𝑳𝑳
𝟏𝟐𝟒+𝟐+𝟎+𝟏𝟏𝟕+𝟏𝟎𝟐+𝟎+𝟏+𝟏𝟑𝟏+𝟏𝟎𝟓+𝟎+𝟐+𝟏𝟏𝟔
= 7%
𝟐𝟒𝟏𝟕+𝟏𝟐𝟒+𝟐+𝟎+𝟏𝟏𝟕+𝟐𝟐𝟕𝟗+𝟏𝟎𝟐+𝟎+𝟏+𝟏𝟑𝟏+𝟐𝟐𝟑𝟔+𝟏𝟎𝟓+𝟎+𝟐+𝟏𝟏𝟔+𝟐𝟑𝟔𝟖
OR
𝟐𝟒𝟏𝟕+𝟐𝟐𝟕𝟗+𝟐𝟐𝟑𝟔+𝟐𝟑𝟔𝟖
1 – accuracy = 1 - 𝟐𝟒𝟏𝟕+𝟏𝟐𝟒+𝟐+𝟎+𝟏𝟏𝟕+𝟐𝟐𝟕𝟗+𝟏𝟎𝟐+𝟎+𝟏+𝟏𝟑𝟏+𝟐𝟐𝟑𝟔+𝟏𝟎𝟓+𝟎+𝟐+𝟏𝟏𝟔+𝟐𝟑𝟔𝟖 =
𝟗𝟑𝟎𝟎
1- = 7%
𝟏𝟎𝟎𝟎𝟎
II. The instances in class B are 2498 (117+2279+102+0)
III. The instances which are incorrectly classified with class A are 118
(117+1+0)
𝑻𝒑 𝟐𝟐𝟕𝟗
IV. The Precision of class B is: = = 89.86%
𝑻𝒑+𝑭𝒑 𝟐𝟐𝟕𝟗+(𝟏𝟐𝟒+𝟏𝟑𝟏+𝟐)
𝑻𝒑 𝟐𝟑𝟔𝟖
V. The recall (Sensitivity) of class D is: = = 𝟗𝟓%
𝑻𝒑+𝑭𝒏 𝟐𝟑𝟔𝟖+(𝟎+𝟐+𝟏𝟏𝟔)
Lecture 6

a) Given the following distance matrix, use the DBSCAN algorithm to find
the final clusters. Determine for each point whether it is core, border, or
a noise point.
Use the Following parameters as inputs for the DBSCAN
(EPS=2, Minpts=2)

‫ وبحطهم مع بعض بالشكل دا‬eps ‫اول خطوه بشوف اللي اقل من ال‬
N (A1) = {} noise
N (A2) = {} noise
N (A3) = {A5, A6} core
N (A4) = {A8} core
N (A5) = {A3, A6} core
N (A6) = {A3, A5} core
N (A7) = {} noise
N (A8) = {A4} core
‫المفروض بعد كدا اشوف ال ‪ border‬و ال ‪ core‬وال ‪ noise‬طب دول اعرفهم ازاي‬
‫◼ ال ‪ core‬بتبقي اكبر من او تساوي ال ‪ Minpts‬بس عندك مثال ال‬
‫}‪N (A4) = {A8‬‬
‫دي بتبقي ‪ 2‬علشان اصلها بيبقي كدا }‪N (A4) = {A8, A4‬‬
‫◼ ال ‪ border‬بتبقي مش محققه شرط ال ‪ Minpts‬بس جواها ‪ point core‬زي ما هنشوف‬
‫في ال ‪ example‬الجي‬
‫◼ ال ‪ noise‬بتبقي بتبقي مش محققه شرط ال ‪ Minpts‬وال جواها ‪point core‬‬
‫اخر خطوه منساش احطهم في ‪Clusters‬‬

‫}‪Cluster (1) = {A3, A5, A6‬‬


‫}‪Cluster (2) = {A4, A8‬‬
‫‪Outliers: A1, A2, A7‬‬
b) Given the following distance matrix, use the DBSCAN algorithm to find
the final clusters. Determine for each point whether it is core, border, or
a noise point.
Use the Following parameters as inputs for the DBSCAN
(EPS=1.9, Minpts=4)

Solution:
Solution:

With my best wishes

You might also like