Professional Documents
Culture Documents
Business
A.K. Swain
IIM Kozhikode
ML: Quotes
If I had an hour to solve a problem I’d spend 55 minutes
thinking about the problem and 5 minutes thinking
about solutions.
Albert Einstein
2
ML Requirements: Modeling
Statistical Modelling:
• Described as a formalization of relationships
between variables in the data in the form of
mathematical equations.
Machine Learning:
• It is an algorithm that can learn from data without
relying on rules-based programming.
3
Modeling
All models are wrong, but some are useful.
George Edward Pelham Box
Predictive Models
Tree-Based Methods:
Classification
Disease Diagnosis
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep Throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep Throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep Throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep Throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep Throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep Throat
8 Yes No No Yes Yes Allergy Swollen
Glands
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
NO YES
NO YES
Allergy Cold
Data Instances with an Unknown
Classification
Sore Swollen
Patient ID Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ?
12 Yes Yes No No Yes ?
13 No No No No Yes ?
Swollen
Glands
NO YES
NO YES
Allergy Cold
Production Rules
NO YES
Allergy Cold
Credit Card Promotion Database
Income Magazine Watch Life Ins Credit Card
Range Promo Promo Promo Ins. Gender Age
40-50,000 Yes No No No Male 45
30-40,000 Yes Yes Yes No Female 40
40-50,000 No No No No Male 42
30-40,000 Yes Yes Yes Yes Male 43
50-60,000 Yes No Yes No Female 38
20-30,000 No No No No Female 55
30-40,000 Yes No Yes Yes Male 35
20-30,000 No Yes No No Male 27
30-40,000 Yes No No No Male 43
30-40,000 Yes Yes Yes No Female 41
40-50,000 No Yes Yes No Female 43
20-30,000 No Yes Yes No Male 29
50-60,000 Yes Yes Yes No Female 39
40-50,000 No Yes No No Male 55
20-30,000 No No Yes Yes Female 19
Income Life Ins Credit
Range Promo Card Ins. Gender Age
40-50,000 No No Male 45
40-50,000 No No Male 42
20-30,000 No No Female 55
20-30,000 No No Male 27
30-40,000 No No Male 43
20-30K
40-50K 50-60K
30-40K
40-50,000 No No Male 42
20-30,000 No No Female 55
30-40,000 No No Male 43
40-50,000 No No Male 55
Calculations: 20-30,000 Yes Yes Female 19
Income Range
Correct Classification = 11/15 = 0.733 Credit Card Promotion Database
Goodness of Score = 0.73/4 = 0.183
20-30K
40-50K 50-60K
30-40K
40-50,000 No No Male 42
20-30,000 No No Male 27
30-40,000 No No Male 43
40-50,000 No No Male 55
Calculations:
YES
NO Correct Classification = 9/15 = 0.6
Goodness of Score = 0.6/2 = 0.3
YES YES
Income Life Ins Credit
Range Promo Card Ins. Gender Age
Age
40-50,000 No No Male 45
40-50,000 No No Male 42
20-30,000 No No Male 27
30-40,000 No No Male 43
40-50,000 No No Male 55
Calculations:
>43
<=43
Correct Classification = 9/15 = 0.6
Goodness of Score = 0.6/2 = 0.3
YES NO
Income Life Ins Credit
Range Promo Card Ins. Gender Age
Age
40-50,000 No No Male 45
40-50,000 No No Male 42
20-30,000 No No Female 55
0 (Yes)
30-40,000 Yes Yes Male 35
3 (No)
Gender 20-30,000 No No Male 27
30-40,000 No No Male 43
1 (Yes) 2 (Yes)
3 (No) 0 (No)
Efficient Node Selection
Information Theory
Gini Index
Training Data
Owns Home Married Gender Employed Credit Ratings Risk Class
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Yes No Male No B B
No Yes Female Yes B C
No No Female Yes B A
No No Male No B B
Yes No Female Yes A A
No Yes Female Yes A C
Yes Yes Female Yes A C
Information Theory
Owns Credit Risk
Home Married Gender Employed Ratings Class
Yes Yes Male Yes A B
No No Female Yes A A
No of Samples = 10, No of Classes = 3 Yes Yes Female Yes B C
Frequency of the classes: Yes No Male No B B
No Yes Female Yes B C
A = ?, B = ?, C = ?
No No Female Yes B A
A= 3, B = 3, C = 4 No No Male No B B
I= Yes No Female Yes A A
No Yes Female Yes A C
-(3/10)log(3/10) - (3/10)log(3/10) -(4/10)log(4/10) = 1.57
Yes Yes Female Yes A C
Gender
Female Male
? Class B
Removing “Gender and Class B”
Gender
Female Male
Married Class B
Yes No
Class C Class A
Unsupervised Learning:
Clustering
Distance Measure
Euclidean distance
Manhattan distance
Hamming distance
Maximum norm
Mahalanobis distance
Minkowski distance (higher
dimensional data)
Clustering: Categories
Categories
• Exclusive Clustering
• Overlapping Clustering
• Hierarchical Clustering
• Probabilistic Clustering
The END
IIM Kozhikode