You are on page 1of 27

Artificial Intelligence for

Business

A.K. Swain

IIM Kozhikode
ML: Quotes
If I had an hour to solve a problem I’d spend 55 minutes
thinking about the problem and 5 minutes thinking
about solutions.
Albert Einstein

2
ML Requirements: Modeling
Statistical Modelling:
• Described as a formalization of relationships
between variables in the data in the form of
mathematical equations.
Machine Learning:
• It is an algorithm that can learn from data without
relying on rules-based programming.

3
Modeling
All models are wrong, but some are useful.
George Edward Pelham Box
Predictive Models
Tree-Based Methods:
Classification
Disease Diagnosis
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep Throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep Throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep Throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep Throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep Throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep Throat
8 Yes No No Yes Yes Allergy Swollen
Glands
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold

NO YES

Fever Step Throat

NO YES

Allergy Cold
Data Instances with an Unknown
Classification
Sore Swollen
Patient ID Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ?
12 Yes Yes No No Yes ?
13 No No No No Yes ?
Swollen
Glands

NO YES

Fever Step Throat

NO YES

Allergy Cold
Production Rules

IF Swollen Glands = Yes


THEN Diagnosis = Strep Throat Swollen
Glands
IF Swollen Glands = No & Fever = Yes
NO YES
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
Fever Step Throat
THEN Diagnosis = Allergy

NO YES

Allergy Cold
Credit Card Promotion Database
Income Magazine Watch Life Ins Credit Card
Range Promo Promo Promo Ins. Gender Age
40-50,000 Yes No No No Male 45
30-40,000 Yes Yes Yes No Female 40
40-50,000 No No No No Male 42
30-40,000 Yes Yes Yes Yes Male 43
50-60,000 Yes No Yes No Female 38
20-30,000 No No No No Female 55
30-40,000 Yes No Yes Yes Male 35
20-30,000 No Yes No No Male 27
30-40,000 Yes No No No Male 43
30-40,000 Yes Yes Yes No Female 41
40-50,000 No Yes Yes No Female 43
20-30,000 No Yes Yes No Male 29
50-60,000 Yes Yes Yes No Female 39
40-50,000 No Yes No No Male 55
20-30,000 No No Yes Yes Female 19
Income Life Ins Credit
Range Promo Card Ins. Gender Age

40-50,000 No No Male 45

30-40,000 Yes No Female 40

40-50,000 No No Male 42

30-40,000 Yes Yes Male 43

50-60,000 Yes No Female 38

20-30,000 No No Female 55

30-40,000 Yes Yes Male 35

20-30,000 No No Male 27

30-40,000 No No Male 43

30-40,000 Yes No Female 41

40-50,000 Yes No Female 43

20-30,000 Yes No Male 29

50-60,000 Yes No Female 39

Income Range 40-50,000 No No Male 55

20-30,000 Yes Yes Female 19

Credit Card Promotion Database

20-30K

40-50K 50-60K
30-40K

2 (Yes) 4 (Yes) 1 (Yes) 2 (Yes)


2 (No) 1 (No) 3 (No) 0 (No)
Income Life Ins Credit
Range Promo Card Ins. Gender Age
Income Range 40-50,000 No No Male 45

30-40,000 Yes No Female 40

40-50,000 No No Male 42

30-40,000 Yes Yes Male 43

20-30K 50-60,000 Yes No Female 38

20-30,000 No No Female 55

40-50K 50-60K 30-40,000 Yes Yes Male 35


30-40K
20-30,000 No No Male 27

30-40,000 No No Male 43

30-40,000 Yes No Female 41


2 (Yes) 4 (Yes) 1 (Yes) 2 (Yes)
40-50,000 Yes No Female 43
2 (No) 1 (No) 3 (No) 0 (No)
20-30,000 Yes No Male 29

50-60,000 Yes No Female 39

40-50,000 No No Male 55
Calculations: 20-30,000 Yes Yes Female 19
Income Range
Correct Classification = 11/15 = 0.733 Credit Card Promotion Database
Goodness of Score = 0.73/4 = 0.183

20-30K

40-50K 50-60K
30-40K

Yes Yes No Yes


Income Life Ins Credit
Range Promo Card Ins. Gender Age
Credit Card Insurance
40-50,000 No No Male 45

30-40,000 Yes No Female 40

40-50,000 No No Male 42

30-40,000 Yes Yes Male 43


YES 50-60,000 Yes No Female 38
NO
20-30,000 No No Female 55

30-40,000 Yes Yes Male 35

20-30,000 No No Male 27

30-40,000 No No Male 43

6 (Yes) 3 (Yes) 30-40,000 Yes No Female 41


6 (No) 0 (No) 40-50,000 Yes No Female 43

20-30,000 Yes No Male 29

50-60,000 Yes No Female 39

40-50,000 No No Male 55

20-30,000 Yes Yes Female 19


Credit Card Insurance

Credit Card Promotion Database

Calculations:
YES
NO Correct Classification = 9/15 = 0.6
Goodness of Score = 0.6/2 = 0.3

YES YES
Income Life Ins Credit
Range Promo Card Ins. Gender Age
Age
40-50,000 No No Male 45

30-40,000 Yes No Female 40

40-50,000 No No Male 42

30-40,000 Yes Yes Male 43


>43 50-60,000 Yes No Female 38
<=43
20-30,000 No No Female 55

30-40,000 Yes Yes Male 35

20-30,000 No No Male 27

30-40,000 No No Male 43

9 (Yes) 0 (Yes) 30-40,000 Yes No Female 41


3 (No) 3 (No) 40-50,000 Yes No Female 43

20-30,000 Yes No Male 29

50-60,000 Yes No Female 39

40-50,000 No No Male 55

20-30,000 Yes Yes Female 19


Age

Credit Card Promotion Database

Calculations:
>43
<=43
Correct Classification = 9/15 = 0.6
Goodness of Score = 0.6/2 = 0.3

YES NO
Income Life Ins Credit
Range Promo Card Ins. Gender Age
Age
40-50,000 No No Male 45

30-40,000 Yes No Female 40

40-50,000 No No Male 42

30-40,000 Yes Yes Male 43


<=43 >43 50-60,000 Yes No Female 38

20-30,000 No No Female 55
0 (Yes)
30-40,000 Yes Yes Male 35
3 (No)
Gender 20-30,000 No No Male 27

30-40,000 No No Male 43

30-40,000 Yes No Female 41

Female 40-50,000 Yes No Female 43


Male
20-30,000 Yes No Male 29

50-60,000 Yes No Female 39


6 (Yes) 40-50,000 No No Male 55
0 (No)
Credit Card insurance 20-30,000 Yes Yes Female 19

Credit Card Promotion Database


NO Yes

1 (Yes) 2 (Yes)
3 (No) 0 (No)
Efficient Node Selection
Information Theory
Gini Index
Training Data
Owns Home Married Gender Employed Credit Ratings Risk Class
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Yes No Male No B B
No Yes Female Yes B C
No No Female Yes B A
No No Male No B B
Yes No Female Yes A A
No Yes Female Yes A C
Yes Yes Female Yes A C
Information Theory
Owns Credit Risk
Home Married Gender Employed Ratings Class
Yes Yes Male Yes A B
No No Female Yes A A
No of Samples = 10, No of Classes = 3 Yes Yes Female Yes B C
Frequency of the classes: Yes No Male No B B
No Yes Female Yes B C
A = ?, B = ?, C = ?
No No Female Yes B A
A= 3, B = 3, C = 4 No No Male No B B
I= Yes No Female Yes A A
No Yes Female Yes A C
-(3/10)log(3/10) - (3/10)log(3/10) -(4/10)log(4/10) = 1.57
Yes Yes Female Yes A C

Attribute OWNS HOME Attribute OWNS HOME


Value = Yes; No of Samples = 5, Value = No; No of Samples = 5,
No of Classes = 3 No of Classes = 3
Frequency of the classes: Frequency of the classes:
A = ?, B = ?, C = ? A = ?, B = ?, C = ?
A= 1, B = 2, C = 2 A= 2, B = 1, C = 2
I(Yes) = I(No) =
-(1/5)log(1/5) - (2/5)log(2/5) - (2/5)log(2/5) -(1/5)log(1/5)
-(2/5)log(2/5) = 1.52 -(2/5)log(2/5) = 1.52

Total info of subtree = 0.5I(Yes) + 0.5I(No) = 1.52


Information Gain

Potential Split Information Before Information after Information


Attribute split split gain
Owns Home 1.57 1.52 0.05
Married 1.57 0.85 0.72
Gender 1.57 0.69 0.88
Employed 1.57 1.12 0.45
Credit Rating 1.57 1.52 0.05
1st Iteration

Gender

Female Male

? Class B
Removing “Gender and Class B”

Owns Home Married Gender Employed Credit Ratings Risk Class


No No Female Yes A A
Yes Yes Female Yes B C
No Yes Female Yes B C
No No Female Yes B A
Yes No Female Yes A A
No Yes Female Yes A C
Yes Yes Female Yes A C
Decision Tree

Gender

Female Male

Married Class B

Yes No

Class C Class A
Unsupervised Learning:
Clustering
Distance Measure

 Euclidean distance
 Manhattan distance
 Hamming distance
 Maximum norm
 Mahalanobis distance
 Minkowski distance (higher
dimensional data)
Clustering: Categories

Categories
• Exclusive Clustering
• Overlapping Clustering
• Hierarchical Clustering
• Probabilistic Clustering
The END

IIM Kozhikode

You might also like