Professional Documents
Culture Documents
What is likely
to happen?
Value (Added to Company)
Why is it
happening?
What
should I do?
Complexity
• Statistics is a branch of
mathematics dealing with
data collection and
organization, analysis,
and interpretation
• To find trends in change
• Analyst read the data
through statistical
measure to arrive at a
conclusion
https://www.lynda.com/Excel-tutorials/Excel-Statistics-Essential-Training-1/5026557-2.html
https://www.springboard.com/blog/data-cleaning/
https://www.tehrantimes.com/news/438777/Iran-develops-first-integrated-health-data-visualization-system
Input (Data)
Traditional
Program Output (Data)
Programming
Input (Data)
Machine
Output (Data) Program
Learning
• Helps in
• The planning of operations
• The setting up of standards
• Mode
• Selects most common value
3+4+⋯+8
Mean 𝑥ҧ = = 4.583
12
Mode 1 1 4 2 7 1
mode 𝑥 = 3
2 1 5 1 8 1
3 3 6 1 9 1
Data items frequency
Person P1 P2 P3 P4 P5 P6 P7
Income
1 1 1 2 2 3 11
(Million)
Mean = 3 Every person could make 3M
Median = 2 Poor half of the population makes 2M or less
• Median = -1
• Mean = ((-1)+(-1)+…+(-1)+1000000)/100 = Some
positive number
https://www.selecthub.com/business-intelligence/statistical-software/
𝑦 = 𝑓 𝑥1 , 𝑥2 , 𝑥3
Model
Variables Parameters
𝑦 = 𝑚𝑥 + 𝑐
𝑦
Variables Parameters
Regression Analysis
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘
𝜕𝑦
If is independent of parameters then model is linear
𝜕(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
𝑦 ∗ = 𝛽0∗ + 𝛽1 𝑥 ∗
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 ; 𝑖 = 1,2, ⋯ , 𝑛
𝑦 = 𝛽0 + 𝛽1 𝑥
𝑦
This model is not representing 𝜀𝑛
the true phenomenon (𝑥 𝑛
,𝑦 𝑛
)
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 ; 𝑖 = 1,2, ⋯ , 𝑛 (𝑥 1
,𝑦 1
)
𝜀1
𝜀2
𝜀𝑖 - random error (𝑥 2
,𝑦 2
)
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
𝑆𝑆𝐸
𝑆𝑆𝐸
𝛽𝑜∗ 𝛽1∗
𝛽0 𝛽1
𝑛 𝑛 𝑛
1 1
−2 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 = 0 𝑦𝑖 = 𝑦ത 𝑥𝑖 = 𝑥ҧ
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
𝑦𝑖 − 𝑛𝛽0 − 𝛽1 𝑥𝑖 = 0 𝑦ത − 𝛽0 − 𝛽1 𝑥ҧ = 0
𝑖=1 𝑖=1
𝑛 𝑛
1 𝛽1
𝑦𝑖 − 𝛽0 − 𝑥𝑖 = 0 𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ
𝑛 𝑛
𝑖=1 𝑖=1
𝑥𝑖 𝑦𝑖 − 𝛽0 𝑥𝑖 − 𝛽1 𝑥𝑖2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛
𝑛 𝑛 𝑛 𝑛
1 1
𝑥𝑖 𝑦𝑖 − 𝑦ത × 𝑛 × × 𝑥𝑖 = 𝛽1 −𝑥ҧ × 𝑛 × × 𝑥𝑖 + 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1
σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖2 − 𝑛𝑥ҧ 2
The slides use the content from Machine Learning course on Coursera.
https://www.coursera.org/learn/machine-learning/home/
500
Housing Prices
(Trichy, TN) 400
300
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Learning Algorithm
Size of Estimated
h
house price
‘s: Parameters
How to choose ‘s ?
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Parameters:
Cost Function:
Goal:
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Parameters:
Cost Function:
Goal:
500
400
300
Price (₹)
in 100000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x)
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
1
0
1
0
Current value of
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
update
and
simultaneously
1
0
ANOVA Page 1
ANOVA Page 2
Test of Significance
Tuesday, 15 September, 2020 02:35 PM
1.9
1.65 1 30 900
1.6
1.5
1.55 1 35 1225
y
1.4
1.48 1 40 1600 1.3
1.40 1 50 2500 1.2
𝒚= 𝑿=
1.30 1 60 3600 1.1
1.26 1 65 4225 1
15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
1.24 1 70 4900 𝛽0 x
1.21 1 75 5625 𝜷 = 𝛽1
1.20 1 80 6400 = 𝑿′ 𝒚
𝑿′ 𝑿𝜷
𝛽2
1.18 1 90 8100
𝑦ො = 2.19826629 − 0.02252236𝑥 + 0.00012507𝑥 2
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖
If x2=0
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝜖
If x2=1
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 . 1 + 𝜖
𝑦 = (𝛽0 +𝛽2 ) + 𝛽1 𝑥1 + 𝜖
𝑅𝑗2 is the coefficient of multiple determination resulting from regressing xj on the other k-1
regressor variables
Malignant ?
(No) 0
Tumor Size
0 1
ℎ𝛽 (𝑥𝑖 ) = 𝑔(𝛽𝑇 𝑥)
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
1
ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽
𝑥0 1
Example: If 𝑥 = 𝑥 =
1 tumorSize
ℎ𝛽 𝑥 = 0.7
m examples
1
ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽
𝑖
1 2
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 (𝑖) = ℎ𝛽 𝑥 𝑖 − 𝑦 (𝑖)
2
𝐽 𝛽 𝐽 𝛽
𝛽 𝛽
log z
z Cost = 0 if y=1, ℎ𝛽 𝑥 =1
0 1 But as ℎ𝛽 𝑥 → 0
-log z
Cost → ∞
Captures intuition that
If y = 1 if ℎ𝛽 𝑥 =0, (predict 𝑃(𝑦 = 1|𝑥; 𝛽)=0),
Cost
but y=1,
We will penalize learning algorithm
by a very large cost
0 1 ℎ𝛽 𝑥
−log(ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = ൝
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0
If y = 0
Cost
0 ℎ𝛽 𝑥 1
−log(ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = ൝
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0
𝑚
1
=− 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝛽 𝑥 𝑖
𝑚
𝑖=1
Want 𝑚𝑖𝑛𝐽(𝛽) :
𝛽
Repeat
𝜕
𝛽𝑗 = 𝛽𝑗 − 𝛼 𝐽(𝛽)
𝜕𝛽𝑗
Want 𝑚𝑖𝑛𝐽(𝛽) :
𝛽
Repeat
(𝑖)
𝛽𝑗 = 𝛽𝑗 − 𝛼 σ𝑚
𝑖=1 ℎ𝛽 𝑥
𝑖 − 𝑦 (𝑖) 𝑥𝑗
x2 x2
x1 x1
x1
x2 x2
x1
x1
x2
Class 1:
Class 2:
Class 3:
𝑖
ℎ𝛽 (𝑥) = 𝑃(𝑦 = 𝑖|𝑥; 𝛽)
x1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
One-vs-all
𝑖
Train a logistic regression classifier ℎ𝛽 (𝑥) for each
class 𝑖 to predict the probability that 𝑦 = 𝑖
https://imjitendra.wordpress.com/
https://www.linkedin.com/in/dr-jitendra/
LDA Page 1
LDA Page 2
LDA Page 3
CAMI16 : Data Analytics
(Practice Questions)
2. A company is engaged in the packaging of superior quality tea in jars of 500 gm each.
The company is of the view that as long as jars contain 500 gm of tea, the process is in
control. The standard deviation is 50 gm. A sample of 225 jars is taken at random and
the sample average is found to be 510 gm. Has the process gone out of control?
3. A company manufacturing light bulbs is using two different processes A and B. The life of
light bulbs of process A has a normal distribution with mean µ1 and standard deviation
σ1 . Similarly, for process B, it is µ2 and σ2 . The data pertaining to the two process are
as follows:
Sample A Sample B
n1 = 16 n2 = 21
x̄1 = 1200hr x̄2 = 1300hr
σ1 = 60hr σ2 = 50hr
Verify that the variability of the two processes is the same. (Hint: Use F -statistic)
4. Examine the claim of a battery producer that the batteries will last for 100 days, given
that a sample study about their life, of the batteries on 200 batteries, showed mean life
of 90 days with a standard deviation of 15 days. Assume normal distribution, and test at
5% level of significance.
5. A company has appointed four salesmen, SA , SB , SC and SD , and observed their sales
in three seasons - summer, winter and monsoon. The figures (in Rs lakh) are given in the
following table:
Using 5% level of significance, perform an analysis of variance on the above data and
interpret the results.
6. Find out the regression equation using least squares estimation on below mentioned data:
X 2 3 4 5 6 7 8 9 10 12
Y 7 9 10 13 15 18 19 24 25 29
1
Principal Component Analysis
(PCA)
CAMI16: Data Analytics
2D to 1D
(cm)
2D to 1D
(cm)
Mean
Per capita Poverty household
GDP GDP Human Index income
(trillions of (thousands Develop- Life (Gini as (thousands
Country US$) of intl. $) ment Index expectancy percentage) of US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …
[resources from en.wikipedia.org]
Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …
Training set:
Preprocessing (feature scaling/mean normalization):
[U,S,V] = svd(Sigma);
Sigma =
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;
𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧
(1%)
Check if
Extract inputs:
Unlabeled dataset:
- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm
- Visualization
𝑃 𝐴 𝑃(𝐵|𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
𝑚
𝑃 𝐴 =
𝑛
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃(𝐵)
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)
TCS 22 28 18 68
L&T 34 25 30 89
IBM 19 32 21 72
Total 75 85 69 229
𝑃(𝐴 ∩ 𝐵)
𝑃 𝐵𝐴 =
𝑃 𝐴 A B
𝑃 𝐵𝐴 = 𝑃 𝐴𝐵 =
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
B
A3
𝑃 𝐹𝑖𝑟𝑒 𝑆𝑚𝑜𝑘𝑒 =
𝑃 𝐹𝑖𝑟𝑒 𝑃(𝑆𝑚𝑜𝑘𝑒|𝐹𝑖𝑟𝑒)
𝑃(𝑆𝑚𝑜𝑘𝑒)
=9% ?
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Example
• You are planning a picnic today
• but the morning is cloudy
• Oh no! 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40% of days
start cloudy)
• And this is usually a dry month (only 3 of 30 days tend
to be rainy, or 10%)
𝑃 𝑅𝑎𝑖𝑛 𝑃(𝐶𝑙𝑜𝑢𝑑|𝑅𝑎𝑖𝑛)
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 =
𝑃(𝐶𝑙𝑜𝑢𝑑)
0.1 × 0.5
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 = = 0.125
0.4
12.5% chances of rain. Not too bad, you may have a picnic.
Blue notBlue
40
Man 5 35 40 𝑃 𝑀𝑎𝑛 = = 0.4
100
25
Woman 20 40 60 𝑃 𝐵𝑙𝑢𝑒 = = 0.25
100
5
25 75 100 𝑃 𝐵𝑙𝑢𝑒|𝑀𝑎𝑛 = = 0.125
40
𝑃 𝑀𝑎𝑛 𝐵𝑙𝑢𝑒 =?
𝑃 𝑀𝑎𝑛 𝑃(𝐵𝑙𝑢𝑒|𝑀𝑎𝑛)
𝑃 𝑀𝑎𝑛 𝐵𝑙𝑢𝑒 =
𝑃(𝐵𝑙𝑢𝑒)
0.4 × 0.125
= = 0.2
0.25
Sunny
Overcast
Rainy
Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No) Play P(Yes)
or
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5 P(No)
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5 Yes
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5 No
Total 9 5 100% 100% Total 9 5 100% 100% Total 14 100%
Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No) Play P(Yes)
or
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5 P(No)
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5 Yes 9 9/14
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5 No 5 5/14
Total 9 5 100% 100% Total 9 5 100% 100% Total 14 100%
𝑃 𝑌𝑒𝑠 𝑇𝑜𝑑𝑎𝑦 ∝
𝑃 𝑁𝑜 𝑇𝑜𝑑𝑎𝑦 ∝
MACHINE
Learns from data
Learns from mistakes
Definition:
“changes in [a] system that ... enable [it] to do the same
task or tasks drawn from the same population more
efficiently and more effectively the next time.'' (Simon
1983)
Problem Code
RULES
?
Data Traditional
Computing
Output
Program
Data Machine
Learning
Program
Output
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Types of Machine Learning
Regression Classification
1416 232
1534 315 500
… …
(in 100,000)
300
Price (₹)
200
Housing Prices
100
(Trichy, TN)
0
0 1000 2000 3000
Size (feet2)
Learning Algorithm
Size of Estimated
Model (h)
house price
(in 100,000)
300
Price (₹)
ℎ𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥 200
100
Identify 𝛽0 and 𝛽1 so that
ℎ𝛽 𝑥 is close to 𝑦 0
0 1000 2000 3000
Size (feet2)
3 3 3
𝛽0 = 1.5, 𝛽1 = 0 𝛽0 = 0, 𝛽1 = 0.5 𝛽0 = 1, 𝛽1 = 0.5
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
How to define closeness?
ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 ; 𝑖 = 1,2, ⋯ , 𝑚
𝑦
𝜀𝑖 = ℎ𝛽 𝑥𝑖 − 𝑦𝑖 𝑖 = 1,2, ⋯ , 𝑚
𝑦 = 𝛽0 + 𝛽1 𝑥
𝜀𝑚
How to compute the total error? (𝑥 𝑚
,𝑦 𝑚
)
a) σ𝑛𝑖=1 𝜀𝑖 (𝑥 1 ,𝑦 1 )
𝜀1
𝜀2
b) σ𝑛𝑖=1 𝜀𝑖2
(𝑥 2 ,𝑦 2 )
x
𝑚
1
Cost function: 𝐽 𝛽0 , 𝛽1 = (ℎ𝛽 𝑥𝑖 − 𝑦𝑖 )2
2𝑚
𝑖=1
Goal: 𝛽min
0 ,𝛽1
𝐽 𝛽0 , 𝛽1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
ℎ𝛽 𝑥 𝐽 𝛽1
(for fixed 𝛽1, this is a function of x) (function of the parameter 𝛽1 )
𝛽1 =1
3 3
2 𝛽1 =0.5 2
y 𝐽 𝛽1
1 1
0 𝛽1 =0 0
0 1
x 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝛽1
𝑚
1
𝛽1 ≔ 𝛽1 − α ℎ𝛽 𝑥𝑖 − 𝑦𝑖 𝑥𝑖
Outline: 𝑚
𝑖=1
}
• Start with some 𝛽0 , 𝛽1
• Keep changing 𝛽0 , 𝛽1 to reduce
until we hopefully end up at a minimum 𝐽 𝛽0 , 𝛽1
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
𝛽1
𝛽0
Malignant ?
(No) 0
Tumor Size Tumor Size
Goal: 0 ≤ ℎ𝛽 𝑥𝑖 ≤1
ℎ𝛽 (𝑥𝑖 ) = 0.7
x2 x2
x1 x1
x1
x2 x2
x1
x1
x2
Class 1:
Class 2:
Class 3:
𝑖
ℎ𝛽 (𝑥) = 𝑃(𝑦 = 𝑖|𝑥; 𝛽) 𝑖 = (1,2,3) x1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Unsupervised Learning
Action
take actions in an environment
State, Reward
Agent
Reward
Opponent
State, Reward
R6
0
0 100
0 0 0
R0 R1 0
0 100
R2 R0 R3 R1 R7
R4 R3 0 0
0
0
R5
R6 0
R7 R2 R5
0
0 0 0 0 0 0 0 0 𝑅0 −1 −1 −1 −1 0 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅1 −1 −1 −1 0 −1 −1 −1 100
0 0 0 0 0 0 0 0 𝑅2 −1 −1 −1 0 −1 0 −1 −1
0 0 0
𝑄= 0 0 0 0 0
𝑅 = 𝑅3
−1 0 0 −1 0 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅4 0 −1 −1 0 −1 −1 0 −1
0 0 0 0 0 0 0 0 𝑅5 −1 −1 0 −1 −1 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅6 −1 −1 −1 −1 0 −1 −1 100
0 0 0 0 0 0 0 0 𝑅7 −1 0 −1 −1 −1 −1 0 −1
Feature Extraction
Model Evaluation
Data Preparation
Data Collection
Model Training
Model Testing
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6
jitendra@nitt.edu
https://imjitendra.wordpress.com/
I want a to buy
a new house! Credit
★★★★
Income
★★★
Term
★★★★★
Loan
Application
Personal Info
★★★
Credit history explained
Term
★★★★★
Personal Info
★★★
Income
Credit History
What’s my income? ★★★★
Example: Income
★★★
$80K per year
Term
★★★★★
Personal Info
★★★
Loan terms
Credit History
How soon do I need to ★★★★
pay the loan?
Income
Example: 3 years, ★★★
5 years,… Term
★★★★★
Personal Info
★★★
Personal information
Credit History
★★★★
Income
Age, reason for the ★★★
loan, marital status,…
Term
Example: Home loan ★★★★★
for a married couple Personal Info
★★★
Intelligent application
Loan
Applications
Safe
✓
Risky
✘
Classifier review
ŷi = +1
Output: ŷ
Input: xi Predicted
class ŷi = -1
Decision Tree: Intuitions
What does a decision tree represent?
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
3 year loans with high
income & poor credit Safe
Risky
history are risky
Scoring a loan application
xi = (Credit = poor, Income = high, Term = 5 years)
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
excellent poor
Credit?
fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years
3 years 5 years
Risky Safe
Decision tree learning task
Training
x Feature h(x) ML ŷ
extraction model
Data
y T(x)
ML algorithm
Quality
metric
Learn decision tree from data?
Risky Safe
Decision tree learning problem
excellent poor
Credit ?
fair
Credit?
(all data)
Safe
Nothing more Risky
to do here
Credit?
Number of Safe
loans N = 40 examples
Compact visual notation: Root node
Number of safe
loans N = 40 examples
Decision stump: Single level tree
Loan status: (all data )
Safe Risky
Split on Credit
Credit?
excellent fair poor
Credit?
Intermediate nodes
Making predictions with a decision stump
credit?
Credit?
Credit?
OR Term?
Error = # mistakes
# data points
Calculating classification error
• Step 1: ŷ = class of majority of data in node
• Step 2: Calculate classification error of
predicting ŷ for this data
Loan status:
Safe Risky Root Error = .
22 18
=
22 correct 18 mistakes
Safe
Tree Classification error
(root) 0.45
ŷ = majority class
Choice 1: Split on credit history?
Credit?
Step 1: For each
excellent fair poor intermediate node,
9 0 9 4 4 14 set ŷ = majority value
Credit? Error = .
Term?
3 years 5 years
16 4 6 14
Safe Risky
Evaluating the split on Term
Term?
3 years 5 years
=
16 4 6 14
Tree Classification error
Safe Risky (root) 0.45
Split on credit 0.2
4 mistakes 6 mistakes Split on term 0.25
Choice 1 vs Choice 2
Tree Classification error
(root) 0.45
split on credit 0.2
split on loan term 0.25
Choice 1: Split on Credit Choice 2: Split on Term
Loan status: Root Loan status: Root
Safe Risky 22 18 Safe Risky 22 18
Credit? OR Term?
fair
high Low
Credit?
Credit?
Safe
Build decision stump Build decision stump
with subset of data with subset of data
where Credit = fair where Credit = poor
Second level
Credit?
excellent
Fair
9 0 high low
9 4
4 5 0
9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe
Risky Safe
Simple greedy decision tree learning
When do we stop???
Stopping condition 1: All data agrees on y
All data in these
nodes have same
Root poor
y value
Loan ->
status: 22 18 4 14
Safe Risky
Nothing to do
Credit? Income?
excellent
Fair
9 0 high low
9 4
4 5 0 9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe
Risky Safe
Stopping condition 2: Already split on all features
Already split on all
possible features ->
Root poor
Loan status:
Nothing to do 22 18 4 14
Safe Risky
Credit? Income?
excellent
Fair
9 0 high low
9 4
4 5 0 9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe
Risky Safe
Greedy decision tree learning
y T(x)
ML algorithm
Quality
metric
Decision tree model
excellent poor
Credit?
fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years
3 years 5 years
Risky Safe
Traversing a decision tree
xi = (Credit = poor, Income = high, Term = 5 years)
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
predict(tree_node, input)
• If current tree_node is a leaf:
o return majority class of
data points in leaf
• else:
o next_note = child node of
tree_node whose feature value
agrees with input
o return predict(next_note, input)
Multiclass classification
Multiclass prediction
Safe
Loan Classifier
Application MODEL
Risky
Output: ŷ i
Input: xi Predicted class
Danger
Multiclass decision stump
N = 40,
1 feature,
3 classes Loan status: Root
Credit y Safe Risky Danger 18 12 10
excellent safe
fair risky
Credit?
fair safe
poor danger
excellent risky excellent fair poor
fair safe 9 2 1 6 9 2 3 1 7
poor danger
poor safe
fair safe
… … Safe Risky Danger
Decision tree learning:
Real valued features
How do we use real values inputs?
Income?
Income?
$10K $120K
Visualizing the threshold split
Threshold split is
the line Age = 38
Income
$80K
$40K
$0K
0 10 20 30 40 …
Age
Split on Age >= 38
$80K
$40K
Predict Safe
$0K
0 10 20 30 40 …
Age
Depth 2: Split on Income >= $60K
$80K
$40K
$0K
0 10 20 30 40 …
Age
Each split partitions the 2-D space
Age >= 38
Income Age < 38 Income >= 60K
$80K
$40K
Age >= 38
Income < 60K
$0K
0 10 20 30 40 …
Age
Summary of decision trees
What you can do now
Training set:
Unsupervised Learning
Training set:
K-means algorithm
Input:
- (number of clusters)
- Training set
(drop convention)
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
K-means for non-separated clusters
T-shirt sizing
Weight
Height
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
Random initialization
Should have
Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
E.g.
T-shirt sizing T-shirt sizing
Weight
Weight
Height Height
Thank You!
Random Forest
CAMI16: Data Analytics
I want a to buy
a new house! Credit
★★★★
Income
★★★
Term
★★★★★
Loan
Application
Personal Info
★★★
excellent poor
Credit?
fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years
3 years 5 years
Risky Safe
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Bootstrapping
Resampling of the
observed dataset
(and of equal size
to the observed
dataset), each of
which is obtained
by random
sampling with
replacement from
the original
dataset.
Training Data
M features
N examples
M features
N examples
....…
M features
N examples
....…
M features
N examples
....…
....…
M features
N examples
Take he
majority
vote
....…
....…
For prediction:
Regression: average all k predictions from all k trees
Classification: majority vote among all k trees
• All bagged trees will look similar. Hence all the predictions
from the bagged trees will be highly correlated
Eager Lazy
• Model is computed before • Model is computed
classification during classification
• Model is independent of the • Model is dependent on
test instance the test instance
• Test instance is not included
• Test instance is included
in the training data
in the training data
• Avoids too much work at
classification time • High accuracy for models
at each instance level.
• Model is not accurate for
each instance
Learning by analogy
Tell me who your friends are and I’ll tell you who you are
Initialization, define k
Compute distance
𝐷 𝑋, 𝑌 = (𝑥𝑖 − 𝑦𝑖 )2
𝑖=1
• Manhattan distance 𝑛
𝐷 𝑋, 𝑌 = |𝑥𝑖 − 𝑦𝑖 |
𝑖=1
• Small k?
• Captures fine structures
• Influenced by noise
• Larger k?
• Less precise, higher bias
2
•𝑘= 𝑛
Majority indicates
GOOD
Testing:
What is this?
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
pixel 1
Learning
Algorithm
pixel 2
Raw image
pixel 2
Cars
“Non”-Cars
pixel 1
Learning
Algorithm
pixel 2
Raw image
pixel 2
Cars pixel 1
“Non”-Cars
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
pixel 1
Learning
Algorithm
pixel 2
Raw image
50 x 50 pixel images→ 2500 pixels
(7500 if RGB)
pixel 2
pixel 1 intensity
pixel 2 intensity
Cars
pixel 1
“Non”-Cars
0 1
10
𝑌
-20
𝑥1 𝑦
0 1
1 0
𝑦 = 𝑥1 OR 𝑥2
-10
20 0 1
𝑌
20 𝑦
0 0 0
0 1 1
1 0 1
1 1 1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Logical AND
𝑥1 , 𝑥2 ∈ {0,1} 1
𝑦 = 𝑥1 AND 𝑥2
-30
20 0 1
𝑌
20 𝑦
0 0 0
0 1 0
1 0 0
1 1 1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
How does the perceptron learn its
classification tasks?
• This is done by making small adjustments in the
weights to reduce the difference between the predicted
and desired outputs of the perceptron.
• The initial weights are randomly assigned, usually in the
range [-0.5, 0.5], and then updated to obtain the output
consistent with the training examples.
• If at iteration p, the predicted output is Y(p) and the
desired output is Yd (p), then the error is given by:
• where p = 1, 2, 3, . . .
𝑌 𝑝 = 𝑠𝑡𝑒𝑝 𝑥𝑖 × 𝑤𝑖 𝑝 − 𝜃
𝑖=1
Where n is the number of the perceptron inputs, and step
is a step activation function
• Step4: Iteration
• Increase iteration p by one, go back to step 2 and
repeat the process until convergence
A perceptron can learn the basic operations like AND, OR, and NOT
but it can not learn other complex functions such as X-OR
𝑧 (2) = 𝑤 (1) 𝑥
x2
x2
x1
x1
20 -20 20
𝑦 𝑦 𝑦
20 -20 20
𝑦
-30
20
-10
0 0
20
20
𝑦
0 1
10 20
-20 1 0
-20
1 1