Pad Unit 1 Ibm

Pattern Recognition and
Anomaly Detection
Student Guide
Course code GAI10SG194
V10.1
Student Notebook
TOC Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Unit 1. Pattern and Anomaly Detection Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
What is pattern? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
What is pattern recognition? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
Pattern recognition techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Training and learning in pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
Pattern recognition applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9
Pattern recognition use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
What is anomaly detection? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
What are some other practical uses for anomaly detection? . . . . . . . . . . . . . . . . . . . . . . . 1-15
How is anomaly detection calculated over time? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
Self evaluation: Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
Key point for AI and ML-anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18
Tasks for artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20
AI system learning process (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-21
AI system learning process (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22
Test to geometric requirements for curves algebraic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24
Curves matched to data points (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25
Curves matched to data points (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-27
Case study: Anomaly detection with IBM Watson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28
Probability theory (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-31
Probability theory (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-32
Maximum likelihood theory and estimation (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-33
Maximum likelihood theory and estimation (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-34
Model selection (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-38
Model selection (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-40
Matrices of uncertainty (confusion matrices) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-43
Loss of logging (log-loss) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-44
Rate for F1 (F1 score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-45
Metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-46
Hyperparameter selection (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-47
Hyperparameter selection (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-48
The problem with high dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-49
Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-50
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-52
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-53
Question bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-54
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-55
© Copyright IBM Corp. 2020 Contents iii

Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Unit 2. Statistical Approaches for Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-2
Understanding statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
T-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6
Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-11
Self evaluation: Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-14
Z-test and t-test difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-15
P-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-16
Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-18
Type I error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-21
Type II error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-23
Differences between type I and type II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-24
Null hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-25
Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-27
Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-30
Four steps of hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-31
Real-world example of hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-32
Bonferroni test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-34
Check of one-tailed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-35
Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-38
Types of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-42
Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-44
Self evaluation: Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-45
Types of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-46
How to select the best model for regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-52
Common questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-53
Linear models for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-55
Example of positive linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-57
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-58
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-59
Question bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-60
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-61
Unit 3. Machine Learning Approaches for Pattern Recognition . . . . . . . . . . . . . . . . . . . . . .3-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2
Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
How neural networks learn? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
Neural networks examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
Neural networks use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-20
Kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-22
Sparse kernel machines use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-31
Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-33
Mixture models and EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-34
Bayesian networks: Directed graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-35
Conditional probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-37
Potential functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-38
iv PAD © Copyright IBM Corp. 2020

V10.1
Student Notebook
TOC Conditional independences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45

Sampling methods for pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Continuous latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59
Sellf evaluation: Exercise 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
Combining models for pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70
Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-74
The K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75
Applications of K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-76
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-78
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-79
Question bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-80
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-81
Unit 4. Anomaly Detection & Anomaly Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . 4-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
What are anomalies? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Applications of anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Related use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Types of input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
Types of anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
Evaluation of an anomaly detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Taxonomy of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Classification based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Classification use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
Supervised classification techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Self evaluation: Exercise 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
Nearest neighbor based techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27
Others model techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-37
Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-44
Contextual anomaly based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-45
Collective anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-48
On-line based model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-50
Distributed anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-53
IDS analysis strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-55
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-60
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-61
Question bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-62
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-63
Unit 5. Real-world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Network intrusion detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Understanding of IDS core operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
How an IDS works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Types of intrusion detection systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
© Copyright IBM Corp. 2020 Contents v

Student Notebook
Fundamental concerns of intrusion detection systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9

Intrusion detection vs. intrusion prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-11
The future of IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-12
Anomaly detection in big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
Key attributes of advanced anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
The real-world impact of anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
Anomaly detection on 5G: Possibilities and opportunities . . . . . . . . . . . . . . . . . . . . . . . . .5-20
Real time anomaly detection in docker, Hadoop cluster . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23
Anomaly detection in IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-24
Detection of deviations in deep learning time series results . . . . . . . . . . . . . . . . . . . . . . . .5-26
Anomaly detection use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-29
Anomaly detection with time series forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-31
What is time series analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-34
Time series data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-36
How to find anomaly in time series data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-39
Anomaly detection using machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-41
Anomaly detection using deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-42
Anomaly detection for an e-commerce pricing system . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-43
IBM’s Watson AIOps automates IT anomaly detection and remediation . . . . . . . . . . . . . .5-45
Checkpoint (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-48
Checkpoint (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-49
Question bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-50
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-51
Unit 6. Lab Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1

Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Lab specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Exercise 1: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-4
Exercise 2: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-7
Exercise 3: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-17
Exercise 10: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-41
vi PAD © Copyright IBM Corp. 2020

V10.1
Student Notebook
TOC Exercise 20: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-83

Exercise 21: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-86
Exercise 24: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-100
Exercise 25: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-114
Appendix A. Checkpoint solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
© Copyright IBM Corp. 2020 Contents vii

V10.1
Student Notebook
TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this training document,
are official trademarks of IBM or other companies:
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
DB2® HACMP™ System i™
System p™ System x™ System z™
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other product and service names might be trademarks of IBM or other companies.
© Copyright IBM Corp. 2020 Trademarks ix

V10.1
Student Notebook
pref
Course description
Pattern Recognition and Anomaly Detection
Purpose
This course describes the classify the datasets a given pattern to one of the pre-specified classes, develop
skills of using pattern detection and anomaly detection techniques with AI algorithms and gain experience of
doing independent study and research for different anomaly detection areas.
Audience
The audience of this course is Bachelor of Technology (B.Tech) students.
Prerequisites
A Basic overview of pattern anomaly detection.
Objectives
After completing this course, you should be able to:
• Understand the concept of pattern recognition and anomaly detection
• Learn about linear models for classification
• Gain an insight into clustering based methods
• Learn example of sparse kernel machines and graphical models
• Gain knowledge on anomaly detection in big data
References
• https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Gohana_inverted_S-curve.png/560px-Gohan
a_inverted_S-curve.png
• https://lh3.googleusercontent.com/HkRug5Yd6SlGy0AkSgLZ9FYwrq3Os5jeSoEiHqg5ft1se9C8uSUcXjY9
p3yfYfhg13eyUA=s86
• https://images.app.goo.gl/j5A2Q22DcLvTFt5K8
• www.IBM.com
© Copyright IBM Corp. 2020 Course description xi

V10.1
Student Notebook
Uempty
Unit 1. Pattern and Anomaly Detection
Introduction
What this unit is about

This unit explains about concept of pattern recognition and anomaly detection, example of polynomial curve
fitting, probability theory, model selection, the problem with high dimensionality, information theory with real
time example for pattern understanding.
What you should be able to do

After completing this unit, you should be able to:
• Gain knowledge on example of polynomial curve fitting
• Learn about probability theory architecture and working model
• Understand Information theory
How you will check your progress

• Checkpoint
References
IBM Knowledge Center
© Copyright IBM Corp. 2020 Unit 1. Pattern and Anomaly Detection Introduction 1-1
Student Notebook
Unit objectives IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure 1-1. Unit objectives PAD011.0
Notes:
Unit objectives are stated above.
1-2 PAD © Copyright IBM Corp. 2020

V10.1
Student Notebook
Uempty
What is pattern? IBM ICE (Innovation Centre for Education)

IBM Power Systems
• Pattern is all about it in this digital age.
• A pattern can be either visually identified or mathematically detected via the

implementation of algorithms.
Figure: Pattern definition

Source: https://images.app.goo.gl/NmRdihnyFymA23uRA
Figure 1-2. What is pattern? PAD011.0
Notes:
Example: Dress colors, voice style etc. In computer science, the values of vector features are used to
describe a sequence.
Student Notebook
What is pattern recognition? IBM ICE (Innovation Centre for Education)

IBM Power Systems
• As per Wikipedia, pattern recognition is the automated recognition of patterns and

regularities in data.
• It has applications in statistical data analysis, signal processing, image analysis, information
retrieval, bioinformatics, data compression, computer graphics and machine learning.
Figure 1-3. What is pattern recognition? PAD011.0
Notes:
What is pattern recognition?
Pattern recognition has its origins in statistics and engineering some modern approaches to pattern
recognition include the use of machine learning, due to the increased availability of big data and a new
abundance of processing power. However, these activities can be viewed as two facets of the same field of
application, and together they have undergone substantial development over the past few decades.
Acknowledgement of patterns is the method of utilizing machine learning algorithms to recognize patterns.
Acknowledgement of trends is classification of information which is based on statistical evidence and
representation from experience or models earlier gained. Their applicability is one of the essential facets of
pattern detection.
Explanations: Acknowledgement of the voice, detection of voices, Multimedia Document Recognition (MDR),
automated medical diagnosis. The raw data is interpreted and translated into a mode of computer usage in a
basic pattern recognition program. Pattern Identification requires identifying groupings and trends.
Clustering defines data fragmentation, which is a basic judgment making process which is of concern to
others. Unsupervised learning utilizes grouping. Pattern identification has the following characteristics:
• Pattern detection programs must recognize a recognizable image rapidly and reliably.
• Distinguish and classify unfamiliar things.
• Distinguishes forms and structures from various perspectives.

V10.1
Student Notebook
Uempty • Detects patterns and features even though partly obscured.

• Recognizes trends instantly and conveniently and automatically.
What is pattern class?
• A set of patterns which share similar features is called a class of pattern.
• A set of items which are "related" (not identical).
• Provided items are allocated to specified groups during acknowledgment.
Student Notebook
Pattern recognition techniques IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure: Pattern recognition process

Source: https://images.app.goo.gl/3x6gZVVao9u3vV8w7
Figure 1-4. Pattern recognition techniques PAD011.0
Notes:
Pattern recognition techniques
There are three primary pattern detection models:
Statistical: Define where the individual item resides (for reference, whether this is a cake). This approach
takes advantage of controlled machine learning.
Syntactic/structural: Describing a more dynamic interaction among components (e.g., speech parts). This
model takes advantage of semi-supervised deep learning.
Template alignment: Alignment the characteristics of the item to the predefined prototype, and proxy
recognition of the model. Another of the applications of such a pattern is to test plagiarism.
The pattern recognition algorithms have two key components:
• Explorative: Used to understand data commons.
• Descriptive: Had to identify the commonalities in definite way.
The synthesis of these two components is used to derive information from the data for usage in analytics of
large data. The study of the prevalent variables and their interaction provides information that can be
important in interpreting the subject matter.

V10.1
Student Notebook
Uempty The process flow is as below:

• Continuous data collection from different sources.
• Data cleaning process from the noise.
• For necessary characteristics or common elements, Information is tested.
• Subsequently components are grouped in specific segments.
• The segments are analyzed for insights into data sets.
• The insights gained are considered in the company process.
Student Notebook
Training and learning in pattern

recognition IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: Training and testing dataset

Source: https://images.app.goo.gl/vetePfKYS2t7vmX99
Figure 1-5. Training and learning in pattern recognition PAD011.0
Notes:
Training and learning in pattern recognition: Training and learning is a process that trains a computer and
produces reliable results. Training is an important step because the way the software works on the given data
depends on which algorithm is used for the data. The entire dataset is divided into two categories, one of
which is to evaluate software in the software preparation process and the other is to evaluate the software
after training.
Training set: Training sets are employed in model construction. It is composed of drawing sets that could be
employed to train the program. The usage of training laws and algorithms provides valuable details on how
input data should be correlated with output judgments. Some specific knowledge is derived from the data and
the findings produced by applying such algorithms to a system-trained dataset. Typically 80% of data is used
for training data in the dataset.
Test set: The device is checked using test info. This is a collection of data that can be used during testing to
check that the device is generating the correct result. Usually, for data processing, 20 per cent of the dataset
is used. The test results was used to assess device precision. Illustration a program that recognizes which
group of a specific flower belongs, can correctly recognize the flowers in seven and misinterpret other
flowers, so that the precision is seventy percent.

V10.1
Student Notebook
Uempty
Pattern recognition applications IBM ICE (Innovation Centre for Education)

IBM Power Systems
• A model is an ideal concrete entity, or theoretical idea. The definition of the animal is an
illustration when thinking regarding animal groups.
• The definition of the ball is a trend while speaking about various styles about balls.
• The groups can be baseball, cricket game, table tennis match, in sample case balls.
• Before approaching a new species, the species class has to be established.
• Choosing attributes and describing patterns is a very critical phase in classifying the layout.
• Effective presentation requires the use of non-discriminatory attributes, and the sample
classification computational pressure.
Figure 1-6. Pattern recognition applications PAD011.0
Notes:
Real-time examples and explanations: A vector is a distinct image of a pattern. Any unit of a vector reflects
the model attribute. In the model being discussed, the first dimension of the vector includes the value of the
first feature.
Illustration: While describing a spherical body, (25, 1) may be interpreted as a round shell with a weigh of 25
units and a diameter of 1 dimension. A portion of the vector can be generated by class mark.
Advantages:
• Classification problems are resolved by pattern recognition.
• Pattern recognition addresses the question of fraudulent bio-metric identification.
• It is very useful for textile pattern recognition for visually disabled blind people.
• It helps in speaker diarization (“process of partitioning an input audio stream into homogeneous
segments according to the speaker identity”).
• The same object can be seen from various angles.
Student Notebook
Disadvantages:
• Methodology for identification of syntactic patterns is complex and very long to use.
• Larger datasets are necessary sometime to get improved precision.
• This can not clarify whether this would recognize a particular thing.
• Instance my face vs my friend’s face.
Applications:
• Analysis, segmentation and image processing: Consideration of patterns is used to convey
knowledge of the method needed for image processing in human identification.
• Computer vision: Pattern detection is employed to derive relevant attributes from a specified
image/video pattern and is employed for numerous purposes such as biological and biochemical
photography in machine vision.
• Seismic analysis: The pattern realizing system is employed in seismic records to identify, image, and
describe temporal patterns. The definition of predictive variables is applied and employed in different
applications of seismic research.
• Analysis and classification for radar signal processing: The sample identification and signal
processing methods are used to identify and interpret AP mine (Anti-personnel mine) in specific radar
signal recognition applications.
• Remembering words: Significant progress in understanding expression utilizing pattern detection
methods. It is employed in numerous audio recognition algorithms which aim to include issues with the
representation of phonemes and model broad objects such as phrases.
• Fingerprint identification: Technology for identifying fingerprints is a significant development on the
biometric industry. Various identification approaches have been utilized to suit fingerprints, with which
tools for pattern realization are commonly utilized.

V10.1
Student Notebook
Uempty
Pattern recognition use cases IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Customer research and stock market analysis.
• Chat bots, NLP with text generation, text analysis, text translation.
• Optical Character Recognition (OCR), document classification and signature verification.
• Image recognition, visual search, face recognition.
• Voice recognition and ai assistants.
• Recommendation sentiment analysis, audience research.
Figure 1-7. Pattern recognition use cases PAD011.0
Notes:
Pattern recognition use cases
Customer research and stock market analysis: Current equity market forecast-trend identification is used
for quantitative market price research and probable result forecasts. Yard charts utilize this study for pattern
detection. Viewer study-audience selection is the review of chosen features of accessible consumer data and
the classification. Those apps are supported by Google Analytics.
Chat bots: NLP with text generation, text analysis, text translation: The processing of NLP is a machine
learning field that emphasizes on training computers to comprehend human language and generate
notifications. This seems like intense sci-fi, yet in fact it's not about the meaning of conversation. it is just
about what is conveyed explicitly in the letter. NLP creates fragments of text, seeks links and makes a
distinction. The cycle starts by dividing the phrases; it distinguishes the vocabulary terms and pieces, then
describes how such terms should be included in a paragraph. In doing so, the NLP is utilizing a mix of
techniques such as filtering, fragmentation, and labeling to establish a paradigm for process management.
Controlled and unsupervised algorithms for machine learning require many phases in the method.
Student Notebook
Areas for NLP as below:

• Application text processing: For language classification, content creation and simulation (online
management applications such as BazSumo using this technology).
• G (Google) plagiarism detection: A type of text analysis based on a web crawler's analysis of the
document. The words are divided into tokens that are checked for compatibility elsewhere. Copying
space is a smart way to do this.
• Text summarization and contextual extraction: Finding the meaning of the text. There are many online
tools for this task, for example, text summarizer.
• Generate text generation: For chat bots and AI assists, or for automatic content creation (e.g. automatic
communications, Twitter bot alerts).
• Translation data conversion: The system utilizes a mix of meaning and emotion analysis in addition to
data interpretation and word replacement to help entertain the message in another language. The main
illustration of this is Google translation.
• Text clarification and matching: This method may often be used to clarify language-from format to word
use, in addition to fixing grammatical and structured syntax flaws. Grammatically, a startup developed in
the hands of Ukraine by two Ukrainians is a popular illustration of these applications for identification of
NLP trends.
• Optical Character Recognition (OCR), document classification and signature verification: Optical
Character Recognition (OCR) is an encoding and ultimate conversion of images, known as alphanumeric
text, to computer coded texts.
The most common source of the optical characters are scanned documents or photographs, but the thing can
also be used on computer-generated unlabeled images. The OCR algorithm applies a library of patterns and
compares them with the available input document to mark up the text and construct these. These matches
are then assessed with the assistance language corpus and thus perform the recognition itself. A mixture of
pattern detection and contrast algorithms connected to the reference set resides at the core of OCR.
The most popular OCR usage involve:
• The most important method is the reproduction of documents. In recognizable letters the text is shown,
labeled and moved to a digital environment.
• Text Transcription is well-presented on the market.
• The ABBYY Fine Reader may be a perfect illustration of this.
• Handwriting identification is a text interpretation variant and is more essential for the visual components.
OCR algorithm employs a contrast engine to analyze the pattern in script. Google handwriting feedback
is an outstanding example. While this methodology has its primary function in the document, it can even
be used to verify signs and other handwritten samples.
• Requires in-depth paper preparation, with a focus on on-line form and design. This method is used to
digitize paper records and to restore fragmented items from corrupted records (e.g., whether the content
is ripped or partly blurred ink). Para script is a product which provides these services for the identification
of documents.
Image recognition, visual search, face recognition
Image Recognition is an OCR model programmed to recognize what's on the frame. In comparison to OCR,
pattern recognition is used during image processing to identify what is represented on the input pictures.
Basically, the picture is "defined" rather than "acknowledging," such that it can be viewable and comparative
with the other pictures. The core algorithms for image recognition are a mixture of unsupervised and
supervised learning algorithms. The model is trained in the labeled datasets using the first supervised
approach, i.e. the cases of representation of the business. The unsupervised algorithm is used to analyze an
input image. A supervised algorithm then moves through patterns of the same type of objects and categorizes
them.

V10.1
Student Notebook
Uempty Face Identification has two primary usage cases:

• Google Engines and ecommerce marketplaces are commonly used for digital Google apps. This just
deals for pictures the same way as alphanumeric search query does. Picture detection is a part of the
calculation in both situations. The other component is metadata of the file, and additional textual
information. Ex: Google search and Amazon.
• In social network applications such as Facebook and Instagram, facial recognition is commonly used.
Law enforcement uses the same techniques to locate an individual of concern or suspects on the lam.
The technological mechanism behind face identification is more complicated than pure recognition of
objects. To detect a specific person's face, the algorithm must have a specifically labeled collection of
samples.
Furthermore, these enhancements are typically voluntary due to privacy restrictions, which require user
approval. VeriLook SDK is one of the best-known implementations of the technology.
Voice recognition and ai assistants: The audio is just as critical a knowledge source as anything else. The
explosive growth of machine learning algorithms enabled it to be used in the delivery of essential services.
Vocal detection essentially operates on the same concepts as OCR. The only distinction is in the knowledge
flow.
For the following uses the speech and sound identification is used:
• AI Assistants applications utilize natural language analysis to write a message and an external
soundtrack library to implement a request. Google assistant for proof.
• Sound-based diagnosis utilizes a quantitative auditory archive to identify disturbances, recommending
potential explanations and methods to repair them. Frequently used for testing the condition of the engine
or car components in the automotive sector.
• Conversion of speech-to-text and text-to-speech using a comparable sample library, an OCR system and
a voice generation system. It's often utilized beyond AI assistants to retell written material.
• Intelligent captions adding requires speech-to-text detection and accompanying picture overlay to display
the text on the device (for example, automated subtitling functionality for YouTube or Facebook).
Recommendation- sentiment analysis, audience research
Audience polling, customer care, prescription, suggestion sentiment analysis sentiment analysis is a sub-set
of pattern recognition that takes another phase in defining its meaning and intent. In other terms, it tries to
clarify what lies behind the mood of sentences, insight and, above all, a goal. It is one of the most common
forms of pattern recognition. Market applications sentiment analysis may be used to analyze the diversity of
responses from the experiences of various network styles. In addition to the simple recognition process, the
program employs unsupervised deep learning. The sentiment analysis conclusions are typically focused on
fantastic references such as dictionaries but can also provide more personalized datasets depending on the
project background.
Cases in usage for evaluating emotions include:
• Market analysis, content management, consumer engagement tools used to help identify user groups,
connect with their content and evaluate their thoughts towards them. It also leads to the material getting
more optimized. These apps are also being carried out through the Einstein application tools from sales
force.
• Customer support helps identify the essence of the problem (whether good or bad, combative or badly
defined). It is widely seen in AI helpers such as Alexa, Siri and Cortana.
• Advice/diagnosis used to assess a single user's substance of interest. The recommendation that be
complemented by the questions and previous service use data. The closest cases are Netflix with its "you
may like it too" and Amazon's "people purchase it too“.
Student Notebook
What is anomaly detection? IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure: Anomaly detection example

Source: https://images.app.goo.gl/WpR4Rk1Xth1Fj4rt6
Figure 1-8. What is anomaly detection? PAD011.0
Notes:
Anomaly detection: An anomaly is a divergence from ordinary, natural, or anticipated values. Detection of
abnormalities detects atypical trends within the results. Often such rare occurrences are called outliers. You
may define the report metrics which anomaly detection should track. For the defined metrics, the following
action is taken by anomaly detection:
• Historical tons of information.
• Model and points of anomaly are determined.
• The contributing factors are listed.
• Visually shows information in a separate view of the anomaly detection data.
Now let us assume you've got a report which monitors positive checkouts. We want to learn whether the
amount of checkouts would diverge from the standard, so you pick the checkout parameter to spot
irregularities. Oddity analysis tracks the metric across history records and detects and quantifies metric
deviations. A phenomenon is noticed that the average amount of active weekend checkouts is 15 per cent
smaller than on during the week. It then senses an error as the checkout abruptly decreases much more.
Deviation Identification flags the data deviation, and lists the reasons leading to the decrease in positive
checkouts.

V10.1
Student Notebook
Uempty
What are some other practical
uses for anomaly detection? IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Traffic dropped or spiked.
• Transactions or revenue dropped.
• Traffic from social media increased or decreased.
• Traffic from organic search increased or decreased.
Figure 1-9. What are some other practical uses for anomaly detection? PAD011.0
Notes:
What are some other practical uses for anomaly detection?
Anomaly detection can be used for the following business cases:
Traffic dropped or spiked: Spot irregular improvements to regular traffic patterns and might not be constant
over the year. Tracking of irregularities will identify what is the usual traffic trend, and mark a deviation when
traffic diverges from that trend.
Transactions or revenue dropped: Monitoring of irregularities will identify what is the usual trend for
purchases or sales and mark an anomaly if they dip outside the standard.
Traffic from social media increased or decreased: Marketers will be warned if the flow of social network
traffic switches all a moment. The move may be attributed to their mass tweeting or their Facebook page
getting punished for utilizing so much email.
Traffic from organic search increased or decreased: For SEO, if the volume of traffic from the search
engines decreases, it may be an indication that the rating algorithms have improved. The database need to
be revised in order to achieve a better rank in the existing requirements for assessment.
Which are the anomaly detection information standards? Was it for data days, or data density?
Anomaly identification includes data measured for at least fourteen days, minus the initially consecutive zero
values. Therefore, the lacking interest in the data does not surpass fifty percent.
Student Notebook
How is anomaly detection calculated

over time? IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: CPU time frame for anomaly detection

Source: https://images.app.goo.gl/a5wk76ZUmKZLj63v6
Figure 1-10. How is anomaly detection calculated over time? PAD011.0
Notes:
How is anomaly detection calculated over time?
For instance, in August there had been a launch of an iOS app that generated a huge spike in iOS traffic. The
traffic increase in the total session count was associated to the anomaly (increase). An android app was
released in September, which generated its own increase in traffic.
Is it clearly addressed for every anomaly?
The highest adding element for session count metric for launching the iOS device is the application aspect for
an iOS rating. The second thing which contributes is the aspect of the platform with an android meaning.
Though, if you set the metric income, the "metric" session list has a device aspect with iOS as the very first
respondent and Android as the secondary respondent.

V10.1
Student Notebook
Uempty
Self evaluation: Exercise 1 IBM ICE (Innovation Centre for Education)

IBM Power Systems
• To continue with the training, after learning the various steps involved in pattern recognition
and anomaly detection, it is instructed to utilize the concepts to perform the following activity.
• You are instructed to write the following activities using python code.
• Exercise 1: Polynomial curve fitting.
Figure 1-11. Self evaluation: Exercise 1 PAD011.0
Notes:
Student Notebook
Key point for AI and ML anomaly

detection IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: Anomaly detection flow with Machine learning

Source: https://images.app.goo.gl/ewiV18h8cpAkEyCL9
Figure 1-12. Key point for AI and ML-anomaly detection PAD011.0
Notes:
Key point for AI and ML- anomaly detection
Another approach to manage data more quickly and easily is to identify unexpected occurrences,
adjustments or shifts in datasets. Monitoring of irregularities applies to detecting objects or occurrences that
do not adhere to a trend predicted or certain things in a dataset that are typically impossible to detect by a
professional specialist. Another approach for data management is to detect unusual incidents, changes, or
shifts in data sets more quickly and effectively. Therefore an anomaly detection is one of the key goals of
Industrial IoT, and a device that relies on artificial intelligence to identify unusual activities in the reported data
collection.
Identification of irregularities applies to detecting objects or occurrences that do not adhere to a trend
predicted or certain things in a dataset that are typically impossible to detect by a human specialist. Typically
these irregularities may be converted into issues such as design faults, defects or theft. Instances of potential
anomalies:
• A leaking link pipe which causes the entire manufacturing line to be closed.
• Several unsuccessful authentication attempts showing the opportunity for cyber-fishing activities.
• Fraud identification in financial transactions.

V10.1
Student Notebook
Uempty Why is it significant?

New companies are starting to realize the value of integrated processes in order to get a complete view of
their company. In addition, they must react promptly to rapidly changing data, particularly in the context of
cyber security challenges. Anomaly identification may be a key to addressing these infringements, because
disruptions in usual activity, when identifying abnormalities, signify the existence in expected or accidental
triggered assaults, errors, flaws, and so on. Sadly, there is no efficient way to remotely manage and evaluate
ever through datasets. With the complex structures having multiple elements in perpetual motion that
continuously redefine the "standard" behavior, a modern constructive strategy is required to detect
anomalous conduct.
SPC (Statistical Process Control): Statistical Process Control, or SPC, is a technique for consistency
assessment and monitoring during manufacturing phase. After the processing phase, quality data are
gathered in metadata format and mapped on a chart with preset control limits which represent the operation
strength. Data that comes beyond the boundaries of control implies that it is running as planned. Any
variability beyond the control limits is usually due to the normal variance anticipated as part of the operation,
leading to a trustworthy source. When data comes below the regulation limits, this means that the root of the
difference in the product may be an expandable factor, so it must be resolved so modified throughout the
procedure to correct the problem before errors arise.
SPC is an important tool to accelerate ongoing change in this manner. The process was implemented in 1924
and is expected to live permanently at the core of industrial quality control. Though, its incorporation of
artificial intelligence technologies would allow it more reliable and precise and offer further insight into the
method of production and the existence of anomalies.
Student Notebook
Tasks for artificial intelligence IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: AI Task
Source: https://images.app.goo.gl/F9Z664sSWR53yGqLA
Figure 1-13. Tasks for artificial intelligence PAD011.0
Notes:
Tasks for artificial intelligence
• Automation: Driven by an AI detection algorithms, data sets are constantly analyzed, the normal
behavioral parameters are precisely specified, and pattern leaks are recognized.
• Real-time analysis: AI applications will view the behavior of the data in live time. The minute the
machine doesn't know a sequence, it gives out a warning.
• Scrupulousness: Anomaly monitoring systems include end-to-end gap-free surveillance to track data in
depth to detect the slightest irregularities that individuals may not find.
• Accuracy: AI increases anomaly identification performance, eliminating warnings of disturbance and
false positives/negatives caused by static criteria.
• Self-learning: AI-driven, self-learning algorithms are the backbone of systems and can benefit from data
patterns and provide predictions or answers when appropriate.

V10.1
Student Notebook
Uempty
AI system learning process
(1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: AI Learning Process

Source: https://images.app.goo.gl/dBtnk7CPRtqe7xM59
Figure 1-14. AI system learning process (1 of 2) PAD011.0
Notes:
If two points A and B are connected by a curve, the curve should also be like the midpoint of A and B. This is
not true for high order curves of polynomials; even values of a positive or negative magnitude may be very
large. With polynomials that are small in order, the curve is more likely to fall close to the middle (a first-grade
polynomial only guarantees that it runs exactly through the middle). Polynomials of low order are usually
smooth, and polynomial curves of high order tend to be lumpy. The maximum possible number of binding
points in a polynomial curve is n-2, to define this more precisely, where n is the order of the polynomial
equation. An inflection point is on a curve, from a positive to a negative radius. This is also where we can
claim that it moves from 'water holding' to 'water dumping’.
Be mindful that high-order polynomials are only "possible" to be lumpy; they can even be smooth, but unlike
low-order polynomial curves, there is no guarantee of it. A 15th grade polynomial may have, at most, 13 rows,
but may also have 12, 11 or any number up to zero. It is undesirable for all the reasons previously provided
for polynomials of high order that the level of the polynomial curve is higher than sufficient for an exact fit, but
it also contributes to the existence of several solutions. For instance, a first-degree polynomial (one line),
restricted by only one point, would provide an unlimited number of solutions rather than the normal two. This
leads to a problem how one solution can be compared and selected, which can also be a problem for
software and for people. That is why it is generally easier to select a degree as small as possible to meet all
the restrictions and maybe even less if an approximate fit is appropriate.
Student Notebook

IBM Power Systems
Figure: Relation between wheat yield and soil salinity

Source: https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Gohana_inverted_S-curve.png/560px-Gohana_inverted_S-curve.png
Figure 1-15. AI system learning process (2 of 2) PAD011.0
Notes:
Change other data points functions
Similar types of curves can, in some cases, be used, for example trigonometric functions (like sine and
cosine). Data can be equipped with Gaussian, Lorentzian and Voigt functions in spectroscopy. The inverted
logistic sigmoid (S-curve) feature in agriculture describes the relationship between crop yield and growth
factors. A sigmoid regression of data in agricultural lands made the blue figure. It is shown that the crop yield
gradually decreases initially, i.e., with lower soil salinity, while the decrease then progresses more rapidly.

V10.1
Student Notebook
Uempty

IBM Power Systems
• Exercise 2: Probability and distribution.
Notes:
Student Notebook
Test to geometric requirements for

curves algebraic IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: Algebraic curves in a parametrical form used for creation of the forming line in the edge
section of different rated surfaces
Source: https://images.app.goo.gl/Rn75dzFLCv9y6nu5A
Figure 1-17. Test to geometric requirements for curves algebraic PAD011.0
Notes:
Test to geometric requirements for curves algebraic
"Fitting" generally means trying to find a curve for the algebraic analysis of data, which minimizes the vertical
displacement (y-axis) of an element from the curve (e.g., ordinary smaller squares). Geometric fitting is
intended therefore to provide the optimal visual fit for graphic and image applications; this typically includes
attempting to reduce the orthogonal distance to the curve (e.g., the sum of lower squares), or the two points
from the curve displacement axes. Geometric fits are not common since non-linear or iterative calculations
are typically required, although the result is more esthetically accurate and geometrically accurate.

V10.1
Student Notebook
Uempty
Curves matched to data points
IBM Power Systems
Figure: Different models of ellipse fitting

Source: https://lh3.googleusercontent.com/HkRug5Yd6SlGy0AkSgLZ9FYwrq3Os5jeSoEiHqg5ft1se9C8uSUcXjY9p3yfYfhg13eyUA=s86
Figure 1-18. Curves matched to data points (1 of 2) PAD011.0
Notes:
Unable to postulate a function y = f(x), a plane curve can still be modified. In some cases certain types of
curves can also be used, such as conical (circular, elliptical, parabolic or hyperbolic) or trigonometric (sine,
cosine). For starters, trajectories of gravitational objects follow a parabola when air resistance is ignored.
Therefore, a parabolic curve should be valid to suit trajectory data points. Tides obey sinusoidal patterns, and
if the impact of the moon and the sun are considered, the maiden database points must be compared to one
sinusoidal or two sine waves from various ages. In the case of a parametric curve and of its coordinates can
be effectively adapted as a separate arc length function; the chord distance can be used if data points can be
ordered.
A geometrically fitting circle: The issue of trying to find the best visual fit in a circle of 2D data points is
approached by Coope. The technique elegantly turns the normally non-linear problem into a linear problem
that can be overcome without using iterative numerical methods.
The geometric fit of an ellipse: "A geometrically fitting circle" technique is extended to general ellipses by
adding a non-linear step, resulting in a method that is fast, yet finds visually pleasing ellipses of arbitrary
orientation and displacement.
Application to surface: Most of this principle also applies in 2D curves to 3D surfaces, each patch identified
in two parametric directions, usually called u and v, by a net of curves. A region may be made up of one or
more patches of a surface in either direction.
Student Notebook
Software: Many statistic packages such as GNU plot, MLAB, Maple, MATLAB, GNU Octaves and SciPy,
such as R and numerical applications, provide controls for making curve fits in a range of different scenarios.
Case study
Example 1: A group of senior citizens who have never used the Internet before will receive training. As
shown in the table above left of figure 1, the random sample is chosen for 5 people and for 6 months. The
hours of the internet are registered. Determine if the data matches a quadratic regression line. First, on the
right of this data, we create a table in above figure with a second variable (MonSq). In columns I, J and K, we
now use the right-hand table for running the regression data analytical method (quadratic model). The result
is shown in second figure.
Figure: Data for polynomial regression

V10.1
Student Notebook
Uempty
IBM Power Systems
Figure: Linear regression output
Figure: Quadratic regression output
Figure 1-19. Curves matched to data points (2 of 2) PAD011.0
Notes:
The set r square 95 percent value and p-value (meaning F) near 0 suggest that the structure is very well
suited to the data means it suits well. It can also be confirmed by the quadratic meaning that the p-value of
the MonSq variable is close to 0. That is further shown by the dispersion diagram in figure 1, which
demonstrates that the four trends are greater than the linear trend.
Figure shows the regression quadratic best suited to the data.
Usage times=21.92–24.55*month+8.06*month 2. We therefore assume that the model will run for 20.8 hours
(or use the TREND function) and we know how many hours after three months the actual person will access
the Internet. For the comparison with a linear model in the regression analysis of the previous tests, the
regression data analysis approach is also available. The linear pattern is produced only by columns I and K of
Figure 1. The production appears in figure 2. The fact that the quadratic model has a higher R-square value
(95.2%versus 83.5%), and the default error (13.2% versus 24.5%), reflects the fact that the quadratic model
matches the data more precisely.
Student Notebook
Case study: Anomaly detection with

IBM Watson IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: Anomaly detection workflow engine

Source: https://images.app.goo.gl/ukrNQnHbjnP5XXKf6
Figure 1-20. Case study: Anomaly detection with IBM Watson PAD011.0
Notes:
Case study– anomaly detection with IBM Watson
https://dataplatform.cloud.ibm.com/docs/content/wsd/nodes/anomalydetection.html
In order to identify outliers or unusual cases in the data, anomaly detection models are applied. Referring to
other forms of modeling, where abnormal case rules are stored, anomaly detection models store information
on normal behavior. This allows outliers to be detected even though they do not adhere to a known pattern
and it can be especially useful in cases where new trends are continually developing, such as fraud detection.
Anomaly detection is an unmonitored process, meaning that a training dataset containing known fraud cases
is not necessary.
In general, anomaly detection can analyze large numbers of fields to identify clusters or peer groups within
which similar recordings fall during conventional methods of identifying outliers one or two different variables
at a time. In order to assess potential anomalies, each record may be compared to those in its peer group.
The more remote an event is from the standard center, the rarer it is. The algorithm could, for example, group
records into three separate clusters, flagged as far as the middle of each cluster is concerned.
An anomaly index is assigned to each record which represents the ratio of the group deviation index to its
average over the cluster of the event. That the index value, the greater the difference the situation is than the
average. If the anomaly index values are less than 1 or even 1.5 under the normal situation, they should not
be seen as exceptions because it's either the same or slightly higher than the average variance. However,
cases with an index value above 2 may be perfect for exceptions because the difference is at least twice the
norm.

V10.1
Student Notebook
Uempty Anomaly detection is a form of discovery designed to rapidly identify rare cases or documents that are
candidates for further study. Those should be considered as possible phenomena that may or may not be
true after closer examination. The record is completely true, but you may opt to screen it from the data for
model building purposes. If the algorithm repeatedly causes errors or an object in the data collection process,
this can mean an error or artifact.
Note that the identification of anomalies identifies irregular registers or cases through cluster analysis based
on a set of fields selected in the model, regardless of whether the fields are important to the pattern you’re
looking for. For this purpose, in combination MIT derivation or another tool for the screening and ranking of
fields, you may want to use anomalies detection. For example, the feature selection can be used to classify
the most appropriate fields for a particular target and then anomaly detection can be used to find the records
that are most unique for those fields. (An alternate way was to build a model decision tree and then analyze
any misclassified documents as potential anomalies, but the large-scale replication or automation of this
process would be harder.)
For instance. Anomaly detection can be used for screening agricultural production grants for potential fraud
cases to identify anomalies from the pattern, identifying records that are irregular and worth investigating
further. Particular attention is paid to grant applications which seem to require too much (or too little) money
for the farm form and scale.
Necessities: One or more fields of knowledge. Please be aware that only input fields with a function set can
be used as input for a source or form node. Goal fields (Task or Both functions set) are ignored. By
highlighted cases that do not obey a known set of rules and not an abnormality detection model, even though
it does not match the previously known trends, irregular cases are detected. In conjunction with the selection
of the app, anomaly detection allows vast volumes of data to be screened quickly to identify the records of
highest interest.
Why a Brazilian bank gives each of its 65 million client’s personal attention?
Watson is IBM AI that integrates in the workflows seamlessly and already uses leading platforms and
resources. Once you need AI and use it, it means allowing your employees to focus on what they do best. If
you need IA, your employee. Bradesco is one of Brazil's biggest banks with over 5,200 branches. When your
customers don't have a pleasant experience in a sector as competitive as banking, they might not long be
your customer. Bradesco has therefore started to search for ways to speed up operation and boost the
personalization level for each customer. Bradesco has deployed all services in IBM Watson.
In five stages, how Watson learned:
• A dedicated team of 10,000 customer questions trained Watson in Portuguese and banking.
• In a small number of branches, Watson was checked until the bank was pleased with the responses.
• Watson was established and given nationally to the 5,200 branches.
• Watson 's findings were decreased from 10 minutes to a few seconds as staff started to trust Watson.
• With reviews on over 10 million interactions, Watson continues to learn and improve.
Student Notebook

IBM Power Systems
• Exercise 3: Simple linear regression.
Notes:

V10.1
Student Notebook
Uempty
Probability theory (1 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure: Probability Theory

Source: https://images.app.goo.gl/DKudJzQZCPZEQyvt6
Figure 1-22. Probability theory (1 of 2) PAD011.0
Notes:
Probability theory: Probability provides the details as to how probable an incident would arise. Going to dig
through chance jargon:
• Trial or experiment: The act with a certain probability which contributes to a consequence.
• Sample space: A compilation of all the potential experimental outcomes.
• Event: Non-empty sample space subset is referred to as case.
In scientific terminology, chance is the calculation of how probable an occurrence is to occur while an
experiment is carried out.
Student Notebook
Probability theory (2 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
• Sample Space: 12 There are 12 marbles total (4+5+1+2 = 12)
Probability= Total Possible outcome
• P (black) = 2/12 = 1/6 There are 2 black marbles in the bag, 12 is your sample space.
.
• P (blue) = 4/12 = 1/3 There are 4 blue marbles in the bag , 12 is your sample space.
• P (blue or black) = 6/12= 1/2 4 blue + 2 black = 6 , 12 is your sample space.
• P (not green) = 11/12 There's 1 green. So 12-1 = 11 that aren't green,12 is your sample
space.
• P (not purple) = 1
• I will select a marble that is not purple because there are no purple marbles in the bag.
Whenever the chance of something occurring is definite, the probability is i.
Figure 1-23. Probability theory (2 of 2) PAD011.0
Notes:
Likelihood (probability) with marbles: Within a bag you can find four blue marbles, five red marbles, one
green marble and two black marbles. Suppose you randomly select one marble. Find each likelihood.
• P(black).
• P(blue).
• P(blue or black).
• P(not green).
• P(not purple).
Solution as above.

V10.1
Student Notebook
Uempty
Maximum likelihood theory and
estimation (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The estimate of density is the issue of evaluating the distribution of likelihood for a sub-set of
a sample in a question domain.
• Two Important concepts:

– Probability density estimation problem.
– Maximum likelihood calculation.
Figure 1-24. Maximum likelihood theory and estimation (1 of 2) PAD011.0
Notes:
Maximum likelihood theory and estimation: Although the basic approach used in computer training is
measuring the overall probability, there are many methods used to tackle estimating the number. The final
risk calculation involves a mathematical method to calculate the possibility of observing the data set in terms
of the chance distribution and calculation parameters. This method can be employed to find a space of
potential parameters and distributions. This powerful probabilistic model often forms the foundation for many
machine learning algorithms, namely important approaches to estimating quantitative values and class titles,
such as linear regression and logical regression, as well as deep artificial neural learning networks.
Probability density estimation problem: A popular simulation problem involves computing a cumulative
distributive likelihood for a data gathering. For examples, given an observation sample (X) in a domain (x1,
x2, x3 ..., xn), where each observation has the same probability distribution separately from the domain (i.e.,
so called separate and identically distributable or close). The density approximation comprises of the
probability distribution function choice and the distribution parameters that best define the information
detected (X) as a composite probability distribution.
Where can you pick the distribution likelihood function?
This question is more complicated since the sampling (X) of the population is small and contains noise, so
any measured density and its parameters are determined incorrectly. There are several techniques to solve
the problem, but 2 common solutions exist:
• Maximum a Posterior Probability (MAP).
• Maximum Likelihood Estimate (MLE).
Student Notebook
Maximum likelihood theory and

estimation (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Suppose that we are given a sequence (x1….xn) of IID random variables and a priori
distribution of it is given by Us wish to find the MAP estimate of it. Note that the normal
distribution is its own conjugate prior, so we will be able to find a closed-form solution
analytically.
• The function to be maximized is then given by:
• Which is equivalent to minimizing the following function of it:
• Thus, we see that the MAP estimator for p is given by:
• Which turns out to be a linear interpolation between the prior mean and the sample mean
weighted by their respective covariance's. The case of is called a non-informative prior and
leads to an ill-defined a priori probability distribution.
Figure 1-25. Maximum likelihood theory and estimation (2 of 2) PAD011.0
Notes:
Maximum a posterior probability (MAP)
In Bayesian statistics, a maximum a posterior probability (MAP) estimate is an estimate of an unknown
quantity that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of
an unobserved quantity based on empirical data. It is closely related to the method of maximum likelihood
(ML) estimation but employs an augmented optimization objective which incorporates a prior distribution (that
quantifies the additional information available through prior knowledge of a related event) over the quantity
one wants to estimate. MAP estimation can therefore be a regularization of ML estimation.
Maximum Likelihood Estimate (MLE): The Maximum Likelihood Estimate or MLE is a method for
probability density calculation. The overall estimation of chance is to regard the problem as an optimization or
search issue, in which we are searching for a collection of parameters that are best suited to the shared
likelihood of the data (X). First, a parameter named theta is specified which determines both the probability
density function option and the parameters of the distribution. This may be a computational value vector with
dynamic values and reflect several probability distributions, as well as their parameters.
Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a ‘head’ p.
The goal then becomes to determine p.
Suppose the coin is tossed 80 times: I.e., the sample might be something like x1 = H, x2 = T, ..., x80 = T, and
the count of the number of heads "H" is observed.

V10.1
Student Notebook
Uempty The probability of tossing tails is 1 − p (so here p is θ above). Suppose the outcome is 49 heads and 31 tails,
and suppose the coin was taken from a box containing three coins: one which gives heads with probability p
= 1⁄3, one which gives heads with probability p = 1⁄2 and another which gives heads with probability p = 2⁄3.
The coins have lost their labels, so which one it was is unknown. Using maximum likelihood estimation, the
coin that has the largest likelihood can be found, given the data that were observed. By using the probability
mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but for
different values of p (the "probability of success"), the likelihood function (defined below) takes one of three
values:
The likelihood is maximized when p = 2⁄3, and so this is the maximum likelihood estimate for p.
Validation and testing
The validation is the process in which the model and its hyperparameters are tuned. You want to check the
model with a new collection of data (i.e., data not used for cross-validation, bootstrapping or other approach
you have used) that have not been used yet. This simulates the performance of the model on completely new
data. The key benefit of the model is to see how it performs.
Assessing models of regression
The core techniques for evaluating models of regression are:
• Mean absolute error.
• Median absolute error.
• (root) mean squared error.
• Coefficient of determination (R2).
Residuals: The disparity between the observed and predicted effects is a residual( ei).
The distance between the observed data point and the regression line can also be regarded as vertical.
Minimize the least squares of the line.
That is, between the line and the data, it minimizes the mean squared error (MSE).
Residual (error) variation
The residual variance tests how well the data points match a regression line.
The average residual estimate is the same as the average square error:
Nonetheless, you are more likely to see that to make this estimator uneven:
Student Notebook
This means taking into account the degrees of freedom (here for interception and slope, both of which must
be estimated).
The square root of this difference, σ, is the average squared root error (RMSE).
Coefficient of determination
The variance overall is proportional to the residual variance (variation after the predictor is eliminated) plus
the systematic/regression variation:
Where:
r= Correlation coefficient.
x= Values in first set of data.
y= Values in second set of data.
n= Total no of values.
In the case of a y = mx+b line a point error (xn,xy) is:
This is the difference between the points on the xn line intuitively. And the current xn stage.
The squared error of the lines is the sum of the squares of all these errors:
You want to eliminate this squared mistake to make the best match rows.
Evaluating classification models
Significant quantities:

V10.1
Student Notebook
Uempty

IBM Power Systems
• Exercise 4: Multiple linear regression.
Notes:
Student Notebook
Model selection (1 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
• The MDL (Minimum Description Length) statistic is calculated as follows:
MDL = L(h) + L(D | h)
• Where h is the model, D is the predictions made by the model, L(h) is the number of bits
required to represent the model, and L(D | h) is the number of bits required to represent the
predictions from the model on the training dataset.
• The score as defined above is minimized, e.g., the model with the lowest MDL is selected.
• The number of bits required to encode (D | h) and the number of bits required to encode (h)
can be calculated as the negative log-likelihood. For example:
MDL = -log(P(theta)) – log(P(y | X, theta))
• Or the negative log-likelihood of the model parameters (theta) and the negative log-likelihood
of the target values (y) given the input values (X) and the model parameters (theta).
Figure 1-27. Model selection (1 of 2) PAD011.0
Notes:
Model selection: We need to assess the performance of the models and choose which is the best one based
on different factor. We cannot determine the cost function of the model hypothesis because it can lead to
overrun if the error is minimized. A good approach is to split the data into a training set and a test set (e.g.
70% /30% division). You then train your model on the test set to see how it performs.
You will also calculate the validation error instead of only calculating the test error. Validation is mostly done in
order to adjust hyperparameters you will not adapt them to the training system because this may result in
over fitting or you will prefer to modify them to the test set since that results in a generalization calculation that
is overly optimistic. Therefore, we have a separate data set for tuning hyperparameters for the purpose of
validation.
• If your model does not fit well, you may use these mistakes to determine what kind of problem you have.
• If your training error is high, you have a high bias (under fitting) problem and a broad validation/testing set
error.
• When you have a small error in training and a significant validation and testing error, you have a massive
issue with variance.
• Cross-validation k-fold (better than small datasets).
- The training set is divided into kk folds.
- Take the k−1k−1 folds iteratively and check on the other fold.

V10.1
Student Notebook
Uempty - The performance average "take one-out" cross-validation is also provided where k = nk = n (nn is the
number of data dots) is k-fold cross-validation.
- The bootstrapping cycle.
• New data sets are generated from the original dataset by replacement sampling (uniformly at random).
• Train and validate the unselected data in the bootstrapped dataset.
Student Notebook
Model selection (2 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure: AU ROC
Source: https://images.app.goo.gl/iMXwj7jLYADko1uCA
Figure 1-28. Model selection (2 of 2) PAD011.0
Notes:
Area under the curve (AUC)
It is the way to classify the binary and multi label. You may select some cut-off from the binary classification
above which a sample is assigned to one class and below which the other class is assigned to an example.
You can achieve different results depending on the cut-off-there is a tradeoff between real and false positive
concentrations.
You can draw a curve for its y axis P(TP) and its x-axis P(FP) of the receiver operational characteristic
(ROC). Every point in the curve is a cut-off value. That is, the ROC curve will see the classifier performance
through the entire cutting cap, while other metrics (e.g. the F mark etc.) tell you just the results for one cuts.
The ROC curve shows the sweeping over all cutout thresholds. You will get a completer and truthful summary
of how well your classifier functions if you display all thresholds at once. It is not immune to the harm of the
class of data.
In order to calculate how effective the classification algorithm is, the area under the curve (AUC) is used. An
AUC of more than 0.8 is commonly regarded as 'healthy.' An AUC of 0.5 is equal to a random devaluation. A
straight line.

V10.1
Student Notebook
Uempty Matrices of uncertainty (confusion matrices): This method is suitable for binary or multi-class
classification. Evaluation is often viewed as a classification uncertainty matrix. The core values are:
• True positive (TP): Positive samples that have been established as positive.
• True Negative (TN) samples: Negative samples graded.
• Wrong Positives (FP): Positive samples labelled negative.
• False negatives (FN): Negative samples that have been declared positive.
An additional value:
• Positive predictive value (PPV): Precision also takes account of the prevalence. The PPV is equal to
the accuracy of the fully balanced data set (i.e. equal positive and negative cases, 0.5 prevalence).
• Zero error rate: How often would you be wrong if you expected that every example would be positive.
This is a good starting point for contrasting the classifier with.
• F-score: Average weighted recall and accuracy.
• Cohen’s kappa: High kappa score if the accuracy of the classification and the zero-error rate differ
greatly.
Remember that class 1 and class 0 for the unique class must be specified in the convention. This is to say,
we seek to predict the uncommon class.
Perhaps you would like to use accuracy/alert as a calculating metric instead.
In 1T/0 T the real class is indicated and in 1P/0P the expected class is indicated.
Precision is the amount of true positive over the expected total number. That is, what are actually the positive
fractions of the examples labeled?
Recall the number of true positives above the number of actual positives. In other words, what are the
positive examples found in the data?
Our simple classifier will have a 0 reminder in the previous example.

There is a balance between consistency and warning.
In regression analysis the logistic regression level threshold is 0.5, which is, class 1 should be expected.
Student Notebook
But if you are very relaxed, you may just want to mark an example as 1. You may then move the threshold to
0.9 to make the classifications stricter. In this scenario, you're more precise, but less alert because some of
the more vague optimistic examples might not be valid enough.
On the other hand, to eliminate false negatives, you may want to lower the threshold at which case recall
increases, but accuracy decreases.
So what is the most efficient way to compare accuracy and reminder values between algorithms? You can
condense precision and reminiscently into one metric: the F1 score (the harmonic mean of precision and
reminder also known as the F-score):
Although not always helping more info, it does generally. Many algorithms perform much better as more and
more data is collected. More complex but fairly simple algorithms can only be done by more training data.
Here are some things to do if the algorithm is not working well:
• Obtain more examples of training (can assist with problems of great variance).
• Seek smaller feature sets (may help with problems with high variances).
• Seek to incorporate polynomial characteristics (can help with problems with high bias).
• Seek to decrease? (can help with high bias issues) regularization parameter.
• Check additional features (can help with problems with high bias).
• Seek to increase regulatory parameter ?(can help with problems with high variance).

V10.1
Student Notebook
Uempty
Matrices of uncertainty
(confusion matrices) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A few other metrics are computed from these values:
• Accuracy: How often is the classifier correct?

• Misclassification rate (or "error rate"): How often is the classifier wrong?
• Recall (or "sensitivity" or "true positive rate"): How often are positive-labeled samples
predicted as positive?
• False positive rate: How often are negative-labeled samples predicted as positive?
• Specificity (or "true negative rate"): How often are negative-labeled samples predicted as
negative?
• Precision: How many of the predicted positive samples are correctly predicted?
• Prevalence: How many labeled-positive samples are there in the data?
Figure 1-29. Matrices of uncertainty (confusion matrices) PAD011.0
Notes:
Student Notebook
Loss of logging (log-loss) IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure: Log Loss formula

Source: https://images.app.goo.gl/UpFaWENnNrSm935R9
Figure 1-30. Loss of logging (log-loss) PAD011.0
Notes:
Loss of logging (log-loss): This approach is ideal for classification of binary, multiclass and multi label.
Log-loss is a precision metric that can be used if it is not a class but a probability of the classification
performance. For instance, if it forecasts 1 as 0.51 and the corresponding class is 0, then it is less "true" than
if it had forecast class 1 as 0.95. The distance from the classifier is penalized.

V10.1
Student Notebook
Uempty
Rate for F1 (F1 score) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The F1 score is the weighted average accuracy and warning, even the balanced F-score or
F-measure.
Figure: F1 score
Figure 1-31. Rate for F1 (F1 score) PAD011.0
Notes:
Rate for F1 (F1 score): The highest score is 1 and the worst score is 0. It can be used to define binary,
multi-class and multi-label (the latter two use the weighted average F1 score for every class).
Student Notebook
Metric selection IBM ICE (Innovation Centre for Education)

IBM Power Systems
• Metric selection is more complex for biased groups (or strongly predetermined bias data).
• For example, you have a dataset with only 0.5% of the data in category 1.
• You run your experiment and remember that 99.5 percent of the tests are correctly graded.
Figure 1-32. Metric selection PAD011.0
Notes:
Metric selection
Only by using the skews in these data can the model define each example in category 0 and achieve the
exactness.

V10.1
Student Notebook
Uempty
Hyperparameter selection (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Figure: Hyperparameter selection

Source: https://images.app.goo.gl/o3DaxpbGLPVRxKnPA
Figure 1-33. Hyperparameter selection (1 of 2) PAD011.0
Notes:
Hyperparameter selection: Hyperparameter tuning is also considered an art as that does not involve an
optimization process that is accurate and realistic. Other automated methods, including:
• Grid search.
• Random search.
• Evolutionary algorithms.
• Bayesian optimization.
Grid search: Only search at various hyperparameter combinations to see what is the right combination. In
general, depending on the parameter, hyperparameters are checked over specific intervals or scales. This
can be 10, 20, 30, and 1e-5, 1e-4, 1e-3, and so on. Parallels are fast, but it is brute.
Random search: Surprisingly, sampling from the whole grid by random frequency works as well as scanning
the whole grid in far less time.
Intuitively: If we want to obtain the maximum 5 percent value of the hyperparameter mixture, then a 5
percent probability of leading to that outcome is a random hyperparameter combination. If we want to get
a combination 95 % of the time successfully, we need to pass many random combinations. If we take n
combinations of hyperparameter the probability of n not being the top 5% is (1−0.05)n, so the chance that at
least one of the top 5% is just 1−(1−0.05)n. When we want a 95 % of the time, i.e. we want the probability that
at least one is what we're looking for to be at 95% so we set at least 1−(1−0.05)n=0.95 and therefore n to 60
random hyperparameter combinations so that we have a 95% chance of at least 1 hyperparameter
combination with the top 5 of such combinations. We need only 60 random hyperparameter combinations.
Student Notebook
Hyperparameter selection (2 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
• Optimization of Bayesian hyperparameter.
• There are two parts:

– Exploration: Test the feature with the most uncertain outcome in collection of hyperparameters.
– Operating: Test this function on a set of high-value hyperparameters.
Figure 1-34. Hyperparameter selection (2 of 2) PAD011.0
Notes:
Optimization of Bayesian hyperparameter
We can use Bayesian optimization to choose appropriate hyperparameters. We can sample Gaussian
hyperparameters and use them for the measurement of a later distribution as observations. Then we pick the
following hyperparameters to model the predicted changes from the best results or a Gaussian cycle of high
confidence (UCB). We want to construct a utility feature from the rear model, this is what we are searching for
next hyperparameter.
Basic idea: Models the output of a smooth algorithm in order to achieve the highest level of the
hyperparameters. This is faster than searching the grid by conceiving where the ideal set of hyperparameters
can be rather than searching the brute force across the entire space. One problem is that it can be very costly
(for example, to train a large neural network) to use a hyperparameter sample to calculate the results. We use
a Gaussian process because we calculate margins and conditions in closed form with the properties of the
process.

V10.1
Student Notebook
Uempty
The problem with high
dimensionality IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The dimension of a problem refers to the number of input variables (actually, degrees of
freedom).
• The exponential increase in data required to densely populate space as the dimension
increases.
Figure: The Problem with High Dimensionality

Source: https://images.app.goo.gl/T6QLvXses34dSeXXA
Figure 1-35. The problem with high dimensionality PAD011.0
Notes:
The problem with high dimensionality
Choosing a particular piece of machine learning algorithm e.g. logistic regression support vector machine
(SVM) is less useful, it may be a best algorithm for a specific issue, but sometimes the average efficiency of
this algorithm is not performing well in the case of over fitting and under fitting of data. Even high
dimensionality becomes a disadvantage as one attempts to analyze the dismissal. With a higher dimensional
distribution of probability, locating a good enveloping distribution is more complicated, because the approval
likelihood would begin to diminish with dimensionality.
Let us presume you are 100 yards long on a straight line and you lost a penny on it somewhere. This would
not have been that tough to search. You are heading down the path and it requires 2 minutes. So let us
presume you have got a 100-yard square on either hand and you have lost a penny on it somewhere. It will
be hard to find, like being trapped together over two football fields. This will take days. A circle still 100 yards
cut. That is like looking for a 30-story football stadium building the scale of one. When you get more
measurements, the challenge of navigating in the room gets even easier. You do not intuitively know that as it
is only described in quantitative formulae, because they all have the same distance. This is the
Dimensionality curse. It gets a name because it is clunky, valuable and yet easy.
Student Notebook
Information theory IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure: Information theory Formulas

Source: https://images.app.goo.gl/DnbEL1fyXk23ory49
Figure 1-36. Information theory PAD011.0
Notes:
Information theory: Information theory is a subfield of mathematics associated with communication of data
via a noisy system. The definition of precisely how many information a communication includes is a
cornerstone of knowledge theory. Most frequently, it is possible to quantify the information using probability in
an event and a random variable called entropy. Awareness estimation and entropy are an important approach
in machine learning and serve as a basis for techniques such as attribute selection, decision taking and in
broader terms fitting category models. A machine learner therefore needs a good understanding and
comprehension of information and entropy.
Calculate the information for an event: The quantification of information is the foundation of the area of
knowledge theory. The concept of calculating how much surprise an occurrence is the theory behind the
quantification of knowledge. There are more unlikely rare (low-probability) events and more information about
other usual accidents (high-probability).
• Low probability event: High Information (surprising).
• High probability event: Low Information (unsurprising).
We can see the forecast trend that low-will events are more volatile and contain more information and less
information is available in comparison to high-risk incidents. We can also see that this interaction is not linear,
but rather sub linear in fact. When using the log form, this makes sense.

V10.1
Student Notebook
Uempty

IBM Power Systems
• Exercise 5: Logistic regression model.
Notes:
Student Notebook
Checkpoint (1 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
Multiple choice questions:
1. The recalled output in pattern association problem depends on?

a) Nature of input-output
b) Design of network
c) Both input & design
d) None of the mentioned
2. What is the objective of feature maps?

a) To capture the features in space of input patterns
b) To capture just the input patterns
c) Update weights
d) To capture output patterns
3. Use of nonlinear units in the feedback layer of competitive network leads to concept of?
a) Feature mapping
b) Pattern storage
c) Pattern classification
d) None of the mentioned
Figure 1-38. Checkpoint (1 of 2) PAD011.0
Notes:
Write your answers here:
1.
2.
3.

V10.1
Student Notebook
Uempty
Checkpoint (2 of 2) IBM ICE (Innovation Centre for Education)

IBM Power Systems
Fill in the blanks:
1. __________learning is involved in pattern clustering task.

2. If the weight matrix stores the given patterns, then the network becomes _________.
3. Activation models are _______.
4. Information theory is using in __________ detection.
True or False:
1. From given input-output pairs pattern recognition model should capture characteristics of
the system? True/False
2. Can system be both interpolative & accretive at same time? True/False
3. Does pattern classification belong to category of non-supervised learning? True/False
Figure 1-39. Checkpoint (2 of 2) PAD011.0
Notes:
Write your answers here:
Fill in the blanks:
1.
2.
3.
4.
True or false:
1.
2.
3.
Student Notebook
Question bank IBM ICE (Innovation Centre for Education)

IBM Power Systems
Two mark questions:
1. What is pattern detection?
2. What is information theory?
3. What is linear regression model?
4. What is the math formula for curve designing?
Four mark questions:

1. What is the difference between pattern and anomaly detection?
2. What is polynomial curve fitting?
3. Describe high dimensionality problems.
4. Describe information theory components.
Eight mark questions:

1. Explain model selection techniques.
2. Explain probability theory in details.
Figure 1-40. Question bank PAD011.0
Notes:

V10.1
Student Notebook
Uempty
Unit summary IBM ICE (Innovation Centre for Education)

IBM Power Systems
Figure 1-41. Unit summary PAD011.0
Notes:
Unit summary is as stated above.

Pad Unit 1 Ibm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pad Unit 1 Ibm

Uploaded by

Copyright:

Available Formats

Pattern Recognition and

Unit 1. Pattern and Anomaly Detection Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

© Copyright IBM Corp. 2020 Contents iii

Unit 2. Statistical Approaches for Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1

Unit 3. Machine Learning Approaches for Pattern Recognition . . . . . . . . . . . . . . . . . . . . . .3-1

iv PAD © Copyright IBM Corp. 2020

TOC Conditional independences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45

Unit 4. Anomaly Detection & Anomaly Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . 4-1

Unit 5. Real-world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1

© Copyright IBM Corp. 2020 Contents v

Fundamental concerns of intrusion detection systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9

Unit 6. Lab Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-1

vi PAD © Copyright IBM Corp. 2020

TOC Exercise 20: Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-83

Appendix A. Checkpoint solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1

© Copyright IBM Corp. 2020 Contents vii

© Copyright IBM Corp. 2020 Trademarks ix

© Copyright IBM Corp. 2020 Course description xi

What this unit is about

What you should be able to do

How you will check your progress

Unit objectives IBM ICE (Innovation Centre for Education)

After completing this unit, you should be able to:

• Understand the concept of pattern recognition and anomaly detection

• Gain knowledge on example of polynomial curve fitting

• Learn about probability theory architecture and working model

• Understand Information theory

Figure 1-1. Unit objectives PAD011.0

1-2 PAD © Copyright IBM Corp. 2020

What is pattern? IBM ICE (Innovation Centre for Education)

• Pattern is all about it in this digital age.

• A pattern can be either visually identified or mathematically detected via the

Figure: Pattern definition

Figure 1-2. What is pattern? PAD011.0

What is pattern recognition? IBM ICE (Innovation Centre for Education)

• As per Wikipedia, pattern recognition is the automated recognition of patterns and

Figure 1-3. What is pattern recognition? PAD011.0

1-4 PAD © Copyright IBM Corp. 2020

Uempty • Detects patterns and features even though partly obscured.

Pattern recognition techniques IBM ICE (Innovation Centre for Education)

Figure: Pattern recognition process

Figure 1-4. Pattern recognition techniques PAD011.0

1-6 PAD © Copyright IBM Corp. 2020

Uempty The process flow is as below:

Training and learning in pattern

Figure: Training and testing dataset

Figure 1-5. Training and learning in pattern recognition PAD011.0

1-8 PAD © Copyright IBM Corp. 2020

Pattern recognition applications IBM ICE (Innovation Centre for Education)

• Before approaching a new species, the species class has to be established.

Figure 1-6. Pattern recognition applications PAD011.0

1-10 PAD © Copyright IBM Corp. 2020

• Customer research and stock market analysis.

• Optical Character Recognition (OCR), document classification and signature verification.

• Image recognition, visual search, face recognition.

• Voice recognition and ai assistants.

• Recommendation sentiment analysis, audience research.

Figure 1-7. Pattern recognition use cases PAD011.0

Areas for NLP as below:

1-12 PAD © Copyright IBM Corp. 2020

Uempty Face Identification has two primary usage cases:

What is anomaly detection? IBM ICE (Innovation Centre for Education)

Figure: Anomaly detection example