Lec 01 Introductionv 2024

模式识别与机器学习 A
Advanced Pattern Recognition & Machine Learning
郭玉柱
yuzhuguo@buaa.edu.cn
自动化科学与电气工程学院
Sunday, March 31, 2024
Start from ChatGPT
RLHF （ Reinforcement Learning from Human

Feedback ）
2
Sora
3
Embodied AI
Minds live in bodies, and bodies move through a changing world. The goal of embodied
artificial intelligence is to create agents, such as robots, that learn to creatively solve
challenging tasks requiring interaction with the environment. Fantastic advances in deep
learning have enabled superhuman performance on a variety of AI tasks previously thought
intractable. Computer vision, speech recognition, and natural language processing have
experienced transformative revolutions at passive input-output tasks like language translation
and image processing, and reinforcement learning has similarly achieved world-class
performance at interactive tasks like games. These advances have supercharged embodied AI,
which can:
• See: perceive their environment through vision or other senses.
• Talk: hold a natural language dialog grounded in their environment.
• Listen: understand and react to audio input anywhere in a scene.
• Act: navigate and interact with their environment to accomplish goals.
• Reason: consider and plan for the long-term consequences of their actions.
4
AI4Science
5
Simpson’s Paradox
yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 6

Correlation vs Causation
Monthly ice cream production in the United States and

drowning deaths in Florida.
Out-of-distribution
Catastrophic Forgetting
Adversarial Attack
Small Data Learning
Simple network
Small data training
Adaptive
Big Data is all about finding

correlations, but Small Data is all
about finding the causation, the
reason why.
“If one takes the top 100 biggest
innovations of our time, perhaps
around 60% to 65% percent are really
based on Small Data.” as Martin Small Data: the Tiny Clues that
Lindstrom Uncover Huge Trends
AGI
“Honeybees are excellent navigators and

explorers, using vision extensively in
these tasks, despite having a brain of
only one million neurons”
The team aim to create the first flying

robot able to sense and act as
autonomously as a bee, rather than just
carry out a pre-programmed set of
instructions.
Probabilistic AI
1. Probabilistic Computing
2. Third wave of AI
3. Ex: Driving a car
4. Role in Explainable AI (XAI)
5. Role of probability in machine learning
Roleof Probability in AI
Prediction → Inference
• Probabilistic computing allows us to
1. Deal with uncertainty in natural data around us
2. Predict events in the world with an understanding
of data and model uncertainty
• Predicting what will happen next in a scenario,
as well as effects of our actions, can only be
done if we know how to model the world around
us with probability distributions
Rolewith XAI
• Augmenting deep learning with probabilistic

methods opens door to understanding why
AI systems make the decisions they make,
• Will help with issues like tackling bias in AI
systems.
• Research into probabilistic computing is really
about establishing a new way to evaluate the
performance of the next wave of AI — one
that requires real-time assessment of “noisy”
data.
Current AI Models
Rule-based System: Classic Machine Representation Learning:

Pre-programmed Logic learning Sense and Perceive
Deep Learning
Output Output Output Output
Mapping from Mapping from Mapping from

features features features
Hand-
designed Additional layers
program of more abstract
Hand-designed
Features features
Features
Simple features
Input Input Input
Input
Shaded boxes indicate components that can learn from data

Next step for AI
• First AI systems focused on logic:

– Pre-programed rules.
• Second wave of AI concerns ability to sense
and perceive information
– Leveraging neural networks to learn over time.
• But, neither solution can do things that human
beings do naturally as we navigate the world.
– They can’t think through multiple potential scenarios
based on data that you have on-hand while
conscious of potential data that you don’t have.
Next step for AI
Driving a Car and Soccer Ball
• If you are driving a car and see a soccer ball roll

into the street,
• Your immediate and natural reaction is to stop
the car since we can assume a child is running
after the ball and isn’t far behind. 8
Role of Probabilistic System
• Driver reaches the decision to stop the car

based on experience of natural data and
assumptions about human behavior.
– But, a traditional computer likely wouldn’t reach the
same conclusion in real-time, because today’s
systems are not programmed to mine noisy data
efficiently and to make decisions based on
environmental awareness.
– You would want a probabilistic system calling the
shots—one that could quickly assess the situation
and act (stop the car) immediately.
9
Explainable AI
Why did you do that?

Why not something else?
When do you succeed?
Anecdote: Medical AI When do you fail?
Decisions can be worse with AI When can I trust you?
e.g., Patient discharge to a nursing home How do I correct an error?
10
Role of Probability in ML
• In neural networks (discriminative models)
1. Output is a probability distribution over y
2. Instead of error as loss function we use a
surrogate loss function, viz., log-likelihood, so that
it is differentiable (which is necessary for gradient
descent)
• In probabilistic AI (generative models)
– We learn a distribution over observed and latent
variables whose parameters are determined by
gradient descent as well
1
p(x; )  Z() p (x, ) Z()  x p (x,

Introduction
What is Artificial Intelligence?
A brief history of AI
Course logistics

Outline
Course logistics

What is “AI”?

Some classic definitions
Building computers that
Think like a human Think rationally

- Cognitive science / neuroscience - Logic and automated reasoning
- Can’t there be intelligence without - But, not all problems can be
humans? solved just be reasoning
Act like a human Act rationally

- Turing test - Basis for intelligence agents
- ELIZA, Loebner prize framework
- “What is 1228 x 5873?” … “I - Unclear if this captures the
don’t know, I’m just a human” current scope of AI research

Symbolicism AI
a.k.a “classical AI,” “rule-based AI,” and “good old-fashioned AI.”
Symbolic AI involves the explicit embedding of human knowledge and behavior

rules into computer programs. The practice showed a lot of promise in the early
decades of AI research. But in recent years, as neural networks, also known as
connectionist AI, gained traction, symbolic AI has fallen by the wayside.
Symbolic AI programs are based on creating explicit structures and behavior rules.
An example of symbolic AI tools is

object-oriented programming.

Connectionism AI
What is connectionism?
Connectionism is based on the idea that the brain is made up of a large number of simple
processing units, or neurons, that are interconnected. These neurons are able to learn by
adjusting the strength of the connections between them.
What are the benefits of connectionism?
flexible, scalable, effective at learning from data.
What are the limitations of connectionism?
• too simplistic and that it does not capture the true nature of intelligence.
• too reliant on data and that it is not able to generalize well to new situations.
connectionism is well-suited for problems where data is abundant and where there is a need
for fast and accurate predictions. struggle with tasks that require reasoning. difficult to
interpret.
Caenorhabditis elegans, a much-studied worm, has
approximately 300 neurons whose pattern of interconnections
is perfectly known. Yet connectionist models have failed to
mimic even this worm.

Actionism AI （ Dynamic ）
Reinforcement learning is a type of machine learning that is concerned with how

software agents ought to take actions in an environment so as to maximize some
notion of cumulative reward. The agent learns by interacting with its environment, and
through trial and error discovers which actions yield the most reward.
Reinforcement learning is an important area of machine learning because it is able to

deal with problems that are too difficult for traditional supervised learning methods.
Additionally, reinforcement learning can be used to solve problems that do not have a
clear set of training data, as is the case with many real-world problems.

Outline
Course logistics

(Some) history of AI

Prehistory (400 B.C – )
Philosophy: mind/body dualism,

materialism
Mathematics: logic, probability, decision

theory, game theory
Cognitive psychology
Aristotle
Computer engineering

Birth of AI (1943 – 1956)
1943 – McCulloch and Pitts: simple neural

networks
1950 – Turing test
1955-56 – Newell and Simon: Logic

Theorist
1956 – Dartmouth workshop, organized by

John McCarthy, Marvin Minsky, Nathaniel
Rochester, Claude Shannon
“The study is to proceed on the basis of the conjecture that
every aspect of learning or any other feature of intelligence can in
principle be so precisely described that a machine can be made to
simulate it. … We think that a significant advance can be made in
one or more of these problems if a carefully selected group of
yuzhuguo@buaa.edu.cn scientists work
Patternon it together
Recognition & Machinefor a summer.”
Learning 33
Early successes (1950s – 1960s)
1952 – Arthur Samuel develops checkers

program, learns via self-play
1958 – McCarthy LISP, advice taker, time

sharing
1958 – Rosenblatt’s Perceptron algorithm

learns to recognize letters
1968-72 – Shakey the robot
1971-74 – Blocksworld planning and

reasoning domain

First “AI Winter” (Later 1970s)
Many early promises of AI fall short
1969 – Minky and Pappert’s “Perceptrons”

books shows that single-layer neural
network cannot represent XOR function
1973 – Lighthill report effectively ends AI

funding in U.K.
1970s – DARPA cuts funding for several AI

projects

Expert systems(1970s – 1980s)
Move towards encoding domain expert

knowledge as logical rules
1971-74 – Feigenbaum’s DENRAL

(molecular structure prediction) and MYCIN
(medical diagnoses)
1981 – Japan’s “fifth generation” computer

project, intelligence computers running
Prolog
1982 – R1, expert system for configuring

computer orders, deployed at DEC

Second “AI Winter” (Late 1980s – Early 1990s)
As with past AI methods, expert systems
fail to deliver on promises
Complexity of expert systems made them

difficult to develop/maintain
1987 – DARPA cuts AI funding for expert

systems
1991 – Japan’s 5th generation project fails

to meet goals

Splintering of AI (1980s – 2000s)
Hidde
Input n Much of AI focus shifts to subfields: machine
Output learning, multiagent systems, computer vision,
natural language processing, robotics, etc
1982 – Backpropagation for training neural

networks popularized by Rumelhart, Hopfield,
Hinton (amongst many others)
1988 – Judea Pearl’s work on Bayesian

networks
1995 – NavLab5 automobile drives across

country steering itself 98% of the time

Focus on applications
(1990s – Early 2010s)
Meanwhile, AI (sometimes under a subfield),

achieves some notable milestones
1997 – Deep Blue beats Gary Kasparov
2005, 2007 – Stanford and CMU respectively

win DARPA grand challenge in autonomous
driving
2000s – Ad placement and prediction for

internet companies becomes largely AI-based
2011 – IBM’s Watson defeats human Jeopardy

opponents

“AI” Renaissance (2010s – ??)
“AI” is a buzzword again; Google, Facebook,

Apple, Amazon, Microsoft, etc, all have
large “AI labs”
2012 – Deep neural network wins image

classification contest
2013 – Superhuman performance on most Atari

games via a single RL algorithm
2016 – DeepMind’s AlphaGo beats one of the

top human Go players
2017 – CMU’s Libratus defeats top pro players

at No-limit Texas Hold’em

AI is all around us
Face detection Personal assistants
Machine translation Logistics planning

A broader definition
Artificial intelligence is the development and study of

computer systems to address problems typically
associated with some form of intelligence

Turing Test
•
•

The Chinese Room

Deep Learning

AI Safety

AI Ethics

Singularity
•
•
•
•
48
Some parting thoughts
“Computers in the future may have only 1,000

vacuum tubes and weigh only 1.5 tons.”
– Popular Mechanics, 1949
“Machines will be capable, within twenty

years, of doing any work a man can do.”
– Herbert Simon, 1965

Outline
Course logistics

Organization of course
• Undergrad AI, broad introduction to a wide range of topics

• Grad AI, more focused on a few topics, leaving out others
The goal of this course is to introduce you to some of the topics

and techniques that are at the forefront of modern AI research:
• Probabilistic reasoning
• Graphical model
• Feature Engineering
• Learning theory
• Model assessment and selection
• Machine learning and deep learning
…
Grading
Students taking this course should have experience with:

mathematical proofs, linear algebra, calculus, probability,
Matlab/Python programming
Grading breakdown for the course:

10% class performance
30% project
60% exams

Academic integrity
Homework/project policy:
• You may discuss homework problems with other
students, but you need to specify all students you
discuss with in your writeup
• Your writeup and code must be written entirely
on your own, without reference to notes that
you took during any group discussion
All code and written material that you submit must be

entirely your own unless specifically cited (in quotes for text,
or within a comment block for code) from third party
sources

Pattern Recognition References
Pattern Recognition and Probabilistic Machine Deep Learning by Ian

Machine Learning by Learning An Introduction by Goodfellow, Yoshua
Christopher M. Bishop Kevin P. Murphy Bengio, and Aaron
Courville
Why Learn Learning?

Motivation
• “We are drowning in information,

but we are starved for knowledge”
- John
Naisbitt,
Megatrends
• Data = raw information

• Knowledge = patterns or models
behind the data
Solution: Machine Learning
• Hypothesis: pre-existing data repositories contain a

lot of potentially valuable knowledge
• Mission of learning: find it
• Definition of learning:
(semi-)automatic extraction of valid, novel, useful and
comprehensible knowledge – in the form of rules,
regularities, patterns, constraints or models – from arbitrary
sets of data

Applications of ML are Deep and Prevalent
• Online ad selection and placement
• Risk management in finance, insurance, security
• High-frequency trading
• Medical diagnosis
• Mining and natural resources
• Malware analysis
• Drug discovery
• Search engines
…
Draws on Many Disciplines
• Artificial Intelligence
• Statistics
• Continuous Optimisation
• Databases
• Information Retrieval
• Communications/Information Theory
• Signal Processing
• Computer Science Theory
• Philosophy
• Psychology and Neurobiology
…

Terminology
• Input to a machine learning system can consist of

 Instance: measurements about individual entities/objects
a loan application
 Attribute (aka Feature, explanatory var.): component of the
instances
the applicant’s salary, number of dependents, etc.
 Label (aka Response, dependent var.): an outcome that is
categorical, numeric, etc.
forfeit vs. paid off
 Examples: instance coupled with label
<(100k, 3), “forfeit”>
 Models: discovered relationship between attributes
and/or label
Human Perception
• Humans have developed highly sophisticated skills for

sensing their environment and taking actions
according to what they observe, e.g.,
– Recognizing a face.
– Understanding spoken words.
– Reading handwriting.
– Distinguishing fresh food from its smell.
– ...

Why is Pattern Recognition important?
Kurzweil describes a series of thought

experiments which suggest to him that the
brain contains a hierarchy of pattern
recognizers. Based on this he introduces
his Pattern Recognition Theory of Mind.
He says the neocortex contains 300 million
very general pattern
recognition circuits and argues that they
are responsible for most aspects of
human thought.

Pattern Recognition (PR)
• Pattern Recognition is the study of how machines

can:
– observe the environment,
– learn to distinguish patterns of interest,
– make sound and reasonable decisions about the categories
of the patterns.

What is a Pattern
• What is a Pattern?
– is an abstraction, represented by a set of
measurements describing a “physical” object
• Many types of patterns exist:
– visual, temporal, sonic, logical, ...

What is a Pattern
“A pattern is the opposite of a chaos; it is an entity

vaguely defined, that could be given a name.”
(Watanabe)

• What is a Pattern Class (or category)?

– is a set of patterns sharing common attributes
– a collection of “similar”, not necessarily identical, objects
– During recognition, given objects are assigned to a prescribed
class

Recognition
Identification of a pattern as a member of a category

(class) we already know, or we are familiar with
Classification (known categories)
Clustering (learning categories)
Category “A”
Category “B”
Clustering
Classification

• No single theory of Pattern Recognition can

possibly cope with such a broad range of
problems...
• However, there are several standard models,
including:
– Statistical or fuzzy pattern recognition
– Syntactic or structural pattern recognition
– Knowledge-based pattern recognition

PR Systems
Pattern recognition systems have four major components:
 data acquisition and collection  feature extraction and representation
 similarity detection and pattern  performance evaluation
classifier design

Pattern Recognition
• Two phase Process

1. Training/Learning
•
Learning is hard and time consuming
•
System must be exposed to several examples of
each class
•
Creates a “model” for each class
• Once learned, it becomes natural
2. Detecting/Classifying

Methodology

Development of PR
 1929 ： Tauschek invented the first OCR (Optical

Character Recognition) machine called Reading Machine,
which can read number 0-9.
 1930s ： Fisher proposed the theory of statistical

classification, which was the foundation for statistical
pattern recognition.
 1950s ： Noam Chemsky proposed formal language

theory ； King-Sun Fu proposed the syntactic/structural
pattern recognition theory.
72
Development of PR
 1960s ： L. A. Zadeh proposed fuzzy set theory, fuzzy

pattern recognition method was developed and applied.
 1980s ： The neural network model represented by Hopfield

network and BP network leads to the revival of artificial
neural network and is widely used in pattern recognition.
 1990s ： Small Sample Size Learning and Support Vector

Machine (SVM) have attracted much attention.
 2006 ： Deep Learning
73
Applications

Main Contents

Mathematical Foundations

Main Contents

From Evidence-Based Medicine
to Personalised Precision
Medicine
 Complex physiological
and pathological processes
 Data driven machine learning
 Personalised in silico medicine

Prediction of risk of hip fracture
Clinical gold standard
Body + Organ + Tissue
Miss 30-50% of hip

fracture
Personalised FE simulation
+
SVM
AUC increase from 50% to

92%
Multi-Modal Fusion
检测 – 诊断 – 干预脑电皮肤电导
一体化智能传感技术
肌电
惯性传感器
肌肉电刺激足压分布
改善患者生活
实现无人化护理，缓解医疗压力
FOG Detection with Wearable Sensors
原
始
数
据
时
频
谱
冻
结
指
数
STFT CWT TV-ARMA TV-ARMA

with RLS with LROFR
Identify the nonlinear function

1 1 1 1
…
10 10 10
…
10 10 10
…
K 20 20 20
K 20 20 20
K K
… …
10 10 10
…
10 10 10
30 30 30 30 30 30
…
,
20 20 20
20 20 20
… … ,…, …
40 40 40 40 40 40
…
30 30 30 30
10 10 10 30 30
10 10 10
50 50 50 50 50 50
40 40 40 40
40 40
20 20 20 20 20 20
…
60
…
60
…
60 60 60 60
50 50 50
…
50 50 50
30 30 30 30 30 30
0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5
60 60 60 60
40 40 60 60
40 40 40 40
0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5
0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5
50 50 50 50 50 50
60 60 60 60 60 60
0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1. 2 1. 3 1. 4 1. 5
LSTM
y  k   flstm  x  k  , x  k  1 ,, x  k  nu  , e(k ) 

Computer Aided Diagnosis
LPRD
灵敏度
24.1→82.7%
已应用于解放军 306 医院
临床 1500 余例 LPR 诊断

MCI and AD brain disease
Complex networks
Normal Control for Average Connectivity MCI Patient for Average Connectivity
1 1
0.9 10 0.9
10
0.8 20 0.8
20
0.7 0.7
30 30
0.6 0.6
40 40
0.5 0.5
50 50
0.4 0.4
60 60
0.3 0.3
70 70
0.2 0.2
80 80 0.1
0.1
90
10 20 30 40 50 60
Mechanism
70 80 90
0 90
10 20 30 40 50 60 70 80 90
0
• Playing an important role in MCI pathology trigger mechanism analysis

– More accurately revealing the brain activity of tools
– Providing a new approach for challenging brain research

Brain Computer Interface

Brain Mode Decomposition

Brain Inspired Intelligence
科技创新 2030“ 新一代人工智能”重大项目

--- 人在回路的混合增强智能理论与方法

Important Conceptions in PR
Pattern ， Feature ， Algorithm ， Model ，

Machine Learning ， Optimization ， Validation,
Over-fitting, Regularization, Cross-Validation
Feature selection – Feature Extraction

Classification – Regression
Supervised – Unsupervised – Reinforcement
Syntactic – Statistical – ANN
Generative – Discriminative
Linear – Nonlinear

A Case Study: Fish Classification
• Problem:
– sort incoming fish on a conveyor
belt according to species
–Assume only two classes exist:
• Sea Bass and Salmon
Salmon
Sea-bass

• What kind of information can distinguish one

species from the other?
– length, width, weight, number and shape of fins, tail shape, etc.
• What can cause problems during sensing?
– lighting conditions, position of fish on the conveyor belt,
camera noise, etc.
• What are the steps in the
process? Pre-processing
1.Capture image. Feature Extraction
2.Isolate fish.
3.Take measurements Classification
4.Make decision
“Sea Ba ss” “Salmon”

• Selecting Features
– Assume a fisherman told us that a sea bass is generally
longer than a salmon.
– We can use length as a feature and decide between sea
bass and salmon according to a threshold on length.
– How can we choose this threshold?
Even though “sea
Histograms of the bass” is longer than
length feature for “salmon” on the
two types of fish in average, there are
training samples. many examples of
How can we choose fish where this
the threshold to observation does
make a reliable not hold...
decision?

• Selecting Features
– Let’s try another feature and see if we get better
discrimination
➡Average Lightness of the fish scales
Histograms of It looks easier to

choose the
the lightness threshold x but
feature for two we still cannot
types of fish in make a perfect
training decision.
samples.

• Multiple Features
– Single features might not yield the best performance.
– To improve recognition, we might have to use more than one
feature at a time.
– Combinations of features might yield better performance.
– Assume we also observed that sea bass are typically wider than
salmon.
x1 : lightness Scatter plot of

lightness and width
features for
x2 : width training samples.
We can draw a
Each fish decision
image is now boundary to
represented by divide the feature
a point in this space into two
2D feature regions.
space Does it look better
than using only
lightness?

• Designing a Classifier
• Can we do better with another decision rule?
• More complex models result in more complex
boundaries.
DANGER OF
We may
OVER
distinguish
FITTING!!
training samples
perfectly but how
can we predict CLASSIFIER
how well we can WILL FAIL TO
generalize to GENERALIZE
unknown TO NEW
samples? DATA...

• How can we manage the tradeoff between complexity of decision
rules and their performance to unknown samples?
Different criteria
lead to different
decision
boundaries

Architecture of a BCI
Pattern Recgonition & Machine Learning 95

Example Calibration Problem
• Task: A person is presented with a sequence of

300 images (one ever 2 seconds). Half of the
images are exciting, the other half are not.
One channel of EEG (at Cz location) is recorded.
• Question: How to design a BCI that can
determine whether a person is shown an exciting
or a non-exciting image?
• Approach: For each trial k, cut out an epoch Xk of
1s length, extract a short vector of features fk,
and assign a label yk in {E,NE}. Use machine
learning to find an optimal statistical mapping
from fk onto yk.
Extracting Features of a Peak
• A supposed characteristic peak in a time

window (relative to an event) could be
characterized by three parameters:
 Latency 
 Height 
 Width 

Resulting Feature Space
• Plotting the 3-element feature vectors for all

exciting trials in red, and non-exciting trials in
green, we obtain two distributions in a 3d
space:

ML with Feature Extraction
• Including the feature extraction, the analysis process

is as follows:
Here: calc_peak()
S2 S1 R1
S1
Extract
… Features
, ,
2 1 1
Training 𝑓1 𝑓1 𝑓
Model function 𝑓2 𝑓2 1 …
𝜽 X,y ⋮ , ⋮ 𝑓
, 2
⋮
2 1
Pattern Recgonition & Machine Learning 1 99
Pattern Class
 A collection of similar (not necessarily

identical) objects
 A class is defined by class samples
(exemplars, prototypes)
 Intra-class variability
 Inter-class similarity
 How to define similarity?

Intra-Class Variability
Handwritten numerals

Inter-class Similarity
Characters that look similar
Identical twins

Feature Extraction
• Designing a Feature Extractor
• Its design is problem specific (e.g. features to
extract from graphic objects may be quite different
from sound events...)
• The ideal feature extractor would produce the
same feature vector X for all patterns in the same
class, and different feature vectors for patterns in
different classes.
• In practice, different inputs to the feature extractor
will always produce different feature vectors, but
we hope that the within-class variability is small
relative to the between-class variability.
• Designing a good set of features is sometimes “more
of an art than a science”...
Curse of Dimensionality
Does adding more features always

improve the results?
No!! So we must:
Avoid unreliable features.
Be careful about correlations with existing features.
Be careful about measurement costs.
Be careful about noise in the measurements.
Is there some curse for working in
very high dimensions?
YES THERE IS! ==> CURSE OF
DIMENSIONALITY
➡thumbn rule: n >= d(d-1)/2

= nr of examples in training dataset
d = nr of features

Inadequate Features
• Problem: Inadequate Features

– features simply do not contain the information needed to separate the
classes, it doesn't matter how much effort you put into designing the
classifier.
– Solution: go back and design better features.
“Good” features “Bad” features

Correlated Features
– Often happens that two features that were meant to

measure different characteristics are influenced by some
common mechanism and tend to vary together.
• E.g. the perimeter and the maximum width of a figure will both vary
with scale; larger figures will have both larger perimeters and larger
maximum widths.
– This degrades the performance of a classifier based on
Euclidean distance to a template.
• A pattern at the extreme of one class can be closer to the template for
another class than to its own template. A similar problem occurs if
features are badly scaled, for example, by measuring one feature in
microns and another in kilometers.
– Solution: (Use other metrics, e.g. Mahalanobis...) or extract
features known to be uncorrelated!

Designing a Classifier
• Model selection:
– Domain dependence and prior information.
– Definition of design criteria.
– Parametric vs. non-parametric models.
– Handling of missing features.
– Computational complexity.
– Types of models: templates, decision-theoretic or
statistical, syntactic or structural, neural, and
hybrid.
– How can we know how close we are to the true
model underlying the patterns?

• How can we manage the tradeoff between complexity of decision rules
and their performance to unknown samples?
Different criteria lead

to different decision
boundaries

Curved Boundaries
– linear boundaries produced by a minimum-
Euclidean-distance classifier may not be
flexible enough.
• For example, if x1 is the perimeter and x2 is the area of a figure,
x1 will grow linearly with scale, while x2 will grow
quadratically.This will "warp" the feature space and prevent a
linear discriminant function from performing well.
– Solutions:
• Redesign the feature set (e.g., let x2 be the square root of the area)
• Try using Mahalanobis distance, which can produce quadratic decision
boundaries
• Try using a neural network (beyond the scope of these notes; see
Haykin)

• Problem: Subclasses in the dataset

– frequently happens that the classes defined by the end
user are not the "natural" classes ...
– Solution: use CLUSTERING.

Machine Learning
• How can a machine learn the rule from

data?
– Supervised learning: a teacher provides a category
label or cost for each pattern in the training set.
➡Classification
– Unsupervised learning: the system forms clusters or
natural groupings of the input patterns (based on
some similarity criteria).
➡Clustering
• Reinforcement learning: no desired category is given
but the teacher provides feedback to the system such as
the decision is right or wrong.
Supervised Learning
• Supervised Training/Learning
– a “teacher” provides labeled training sets, used to train a
classifier
1 Learn about Shape 1 Learn about Color
Triangles
2
Clas Blue Objects
sify
?
Training Set
Training Set
It’s a Triangle! 2
Clas
s ify
Circle It’s Yellow!

Yellow Objects
s

Unsupervised Training/Learning
– No labeled training sets are provided

– System applies a specified clustering/grouping criteria to
unlabeled dataset
– Clusters/groups together “most similar” objects (according
to given criteria)
Unlabeled Training set
1 Clustering Criteria = some similarity

measure
? 2

Evaluating a Classifier
• Training Set
– used for training the classifier
• Testing Set
– examples not used for training
– avoids overfitting to the data
– tests generalization abilities of the trained classifiers
• Data sets are usually hard to obtain...
– Labeling examples is time and effort consuming
– Large labeled datasets usually not widely available
– Requirement of separate training and testing datasets
imposes higher difficulties...
– Use Cross-Validation techniques!

• Costs of Error
–We should also consider costs of different
errors we make in our decisions. For example,
if the fish packing company knows that:
• Customers who buy salmon will object
vigorously if they see sea bass in their cans.
• Customers who buy sea bass will not be unhappy
if they occasionally see some expensive salmon
in their cans.
• How does this knowledge affect our decision?

• Confusion Matrix

Simple Classifiers
• Minimum-distance Classifiers
– based on some specified “metric” ||x-
m||
– e.g. Template Matching

Simple Classifiers
• Template Matching
TEMPLATE NOISY EXAMPLES
S
– To classify one of the noisy examples, simply

compare it to the two templates.This can be done in
a couple of equivalent ways:
1.Count the number of agreements. Pick the class that has the
maximum number of agreements. This is a maximum correlation approach.
2.Count the number of disagreements. Pick the class with the
minimum number of disagreements. This is a minimum error approach.
• Works well when the variations within a class are due to
"additive noise”, and there are no other distortions of the
characters -- translation, rotation, shearing, warping,
expansion, contraction or occlusion.
Simple Classifiers
• Metrics
– different ways of measuring distance:
• Euclidean metric:
– || u || = sqrt( u12 + u22 + ... + ud2 )
• Manhattan (or taxicab) metric:
– || u || = |u1| + |u2| + ... + |ud|
• Contours of constant...
– ... Euclidean distance are circles (or spheres)
– ... Manhattan distance are squares (or boxes)
– ... Mahalanobis distance are ellipses (or ellipsoids)

Classifiers: Neural Networks

Gaussian Modeling
p(x1,x )
2
x
x2 2
6 6
4 4
1
0.7 2 2
5 5
0.5 0 x
0.2 2
1
6
5 -5 0 -2
4
0
0 -4
-5 x1
5 -6
-6 -4 -2 0 2 4 6

Gaussian Mixture Models
• Use multiple Gaussians to model the

data
x 10
2
7.5
1 2.5
0.7 10
5 0
0.5 5
0.2 -2.5
5 -5
0
0 -5
0
5 -5 x1
10 -5 -2.5 0 2.5 5 7.5 10

Classifiers: kNN
• k-Nearest Neighbours class of training pattern training pattern
Classifier
3
3 3
2 3 3 2
2 2
– Lazy Classifier 2
2
2
2
2
1 1
• no training is actually performed 1 1 3 3 1 1
(hence, lazy ;-))

1 1
1 3 3
– An example of Instance
X (a pattern to be classified)
Based Learning k=8
four patterns of category 1
two patterns of category 2
two patterns of category 3
plurality are in category 1, so

decide X is in category 1

Decision Trees
• Learn rules from

data
x3
 1
• Apply each rule at 

x2

x4
1
1
each node 0
0
x3x4 x1
• classification is at
x3x2 1  1
x3x2
1 0
the leafs of the tree x3x4x1 x3x4x1
f = x3x2 + x3x4x1

Clustering: k-means

Model Training

Q&A

Lec 01 Introductionv 2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 01 Introductionv 2024

Uploaded by

Copyright:

Available Formats

模式识别与机器学习 A

Advanced Pattern Recognition & Machine Learning

RLHF （ Reinforcement Learning from Human

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 6

Monthly ice cream production in the United States and

Big Data is all about finding

“Honeybees are excellent navigators and

The team aim to create the first flying

• Augmenting deep learning with probabilistic

Rule-based System: Classic Machine Representation Learning:

Mapping from Mapping from Mapping from

Shaded boxes indicate components that can learn from data

• First AI systems focused on logic:

• If you are driving a car and see a soccer ball roll

• Driver reaches the decision to stop the car

Why did you do that?

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 23

What is Artificial Intelligence?

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 24

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 25

Building computers that

Think like a human Think rationally

Act like a human Act rationally

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 26

a.k.a “classical AI,” “rule-based AI,” and “good old-fashioned AI.”

Symbolic AI involves the explicit embedding of human knowledge and behavior

An example of symbolic AI tools is

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 27

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 28

Reinforcement learning is a type of machine learning that is concerned with how

Reinforcement learning is an important area of machine learning because it is able to

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 29

What is Artificial Intelligence?

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 30

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 31

Philosophy: mind/body dualism,

Mathematics: logic, probability, decision

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 32

1943 – McCulloch and Pitts: simple neural

1950 – Turing test

1955-56 – Newell and Simon: Logic

1956 – Dartmouth workshop, organized by

1952 – Arthur Samuel develops checkers

1958 – McCarthy LISP, advice taker, time

1958 – Rosenblatt’s Perceptron algorithm

1968-72 – Shakey the robot

1971-74 – Blocksworld planning and

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 34

Many early promises of AI fall short

1969 – Minky and Pappert’s “Perceptrons”

1973 – Lighthill report effectively ends AI

1970s – DARPA cuts funding for several AI

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 35

Move towards encoding domain expert

1971-74 – Feigenbaum’s DENRAL

1981 – Japan’s “fifth generation” computer

1982 – R1, expert system for configuring

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 36

Complexity of expert systems made them

1987 – DARPA cuts AI funding for expert

1991 – Japan’s 5th generation project fails

yuzhuguo@buaa.edu.cn Pattern Recognition & Machine Learning 37

1982 – Backpropagation for training neural