Deep Continual Learning

Deep Continual Learning
Jawad Tariq
Supervised by: Dr. Mohsen Ali

Continual Learning: Motivation
• Deep neural networks are extremely powerful function approximators.

• Vision (Recognition, Synthesis)
• Language (Machine translation, language models)
• Complex structures (graph neural networks)
• They require offline training on huge datasets.

• If data comes as a stream, the way to go is to store and shuffle it
• Then, perform offline training.
• Typically, they are extremely proficient in very narrow tasks.

• Need to build more general AI models, proficient in different things
Continual learning: algorithms that allow to keep learning new tasks online, by adding
new knowledge to the model without sacrificing the previously acquired one.
Deep Continual Learning 2 2

What is Continual Learning (CL)?
aka lifelong learning, incremental learning
• The ability of a model…
• … to learn continually from a stream of data.
• … to learn multiple tasks sequentially.
Multi–task learning Fine-tuning Continual learning

task-specific layers discard keep keep keep
Task 1 Task 2 Task 3 Old task New task Old task New task
shared layers pre-trained layers shared layers

Tasks
• Usually, each task is in the form of a classification dataset.

Incremental MNIST Incremental CIFAR-10
Incremental SVHN
T1 T1
T1
T
T23 T3 T2 T3
T2
T4 T4 T4
T5 T5 T5

Challenge :Catastrophic Forgetting
• First observed in McCloskey and Cohen, 1989, “Catastrophic Interference in

Connectionist Networks: The Sequential Learning Problem”.
New learning may interfere catastrophically with old learning when networks are trained sequentially. The analysis of
the causes of interference implies that at least some interference will occur whenever new learning may alter weights
involved in representing old learning, and the simulation results demonstrate only that interference is catastrophic in
some specific networks.
means that the loss of old

knowledge is sudden, rather
It is a fundamental problem than gradual
introduced by the local nature of
gradient-based optimizations!

CL: more definitions
Task incremental test settings: Class incremental test settings:
“What digits are these, “What digits are

given that they all these?”
belong to task T4?”
𝑝(𝑦𝑥
| , 𝑡) 𝑝(𝑦|𝑥)
Head for T1
Head for T2 Head for T1,T2,T3

Deep NN Deep NN
Head for T3
Multi-head architecture Single-head architecture
6 6
Continual learning: Timeline
2016 2017 2018 2019 2020
DIFFERENT STRATEGIES
• Model growing
• Replay …but the same goal:
• Knowledge Distillation Protect (either explicitly or implicitly) important
• Regularization parameters for prior tasks!
• Parameter isolation
Continual Learning Desiderata
1. Avoid forgetting
• Performance over previous tasks should not decrease
2. Fixed memory and compute
• If not possible, grow sub-linearly with tasks
3. Enable forward transfer
• Knowledge acquired over previous tasks should help learning future tasks
4. Enable backward transfer
• While learning the current task, performance in previous tasks may also increase
5. Do not store examples
• Or store as few as possible

Related Works
MODEL GROWING Hybrid Approach
Increase the model capacity for Use Multiple approaches at the same
every new task time to implement the algorithm
KNOWLEDGE
REGULARIZATION REHEARSAL
DISTILLATION
Penalize (some) Store old inputs and
Use the model in a
parameter replay them to the
previous training
variations model.
state as a teacher

Progressive Neural Networks
MODEL GROWING
• The model is organized in columns

(sequence of fully connected layers)
• Starts with a single column for the first task
• For a generic task K:
• Past weight matrix U are fixed

• Adapter functions
•
Experiment with sequence of
reinforcement learning tasks Rusu, Andrei A., et al. "Progressive neural networks." arXiv preprint
arXiv:1606.04671 (2016).
Progressive Neural Networks MODEL GROWING
• Drawbacks:
1. The model grows linarly with the number of trained tasks
1
1
2. Need to know task labels during test
1. Avoid forgetting
Deep Continual Learning 11

Learning Without Forgetting
KNOWLEDGE DISTILLATION
• Multi head architecture:

• Each task has its own classification head
• Feature extractor is shared 1
2
• During training of task t, apply the following distillation

objective for all past task heads:
Model output for task (i), now
Model output for task (i) at the beginning of the task
• In the meantime, optimize the new head with true labels:

Another example of
standard cross-entropy objective distillation-based
model: E2EIL
Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE transactions on pattern analysis
and machine intelligence, 40(12), 2935-2947.

Incremental Classifier Representation Learning KNOWLEDGE DISTILLATION
• Classifier used is nearest mean of exemplar.

A nearest-mean-of-exemplars classifier that is robust against changes in the data representation.
• Prioritized exemplar selection strategy.

A subset of training samples (exemplar set) from previous classes is stored.
The size of exemplar set is kept constant. As new class arrive, some examples from old classes are removed.
• A representation learning step that uses the exemplars in combination with distillation to
avoid catastrophic forgetting.
Combination of classification loss for new samples and distillation loss for old samples is used.
Rebuffi, S. A., Kolesnikov, A., Sperl, G., & Lampert, C. H. (2017). icarl: Incremental classifier and representation learning. In Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition (pp. 2001-2010).

Knowledge Distillation: Drawbacks KNOWLEDGE DISTILLATION
• Drawbacks:
1. These methods stores examples.
14
1. Avoid forgetting

Elastic Weight Consolidation REGULARIZATION
• Bayesian perspective: dataset and shared weights as probability distributions
•
The posterior over weights after the first task becomes the prior over
weights during the second task
• Assuming the distribution over weights is Gaussian, this translates to:
Quadratic penalty for changing each

weight w.r.t. the task A, modulated by F
Fisher Information Matrix

Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the national academy of sciences 114.13 (2017): 3521-3526.

Elastic Weight Consolidation: Variants
REGULARIZATION
Drawback: for each task, latest MAP parameters and a Fisher information matrix
needs to be stored.
Progress and Compress
Online EWC
• Same objective as EWC, but:
• Keeps only the latest parameters
• Keeps only one Fisher matrix, by
accumulating them over tasks:
• Alternates between 2 phases:

1. PROGRESS: Add a column and learn the new task
2. COMPRESS: Distill the new column into the
knowledge base, protecting parameters with EWC.
Schwarz, Jonathan, et al. "Progress & compress: A scalable framework for continual learning." International Conference on
Machine Learning (2018).
Regularization Approaches
REGULARIZATION
• Other variants have been proposed, changing the way in which penalties for
each parameter are computed
• Zenke, Friedemann, Ben Poole, and Surya Ganguli. "Continual learning through synaptic intelligence." International
Conference on Machine Learning (2018)
• Aljundi, Rahaf, et al. "Memory aware synapses: Learning what (not) to forget." Proceedings of the European
Conference on Computer Vision (ECCV). 2018.
• Chaudhry, Arslan, et al. "Riemannian walk for incremental learning: Understanding forgetting and
intransigence." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
• Drawbacks: 1. Avoid forgetting

1. Hard to scale to many tasks

Gradient Episodic Memory (GEM)
REHEARSAL
• Keep a buffer of examples for each task

•
When observing new examples from the stream, solve the following
constrained optimization problem
• Under mild assumptions, the constraints can be rewritten as follows
• Solve with quadratic programming (QP).

Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for continual learning." Advances in neural information processing systems. 2017.

Average GEM (A-GEM)
REHEARSAL
• Solving GEM’s constrained optimization for each batch is cumbersome

• One constraint per task, training becomes extremely slow as more tasks are encountered
• Chaudry et al. (ICLR 2019) proposed an alternative gradient projection:

If the constraint is violated,
project as follow:
Gradient from a
random batch from
the memory buffer
(one constraint
only!)
Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for continual learning." Advances in neural information processing systems. 2017.
Chaudhry, Arslan, et al. "Efficient lifelong learning with a-gem." International Conference on Learning Representations.

GEM, A-GEM
REHEARSAL
• Other gradient projection method:

• Meta Experience Replay (Riemer et al, ICLR19)
• Drawbacks:
• Experimented with small tasks only
• They outperform simple rehearsal only in some peculiar experimental settings
• Training typically very slow (hard to make fast)
1. Avoid forgetting
4. Enable backward transfer 5.
Do not store examples

Adversarial Continual Learning
Hybrid
• The proposed methodology disentangle task invariant and task specific features.
• For each task, there is task specific private module and task independent shared module.
Shared module is trained using Generative modeling with generator trying to learn a task invariant representation.
Both shared and task specific private features are concatenated and passed through a task specific head.
• Task labels are required at test time to

use task specific private module and
task specific head.
• Discriminator is no more used at test time.

.
Ebrahimi, S., Meier, F., Calandra, R., Darrell, T., & Rohrbach, M. (2020). Adversarial continual learning.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XI 16 (pp. 386-402). Springer International Publishing.

Datasets
• Split Mnist as dataset for the continual learning experimentation.
A benchmark dataset used to evaluate different methods.
Dataset is divided into Total of 5 tasks where every class constitutes 2 classes.
T1
T2
T3
T4
T5

Experiments
• Experiments are performed using split-mnist dataset.
Evaluation metrices used is Average Accuracy.
Method Average Accuracy
Learning without Forgetting 86.18
ICARL 89.29
Progressive Neural Networks 94.1
Elastic Weight Consolidation 84.2
Adversarial Continual Learning 96.03

Conclusion
• Continual Learning is currently a very active research area in Machine Learning research.
Strongly motivated and have many practical applications.
• So far Comprehensive Literature Review is performed which includes

Regularization Methods
Rehearsal based Methods
Model Growing Methods
Hybrid Methods
• Experimentations are performed for all baseline methods.

Adversarial Continual Learning outperforms all other methods.

Next Semester Goals
• Purpose an architecture/Model along the lines of Hybrid Models to improve their

performance.
Improve architecture of ACL for different benchmarks dataset.
• Study Continual Learning in Graph neural Network settings and to see how existing
methods behave.
No previous work have been done to in this direction.
Goal is to see how to set-up Continual learning on graph dataset.
• Use Multiple Benchmarks with various other evaluation metrices including backward and
forward transferability.

Deep Continual Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Continual Learning

Uploaded by

Copyright:

Available Formats

Deep Continual Learning

Supervised by: Dr. Mohsen Ali

• Deep neural networks are extremely powerful function approximators.

• They require offline training on huge datasets.

• Typically, they are extremely proficient in very narrow tasks.

Deep Continual Learning 2 2

Multi–task learning Fine-tuning Continual learning

shared layers pre-trained layers shared layers

Deep Continual Learning 3 3

• Usually, each task is in the form of a classification dataset.

Deep Continual Learning 6 4

• First observed in McCloskey and Cohen, 1989, “Catastrophic Interference in

means that the loss of old

Deep Continual Learning 5 5

“What digits are these, “What digits are

Head for T2 Head for T1,T2,T3

2016 2017 2018 2019 2020

Deep Continual Learning 9 8

MODEL GROWING Hybrid Approach

Deep Continual Learning 10 9

• The model is organized in columns

• Past weight matrix U are fixed

2. Need to know task labels during test

Deep Continual Learning 11

• Multi head architecture:

• During training of task t, apply the following distillation

Model output for task (i), now

Model output for task (i) at the beginning of the task

• In the meantime, optimize the new head with true labels:

Deep Continual Learning 12

• Classifier used is nearest mean of exemplar.

• Prioritized exemplar selection strategy.

Deep Continual Learning 13

Deep Continual Learning 14

• Bayesian perspective: dataset and shared weights as probability distributions

• Assuming the distribution over weights is Gaussian, this translates to:

Quadratic penalty for changing each

Fisher Information Matrix

Deep Continual Learning 14 15

• Alternates between 2 phases:

• Drawbacks: 1. Avoid forgetting

Deep Continual Learning 16 17

• Keep a buffer of examples for each task

• Under mild assumptions, the constraints can be rewritten as follows

• Solve with quadratic programming (QP).

Deep Continual Learning 18 18

• Solving GEM’s constrained optimization for each batch is cumbersome

• Chaudry et al. (ICLR 2019) proposed an alternative gradient projection:

Deep Continual Learning 19 19

• Other gradient projection method:

Deep Continual Learning 20 20

• Task labels are required at test time to

• Discriminator is no more used at test time.

Deep Continual Learning 21

Deep Continual Learning 6 22

Method Average Accuracy

Learning without Forgetting 86.18

Progressive Neural Networks 94.1

Elastic Weight Consolidation 84.2

Adversarial Continual Learning 96.03

Deep Continual Learning 23

• So far Comprehensive Literature Review is performed which includes

• Experimentations are performed for all baseline methods.