You are on page 1of 25

Deep Continual Learning

Jawad Tariq

Supervised by: Dr. Mohsen Ali


Continual Learning: Motivation

• Deep neural networks are extremely powerful function approximators.


• Vision (Recognition, Synthesis)
• Language (Machine translation, language models)
• Complex structures (graph neural networks)

• They require offline training on huge datasets.


• If data comes as a stream, the way to go is to store and shuffle it
• Then, perform offline training.

• Typically, they are extremely proficient in very narrow tasks.


• Need to build more general AI models, proficient in different things

Continual learning: algorithms that allow to keep learning new tasks online, by adding
new knowledge to the model without sacrificing the previously acquired one.

Deep Continual Learning 2 2


What is Continual Learning (CL)?
aka lifelong learning, incremental learning
• The ability of a model…
• … to learn continually from a stream of data.
• … to learn multiple tasks sequentially.

Multi–task learning Fine-tuning Continual learning


task-specific layers discard keep keep keep
Task 1 Task 2 Task 3 Old task New task Old task New task

shared layers pre-trained layers shared layers

Deep Continual Learning 3 3


Tasks

• Usually, each task is in the form of a classification dataset.


Incremental MNIST Incremental CIFAR-10
Incremental SVHN

T1 T1
T1

T
T23 T3 T2 T3
T2

T4 T4 T4

T5 T5 T5

Deep Continual Learning 6 4


Challenge :Catastrophic Forgetting

• First observed in McCloskey and Cohen, 1989, “Catastrophic Interference in


Connectionist Networks: The Sequential Learning Problem”.
New learning may interfere catastrophically with old learning when networks are trained sequentially. The analysis of
the causes of interference implies that at least some interference will occur whenever new learning may alter weights
involved in representing old learning, and the simulation results demonstrate only that interference is catastrophic in
some specific networks.

means that the loss of old


knowledge is sudden, rather
It is a fundamental problem than gradual
introduced by the local nature of
gradient-based optimizations!

Deep Continual Learning 5 5


CL: more definitions
Task incremental test settings: Class incremental test settings:

“What digits are these, “What digits are


given that they all these?”
belong to task T4?”
𝑝(𝑦𝑥
| , 𝑡) 𝑝(𝑦|𝑥)

Head for T1

Head for T2 Head for T1,T2,T3


Deep NN Deep NN

Head for T3
Multi-head architecture Single-head architecture
6 6
Continual learning: Timeline

2016 2017 2018 2019 2020

DIFFERENT STRATEGIES
• Model growing
• Replay …but the same goal:
• Knowledge Distillation Protect (either explicitly or implicitly) important
• Regularization parameters for prior tasks!
• Parameter isolation
Deep Continual Learning 7 7
Continual Learning Desiderata

1. Avoid forgetting
• Performance over previous tasks should not decrease
2. Fixed memory and compute
• If not possible, grow sub-linearly with tasks
3. Enable forward transfer
• Knowledge acquired over previous tasks should help learning future tasks
4. Enable backward transfer
• While learning the current task, performance in previous tasks may also increase
5. Do not store examples
• Or store as few as possible

Deep Continual Learning 9 8


Related Works

MODEL GROWING Hybrid Approach

Increase the model capacity for Use Multiple approaches at the same
every new task time to implement the algorithm

KNOWLEDGE
REGULARIZATION REHEARSAL
DISTILLATION
Penalize (some) Store old inputs and
Use the model in a
parameter replay them to the
previous training
variations model.
state as a teacher

Deep Continual Learning 10 9


Progressive Neural Networks
MODEL GROWING

• The model is organized in columns


(sequence of fully connected layers)
• Starts with a single column for the first task
• For a generic task K:

• Past weight matrix U are fixed


• Adapter functions

Experiment with sequence of
reinforcement learning tasks Rusu, Andrei A., et al. "Progressive neural networks." arXiv preprint
arXiv:1606.04671 (2016).
Deep Continual Learning 11 10
Progressive Neural Networks MODEL GROWING

• Drawbacks:
1. The model grows linarly with the number of trained tasks
1
1

2. Need to know task labels during test

1. Avoid forgetting
2. Fixed memory and compute
3. Enable forward transfer
4. Enable backward transfer
5. Do not store examples

Deep Continual Learning 11


Learning Without Forgetting
KNOWLEDGE DISTILLATION

• Multi head architecture:


• Each task has its own classification head
• Feature extractor is shared 1
2

• During training of task t, apply the following distillation


objective for all past task heads:

Model output for task (i), now

Model output for task (i) at the beginning of the task

• In the meantime, optimize the new head with true labels:


Another example of
standard cross-entropy objective distillation-based
model: E2EIL
Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE transactions on pattern analysis
and machine intelligence, 40(12), 2935-2947.

Deep Continual Learning 12


Incremental Classifier Representation Learning KNOWLEDGE DISTILLATION

• Classifier used is nearest mean of exemplar.


A nearest-mean-of-exemplars classifier that is robust against changes in the data representation.

• Prioritized exemplar selection strategy.


A subset of training samples (exemplar set) from previous classes is stored.
The size of exemplar set is kept constant. As new class arrive, some examples from old classes are removed.

• A representation learning step that uses the exemplars in combination with distillation to
avoid catastrophic forgetting.
Combination of classification loss for new samples and distillation loss for old samples is used.

Rebuffi, S. A., Kolesnikov, A., Sperl, G., & Lampert, C. H. (2017). icarl: Incremental classifier and representation learning. In Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition (pp. 2001-2010).

Deep Continual Learning 13


Knowledge Distillation: Drawbacks KNOWLEDGE DISTILLATION

• Drawbacks:
1. These methods stores examples.
14

1. Avoid forgetting
2. Fixed memory and compute
3. Enable forward transfer
4. Enable backward transfer
5. Do not store examples

Deep Continual Learning 14


Elastic Weight Consolidation REGULARIZATION

• Bayesian perspective: dataset and shared weights as probability distributions


The posterior over weights after the first task becomes the prior over
weights during the second task

• Assuming the distribution over weights is Gaussian, this translates to:

Quadratic penalty for changing each


weight w.r.t. the task A, modulated by F

Fisher Information Matrix


Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the national academy of sciences 114.13 (2017): 3521-3526.

Deep Continual Learning 14 15


Elastic Weight Consolidation: Variants
REGULARIZATION

Drawback: for each task, latest MAP parameters and a Fisher information matrix
needs to be stored.
Progress and Compress
Online EWC
• Same objective as EWC, but:
• Keeps only the latest parameters
• Keeps only one Fisher matrix, by
accumulating them over tasks:

• Alternates between 2 phases:


1. PROGRESS: Add a column and learn the new task
2. COMPRESS: Distill the new column into the
knowledge base, protecting parameters with EWC.

Schwarz, Jonathan, et al. "Progress & compress: A scalable framework for continual learning." International Conference on
Machine Learning (2018).
Deep Continual Learning 15 16
Regularization Approaches
REGULARIZATION

• Other variants have been proposed, changing the way in which penalties for
each parameter are computed
• Zenke, Friedemann, Ben Poole, and Surya Ganguli. "Continual learning through synaptic intelligence." International
Conference on Machine Learning (2018)
• Aljundi, Rahaf, et al. "Memory aware synapses: Learning what (not) to forget." Proceedings of the European
Conference on Computer Vision (ECCV). 2018.
• Chaudhry, Arslan, et al. "Riemannian walk for incremental learning: Understanding forgetting and
intransigence." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

• Drawbacks: 1. Avoid forgetting


2. Fixed memory and compute
1. Hard to scale to many tasks
3. Enable forward transfer
4. Enable backward transfer
5. Do not store examples

Deep Continual Learning 16 17


Gradient Episodic Memory (GEM)
REHEARSAL

• Keep a buffer of examples for each task



When observing new examples from the stream, solve the following
constrained optimization problem

• Under mild assumptions, the constraints can be rewritten as follows

• Solve with quadratic programming (QP).


Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for continual learning." Advances in neural information processing systems. 2017.

Deep Continual Learning 18 18


Average GEM (A-GEM)
REHEARSAL

• Solving GEM’s constrained optimization for each batch is cumbersome


• One constraint per task, training becomes extremely slow as more tasks are encountered

• Chaudry et al. (ICLR 2019) proposed an alternative gradient projection:


If the constraint is violated,
project as follow:

Gradient from a
random batch from
the memory buffer
(one constraint
only!)

Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for continual learning." Advances in neural information processing systems. 2017.
Chaudhry, Arslan, et al. "Efficient lifelong learning with a-gem." International Conference on Learning Representations.

Deep Continual Learning 19 19


GEM, A-GEM
REHEARSAL

• Other gradient projection method:


• Meta Experience Replay (Riemer et al, ICLR19)

• Drawbacks:
• Experimented with small tasks only
• They outperform simple rehearsal only in some peculiar experimental settings
• Training typically very slow (hard to make fast)

1. Avoid forgetting
2. Fixed memory and compute
3. Enable forward transfer
4. Enable backward transfer 5.
Do not store examples

Deep Continual Learning 20 20


Adversarial Continual Learning
Hybrid
• The proposed methodology disentangle task invariant and task specific features.

• For each task, there is task specific private module and task independent shared module.
Shared module is trained using Generative modeling with generator trying to learn a task invariant representation.
Both shared and task specific private features are concatenated and passed through a task specific head.

• Task labels are required at test time to


use task specific private module and
task specific head.

• Discriminator is no more used at test time.


.

Ebrahimi, S., Meier, F., Calandra, R., Darrell, T., & Rohrbach, M. (2020). Adversarial continual learning.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part XI 16 (pp. 386-402). Springer International Publishing.

Deep Continual Learning 21


Datasets
• Split Mnist as dataset for the continual learning experimentation.
A benchmark dataset used to evaluate different methods.
Dataset is divided into Total of 5 tasks where every class constitutes 2 classes.

T1

T2

T3

T4

T5

Deep Continual Learning 6 22


Experiments
• Experiments are performed using split-mnist dataset.
Evaluation metrices used is Average Accuracy.

Method Average Accuracy

Learning without Forgetting 86.18

ICARL 89.29

Progressive Neural Networks 94.1

Elastic Weight Consolidation 84.2

Adversarial Continual Learning 96.03

Deep Continual Learning 23


Conclusion

• Continual Learning is currently a very active research area in Machine Learning research.
Strongly motivated and have many practical applications.

• So far Comprehensive Literature Review is performed which includes


Regularization Methods
Rehearsal based Methods
Model Growing Methods
Hybrid Methods

• Experimentations are performed for all baseline methods.


Adversarial Continual Learning outperforms all other methods.

Deep Continual Learning 24


Next Semester Goals

• Purpose an architecture/Model along the lines of Hybrid Models to improve their


performance.
Improve architecture of ACL for different benchmarks dataset.

• Study Continual Learning in Graph neural Network settings and to see how existing
methods behave.
No previous work have been done to in this direction.
Goal is to see how to set-up Continual learning on graph dataset.

• Use Multiple Benchmarks with various other evaluation metrices including backward and
forward transferability.

Deep Continual Learning 25

You might also like