You are on page 1of 5

Geoffrey Hinton, Oriol Vinyals, Jeff Dean

Distilling the Knowledge


in a Neural Network
Distilling the Knowledge in a Neural Network
• “Distillation”, transfer knowledge from a larger model to a
small model
• More suitable to run on mobile devices or for deployment
• Use class probability distribution (or mean of) produced by
larger model(s) as “Soft Targets” for training smaller model.

Why Soft Targets? Hard targets miss lot of valuable information from large model
ex (Hard): 2 => P(2) = 1, P(3) = 0 , P(7) = 0 (MNIST - database of handwritten digits)
ex (Soft): 2 => P(2) = 0.9 , P(3) = 10-6 , P(7) = 10-9 (MNIST - database of handwritten digits)
How to produce a softer probability distribution over classes? Raise the temperature T of final Softmax.

Softmax

• Use same high temperature to train smaller model to match soft targets.
• Preliminary experiments on MNIST
• Experiments on speech recognition
Future Research Directions
Using soft targets to prevent specialists from over fitting
What is good about the paper
• Address a real problem.
• Shows that small concept can help technology to advance.
• Provide plenty of experiments on their idea.
• Prove their idea based on experiments.
• Good introduction with real world examples
What is bad about the paper
• Requires a lot of effort to get a sense of it
• Need extended literature review
Distilling the Knowledge in a Neural Network
Thank You

You might also like