Professional Documents
Culture Documents
Activations
Activations
swipe right
ReLU
PROs CONs
• Solves vanishing gradient • Neurons with -ve input die
• Computationally Efficient • Sensitive to initialization
• Faster Convergence • Not differentiable at 0
• Default for hidden layers
swipe right
Tanh
PROs CONs
• Zero-centered range (better for • Vanishing gradient problem
optimization) • Computationally expensive
• Smooth gradient • Not suitable for deeper networks
swipe right
Sigmoid
PROs CONs
• Output suitable for binary • Vanishing gradient problem
classification • Computationally expensive
• Used for multi-label • Too much compression
classification
• Smooth gradient
swipe right
Softmax
Range : (0,1)
PROs CONs
• Interpretable as likelihood of a • Doesn't work for multi-label
class classification
• works well with categorical • Vulnerable to Imbalance
cross entropy (CCE) loss datasets
• Optimal for multi-class • Instable to large input values
classification (Overflow errors)
swipe right
GeLU
PROs CONs
• Smooth gradient • Computationally expensive that
• Dynamic gating makes network ReLU
adaptable • Reduced Interpretibility
• Used in SoTA transformer
models (GPT, BERT, SAM)
Thumb rules
https://medium.com/@omkar.nallagoni/activation-functions-wit
h-derivative-and-python-code-sigmoid-vs-tanh-vs-relu-44d23915
c1f4
https://www.researchgate.net/figure/The-Softmax-activation-fun
ction-and-its-derivative_fig5_373474238 [accessed 1 Mar, 2024]
https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Spring
.2019/archive-f19/www-bak11-22-2019/document/note/hwnotes
/HW1p1.html.backup
https://www.researchgate.net/figure/The-GELU-function-and-its
-derivative-with-respect-to-x_fig2_373051870 [accessed 2 Mar,
2024]