Softmax Activation Function
Subscribe
What is an Activation Function?<br>An activation (or transfer) function maps a neuron’s weighted inputs plus bias to its output, adding non-linearity so the model can learn complex patterns beyond simple linear ones.<br>Activation Functions are also known as Transfer Function in the context of Neural Networks.<br>Math functions that calculate weighted sum of inputs and adds bias to give non-linearity to output of neuron.<br>Decides whether a neuron should be activated (“fired”) or not.<br>This helps Neural Network to use important information and suppress not so useful data points.<br>Adds non-linearity to Neural Network to tackle complex problems.<br>Real-world problems are non-linear. Recognizing cats vs. dogs<br>Without activation functions, f(z) = z, linear regression model, multiple linear layers form up to one big linear equation; useless for non-linear problems.<br>What are linear and non-linear problems?<br>A linear pattern is like a straight-line rule of thumb.<br>If you study twice as long, you score twice as high in your exams. Simple analogy, neat and slightly predictable.<br>A non-linear (complex) pattern is more like real life.<br>Studying a little earlier before exams could help a lot at first, then extra hours give smaller dopamine boosts, and maybe after a point you may burn out and your exam does not go well; your scores are average.<br>The scenario bends, twists, and changes depending on the real life events, not just a straight line.<br>That’s why neural networks need non-linearity: life isn't straight-line simple.<br>Softmax Function<br>non-linear, extension of Sigmoid<br>Softmax converts a vector of raw scores (logits) which could be any real numbers, positive or negative into a probability distribution.<br>Used in the last layer (output layer) of neural network for multi-class classification problems<br>Output range (0,1) and normalizes positive values that sum to 1.<br>Specially for selecting one class out of many classes.<br>Outputs a vector of probabilities: Class with highest probability value is chosen with confidence.<br>Softmax - Mathematical Derivation<br>Combination of multiple Sigmoid/ Logistic functions.<br>Calculates the relative probabilities of each Sigmoids.<br>Numerator exponentiates the input<br>Denominator makes all outputs sum to 1.
softmax
exp
exp
Given logits,
the softmax function for class i is:<br>z i = sigmoids at any particular neuron<br>exp(z i) = exponential of zi<br>∑ j exp(z j) = summation of all exp(zj) where j is all sigmoids in the network.<br>How to apply Softmax?<br>Assume 3 classes, i.e. 3 neurons in the output layer. Suppose our output from the neurons is [3.2, 1.2, 0.5] .<br>Applying Softmax function<br>Input: [3.2, 1.2, 0.5] - logits<br>Step 1: Subtract the max from all<br>Max value is 3.2, so subtract from each[ 3.2 - 3.2, 1.2 - 3.2, 0.5 - 3.2 ]<br>Step 2: Exponentiate<br>e0 = 1.0 , e-2 = 0.1353, e-2.7 = 0.0672<br>Step 3: Sum of exponentials<br>1.0 + 0.1353 + 0.0672 = 1.2025<br>Step 4: Divide exponential by the sum<br>z1 = 1.0/ 1.2025 = 0.8317z2 = 0.1353 / 1.2025 = 0.1125z3 = 0.0672 / 1.2025 = 0.0558<br>For eg: To classify image into one of three classes: [bird, fruit, flower]<br>If softmax(output):<br>[0.8317, 0.1125, 0.0558]<br>Class 1: 83.17% probability<br>Class 2: 11.25% probability<br>Class 3: 5.58% probability<br>Show algorithm an image of a bird.<br>Algorithm thinks 83% probability that its a bird, 11% fruit and 5% flower.<br>Algorithm will predict bird.<br>Now, let’s try this example in Python Code with NumPy, PyTorch and TensorFlow.<br>How to implement Softmax Function in Python?<br>We will write simple code for implementing Softmax activation function in 3 most popular platforms viz. Numpy, PyTorch and TensorFlow.<br>All code samples are executable in Google Colab easily.<br>Softmax in Numpy<br>import numpy as np
def softmax(x):<br>e_x = np.exp(x - np.max(x))<br>return e_x / e_x.sum(axis=0)
logits = np.array([3.2, 1.2, 0.5])<br>probabilities = softmax(logits)
print(probabilities)Softmax in PyTorch<br>import torch<br>import torch.nn.functional as F
logits = torch.tensor([3.2, 1.2, 0.5])<br>probabilities = F.softmax(logits, dim=0)<br>print(probabilities)Softmax in TensorFlow<br>import tensorflow as tf
logits = tf.constant([3.2, 1.2, 0.5])<br>probabilities = tf.nn.softmax(logits)<br>print(probabilities)<br>Applications of Softmax<br>Multi-class classification problems<br>NLP - next word prediction<br>Reinforcement Learning (train robot)<br>Distillation - teach smaller models<br>Sentiment analysis (+ve, -ve, neutral)<br>A primary example for the use case of Softmax can be MNIST dataset - 70k grayscale images of handwritten digits (0-9)<br>MNIST Dataset Sample3-4-2 Neural NetworkSoftmax Neural Network<br>Advancements in Softmax Function<br>Adaptive Softmax<br>Faster, memory efficient for large number of classes<br>For eg: Instead of treating all words equally, it treats frequent and rare words differently.<br>Candidate Sampling<br>Sample a few positive & negative examples (called candidates) during training.<br>Calculate for small or random set of candidate classes.<br>Sparsemax<br>Produces sparse outputs<br>Cuts off small values to...