Activation Functions
[!IMPORTANT] Without activation functions, a neural network—no matter how many layers it has—would just be a big linear regression model. Activation functions introduce non-linearity, allowing the network to learn complex patterns.
1. Why Non-Linearity?
If we only use linear operations (weighted sums), the entire network collapses into a single linear transformation.
Output = W2(W1(x)) = (W2 * W1)x = W_new * x
To approximate any function (Universal Approximation Theorem), we need to bend and twist the decision boundaries. This is what activation functions do.
2. Common Activation Functions
Interactive Visualizer
Select an activation function to see its shape (Blue) and its derivative (Red). The derivative is crucial for backpropagation.
Function:
Range:
2.1 Sigmoid
- Formula:
σ(x) = 1 / (1 + e⁻ˣ) - Range: (0, 1)
- Pros: Smooth gradient, output as probability.
- Cons:
- Vanishing Gradient: Gradients become very small at tails (-inf or +inf), killing learning.
- Not Zero-Centered: Outputs are always positive, making optimization zigzag.
2.2 Tanh (Hyperbolic Tangent)
- Formula:
tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ) - Range: (-1, 1)
- Pros: Zero-centered (stronger gradients than Sigmoid).
- Cons: Still suffers from vanishing gradient problem.
2.3 ReLU (Rectified Linear Unit)
- Formula:
f(x) = max(0, x) - Range: [0, ∞)
- Pros:
- Computationally Efficient: Just a thresholding at zero.
- Solves Vanishing Gradient: Gradient is 1 for positive inputs.
- Sparsity: Outputs true 0 for negative inputs.
- Cons:
- Dead ReLU: Neurons can “die” if weights update such that input is always negative (gradient becomes 0 forever).
2.4 Softmax
Used exclusively in the output layer for multi-class classification. It converts raw logits into probabilities that sum to 1.
P(y=j) = e^(z_j) / Σ e^(z_k)
3. Implementation in Python
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def tanh(x):
return np.tanh(x)
def relu(x):
return np.maximum(0, x)
def softmax(x):
# Subtract max for numerical stability (prevents overflow)
e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e_x / np.sum(e_x, axis=-1, keepdims=True)
# Test
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"Sigmoid: {sigmoid(x)}")
print(f"ReLU: {relu(x)}")
4. Which One to Use?
[!TIP] Rule of Thumb:
- Start with ReLU for hidden layers.
- If you face Dead ReLU issues, try Leaky ReLU.
- Use Sigmoid for binary classification output.
- Use Softmax for multi-class classification output.
- Avoid Sigmoid/Tanh in hidden layers for deep networks.