Deep Learning Glossary
Deep Learning Glossary
A comprehensive guide to terms used in Deep Learning.
## Activation Function
A mathematical function applied to the output of a neuron. It introduces non-linearity into the network, allowing it to learn complex patterns. Common examples include ReLU, Sigmoid, and Tanh.
### Adam
Adaptive Moment Estimation. An optimization algorithm that combines the advantages of **Momentum** and **RMSprop**. It computes adaptive learning rates for each parameter.
### Artificial Neural Network (ANN)
A computational model inspired by the human brain, consisting of interconnected nodes (neurons) that process information.
### Attention
A mechanism that allows a model to focus on specific parts of the input sequence when processing data, assigning different weights to different elements.
### Backpropagation
The primary algorithm for training neural networks. It computes the gradient of the loss function with respect to each weight by applying the chain rule, moving backward from the output layer to the input layer.
### Batch (Mini-Batch)
A subset of the training dataset used in one iteration of model training. Small batch sizes can provide a regularizing effect, while larger batch sizes provide more accurate gradient estimates.
### Batch Normalization
A technique to stabilize and accelerate training by normalizing the inputs to a layer for each mini-batch. It maintains a mean of 0 and variance of 1, effectively addressing **Internal Covariate Shift**.
### Bias
A learnable parameter in a neuron that allows the activation function to be shifted left or right.
### BERT
Bidirectional Encoder Representations from Transformers. An Encoder-only model pre-trained using Masked Language Modeling, excellent for natural language understanding tasks.
### Convolution
A mathematical operation where a **Kernel** (filter) slides over an input (like an image) to produce a feature map. It preserves spatial relationships and is the core building block of CNNs.
### Cost Function
See [Loss Function](#loss-function).
### Decoder
The component of a Transformer that generates the output sequence token by token, attending to the Encoder's output and its own previous outputs.
### Deep Learning
A subset of machine learning based on artificial neural networks with representation learning. It allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification.
### Dropout
A regularization technique where randomly selected neurons are ignored during training. This helps prevent overfitting.
### Encoder
The component of a Transformer that processes the input sequence into a contextualized representation.
### Epoch
One complete pass through the entire training dataset. Training usually consists of multiple epochs.
### Feature Extraction
A **Transfer Learning** strategy where the weights of a pretrained model are frozen (treated as a fixed feature extractor), and only the final classifier layer is trained on the new dataset.
### Feedforward Neural Network
A type of neural network where connections between the nodes do not form a cycle. Information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any), and to the output nodes.
### Fine-Tuning
A **Transfer Learning** strategy where some or all of the layers of a pretrained model are unfrozen and retrained (usually with a low learning rate) on a new dataset to adapt specific features.
### GPT
Generative Pre-trained Transformer. A Decoder-only model pre-trained using Causal Language Modeling, excellent for text generation.
### Gradient Descent
An optimization algorithm used to minimize the loss function by iteratively moving in the direction of the steepest descent (negative gradient).
### Hidden Layer
A layer of neurons between the input and output layers. Hidden layers perform nonlinear transformations of the inputs entered into the network.
### Internal Covariate Shift
The phenomenon where the distribution of network activations changes as the parameters of the previous layers change during training. **Batch Normalization** is designed to address this.
### Kernel (Filter)
A small matrix of learnable weights in a **Convolution** layer. It slides over the input to detect specific features like edges, textures, or patterns.
### Learning Rate
A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.
### Loss Function
A method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. Examples include Mean Squared Error (MSE) and Cross-Entropy Loss.
### Multi-Layer Perceptron (MLP)
A class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer.
### Momentum
An extension to **Gradient Descent** that accelerates convergence by accumulating a velocity vector in directions of persistent reduction in the loss function.
### Multi-Head Attention
An extension of Self-Attention where multiple attention heads run in parallel to capture different types of relationships (e.g., syntactic vs. semantic).
### Neuron
The basic unit of a neural network. It receives inputs, applies a weight and bias, and passes the result through an activation function.
### Overfitting
A modeling error that occurs when a function is too closely fit to a limited set of data points.
### Padding
The process of adding extra pixels (usually zeros) around the border of an input image before convolution. This allows the output size to match the input size ("Same" padding) and preserves edge information.
### Perceptron
The simplest type of artificial neural network, consisting of a single layer of linear threshold units.
### Pooling
A downsampling operation (e.g., Max Pooling, Average Pooling) that reduces the spatial dimensions of the feature map, decreasing computation and providing translation invariance.
### Positional Encoding
A technique to inject information about the position of tokens in a sequence, as Transformers (unlike RNNs) process all tokens simultaneously.
### ReLU (Rectified Linear Unit)
An activation function defined as the positive part of its argument: `f(x) = max(0, x)`. It is the most popular activation function for deep neural networks.
### Residual Block
The core component of **ResNet**. It introduces a "skip connection" that adds the input of the block to its output (F(x) + x), allowing gradients to flow through very deep networks without vanishing.
### Self-Attention
A mechanism where each position in the sequence attends to all other positions to compute a representation of the sequence. Defined by Query, Key, and Value vectors.
### Sigmoid
An activation function that maps any real value into the range (0, 1). Often used in the output layer for binary classification.
### Softmax
A function that provides a probability distribution over a set of classes. Often used in the final layer of a neural network-based classifier.
### Stochastic Gradient Descent (SGD)
A variant of **Gradient Descent** where the gradient is estimated using a single training example (or a small batch) at a time, introducing noise that can help escape local minima.
### Stride
The number of pixels the **Kernel** moves at each step during convolution. A stride greater than 1 results in downsampling the output.
### Tanh (Hyperbolic Tangent)
An activation function that maps any real value into the range (-1, 1). It is zero-centered, which can make optimization easier compared to sigmoid.
### Tokenization
The process of breaking text into smaller units (tokens), such as words or subwords, for processing by a model.
### Transfer Learning
A machine learning technique where a model developed for a task is reused as the starting point for a model on a second task. It leverages knowledge (features) learned from massive datasets (like ImageNet).
### Transformer
A deep learning architecture based entirely on attention mechanisms, dispensing with recurrence and convolutions.
### Underfitting
A modeling error that occurs when a model cannot adequately capture the underlying structure of the data.
### Weight
A learnable parameter in a neural network that transforms input data within the network's hidden layers.