Regularization Techniques to Improve Deep Learning Performance
Overview
When building and training deep learning models we often facing the problem of overfitting. Overfitting may happen due the model complexity or the model is “memorized” the training dataset instead of learning generalizable patterns. This become the problem because we want the training and validation performance are as close as possible.
To tackle this problem we can use regularization method which introduces constraints or penalties to the model’s complexity, encouraging the model to generalize better and reducing its likelihood to overfitting. We will deep dive to this techniques in this section.
Methods
There are many methods that we can use to regularize our models. The regularization techniques seem simple but can affect the models.
1. Lasso Regularization (L1)
LASSO stands for Least Absolute Shrinkage and Selection Operator which commonly known as L1 regularization. This technique penalizes the absolute values of the model weight by adding the following formula:
$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{original}} + \lambda \sum_{i=1}^n |w_i|$$
Where :
Lambda (λ) : Regularization strength
Impact
L1 regularization promotes sparsity in the model weights, this mean that many weights become exatly zero.
Sparse weights can lead to simpler models that are more interpretable and less prone to overfitting.
L1 regularization is useful in feature selection tasks, as it effectively eliminates irrelevant feature by setting their weights to zero.
Common values between 0.01 to 1 (depends on the dataset)
2. Ridge Regularization (L2)
This technique, ridge regularization known as L2 is similar to L1 but instead penalizes the squared magnitude of the weights. This regularization follow this formula.
$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{original}} + \lambda \sum_{i=1}^n w_i^2$$
Impact
Encourage smaller weights, making the model more stable and less sensitive to noise
Reduces overfitting by preventing any single feature or neuron from dominating the model
This technique suitable for reducing overfitting without forcing complete sparsity in the model weights.
Common values between 0.01 to 1 (depends on the dataset)
3. ElasticNet Regularization
This regularization combines both L1 and L2 regularizzation into a single loss function.
$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{original}} + \lambda_1 \sum_{i=1}^n |w_i| + \lambda_2 \sum_{i=1}^n w_i^2$$
Where :
λ1 : Coefficient for L1 regularization
λ2 : Coefficeint for L2 regularization
Impact
Combines the benefits of L1 and L2 regularization
Can handle correlated feature better than L1 but still retains some weights instead of forcing all but one to zero
This technique useful when dealing datasets that have many features, especially when some of them are correlate
Common values between 0.01 to 1 (depends on the dataset)
4. Dropout
Dropout is important regularization in neural networks. It work by randomly dropping the neuron (or kill the neuron randomly with probability) in which training epochs. This method encourage the neuron to be robust and not heavily depend on one or particular neuron. The following formula is
$$y_i = \begin{cases} 0 & \text{with probability } p \\ \frac{a_i}{1-p} & \text{with probability } 1-p \end{cases}$$
Impact
Forces the network to learn more robust features by not relying too much on particular neuron
Reduces overfitting significantly
This method widely use in deep learning model
Common values between 0.3 to 0.5
5. Early Stopping
This technique is simple to prevent overfitting. It monitors the validation loss during training and stops the training when the validation performance starts to degrade (overfitting). The model is trained for fixed number of epochs, but training halts early if the validation loss does not improve (degrade) for a ertain number of epochs which controlled by patience parameter.
Impacts
Prevents wasting time and computational resources on overfitting
Often combined with other regularization method
Commonly use for training large models
Common values between 3 to 10
6. Data Augmentation
Data augmentation generates additional data by applying transformations to the original dataset, helping to generalize model better. In computer vision, common transformation include:
Random Brightness
Random Crop
Random Flip
Random Hue
Impacts
Simulates variotion in the data, making the model more robust to changes in input
Reduce overfitting by effectively increasing the dataset size
7. Batch Normalization
Batch normalization normalizes the inputs to each layer during training, ensuring that the mean is zero and the variance is one. The normalized output is scaled and shifted using learnable parameter gamma and beta.
$$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta$$
Impacts
Reduces internal covariate shift, speeding up convergence
Acts as regularizer, reducing the need for dropout in some cases
Widely use in modern architectures.
ϵ is small constant to add variance and γ (Scale) and β (Shift) are learnable parameters.
Summary
Regularization ensures that deep learning models generalize well, preventing them from overfitting to the training data. By incorporating these techniques, you can build models that perform robustly on unseen datasets.