Abstract
Recurrent Neural Networks (RNNs) are widely used for modeling sequential data. However, simple (vanilla) RNNs suffer from well-known training difficulties that motivated the development of more advanced architectures such as Long Short-Term Memory (LSTM) networks. This article summarizes core theoretical questions about RNNs and LSTMs, explains gradient-related problems, and presents a worked numerical example of a vanilla RNN forward pass.
1. Parameter Comparison: Vanilla RNN vs. LSTM
A vanilla RNN neuron computes its hidden state using a single transformation of the input and the previous hidden state. It therefore has one set of parameters:
- Input-to-hidden weights
- Hidden-to-hidden weights
- Bias
An LSTM neuron, in contrast, is composed of four internal components:
- Input gate
- Forget gate
- Output gate
- Candidate (cell) state
Each of these components has its own weights and bias terms.
Result:An LSTM neuron has four times as many parameters as a vanilla RNN neuron (assuming the same input and hidden dimensions).
2. The Vanishing Gradient Problem in Vanilla RNNs
During training, RNNs use Backpropagation Through Time (BPTT). Because the same weights are applied repeatedly across many time steps, gradients are multiplied many times.
If the magnitude of these terms is smaller than 1, the gradients shrink exponentially as they propagate backward.
Result:The forward pass structure of the vanilla RNN leads to vanishing gradients, making it difficult to learn long-term dependencies.
This problem explains why vanilla RNNs struggle to remember information from far in the past.
3. Why Gradients Are Clipped in RNN Training
In some situations, especially when weights or inputs are large, gradients can grow exponentially instead of shrinking.
This phenomenon is known as exploding gradients and can cause:
- Extremely large weight updates
- Numerical instability (NaNs or infinities)
- Training divergence
To address this, practitioners apply gradient clipping, which limits the magnitude of gradients during training.
Result:Gradients are clipped to prevent exploding gradients, not to solve vanishing gradients.
4. Meaning of “Gradients Vanish”
The phrase “gradients vanish” means that gradient values become exceedingly small (close to zero).
Practical consequences:
- Weight updates become negligible
- Earlier layers or time steps stop learning
- Long-term dependencies cannot be captured
This is a fundamental limitation of vanilla RNNs and a primary motivation for gated architectures such as LSTMs and GRUs.
Vanilla RNN Forward Pass
Network architecture
- One input unit
- One hidden unit with sigmoid activation
- One output unit with linear activation
New parameter values
Weights
Biases
Inputs
Initial hidden state
Compute the hidden state
General formula:
For :
Sigmoid:
Result
Compute the hidden state
Result
Compute the output
Output unit is linear:
Result

ANNs and CNNs are example of Feed-Forward Neural Network while In RNNs , Information can flow back and forth through internal loops, allowing the network to consider past information when processing the current input.
©image. Sachin Soni

©image. Sachin Soni
@Yolanda Muriel 