ARTIFICIAL INTELLIGENCE (32) – Natural Language Processing (10) Key Concepts in Recurrent Neural Networks

Abstract

Recurrent Neural Networks (RNNs) are widely used for modeling sequential data. However, simple (vanilla) RNNs suffer from well-known training difficulties that motivated the development of more advanced architectures such as Long Short-Term Memory (LSTM) networks. This article summarizes core theoretical questions about RNNs and LSTMs, explains gradient-related problems, and presents a worked numerical example of a vanilla RNN forward pass.

1. Parameter Comparison: Vanilla RNN vs. LSTM

A vanilla RNN neuron computes its hidden state using a single transformation of the input and the previous hidden state. It therefore has one set of parameters:

Input-to-hidden weights
Hidden-to-hidden weights
Bias

An LSTM neuron, in contrast, is composed of four internal components:

Input gate
Forget gate
Output gate
Candidate (cell) state

Each of these components has its own weights and bias terms.

Result:An LSTM neuron has four times as many parameters as a vanilla RNN neuron (assuming the same input and hidden dimensions).

2. The Vanishing Gradient Problem in Vanilla RNNs

During training, RNNs use Backpropagation Through Time (BPTT). Because the same weights are applied repeatedly across many time steps, gradients are multiplied many times.

If the magnitude of these terms is smaller than 1, the gradients shrink exponentially as they propagate backward.

Result:The forward pass structure of the vanilla RNN leads to vanishing gradients, making it difficult to learn long-term dependencies.

This problem explains why vanilla RNNs struggle to remember information from far in the past.

3. Why Gradients Are Clipped in RNN Training

In some situations, especially when weights or inputs are large, gradients can grow exponentially instead of shrinking.

This phenomenon is known as exploding gradients and can cause:

Extremely large weight updates
Numerical instability (NaNs or infinities)
Training divergence

To address this, practitioners apply gradient clipping, which limits the magnitude of gradients during training.

Result:Gradients are clipped to prevent exploding gradients, not to solve vanishing gradients.

4. Meaning of “Gradients Vanish”

The phrase “gradients vanish” means that gradient values become exceedingly small (close to zero).

Practical consequences:

Weight updates become negligible
Earlier layers or time steps stop learning
Long-term dependencies cannot be captured

This is a fundamental limitation of vanilla RNNs and a primary motivation for gated architectures such as LSTMs and GRUs.

Vanilla RNN Forward Pass

Network architecture

One input unit
One hidden unit with sigmoid activation
One output unit with linear activation

New parameter values

Weights

Biases

Inputs

Initial hidden state

Compute the hidden state

General formula:

For :

Sigmoid:

Result

Compute the hidden state

Result

Compute the output

Output unit is linear:

Result

ANNs and CNNs are example of Feed-Forward Neural Network while In RNNs , Information can flow back and forth through internal loops, allowing the network to consider past information when processing the current input.

	AI CHINESE – AI Chin… en AI CHINESE – AI Chinese Speech…
	Mane Oliva en REVIT ARCHITECTURE (199)…
	REVIT ARCHITECTURE (… en REVIT ARCHITECTURE (957) – PYT…
	REVIT ARCHITECTURE (… en REVIT ARCHITECTURE (946) – PYT…
	REVIT ARCHITECTURE (… en REVIT ARCHITECTURE (927) – PYT…

ARTIFICIAL INTELLIGENCE (32) – Natural Language Processing (10) Key Concepts in Recurrent Neural Networks

Abstract

1. Parameter Comparison: Vanilla RNN vs. LSTM

2. The Vanishing Gradient Problem in Vanilla RNNs

3. Why Gradients Are Clipped in RNN Training

4. Meaning of “Gradients Vanish”

Vanilla RNN Forward Pass

Network architecture

New parameter values

Compute the hidden state

Compute the hidden state

Compute the output

@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Publicado por Yolanda MURIEL

Deja un comentario Cancelar la respuesta

Abstract

1. Parameter Comparison: Vanilla RNN vs. LSTM

2. The Vanishing Gradient Problem in Vanilla RNNs

3. Why Gradients Are Clipped in RNN Training

4. Meaning of “Gradients Vanish”

Vanilla RNN Forward Pass

Network architecture

New parameter values

Compute the hidden state h0

Compute the hidden state h1

Compute the output y1

@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Comparte esto:

Relacionado

Publicado por Yolanda MURIEL

Deja un comentario Cancelar la respuesta

Compute the hidden state

Compute the hidden state

Compute the output