ARTIFICIAL INTELLIGENCE (15) – Deep learning (13) Understanding Backpropagation

The issue with backpropagation is that it acts as a “leaky abstraction.”
Calling backpropagation a leaky abstraction means that, although it is meant to hide the mathematical complexity of gradient computations, in practice you cannot fully ignore what is happening underneath. To use it effectively, developers and researchers still need to understand details such as how gradients flow, how numerical instability can appear, or how layer interactions affect learning. In other words, backpropagation tries to simplify things, but the lower‑level mechanics still “leak through,” requiring deeper understanding.
Sigmoid function

In early neural networks, it was common to use sigmoid or tanh activation functions in fully connected layers. The subtle problem is that these activations can easily saturate if your weight initialization or data preprocessing is poorly chosen.

When a sigmoid saturates, its output becomes very close to 0 or 1. In those regions, the derivative of the sigmoid is almost zero. During backpropagation this means the gradients disappear, so the network stops learning completely. Your training loss becomes flat and refuses to decrease.

For example, a simple fully connected layer with a sigmoid activation in raw NumPy would compute something like:

The objects

  • x: the input column vector (a list of numbers stacked vertically). Shape: (n,) or (n×1).
  • W: the weight matrix that multiplies x. Shape: (m×n) (m neurons, n inputs).
  • z: the output of applying the sigmoid activation to the linear combination W x. Shape: (m,).
Intuition: each neuron computes a weighted sum of the inputs (via W x), then the sigmoid squashes that sum to a value between 0 and 1.
Line 1 — Forward pass
z = 1/(1 + np.exp(-np.dot(W, x)))
np.dot(W, x) computes the linear combination: multiply weights by inputs (this is the “pre‑activation”).
np.exp(- … ) computes the exponential of the negative of that value.
1/(1 + exp(-u)) is the sigmoid function σ(u).  It turns any real number into a number between 0 and 1.
Take the inputs, combine them with the weights, and pass the result through a sigmoid to get the neuron outputs z (each between 0 and 1).
Line 2 — Backward pass: gradient w.r.t. input x
dx = np.dot(W.T, z*(1-z))

In backpropagation, we need derivatives. For the sigmoid, the derivative is σ(u)·(1−σ(u)). Since z = σ(u), the derivative is z*(1-z).

W.T is the transpose of W (rows/columns swapped).

Multiplying W.T by z*(1-z) gives how changes in x would change the output (the local gradient wrt x).

Compute how sensitive the outputs are (using z*(1-z)), then map that sensitivity back to each input using the weights. This tells us how a small change in each input x would affect the outputs after the sigmoid.
Line 3 — Backward pass: gradient w.r.t. weights W
dW = np.outer(z*(1-z), x)

np.outer(a, b) creates a matrix where each element is a_i * b_j.

Here, a = z*(1-z) (the sigmoid derivative per neuron) and b = x (each input value).

The result is an (m×n) matrix: for each neuron (m rows) and each input feature (n columns), it tells how much changing that weight would change the output—the local gradient of the output w.r.t. each weight.

To sumarize:

  • Forward: compute outputs z = sigmoid(Wx).
  • Backward (w.r.t. inputs): dx = Wᵀ · (z·(1−z)) says how changes in inputs would affect outputs after the sigmoid.
  • Backward (w.r.t. weights): dW = outer(z·(1−z), x) says how each weight affects the outputs and how it should be updated.
When weights are initialized with overly large values, the inputs to the sigmoid become extreme, causing the activation to saturate near 0 or 1. In saturation, the sigmoid’s derivative becomes almost zero, so the gradients computed during backpropagation vanish. Because the chain rule multiplies these tiny values across layers, the entire backward pass collapses to zero, stopping learning completely
There is another subtle issue with using the sigmoid activation function.
The derivative of the sigmoid—its local gradient—is:
z(1z)
This value reaches its maximum when z=0.5
And at that point, the derivative is:

0.5×0.5=0.25

In other words: a sigmoid can never pass a gradient bigger than 0.25.

So every time a gradient flows backward through a sigmoid, the signal is reduced to one quarter of its original size (or even smaller, if z is not exactly 0.5).

If your network has multiple sigmoid layers stacked, this repeated shrinking makes the lower layers receive extremely tiny gradients.

If you are training with plain stochastic gradient descent (SGD), this means:

  • the top layers (near the output) get decent gradients and learn faster
  • the bottom layers (near the input) get very small gradients and learn painfully slowly

This is one of the classic reasons that deep networks were hard to train before the introduction of ReLU and better initialization techniques.

Backpropagation is not a clean, perfectly hidden abstraction. If you try to ignore how it actually works because “TensorFlow magically makes my network train,” (quote of Andrej Karpathy), you’ll run into problems that you won’t know how to diagnose, and you won’t be as effective when building or debugging neural networks.

The positive news is that backpropagation becomes quite approachable when it’s explained the right way.
RESOURCES
Professor Andrej Karpathy recommend  the CS231n lecture on backpropagation, which focuses on intuition rather than algebra. And if you have the time, the CS231n assignments let you implement backpropagation yourself, helping you truly understand how it works.

References

Technical article:

© Images. https://cs231n.github.io/

Loss function landscape for the Multiclass SVM (without regularization) for one single example (left,middle) and for a hundred examples (right) in CIFAR-10. Left: one-dimensional loss by only varying a. Middle, Right: two-dimensional loss slice, Blue = low loss, Red = high loss. Notice the piecewise-linear structure of the loss function. The losses for multiple examples are combined with average, so the bowl shape on the right is the average of many piece-wise linear bowls (such as the one in the middle).

Licencia Creative Commons@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Deja un comentario