ARTIFICIAL INTELLIGENCE (16) – Deep learning (14) Understanding Backpropagation (2)

Another interesting type of non-linearity is the ReLU function, which sets any negative neuron output to zero. In a fully connected layer that uses ReLU, the essential operations in both the forward and backward passes include the following:

z = np.maximum(0, np.dot(W, x))     # forward pass

dW = np.outer(z > 0, x)                         # backward pass: local gradient for W

x: the input to the layer 

W: the weights of a fully connected (dense) layer.
This is a grid (matrix) of numbers that transforms input features into outputs.
If the layer has
m outputs and the input has n features, W has shape (m × n).

z: the output after applying the ReLU function.
z has length m (one value per output neuron).

dW: the “local gradient” of the loss with respect to the weights W for this layer, given the ReLU gate
dW has the same shape as W (m × n).

Line 1: Forward pass with ReLU

z = np.maximum(0, np.dot(W, x))

np.dot(W, x)

  • This multiplies the weight matrix W (size m×n) by the input vector x (size n).
  • The result is a vector of length m: one raw score per output neuron.
  • Conceptually: each output neuron computes a weighted sum of the inputs.

np.maximum(0, …)

  • This applies the ReLU function (Rectified Linear Unit): it replaces negative values with 0 and keeps positive values unchanged.
  • So if a raw score is negative, it becomes 0; if it’s positive, it stays as is.

Line 2: Backward pass (local gradient for W)

dW = np.outer(z > 0, x)

  • During training, the model learns by adjusting weights to reduce errors.
  • To adjust correctly, it computes gradients (sensitivities) that say how much a small change in each weight would change the error.
  • The full gradient uses the chain rule (combining several pieces). This line computes the local piece for this layer, using the ReLU’s “on/off” behavior.

z > 0

  • This checks which outputs after ReLU were positive.
  • It returns a list of True/False values of length m (one per output neuron).
  • Think of it as a set of gates:
    • True (or 1): this neuron was “on” (positive output), so it can pass gradient.
    • False (or 0): this neuron was “off” (output 0), so it blocks gradient (ReLU derivative is zero there).

np.outer(z > 0, x)

  • The outer product takes a column (length m) and a row (length n) and produces an m × n matrix.
  • Here, the column is the on/off mask (z > 0), and the row is the input vector x.
  • The result is a matrix where:
    • Each row i is either all zeros (if neuron i was off) or a copy of x (if neuron i was on).

When a neuron gets pushed down to zero during the forward pass (meaning ReLU outputs z = 0, so the neuron does not «activate»), then that neuron receives no gradient for its weights. This can cause the so‑called “dead ReLU” problem: if a ReLU neuron is initialized in a way that makes it never activate, or if during training a big weight update pushes it into this non‑active region, the neuron can stay permanently inactive. It’s similar to irreversible brain damage—once dead, it never recovers. In fact, if you run the full training dataset through a trained network, you might find that a large portion of neurons (e.g., 40%) were zero the entire time.

© Image. https://karpathy.medium.com/

Summarize:

  • A ReLU neuron outputs zero whenever its input is negative.
  • If a neuron outputs zero, it also receives zero gradient during training.
  • No gradient means its weights will not be updated.
  • If this happens repeatedly, the neuron never “wakes up” again.

The «dead ReLU» problem:

A neuron becomes “dead” when:

  1. At initialization, its weights produce negative values every time : it never activates.
  2. During training, a large update pushes it into a region where it always produces negative values : it permanently stops firing.

Once the neuron stops activating:

  • It always outputs 0.
  • Its gradient is always 0.
  • Its weights never change again.

So the neuron becomes useless, like a switch that can no longer turn on.

Dead neurons:

  • Don’t learn anything.
  • Don’t contribute to the model’s predictions.
  • Reduce the network’s capacity.

If you know how backpropagation works and you’re using ReLU activations in your network, you’re always concerned about dead ReLUs. These are neurons that never activate for any sample in the entire training set, and once they stop activating, they stay dead forever. Neurons can also die during training, often because the learning rate is too high.

 

RESOURCES
See a longer explanation in CS231n lecture video.
Professor Andrej Karpathy recommend  the CS231n lecture on backpropagation, which focuses on intuition rather than algebra. And if you have the time, the CS231n assignments let you implement backpropagation yourself, helping you truly understand how it works.

References

Technical article:

 

 

Licencia Creative Commons@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Deja un comentario