ARTIFICIAL INTELLIGENCE (43) – Natural Language Processing (21) PPO Proximal Policy Optimization

Instead of directly asking humans for preferences, model their preferences as a separate (NLP) problem. This means: Humans label or rank a limited number of outputs,  a reward model learns to imitate human judgments, the language model is optimized against the reward model, not directly against humans.

This idea is inspired by earlier work (e.g., Knox and Stone, 2009) on learning from human preferences.

The Example

A single prompt (x) with two possible model outputs:

Prompt:

Something related to an earthquake in San Francisco.

Output y₁:

“An earthquake hit San Francisco. There was minor property damage, but no injuries.”

Output y₂:

“The Bay Area has good weather but is prone to earthquakes and wildfires.”

Humans clearly prefer y₁, because: It directly answers the prompt, it is specific and factual and it is relevant to the event described.

Human Reward Signal

Humans assign preference-based scores:

  • R(x, y₁) = 8.0
  • R(x, y₂) = 1.2

Important points: These numbers do not come from humans directly during training, humans typically rank outputs, not assign exact numbers, and the numeric reward is produced by a trained reward model.

The Reward Model

Train an RMφ(x, y) to predict human reward from an annotated dataset, then optimize for RMφ instead.

What RMφ means:

  • RM = Reward Model
  • φ (phi) = model parameters
  • Inputs: prompt x and output y
  • Output: predicted human preference score

This reward model is a neural network trained on human-labeled comparisons.

PPO

Once we have a reward model, we want to update the language model so that: It produces outputs with higher predicted reward, and it does not change too abruptly, which could harm language quality. This is where Proximal Policy Optimization (PPO) comes in.

Proximal Policy Optimization (PPO)

PPO is a reinforcement learning algorithm designed to: Optimize a policy (the language model), while keeping updates small and stable. Key properties: Prevents the model from exploiting reward model weaknesses, maintains fluency and coherence, and  balances exploration and safety. In simple terms: PPO nudges the model toward better answers without breaking it.

How PPO Works

  • The model generates several candidate answers.
  • The reward model scores them.
  • PPO adjusts the model’s parameters to:
  1. Increase probability of high-reward outputs (like y₁).
  2. Decrease probability of low-reward outputs (like y₂).
  • Constraints ensure the new model stays close to the previous version.

This “proximal” constraint is the reason for the algorithm’s name.

© Image. Proximal Policy Optimization (PPO) architecture. Financial News-Driven LLM Reinforcement Learning for Portfolio Management

PPO Is Especially Important for Language Models

Without PPO: The model may “game” the reward model,  Outputs may become repetitive, exaggerated, or unsafe, and  Language quality can degrade. With PPO: Training remains stable, Responses improve gradually and Human alignment is preserved. This is why PPO is the standard optimization algorithm for RLHF in large language models.

In summary:

  • Human feedback is expensive and limited.
  • A reward model is trained to approximate human preferences.
  • The language model is optimized against this reward model.
  • PPO is used to make stable, controlled improvements.
  • The example shows how better, more relevant answers receive higher reward.
  • This process is central to how ChatGPT learns to produce helpful responses.

 

Bonus

Write down your ideas on paper.

 

Licencia Creative Commons@Yolanda Muriel Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)

Deja un comentario