Opinion: The government should not become the kill switch for AI modelsBreaking: Anthropic says U.S. directive forced suspension of Fable 5 and Mythos 5 accessAiML SuperAgent launches Context Minimizer for AI coding assistantsAnthropic launches Claude Sonnet 4.6 and Opus 4.6 with 1M-context betaGemini 3.1 Flash Live targets real-time voice and vision agentsOpenAI adds more product-layer emphasis to safety and governanceGoogle expands Gemini deeper into Docs, Sheets, Slides, and DriveGPT-5.4 mini and nano push cheaper production inference tiersGitHub spreads GPT-5.4 across Copilot editors, CLI, mobile, and agentsAI agent UX is shifting from async chat to live multimodal interactionModel governance is becoming a shipping requirement, not a policy appendixCoding copilots are now competing on workflow integration, not just model accessLow-latency multimodal APIs are turning into default platform expectationsOpinion: The government should not become the kill switch for AI modelsBreaking: Anthropic says U.S. directive forced suspension of Fable 5 and Mythos 5 accessAiML SuperAgent launches Context Minimizer for AI coding assistantsAnthropic launches Claude Sonnet 4.6 and Opus 4.6 with 1M-context betaGemini 3.1 Flash Live targets real-time voice and vision agentsOpenAI adds more product-layer emphasis to safety and governanceGoogle expands Gemini deeper into Docs, Sheets, Slides, and DriveGPT-5.4 mini and nano push cheaper production inference tiersGitHub spreads GPT-5.4 across Copilot editors, CLI, mobile, and agentsAI agent UX is shifting from async chat to live multimodal interactionModel governance is becoming a shipping requirement, not a policy appendixCoding copilots are now competing on workflow integration, not just model accessLow-latency multimodal APIs are turning into default platform expectations
All Articles
AI Training

Verifier-Calibrated On-Policy Distillation: A Practical Algorithm for Teaching Models Without Making Them Forget

A concrete post-training algorithm that combines on-policy sampling, verifier rewards, teacher logits, clipping, and replay to help language models learn new capabilities without catastrophic forgetting.

By Marvin B. FreedmanJun 16, 2026 18 min read

Most post-training methods can be understood through one simple question: what distribution are we moving the model toward?

A language model is a probability distribution over sequences. Every post-training method reshapes that distribution. Supervised fine-tuning, reinforcement learning, and on-policy distillation all move probability mass in different ways.

Supervised fine-tuning pulls the model toward a fixed external dataset. Reinforcement learning updates the model based on samples generated by the model itself. On-policy distillation also trains on student-generated samples, but adds dense teacher guidance at the token level.

This suggests a practical next step: use the student's own rollouts, score them with a verifier, and use teacher logits only when they are calibrated by task success.

Key Takeaways
  • Sample from the student, score with a verifier, use teacher logits as dense hints, clip dangerous token updates, and replay old skills on-policy to reduce forgetting.
  • The model should learn from the states it actually visits, not only from teacher-generated trajectories or a fixed external dataset.
  • Verifier-calibrated teacher guidance aims to combine RL locality, distillation density, and replay-based retention.

Linked Credits

The Distribution Question

Most post-training methods can be understood through one simple question:

Core question

What distribution are we moving the model toward?

A language model is a probability distribution over sequences. Every post-training method reshapes that distribution. Supervised fine-tuning, reinforcement learning, and on-policy distillation all move probability mass in different ways.

Supervised fine-tuning pulls the model toward a fixed external dataset. Reinforcement learning updates the model based on samples generated by the model itself. On-policy distillation also trains on student-generated samples, but adds dense teacher guidance at the token level.

That difference matters. When we fine-tune a model on a fixed dataset, the model can be pulled far away from its original behavior. This is one reason supervised fine-tuning can cause catastrophic forgetting.

Reinforcement learning often forgets less because the data comes from the current model. The model samples from itself, receives a reward, and then updates toward higher-reward behavior. This naturally keeps the model closer to its own distribution.

But RL has a different problem: the reward signal is usually sparse. A whole answer may receive one score. The model knows whether the final result was good, but it does not always know which tokens mattered.

On-policy distillation sits between the two. It keeps the important on-policy property of RL, while giving the student dense token-level supervision from a teacher. The student generates the trajectory, and the teacher gives guidance on the states the student actually visits.

This suggests a practical next step: use the student's own rollouts, score them with a verifier, and use teacher logits only when they are calibrated by task success.

Proposed Algorithm

Verifier-Calibrated On-Policy Distillation

Sample from the student, score with a verifier, use teacher logits as dense hints, clip dangerous token updates, and replay old skills on-policy to reduce forgetting.

The model should learn from the states it actually visits.

Why SFT Forgets

Supervised fine-tuning is extremely useful. It is often the right first step when a model needs to learn a new output format, instruction style, or task structure.

But SFT has a natural forgetting failure mode. In SFT, every token in the dataset becomes a target. The model is trained to increase the probability of the demonstrated token, whether that token is task-critical or incidental.

A mathematical operator, a code edit, a variable name, a formatting choice, and a phrase like therefore can all receive direct gradient pressure. The loss does not know which tokens matter.

This creates broad updates. If the dataset distribution is far from the model's original distribution, the model may learn the new behavior while damaging older capabilities.

That does not mean SFT is bad. It means SFT should be used carefully. SFT is best for cold-start behavior. It teaches the model the shape of the task. It should not always be the main engine for teaching the deeper capability.

Why RL Forgets Less

Reinforcement learning behaves differently because the model trains on its own samples. The model generates outputs from its current policy. Those outputs are scored. Then the model shifts probability mass toward the outputs that received higher reward.

This means RL is not pulling the model toward an arbitrary external dataset. It is improving behavior in regions the model already visits.

Among all possible task-solving policies, on-policy training tends to find one that is close to the current model. This helps explain why RL can improve a capability without causing the same level of broad forgetting often seen with aggressive SFT.

But RL is expensive and inefficient. If a whole code solution receives one reward, the model may not know exactly which token fixed the bug. If a math answer is correct, the model may not know which step mattered most. Outcome rewards are honest, but sparse.

This is the credit assignment problem. RL gives us locality. Distillation gives us density. The goal is to combine them without inheriting the worst parts of either.

Why Plain Distillation Is Not Enough

Distillation gives the student a dense learning signal. Instead of receiving one reward for a full answer, the student can learn from the teacher's probability distribution at every token.

But teacher logits are not always task importance. Sometimes the largest teacher-student differences happen on style tokens. The teacher may strongly prefer Wait, Let's think, Therefore, or a certain formatting pattern. Those tokens may have high KL divergence, but they may not be the tokens that actually solve the task.

If we blindly train on teacher logits, the model can over-optimize style. That is why distillation needs calibration. A verifier should decide whether the trajectory is actually good. The teacher should only provide dense guidance inside that verified frame.

The Core Algorithm

Let:

π0 = the original base model
πθ = the trainable student model
T  = the teacher model
V  = the verifier or reward function

The verifier takes a prompt and a completion:

V(x, y) -> reward

The teacher can be:

1. a stronger model,
2. a specialist model trained with SFT or RL,
3. the same model with privileged information added to the context.

The third option is especially interesting. For example, in a code repair task, the student sees only the buggy function. The teacher pass may also see the reference solution, hidden test result, or verified patch. The student and teacher are the same base model, but the teacher has privileged information.

The student learns from its own rollout. The teacher gives guidance on that rollout. The verifier decides whether that rollout deserves reinforcement.

Step 1: Use Small SFT Only for Format

Start with SFT only if the model cannot follow the task format. For example, in a minimal code editing task, the model may need to learn:

Input:
A buggy function

Output:
Only the corrected function

A good starting SFT setup:

1,000 to 10,000 examples
1 epoch
low learning rate
small adapter
format-focused examples

The goal is not to force the model to imitate every detail of a specialist dataset. The goal is only to teach the model the task interface. After that, the real learning should happen on-policy.

Step 2: Generate Student Rollouts

For each training prompt, sample multiple outputs from the current student:

y1, y2, ..., yK ~ πθ(. | x)

Good defaults:

K = 4
temperature = 0.7 to 1.0
top_p = 0.95

For expensive code tasks, use:

K = 2

For cheaper math or synthetic verifier tasks, use:

K = 8

This is the most important part of the algorithm. Do not train only on teacher outputs. Train on the states the student actually visits.

Autoregressive models create their own state distribution. A single wrong token can move the model into a prefix the teacher would never have written. If the student only trains on teacher-generated trajectories, it may not learn how to recover from its own mistakes.

On-policy sampling fixes that mismatch.

Step 3: Score Each Rollout With a Verifier

Each student output receives a reward:

R_i = V(x, y_i)

For a minimal code editing task, the verifier should reward more than just passing tests. A model should not get full credit for rewriting an entire function when only one line needed to change.

A concrete reward could be:

if tests fail:
reward = -1.0
else:
reward = 1.0
- 0.5 * normalized_extra_levenshtein
- 0.3 * added_cognitive_complexity
- 0.2 * unrelated_changed_lines_ratio

This reward encourages the model to:

pass the tests
make the smallest necessary edit
avoid unnecessary rewrites
preserve structure
avoid adding complexity

Then normalize rewards within each prompt group:

A_i = (R_i - mean(R_1...R_K)) / (std(R_1...R_K) + ε)

This gives each rollout a sequence-level advantage. Higher advantage means the rollout was better than the other attempts for the same prompt. Lower advantage means it was worse.

Step 4: Add Verifier-Calibrated Teacher Guidance

For each token in a sampled completion:

s_t = prompt + previous tokens
a_t = current token

Compute the old student log-probability:

logp_student = log πθ_old(a_t | s_t)

Compute the teacher log-probability:

logp_teacher = log T(a_t | s_t, privileged_info)

Then compute the teacher-student difference:

D_t = logp_teacher - logp_student

But do not trust this difference blindly. Clip it:

D_t_clipped = clip(D_t, -0.5, 0.5)

Then gate it by the verifier advantage:

G_t = sigmoid(A_i)

The final dense token advantage becomes:

A_dense_t = A_i + λ_teacher * G_t * D_t_clipped

A good default is:

λ_teacher = 0.2

This is the key move. The verifier decides whether the trajectory is good. The teacher provides dense token-level hints. The clip prevents high-KL style tokens from dominating training. The gate prevents the teacher from strongly reinforcing trajectories that did not actually solve the task.

This makes the teacher useful without making the student blindly imitative.

Step 5: Update With Clipped Policy Optimization

Use a PPO-style or GRPO-style clipped objective.

For each token:

ρ_t = πθ(a_t | s_t) / πθ_old(a_t | s_t)

Policy loss:

L_policy = -mean(
min(
    ρ_t * A_dense_t,
    clip(ρ_t, 1 - ε, 1 + ε) * A_dense_t
)
)

A good default:

ε = 0.2

Add a small adaptive KL guard against the original base model:

L_kl = β * KL(πθ(. | s_t) || π0(. | s_t))

Make the KL coefficient adaptive:

if measured_KL > target_KL:
β *= 1.2
else:
β *= 0.95

Good defaults:

target_KL = 0.02 to 0.05 nats/token
β initial = 0.01

Then add a small entropy bonus:

L = L_policy + L_kl - η * entropy_bonus

Good default:

η = 0.001

The entropy bonus matters because distillation can collapse the model's distribution quickly. Some entropy reduction is useful. It means the model is becoming more decisive. But too much entropy collapse can make the model brittle, repetitive, or overfit to narrow patterns.

Step 6: Preserve Old Skills With On-Policy Replay

Every few hundred training steps, sample from the current model on general anchor prompts:

y_old ~ πθ(. | x_old)

These anchor prompts should cover capabilities you do not want to lose:

general coding
reasoning
instruction following
summarization
tool-use format
chat helpfulness
domain knowledge

Filter the outputs by verifier score, self-confidence, or quality estimate. For example:

keep the top 20%

Then mix them into training:

current task batch: 80% to 90%
on-policy replay batch: 10% to 20%

This is not ordinary replay from a static old dataset. It is replay from the model's current distribution. That matters because the model is reminded of old capabilities without being yanked toward an unrelated external distribution.

On-policy replay gives the model a soft anchor. It helps preserve older behavior while the main training loop improves the new capability.

Full Pseudocode

# Verifier-Calibrated On-Policy Distillation
initialize student πθ from base π0
freeze reference πref = π0
optional: initialize teacher T

for step in range(num_steps):

    batch = sample_prompts(train_prompts)

    rollouts = []

    for x in batch:
        ys = sample_from_student(
            model=πθ,
            prompt=x,
            K=4,
            temperature=0.8,
            top_p=0.95
        )

        for y in ys:
            R = verifier(x, y)
            rollouts.append((x, y, R))

    rollouts = group_normalize_rewards(rollouts)

    token_records = []

    for x, y, A_seq in rollouts:
        for t, token in enumerate(y.tokens):

            s_t = make_prefix_state(x, y, t)

            logp_old = logprob(πθ_old, token, s_t)
            logp_new = logprob(πθ, token, s_t)
            logp_ref = logprob(πref, token, s_t)

            if teacher_available:
                logp_teacher = logprob(
                    T,
                    token,
                    s_t,
                    privileged_info=x.reference
                )

                D = clip(logp_teacher - logp_old, -0.5, 0.5)
                G = sigmoid(A_seq)

                A_dense = A_seq + 0.2 * G * D

            else:
                A_dense = A_seq

            token_records.append(
                (logp_new, logp_old, logp_ref, A_dense, s_t)
            )

    L_policy = clipped_policy_gradient(
        token_records,
        eps=0.2
    )

    L_kl = adaptive_kl_to_reference(
        model=πθ,
        reference=πref,
        token_records=token_records,
        target_kl=0.03
    )

    L_entropy = entropy_bonus(
        model=πθ,
        token_records=token_records
    )

    L = L_policy + L_kl - 0.001 * L_entropy

    update(πθ, L)

    if step % replay_refresh_steps == 0:
        replay_buffer = build_on_policy_replay(
            model=πθ,
            prompts=anchor_prompts,
            scorer=verifier_or_self_confidence,
            keep_top_quantile=0.2
        )

    train_with_mixed_batches(
        current_task_ratio=0.9,
        replay_ratio=0.1
    )

Why This Should Forget Less

The central hypothesis is: on-policy data constrains the model toward nearby improvements.

SFT can pull the model toward a fixed external dataset. That dataset may solve the task, but it may be far from the model's original behavior.

RL samples from the current model. Because of that, it tends to find task-solving behavior near the model's existing distribution. On-policy distillation inherits this benefit. Even if the teacher was trained with SFT, the student does not train on the teacher's state distribution. The student trains on its own state distribution.

The teacher gives advice. The student chooses the states. The verifier decides what counts as success. That combination is what makes the algorithm interesting.

Why the Student Can Beat the Teacher

A student can outperform its teacher when it receives guidance on the states it actually visits.

Traditional distillation often trains on teacher-generated trajectories. But the student's mistakes are not always the teacher's mistakes. If the student rarely visits the teacher's prefixes at inference time, some of that supervision is wasted.

In on-policy distillation, the teacher gives advice on the student's own prefixes. That can be more useful than copying the teacher's final answers.

The teacher does not need to be perfect. It only needs to provide useful local information when the student reaches a state where guidance matters. Verifier calibration improves this further. The verifier anchors training to real task success. The teacher fills in dense token-level structure.

The result is not pure imitation. It is guided distributional shaping.

Example: Minimal Code Editing

A minimal code editing task is a strong testbed because it measures two skills at the same time:

1. Can the model fix the bug?
2. Can the model avoid unnecessary rewriting?

A normal code benchmark usually rewards correctness. But real coding agents need more than correctness. They need restraint.

A good coding assistant should not rewrite an entire function to fix one wrong operator. It should make the smallest safe change.

A minimal editing verifier should reward:

passing tests
small diffs
preserved names
preserved formatting
preserved structure
low added complexity
no unrelated changes

The training prompt can be simple:

You are given a buggy function.

Fix the bug.

Preserve the original structure, names, formatting, and logic.

Change only what is necessary.

Return only the corrected function.

Evaluation should include:

Pass@1
Normalized Levenshtein distance
Added cognitive complexity
Unrelated changed lines
General coding benchmark score
KL from the base model
Entropy per token

The model should only be considered improved if:

task pass rate improves
extra diff decreases
general coding ability does not degrade
KL stays under budget
entropy does not collapse

This matters because a model can look better on a narrow task while becoming worse overall. The goal is not just task improvement. The goal is task improvement with controlled distribution movement.

Practical Defaults

For a first run:

Base model: Qwen, Llama, DeepSeek, or another coding-capable instruct model
Training method: LoRA or QLoRA
LoRA rank: 32 or 64
Learning rate: 5e-6 to 1e-5
Rollouts per prompt: 4
Batch prompts: 16 to 64
Teacher coefficient: 0.2
Teacher logit clip: ±0.5
PPO clip: 0.2
KL target: 0.03 nats/token
Entropy bonus: 0.001
Replay ratio: 10%
Replay refresh: every 200 to 500 optimizer steps
Eval cadence: every 100 to 250 steps

For expensive code tasks:

K = 2 rollouts per prompt

For cheaper verifier tasks:

K = 8 rollouts per prompt

Start small. Track KL, entropy, task reward, and general benchmark retention from the beginning. Do not wait until the end to discover that the model collapsed.

The Bigger Point

The next generation of post-training algorithms should not be framed as a simple fight between SFT, RL, and distillation.

The deeper question is: where does the training distribution come from?

If the data comes from a fixed external dataset, the model can be dragged toward behavior far from its original distribution. If the data comes from the student itself, the update is naturally constrained to the regions the model already visits.

That is why on-policy methods are so important. But on-policy data alone is not enough. RL can be too sparse. Distillation can be too biased. Teacher logits can overweight style. Reward models can be gamed. Entropy can collapse.

So the practical answer is not to choose one method blindly.

The practical answer is to combine the right parts:

On-policy sampling gives locality.
Verifier rewards give truth.
Teacher logits give dense credit.
Clipping prevents style-token collapse.
Replay preserves old skills.

That is Verifier-Calibrated On-Policy Distillation.

It is a simple idea: let the model learn from where it actually is, not from where a dataset wishes it were.

If we want models to gain new capabilities without destroying old ones, this is the direction post-training should move.

Further Reading

This essay builds on recent work and discussion around supervised fine-tuning, reinforcement learning, on-policy distillation, catastrophic forgetting, reverse-KL distillation, and on-policy replay.

The source essay that inspired this proposal frames SFT, RL, and OPD as different ways of reshaping a model's distribution.

SFT, RL, and On-Policy Distillation Through a Distributional Lens

Thinking Machines Lab explains dense on-policy supervision, where a student samples its own trajectories while a teacher provides token-level feedback.

On-Policy Distillation

RL's Razor argues that online RL tends to find task-solving policies that stay closer in KL to the original model than SFT.

RL's Razor: Why Online Reinforcement Learning Forgets Less

Self-Distilled Reasoner covers privileged self-distillation, where the same model acts as both student and teacher with extra information available to the teacher pass.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

MiniLLM is useful background on reverse-KL distillation and mode-seeking behavior in language model distillation.

MiniLLM: Knowledge Distillation of Large Language Models

Additional related concepts worth searching:

DAgger: Dataset Aggregation
On-Policy Replay
GRPO
PPO clipping
RLVR
Catastrophic forgetting in LLM post-training
Minimal code editing benchmarks

Closing Note

This is a proposed algorithm, not a finished empirical result.

The next step is to test it. The right experiment is straightforward:

Train three models on the same verifiable task:

1. SFT baseline
2. RL baseline
3. Verifier-Calibrated On-Policy Distillation

Then compare:

task success
KL from base model
entropy collapse
general benchmark retention
minimal edit behavior
teacher dependence

If the hypothesis is right, Verifier-Calibrated On-Policy Distillation should land in the best part of the tradeoff curve: stronger than plain SFT, denser than plain RL, and safer than blind distillation.

Published as a research proposal and technical essay.

© 2026 Marvin B. Freedman.

Recommended Next Step

Ready to try it yourself?

Get started with the tools mentioned in this article. Most have free trials - no credit card required.

Browse Matching Tools ->
Weekly Newsletter

Stay Ahead of the AI Curve

Get weekly AI tool reviews, workflow breakdowns, and prompt ideas without the recycled hype.

No spam. Unsubscribe anytime.