Verifier-Calibrated On-Policy Distillation: A Practical Algorithm for Teaching Models Without Making Them Forget
A concrete post-training algorithm that combines on-policy sampling, verifier rewards, teacher logits, clipping, and replay to help language models learn new capabilities without catastrophic forgetting.
Most post-training methods can be understood through one simple question: what distribution are we moving the model toward?
A language model is a probability distribution over sequences. Every post-training method reshapes that distribution. Supervised fine-tuning, reinforcement learning, and on-policy distillation all move probability mass in different ways.
Supervised fine-tuning pulls the model toward a fixed external dataset. Reinforcement learning updates the model based on samples generated by the model itself. On-policy distillation also trains on student-generated samples, but adds dense teacher guidance at the token level.
This suggests a practical next step: use the student's own rollouts, score them with a verifier, and use teacher logits only when they are calibrated by task success.
- Sample from the student, score with a verifier, use teacher logits as dense hints, clip dangerous token updates, and replay old skills on-policy to reduce forgetting.
- The model should learn from the states it actually visits, not only from teacher-generated trajectories or a fixed external dataset.
- Verifier-calibrated teacher guidance aims to combine RL locality, distillation density, and replay-based retention.
Linked Credits
This proposal builds on the distributional view of post-training, dense on-policy supervision, KL-minimal forgetting arguments, privileged self-distillation, and reverse-KL distillation background.
SFT, RL, and On-Policy Distillation Through a Distributional Lens →On-Policy Distillation →RL's Razor: Why Online Reinforcement Learning Forgets Less →Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models →MiniLLM: Knowledge Distillation of Large Language Models →The Distribution Question
Most post-training methods can be understood through one simple question:
What distribution are we moving the model toward?
A language model is a probability distribution over sequences. Every post-training method reshapes that distribution. Supervised fine-tuning, reinforcement learning, and on-policy distillation all move probability mass in different ways.
Supervised fine-tuning pulls the model toward a fixed external dataset. Reinforcement learning updates the model based on samples generated by the model itself. On-policy distillation also trains on student-generated samples, but adds dense teacher guidance at the token level.
That difference matters. When we fine-tune a model on a fixed dataset, the model can be pulled far away from its original behavior. This is one reason supervised fine-tuning can cause catastrophic forgetting.
Reinforcement learning often forgets less because the data comes from the current model. The model samples from itself, receives a reward, and then updates toward higher-reward behavior. This naturally keeps the model closer to its own distribution.
But RL has a different problem: the reward signal is usually sparse. A whole answer may receive one score. The model knows whether the final result was good, but it does not always know which tokens mattered.
On-policy distillation sits between the two. It keeps the important on-policy property of RL, while giving the student dense token-level supervision from a teacher. The student generates the trajectory, and the teacher gives guidance on the states the student actually visits.
This suggests a practical next step: use the student's own rollouts, score them with a verifier, and use teacher logits only when they are calibrated by task success.
Proposed Algorithm
Sample from the student, score with a verifier, use teacher logits as dense hints, clip dangerous token updates, and replay old skills on-policy to reduce forgetting.
The model should learn from the states it actually visits.
Why SFT Forgets
Supervised fine-tuning is extremely useful. It is often the right first step when a model needs to learn a new output format, instruction style, or task structure.
But SFT has a natural forgetting failure mode. In SFT, every token in the dataset becomes a target. The model is trained to increase the probability of the demonstrated token, whether that token is task-critical or incidental.
A mathematical operator, a code edit, a variable name, a formatting choice, and a phrase like therefore can all receive direct gradient pressure. The loss does not know which tokens matter.
This creates broad updates. If the dataset distribution is far from the model's original distribution, the model may learn the new behavior while damaging older capabilities.
That does not mean SFT is bad. It means SFT should be used carefully. SFT is best for cold-start behavior. It teaches the model the shape of the task. It should not always be the main engine for teaching the deeper capability.
Why RL Forgets Less
Reinforcement learning behaves differently because the model trains on its own samples. The model generates outputs from its current policy. Those outputs are scored. Then the model shifts probability mass toward the outputs that received higher reward.
This means RL is not pulling the model toward an arbitrary external dataset. It is improving behavior in regions the model already visits.
Among all possible task-solving policies, on-policy training tends to find one that is close to the current model. This helps explain why RL can improve a capability without causing the same level of broad forgetting often seen with aggressive SFT.
But RL is expensive and inefficient. If a whole code solution receives one reward, the model may not know exactly which token fixed the bug. If a math answer is correct, the model may not know which step mattered most. Outcome rewards are honest, but sparse.
This is the credit assignment problem. RL gives us locality. Distillation gives us density. The goal is to combine them without inheriting the worst parts of either.
Why Plain Distillation Is Not Enough
Distillation gives the student a dense learning signal. Instead of receiving one reward for a full answer, the student can learn from the teacher's probability distribution at every token.
But teacher logits are not always task importance. Sometimes the largest teacher-student differences happen on style tokens. The teacher may strongly prefer Wait, Let's think, Therefore, or a certain formatting pattern. Those tokens may have high KL divergence, but they may not be the tokens that actually solve the task.
If we blindly train on teacher logits, the model can over-optimize style. That is why distillation needs calibration. A verifier should decide whether the trajectory is actually good. The teacher should only provide dense guidance inside that verified frame.
The Core Algorithm
Let:
π0 = the original base model
πθ = the trainable student model
T = the teacher model
V = the verifier or reward functionThe verifier takes a prompt and a completion:
V(x, y) -> rewardThe teacher can be:
1. a stronger model,
2. a specialist model trained with SFT or RL,
3. the same model with privileged information added to the context.The third option is especially interesting. For example, in a code repair task, the student sees only the buggy function. The teacher pass may also see the reference solution, hidden test result, or verified patch. The student and teacher are the same base model, but the teacher has privileged information.
The student learns from its own rollout. The teacher gives guidance on that rollout. The verifier decides whether that rollout deserves reinforcement.
Step 1: Use Small SFT Only for Format
Start with SFT only if the model cannot follow the task format. For example, in a minimal code editing task, the model may need to learn:
Input:
A buggy function
Output:
Only the corrected functionA good starting SFT setup:
1,000 to 10,000 examples
1 epoch
low learning rate
small adapter
format-focused examplesThe goal is not to force the model to imitate every detail of a specialist dataset. The goal is only to teach the model the task interface. After that, the real learning should happen on-policy.
Step 2: Generate Student Rollouts
For each training prompt, sample multiple outputs from the current student:
y1, y2, ..., yK ~ πθ(. | x)Good defaults:
K = 4
temperature = 0.7 to 1.0
top_p = 0.95For expensive code tasks, use:
K = 2For cheaper math or synthetic verifier tasks, use:
K = 8This is the most important part of the algorithm. Do not train only on teacher outputs. Train on the states the student actually visits.
Autoregressive models create their own state distribution. A single wrong token can move the model into a prefix the teacher would never have written. If the student only trains on teacher-generated trajectories, it may not learn how to recover from its own mistakes.
On-policy sampling fixes that mismatch.
Step 3: Score Each Rollout With a Verifier
Each student output receives a reward:
R_i = V(x, y_i)For a minimal code editing task, the verifier should reward more than just passing tests. A model should not get full credit for rewriting an entire function when only one line needed to change.
A concrete reward could be:
if tests fail:
reward = -1.0
else:
reward = 1.0
- 0.5 * normalized_extra_levenshtein
- 0.3 * added_cognitive_complexity
- 0.2 * unrelated_changed_lines_ratioThis reward encourages the model to:
pass the tests
make the smallest necessary edit
avoid unnecessary rewrites
preserve structure
avoid adding complexityThen normalize rewards within each prompt group:
A_i = (R_i - mean(R_1...R_K)) / (std(R_1...R_K) + ε)This gives each rollout a sequence-level advantage. Higher advantage means the rollout was better than the other attempts for the same prompt. Lower advantage means it was worse.
Step 4: Add Verifier-Calibrated Teacher Guidance
For each token in a sampled completion:
s_t = prompt + previous tokens
a_t = current tokenCompute the old student log-probability:
logp_student = log πθ_old(a_t | s_t)Compute the teacher log-probability:
logp_teacher = log T(a_t | s_t, privileged_info)Then compute the teacher-student difference:
D_t = logp_teacher - logp_studentBut do not trust this difference blindly. Clip it:
D_t_clipped = clip(D_t, -0.5, 0.5)Then gate it by the verifier advantage:
G_t = sigmoid(A_i)The final dense token advantage becomes:
A_dense_t = A_i + λ_teacher * G_t * D_t_clippedA good default is:
λ_teacher = 0.2This is the key move. The verifier decides whether the trajectory is good. The teacher provides dense token-level hints. The clip prevents high-KL style tokens from dominating training. The gate prevents the teacher from strongly reinforcing trajectories that did not actually solve the task.
This makes the teacher useful without making the student blindly imitative.
Step 5: Update With Clipped Policy Optimization
Use a PPO-style or GRPO-style clipped objective.
For each token:
ρ_t = πθ(a_t | s_t) / πθ_old(a_t | s_t)Policy loss:
L_policy = -mean(
min(
ρ_t * A_dense_t,
clip(ρ_t, 1 - ε, 1 + ε) * A_dense_t
)
)A good default:
ε = 0.2Add a small adaptive KL guard against the original base model:
L_kl = β * KL(πθ(. | s_t) || π0(. | s_t))Make the KL coefficient adaptive:
if measured_KL > target_KL:
β *= 1.2
else:
β *= 0.95Good defaults:
target_KL = 0.02 to 0.05 nats/token
β initial = 0.01Then add a small entropy bonus:
L = L_policy + L_kl - η * entropy_bonusGood default:
η = 0.001The entropy bonus matters because distillation can collapse the model's distribution quickly. Some entropy reduction is useful. It means the model is becoming more decisive. But too much entropy collapse can make the model brittle, repetitive, or overfit to narrow patterns.
Step 6: Preserve Old Skills With On-Policy Replay
Every few hundred training steps, sample from the current model on general anchor prompts:
y_old ~ πθ(. | x_old)These anchor prompts should cover capabilities you do not want to lose:
general coding
reasoning
instruction following
summarization
tool-use format
chat helpfulness
domain knowledgeFilter the outputs by verifier score, self-confidence, or quality estimate. For example:
keep the top 20%Then mix them into training:
current task batch: 80% to 90%
on-policy replay batch: 10% to 20%This is not ordinary replay from a static old dataset. It is replay from the model's current distribution. That matters because the model is reminded of old capabilities without being yanked toward an unrelated external distribution.
On-policy replay gives the model a soft anchor. It helps preserve older behavior while the main training loop improves the new capability.
Full Pseudocode
# Verifier-Calibrated On-Policy Distillation
initialize student πθ from base π0
freeze reference πref = π0
optional: initialize teacher T
for step in range(num_steps):
batch = sample_prompts(train_prompts)
rollouts = []
for x in batch:
ys = sample_from_student(
model=πθ,
prompt=x,
K=4,
temperature=0.8,
top_p=0.95
)
for y in ys:
R = verifier(x, y)
rollouts.append((x, y, R))
rollouts = group_normalize_rewards(rollouts)
token_records = []
for x, y, A_seq in rollouts:
for t, token in enumerate(y.tokens):
s_t = make_prefix_state(x, y, t)
logp_old = logprob(πθ_old, token, s_t)
logp_new = logprob(πθ, token, s_t)
logp_ref = logprob(πref, token, s_t)
if teacher_available:
logp_teacher = logprob(
T,
token,
s_t,
privileged_info=x.reference
)
D = clip(logp_teacher - logp_old, -0.5, 0.5)
G = sigmoid(A_seq)
A_dense = A_seq + 0.2 * G * D
else:
A_dense = A_seq
token_records.append(
(logp_new, logp_old, logp_ref, A_dense, s_t)
)
L_policy = clipped_policy_gradient(
token_records,
eps=0.2
)
L_kl = adaptive_kl_to_reference(
model=πθ,
reference=πref,
token_records=token_records,
target_kl=0.03
)
L_entropy = entropy_bonus(
model=πθ,
token_records=token_records
)
L = L_policy + L_kl - 0.001 * L_entropy
update(πθ, L)
if step % replay_refresh_steps == 0:
replay_buffer = build_on_policy_replay(
model=πθ,
prompts=anchor_prompts,
scorer=verifier_or_self_confidence,
keep_top_quantile=0.2
)
train_with_mixed_batches(
current_task_ratio=0.9,
replay_ratio=0.1
)Why This Should Forget Less
The central hypothesis is: on-policy data constrains the model toward nearby improvements.
SFT can pull the model toward a fixed external dataset. That dataset may solve the task, but it may be far from the model's original behavior.
RL samples from the current model. Because of that, it tends to find task-solving behavior near the model's existing distribution. On-policy distillation inherits this benefit. Even if the teacher was trained with SFT, the student does not train on the teacher's state distribution. The student trains on its own state distribution.
The teacher gives advice. The student chooses the states. The verifier decides what counts as success. That combination is what makes the algorithm interesting.
Why the Student Can Beat the Teacher
A student can outperform its teacher when it receives guidance on the states it actually visits.
Traditional distillation often trains on teacher-generated trajectories. But the student's mistakes are not always the teacher's mistakes. If the student rarely visits the teacher's prefixes at inference time, some of that supervision is wasted.
In on-policy distillation, the teacher gives advice on the student's own prefixes. That can be more useful than copying the teacher's final answers.
The teacher does not need to be perfect. It only needs to provide useful local information when the student reaches a state where guidance matters. Verifier calibration improves this further. The verifier anchors training to real task success. The teacher fills in dense token-level structure.
The result is not pure imitation. It is guided distributional shaping.
Example: Minimal Code Editing
A minimal code editing task is a strong testbed because it measures two skills at the same time:
1. Can the model fix the bug?
2. Can the model avoid unnecessary rewriting?A normal code benchmark usually rewards correctness. But real coding agents need more than correctness. They need restraint.
A good coding assistant should not rewrite an entire function to fix one wrong operator. It should make the smallest safe change.
A minimal editing verifier should reward:
passing tests
small diffs
preserved names
preserved formatting
preserved structure
low added complexity
no unrelated changesThe training prompt can be simple:
You are given a buggy function.
Fix the bug.
Preserve the original structure, names, formatting, and logic.
Change only what is necessary.
Return only the corrected function.Evaluation should include:
Pass@1
Normalized Levenshtein distance
Added cognitive complexity
Unrelated changed lines
General coding benchmark score
KL from the base model
Entropy per tokenThe model should only be considered improved if:
task pass rate improves
extra diff decreases
general coding ability does not degrade
KL stays under budget
entropy does not collapseThis matters because a model can look better on a narrow task while becoming worse overall. The goal is not just task improvement. The goal is task improvement with controlled distribution movement.
Practical Defaults
For a first run:
Base model: Qwen, Llama, DeepSeek, or another coding-capable instruct model
Training method: LoRA or QLoRA
LoRA rank: 32 or 64
Learning rate: 5e-6 to 1e-5
Rollouts per prompt: 4
Batch prompts: 16 to 64
Teacher coefficient: 0.2
Teacher logit clip: ±0.5
PPO clip: 0.2
KL target: 0.03 nats/token
Entropy bonus: 0.001
Replay ratio: 10%
Replay refresh: every 200 to 500 optimizer steps
Eval cadence: every 100 to 250 stepsFor expensive code tasks:
K = 2 rollouts per promptFor cheaper verifier tasks:
K = 8 rollouts per promptStart small. Track KL, entropy, task reward, and general benchmark retention from the beginning. Do not wait until the end to discover that the model collapsed.
The Bigger Point
The next generation of post-training algorithms should not be framed as a simple fight between SFT, RL, and distillation.
The deeper question is: where does the training distribution come from?
If the data comes from a fixed external dataset, the model can be dragged toward behavior far from its original distribution. If the data comes from the student itself, the update is naturally constrained to the regions the model already visits.
That is why on-policy methods are so important. But on-policy data alone is not enough. RL can be too sparse. Distillation can be too biased. Teacher logits can overweight style. Reward models can be gamed. Entropy can collapse.
So the practical answer is not to choose one method blindly.
The practical answer is to combine the right parts:
On-policy sampling gives locality.
Verifier rewards give truth.
Teacher logits give dense credit.
Clipping prevents style-token collapse.
Replay preserves old skills.That is Verifier-Calibrated On-Policy Distillation.
It is a simple idea: let the model learn from where it actually is, not from where a dataset wishes it were.
If we want models to gain new capabilities without destroying old ones, this is the direction post-training should move.
Further Reading
This essay builds on recent work and discussion around supervised fine-tuning, reinforcement learning, on-policy distillation, catastrophic forgetting, reverse-KL distillation, and on-policy replay.
The source essay that inspired this proposal frames SFT, RL, and OPD as different ways of reshaping a model's distribution.
SFT, RL, and On-Policy Distillation Through a Distributional Lens →Thinking Machines Lab explains dense on-policy supervision, where a student samples its own trajectories while a teacher provides token-level feedback.
On-Policy Distillation →RL's Razor argues that online RL tends to find task-solving policies that stay closer in KL to the original model than SFT.
RL's Razor: Why Online Reinforcement Learning Forgets Less →Self-Distilled Reasoner covers privileged self-distillation, where the same model acts as both student and teacher with extra information available to the teacher pass.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models →MiniLLM is useful background on reverse-KL distillation and mode-seeking behavior in language model distillation.
MiniLLM: Knowledge Distillation of Large Language Models →Additional related concepts worth searching:
DAgger: Dataset Aggregation
On-Policy Replay
GRPO
PPO clipping
RLVR
Catastrophic forgetting in LLM post-training
Minimal code editing benchmarksClosing Note
This is a proposed algorithm, not a finished empirical result.
The next step is to test it. The right experiment is straightforward:
Train three models on the same verifiable task:
1. SFT baseline
2. RL baseline
3. Verifier-Calibrated On-Policy DistillationThen compare:
task success
KL from base model
entropy collapse
general benchmark retention
minimal edit behavior
teacher dependenceIf the hypothesis is right, Verifier-Calibrated On-Policy Distillation should land in the best part of the tradeoff curve: stronger than plain SFT, denser than plain RL, and safer than blind distillation.
Published as a research proposal and technical essay.
© 2026 Marvin B. Freedman.
Ready to try it yourself?
Get started with the tools mentioned in this article. Most have free trials - no credit card required.
Browse Matching Tools ->