Note: This post is a work-in-progress and may contain mistakes or miss recent work. If you spot an error or know of something I should cover, please send me an email. If you find this post useful, please cite it using the BibTeX at the end. This post was written with assistance from Claude Code, Codex, and GPT-5.2 Pro.

Updates in v2 (2026-05-18). Corrected method descriptions for R3, IcePop, QUATRO, MinPRO, SAPO (gate vs. shape), GSPO stop-gradient scoping, OAPL denominator and critic, KL-cov formula, R²VPO, and DISPO. Replaced the stale "13 advantage estimators" claim with the current 14 (added gdpo). Added the two missing *_token_icepop presets to the rollout-correction table. Pinned all verl-source claims to verl-project/verl@70744059 (the repo migrated from volcengine/verl in early 2026). Re-wrote the DeepSeek-V3.2 and Ring-1T production-recipe paragraphs against the actual paper Sections 3 and 2.3, and added a CompassMax reference (corrected to arXiv:2512.07710 — the original prompt had a different paper's ID). Added a Mar–May 2026 paper round-up, a diagnostic decision tree, a comparison map of every method on the same axes, and a notation map clarifying $\rho_t$ vs. $r_t$. Open-problems section rewritten as five falsifiable conjectures.

Modern LLM reinforcement learning is secretly off-policy (Yao et al. 2025). Even frameworks advertised as "on-policy" have a gap between the policy that sampled the data and the policy that computes gradients on it. The causes are many: separate engines for inference and training running at different precisions, different attention kernels, different tensor-parallel configurations, MoE routing decisions that flip between forward passes, async pipelines where the policy moves before stale rollouts are consumed, and multi-epoch training over the same batch. Each source independently shifts the log-probabilities that the gradient depends on, and they compound. The result is corrupted gradients that can collapse training.

This post walks through how mismatch arises, how to detect it, how to correct it with importance sampling and rejection sampling, and how policy loss functions interact with staleness. The treatment follows the rollout correction framework in verl, with references to the relevant source files throughout.

Why LLM RL training is secretly off-policy

The two sources of staleness

In a standard LLM RL pipeline, the full importance weight between the current training policy $\pi_\theta$ and the rollout (behavior) policy $\mu_{\theta_{\text{old}}}$ decomposes into two factors (Zheng et al. 2025):

$$ \frac{\pi_\theta(y_t \mid x, y_{\lt t})}{\mu_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})} = \underbrace{\frac{\pi_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})}{\mu_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})}}_{\text{training-inference mismatch}} \times \underbrace{\frac{\pi_\theta(y_t \mid x, y_{\lt t})}{\pi_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})}}_{\text{policy staleness}} $$

The first factor is training-inference mismatch: even when the weights are identical ($\theta = \theta_{\text{old}}$), the rollout engine $\mu$ and the training engine $\pi$ produce different log-probabilities. Precision differences (FP8 vs BF16), different attention backends (FlashAttention vs PagedAttention), different parallelism strategies. This factor should be 1 everywhere. In practice it is not.

The second factor is policy staleness: the weights have moved since the rollout was generated. This happens in multi-epoch PPO (the policy updates between epochs but trains on the same rollout data), in one-step async pipelines (generation of batch $N$ overlaps with training on batch $N-1$), and in fully async pipelines (multiple training steps may occur before fresh rollouts arrive).

Both factors corrupt the gradient. Policy gradient methods assume the data was sampled from the current policy. When that assumption breaks, the gradient points in the wrong direction, and the error compounds across training steps. This is the mechanism behind RL collapse (Li & Liu 2025; Yao et al. 2025).

Notation map

Three policies show up everywhere in this post. The literature is not consistent about which symbol means which, so it is worth pinning notation explicitly:

  • $\mu_{\text{rollout}}$ — the rollout engine's distribution over tokens (vLLM, SGLang, etc.) at the rollout-time weights.
  • $\pi_{\text{old}}$ — the training engine's distribution over tokens at the same rollout-time weights $\theta_{\text{old}}$ (i.e., the "actor forward pass" recomputed log-probs).
  • $\pi_\theta$ — the training engine's distribution at the current weights $\theta$.

From these we get three ratios:

  • Engine-mismatch ratio: $\rho_t^{\text{infer}} = \pi_{\text{old}}(y_t)/\mu_{\text{rollout}}(y_t)$ — same weights, different engines. Should be 1; in practice is not.
  • Policy-staleness ratio: $r_t(\theta) = \pi_\theta(y_t)/\pi_{\text{old}}(y_t)$ — same engine, different weights. This is the standard PPO ratio.
  • Full importance weight: $w_t(\theta) = \pi_\theta(y_t)/\mu_{\text{rollout}}(y_t) = \rho_t^{\text{infer}} \cdot r_t(\theta)$ — current policy on training engine vs. behavior policy on rollout engine.

In the rollout-correction sections below, a bare $\rho_t$ means the engine-mismatch ratio $\rho_t^{\text{infer}}$. In the policy-loss sections, a bare $r_t$ means the staleness ratio. Wherever the post uses $\rho_t$ without qualifier in equations imported from a particular paper, the convention follows that paper.

Root causes in detail

The two-factor decomposition above identifies two abstract sources of drift. In practice, six concrete root causes produce this drift, and understanding each determines whether the right fix is a systems change, an algorithmic correction, or both.

Floating-point precision: BF16 vs FP16 vs FP8

BF16 is the default training dtype for most modern LLM stacks, and for good reason: its 8-bit exponent matches FP32's dynamic range, which simplifies mixed-precision recipes. But BF16 has only 7 bits of mantissa (versus FP16's 10), and that coarser rounding is the single largest contributor to training-inference mismatch in many production pipelines. When the inference engine and the training engine both operate in BF16 but use different fused kernels, different reduction orders, or different intermediate accumulation widths, the rounding errors do not cancel — they compound through the softmax into systematically different token probabilities.

Qi et al. (2025) make the sharpest version of this claim: switching both the rollout engine and the training engine to FP16 can nearly eliminate the mismatch with minimal code changes. FP16's extra 3 mantissa bits are enough to make the residual rounding error negligible relative to other noise sources. The cost is a narrower dynamic range (5-bit exponent), which requires careful loss scaling — but that is a solved problem in most frameworks.

At even lower precisions the picture is worse. FP8 and INT8 quantized rollouts are attractive for throughput (roughly 2x over BF16 on modern accelerators), but they introduce a structured distribution shift: the quantized actor is not just a noisy version of the full-precision policy but a systematically different one. FP8-RL (Qiu et al. 2026) and Jet-RL (Xi et al. 2026) both demonstrate that a mixed BF16-train / FP8-rollout stack is not truly on-policy, and that the resulting mismatch must either be corrected with importance sampling or eliminated by unifying precision across both sides. QuRL (Li et al. 2026) goes further, framing the quantized actor as a permanently distinct behavior policy and designing adaptive clipping ranges around the full-precision/quantized ratio.

Zhang et al. (2026) adds a nuance: the failure is not purely about arithmetic precision. A fixed numerical discrepancy between rollout and training can be tolerable early in training and then suddenly become destabilizing later, because the optimizer moves into a region of parameter space where that discrepancy aligns with a steep loss landscape. In other words, mismatch is coupled to optimization state. This is why purely static precision alignment (switching to FP16 once at the start) works most of the time, but occasionally fails in ways that a reactive intervention — like learning-rate scheduling triggered by response-length surges — can catch.

Kernel and framework divergence

Even with identical dtypes, rollout engines and training engines typically use different software paths: different fused attention kernels, different reduction orders in layernorm and softmax, different batch-scheduling policies, and different memory layouts. These differences are invisible at the API level (both engines accept the same weights and produce "the same" outputs) but visible at the floating-point level, where they produce systematically different log-probabilities.

He et al. (2025) argue that what practitioners call "nondeterminism" in LLM inference is mostly a misnomer. The real culprit is batch dependence and floating-point non-associativity: the same prompt produces different logits depending on what other prompts share its batch, because fused kernels reduce across the batch dimension in an order that varies with batch composition. This is deterministic given the batch but not batch-invariant, and since RL training and RL rollout almost never use the same batching strategy, the two engines diverge even when everything else matches.

Yuan et al. (2025) provide a broader catalog of numerical nondeterminism sources in LLM inference, including thread-scheduling in CUDA reductions, non-deterministic cuBLAS GEMM algorithms, and radix-cache-induced prefill chunking differences. While their focus is inference rather than RL, the root causes they identify are exactly the ones that create rollout-training mismatch: every source of numerical variation in the inference engine is a source of off-policy drift in the RL pipeline.

The systems community's response was a sequence of increasingly strict equivalence projects. The vLLM + TorchTitan bitwise consistency article (2025) audited every kernel invocation in the forward pass until rollout and training produced bitwise-identical outputs. The SGLang deterministic inference effort (2025) post achieved reproducible inference across requests by fixing chunked prefill, CUDA graph replay, and radix cache behavior. Both demonstrate that eliminating kernel divergence is possible but expensive: the constrained kernels are slower, and the restrictions (fixed batch sizes, no dynamic batching, deterministic CUDA graphs) limit the serving optimizations that make rollout fast in the first place.

For practitioners, the implication is pragmatic: full kernel-level equivalence is achievable but usually not worth the throughput cost. The more common approach is to accept some kernel divergence and correct it algorithmically — which requires measuring the divergence first (see the diagnostic toolkit below) and then applying IS or RS correction proportional to the observed gap.

Parallelism strategy mismatch

A particularly common and often overlooked divergence source is that the rollout engine and the training engine use different parallelism strategies. The rollout engine typically uses multi-GPU tensor parallelism (TP) for low-latency generation, while the training engine runs with TP=1 under Fully Sharded Data Parallel (FSDP) or a different TP degree under Megatron. Even with identical kernels and identical dtypes, different TP configurations produce different outputs because the all-reduce operations that merge partial results across GPUs use floating-point addition, which is not associative.

Concretely, a TP=4 rollout partitions each matrix multiply across 4 GPUs, computes partial results, and reduces them with an all-reduce. The training engine at TP=1 computes the same matrix multiply on a single GPU with a different reduction tree. The two results differ in the low bits of the mantissa, and these differences propagate through layernorm, softmax, and the autoregressive chain to produce measurably different token probabilities. The gap is typically smaller than the BF16/FP16 precision gap but large enough to matter over long sequences.

Zhang et al. (2025) isolate this problem and propose Tree-Based Invariant Kernels (TBIK): custom reduction kernels whose output is bitwise identical regardless of TP size. TBIK achieves this by fixing a canonical reduction tree (independent of the number of GPUs) and padding to ensure identical accumulation order. Integrated into both vLLM and FSDP, TBIK eliminates the TP-induced component of mismatch entirely. The throughput cost is modest (single-digit percent) because only the reduction kernels are constrained, not the matrix multiplies themselves.

For pipelines that cannot adopt TBIK, the alternative is to match TP configurations across rollout and training — but this is often impractical, since the optimal TP for generation throughput (high TP, low latency) differs from the optimal TP for training throughput (low TP, high batch size under FSDP). The mismatch is a direct consequence of the engineering reality that serving and training have different parallelism sweet spots.

MoE routing discrepancies

For Mixture-of-Experts models, the "policy" is not just a probability distribution over tokens; it includes the router's discrete decisions about which experts process each token. A small numerical perturbation in the router logits can flip the top-$k$ expert selection, replacing one expert's parameters with another's. Unlike the continuous perturbations from precision or kernel differences, this is a discontinuous change. The model suddenly computes with a different subnetwork, and the resulting log-probabilities can shift by much more than the initial numerical perturbation would suggest.

This makes MoE routing mismatch qualitatively different from dense-model mismatch. In a dense model, a small logit perturbation produces a small output perturbation (the computation is Lipschitz). In an MoE model, the same small perturbation can cross a routing boundary and produce an arbitrarily large output change. Standard importance sampling can correct for the resulting ratio, but the ratio itself can be extreme, leading to high variance or outright rejection of the affected sequences.

R3 / Routing Replay (Ma et al. 2025) addresses this by recording the routing distributions produced by the inference engine and replaying them during training. Instead of letting the training engine's router make fresh decisions (which may differ due to numerical divergence), R3 forces the training engine to use the same expert assignments that the rollout engine used while still letting the softmax over routing logits keep its gradient flow. This eliminates the discontinuity: the training forward pass now uses the same computational pathway as the rollout, so the remaining mismatch is only the continuous kernel/precision component.

IcePop (introduced inside the Ring-1T report (Ling Team 2025), §2.3.2) takes a lighter-weight approach: it operates on the probability ratio $k = \pi_{\text{train}}(y_t;\theta_{\text{old}})/\pi_{\text{infer}}(y_t;\theta_{\text{old}})$ between the training and inference engines (same weights) for the sampled token, and applies a double-sided mask: $M(k) = k$ if $k \in [\alpha, \beta]$, otherwise $0$. Tokens outside the band contribute zero gradient; tokens inside the band retain the calibration coefficient. The default reasoning-RL band is $[\alpha, \beta] = [0.5, 5.0]$, and the paper reports that only about 1–2 per mille of training tokens get masked in this configuration. Routing flips are one cause of the ratio tail (especially in MoE), but the mask criterion is the probability ratio itself, not router top-k disagreement.

Industrial deployments confirm the severity of the problem. DeepSeek-V3.2 (§3.1) introduced Keep Routing — "preserve the expert routing paths used during sampling in the inference framework and enforce the same routing paths during training" — and reports the technique has been in their RL pipeline since DeepSeek-V3-0324. Note that Ring-1T itself does not use routing replay: instead, it freezes the MoE router bias during RL (§2.3.4) and relies on IcePop's per-token probability-ratio mask to catch the residual divergence. For MoE models, some form of routing alignment is not optional. Generic IS correction without routing awareness is insufficient.

Asynchronous training lag

Policy staleness from asynchronous pipelines is the most classical form of off-policy drift: the rollout policy is literally an older checkpoint. In one-step async (overlap generation of batch $N$ with training on batch $N-1$), the lag is exactly one gradient step — usually mild. In fully async pipelines like AReaL (Fu et al. 2025), the lag can accumulate across many training steps, producing importance ratios that are heavy-tailed and gradient estimates that are high-variance or biased.

But there is a subtler form of staleness that operates even within a single synchronous batch: minibatch staleness. When a rollout batch is split into multiple minibatches, the policy updates after each minibatch, so by the time the trainer reaches minibatch $k$, the current policy has moved $k-1$ gradient steps from the policy that generated the data. This happens in two common settings. Multi-step PPO uses a single epoch with multiple minibatches — off-policy from the second minibatch onward. Multi-epoch PPO makes multiple passes over the same batch — off-policy from the second epoch onward, and compounding with minibatch splits within each epoch. Both are the same mathematical problem as async staleness, just at a finer timescale. DeepSeek-V3.2 formalizes this with off-policy sequence masking: it computes the KL divergence between the current policy and the sampling policy for each sequence, and masks out negative-advantage sequences whose divergence exceeds a threshold. Critically, the divergence is computed using the sampling probabilities returned by the inference engine, so it captures both engine mismatch and minibatch staleness.

The distinction between these two forms of lag matters for the choice of correction. Classical async staleness is predictable: you know how many gradient steps have elapsed since the rollout, and you can bound the expected drift. Minibatch-reuse staleness is data-dependent: the drift depends on the gradient magnitude of the intervening updates, which depends on the advantage signal, which depends on the reward distribution. This is why per-sequence masking (as in DeepSeek-V3.2) or per-sequence adaptive clipping (as in QUATRO (Lee et al. 2026)) tends to outperform global staleness bounds: the amount of drift varies across sequences within the same batch.

Quantized inference as automatic off-policy

Quantized rollout deserves its own discussion because it is simultaneously the most common mismatch source and the one most often overlooked. When the rollout engine generates with INT8, FP8, or W8A8 (8-bit weights, 8-bit activations) kernels and the training engine computes gradients from full-precision log-probabilities, the pipeline is off-policy by construction. The quantized actor is not an approximation of the training-time policy. It is a different policy, one whose softmax outputs are shifted in structured ways by the quantization grid.

FlashRL (Yao et al. 2025) was among the first to frame this explicitly: 8-bit rollouts can dramatically cut generation cost, but the resulting mismatch must be treated as off-policy data, not ignored. FP8-RL (Qiu et al. 2026) provides a complete production stack — W8A8 inference components, FP8 KV cache, and explicit IS correction integrated into the verl ecosystem — and demonstrates that low-precision RL can match BF16 baselines, but only when the correction mechanism is part of the design from day one. Jet-RL (Xi et al. 2026) makes the stronger architectural claim: the right solution is a unified precision flow across rollout and learning, not post-hoc correction of a precision gap.

The tradeoff is throughput versus correction overhead. Quantized rollout with IS correction is typically faster than full-precision rollout without correction, because the throughput gain from quantization (1.5-2x) exceeds the cost of the extra forward pass needed for IS ratio computation. But the correction adds engineering complexity and monitoring burden: you need to track the quantization-induced mismatch separately from other mismatch sources to know whether your IS weights are doing useful work or just adding noise. QuRL (Li et al. 2026) proposes a cleaner abstraction: treat the quantized actor as a permanent, first-class behavior policy with its own trust region, rather than as a defective copy of the training policy that needs patching.

These sources compound. A typical setup with FP8 rollouts, BF16 training, TP=4 inference, TP=1 training, and 4 PPO epochs experiences the precision gap, parallelism mismatch, and multi-epoch staleness simultaneously. The rollout correction framework treats them uniformly through the importance weight decomposition above.

Identifying mismatch — a diagnostic toolkit

Before correcting mismatch, you need to measure it. verl provides a hierarchy of metrics for this, from per-token signals to aggregate health indicators. The off-policy metrics live in rollout_corr_helper.py; debug rollout-vs-actor probability metrics are in verl/utils/debug/metrics.py; gradient-variance proxies are in verl/trainer/ppo/metric_utils.py. Where this section names a verl function or preset, the claim is verified against verl-project/verl@70744059 (2026-05-18).

The per-token log-importance ratio

The basic signal is the per-token log importance ratio:

$$ \log\rho_t = \log\pi_{\text{old}}(y_t \mid x, y_{\lt t}) - \log\pi_{\text{rollout}}(y_t \mid x, y_{\lt t}) $$

Here, $\pi_{\text{old}}$ denotes the log-probabilities recomputed by the training engine (the "actor forward pass"), and $\pi_{\text{rollout}}$ denotes the log-probabilities recorded during rollout generation. If these two quantities agree everywhere — if $\log\rho_t \approx 0$ for all tokens — then the policies match and no correction is needed.

To measure this signal without applying any correction, configure verl's rollout correction in metrics-only mode:

# Metrics only, no correction applied
rollout_corr_config = RolloutCorrectionConfig.disabled()

Or equivalently via YAML config (which is already the default in rollout_correction.yaml):

algorithm:
  rollout_correction:
    rollout_is: null
    rollout_rs: null

This logs all diagnostic metrics without modifying any gradients or masks, giving you a clean baseline measurement.

Off-policy diagnostic metrics

The raw per-token $\log\rho_t$ is useful but hard to aggregate. The metrics below (computed in compute_offpolicy_metrics() in rollout_corr_helper.py) summarize the distribution shift into scalars you can track on a dashboard.

KL divergence (k1 estimator). The simplest divergence measure:

$$ \text{KL}_{k1} = \mathbb{E}[\log\pi_{\text{rollout}}(y_t \mid x, y_{\lt t}) - \log\pi_{\text{old}}(y_t \mid x, y_{\lt t})] $$

This is the Monte Carlo estimate of $D_{\text{KL}}(\pi_{\text{rollout}} \| \pi_{\text{old}})$. Unbiased but can have high variance.

K3 KL. A lower-variance alternative from Schulman (2020):

$$ \text{KL}_{k3} = \mathbb{E}[\rho_t - \log\rho_t - 1] $$

where $\rho_t = \exp(\log\rho_t)$. Always non-negative, with better numerical properties than k1.

Chi-squared divergence. More sensitive to tail behavior. At the token level:

$$ \chi^2_{\text{token}} = \mathbb{E}[\rho_t^2] - 1 $$

And at the sequence level, using the product of per-token ratios:

$$ \chi^2_{\text{seq}} = \mathbb{E}\left[\left(\prod_t \rho_t\right)^2\right] - 1 $$

Chi-squared divergence directly relates to the variance of importance-weighted estimators. When $\chi^2_{\text{token}} \gt 1.0$, the IS weight distribution is heavy-tailed enough to substantially inflate gradient variance — for estimating a constant, the IS variance exceeds twice the on-policy level.

Perplexity gap and PPL ratio. The absolute and relative differences in perplexity between rollout and training engines:

$$ \text{PPL ratio} = \frac{\text{PPL}_{\text{old}}}{\text{PPL}_{\text{rollout}}} $$

A PPL ratio above 1.0 means the training engine is less confident than the rollout engine (higher perplexity); below 1.0 means the opposite. A ratio slightly above 1.0 is the expected default even when both engines implement the same policy, because $\text{PPL}_\text{rollout}$ on its own samples measures the entropy $H(\pi_\text{rollout})$, while $\text{PPL}_\text{old}$ on those same samples measures the cross-entropy $H(\pi_\text{rollout}, \pi_\text{old}) = H(\pi_\text{rollout}) + D_\text{KL}(\pi_\text{rollout} \| \pi_\text{old})$. By Gibbs' inequality the KL term is non-negative, so any numerical discrepancy between the two engines pushes the ratio above 1.0. A ratio substantially above 1.0, or any ratio below 1.0, therefore signals a systematic gap (precision differences, kernel divergence, TP mismatch, or routing discrepancies) beyond the baseline information-theoretic floor.

IS weight health and rejection sampling metrics

Once you enable IS or RS correction, additional metrics (from compute_is_metrics() in rollout_corr_helper.py) track the health of the correction itself.

Effective Sample Size (ESS). The fraction of samples carrying meaningful weight:

$$ \text{ESS} = \frac{1}{\mathbb{E}[\tilde{w}^2]} $$

where $\tilde{w}$ denotes the normalized IS weights (mean 1). An ESS of 0.5 means your effective batch size is half the actual batch size. Below 0.3, most of your compute is wasted on samples that contribute almost nothing.

IS weight statistics. Mean, standard deviation, min, and max of the IS weights. Healthy weights have mean near 1.0 and low std. A mean far from 1.0 suggests systematic bias; high std means a few samples dominate.

Truncation fractions. The fraction of IS weights clipped by the truncation threshold. High truncation fractions mean severe mismatch that IS alone cannot handle.

RS masked fraction. For each rejection sampling option (and overall), the fraction of sequences whose response masks were zeroed out. This tells you how much data you are throwing away.

Debug probability metrics

For deeper investigation, verl provides debug metrics (in verl/utils/debug/metrics.py) comparing raw probability distributions, all logged under the training/ namespace:

  • training/rollout_actor_probs_pearson_corr: Pearson correlation between rollout and actor probabilities (exponentiated log-probs) across all tokens. On-policy training should see correlations above 0.99. Below 0.95 means something is seriously off in the backend.
  • training/rollout_probs_diff_mean and training/rollout_probs_diff_max: mean and maximum absolute difference in probabilities. The max catches rare but extreme discrepancies.

Gradient variance proxy metrics

When IS weights are applied, they can inflate gradient variance beyond what the advantage signal warrants. verl tracks this via compute_variance_proxy_metrics() in metric_utils.py:

  • proxy1_signal_strength: $\lVert\bar{g}\rVert^2$ (squared mean gradient) — measures gradient signal strength.
  • proxy2_total_power: $\mathbb{E}[\lVert\hat{g}\rVert^2]$ — total power (signal + noise).
  • proxy3_pure_noise: $\frac{1}{N-1}(\text{proxy2} - \text{proxy1})$ — pure variance in gradient estimates.

If proxy3_pure_noise inflates over training, IS correction is adding unacceptable noise and you should tighten the IS threshold or switch to RS.

Training-time loss diagnostics

Standard PPO health metrics are logged even without rollout correction and interact with off-policy drift:

  • actor/pg_clipfrac: fraction of tokens where PPO clipping activated. Rising clip fraction across epochs in multi-epoch PPO signals excessive divergence from the sampling policy.
  • critic/vf_explained_var: $1 - \text{Var}(G - V) / \text{Var}(G)$. Low explained variance means a stale value function that is no longer tracking actual returns.

The SGA bias-variance framework

The clearest analytic framework for off-policy drift in LLM RL comes from the When Speed Kills Stability blog series (Liu, Li et al. Sep 2025). The usual question, "is this pipeline on-policy or off-policy?", turns out to be less useful than a more precise one: "what is the bias-variance profile of the gradient estimator under the actual distributional drift?"

The framework starts from the observation that policy gradient methods in LLM RL are instances of stochastic gradient ascent (SGA) on the expected reward $J(\theta) = \mathbb{E}_{\pi_\theta}[R(y)]$. When data is sampled from a behavior policy $\mu \neq \pi_\theta$, the gradient estimator becomes an importance-weighted surrogate. The key question is what happens at the token level versus the sequence level, because LLM RL computes token-level losses but assigns sequence-level rewards.

Token-level corrections optimize a surrogate. Standard practice in PPO and GRPO is to correct each token's contribution independently: the per-token ratio $\rho_t = \pi_\theta(y_t \mid x, y_{\lt t}) / \mu(y_t \mid x, y_{\lt t})$ appears in the clipped objective or as an IS weight. But this token-level correction optimizes a surrogate objective that differs from the true sequence-level expected reward. The surrogate is exact only when $\mu = \pi_\theta$; under distributional drift, it incurs a bias that grows with both the sequence length $T$ and the magnitude of the drift. Intuitively, the token-level ratio corrects each conditional $\pi_\theta(y_t \mid y_{\lt t})$ but does not account for the fact that the prefix $y_{\lt t}$ was also sampled from the wrong distribution. The accumulated prefix error is the bias.

Exact sequence-level IS is unbiased but exponentially variable. The theoretically correct correction uses the full sequence-level importance ratio:

$$ \rho_{1:T} = \prod_{t=1}^{T} \frac{\pi_\theta(y_t \mid x, y_{\lt t})}{\mu(y_t \mid x, y_{\lt t})} $$

This yields an unbiased estimator of the on-policy gradient, regardless of how different $\mu$ and $\pi_\theta$ are. The problem is variance: the product of $T$ per-token ratios can be astronomically large or small. Even with moderate per-token ratios (say, each in $[0.9, 1.1]$), a 500-token sequence can produce a sequence ratio anywhere from $\sim 10^{-23}$ to $\sim 10^{21}$. The resulting gradient estimate has variance that grows exponentially with horizon, making it useless in practice without aggressive truncation — which reintroduces bias.

Practical methods live on a bias-variance continuum. The SGA framework places every off-policy correction method on a spectrum:

  • At one extreme, pure token-level correction (PPO-clip, token-level TIS) has low variance but bias that grows with drift and length.
  • At the other extreme, exact sequence-level IS has zero bias but variance that can make learning impossible.
  • In between, methods like geometric-mean ratios (GSPO, GMPO), prefix ratios (MinPRO), turn-level IS, and second-moment-constrained objectives (M2PO, R$^2$VPO) trade bias for variance in different ways.

The framework's concrete contribution to diagnostics is identifying the right divergence measures for each failure mode:

  • Total variation (TV) distance between $\mu$ and $\pi_\theta$ bounds the bias of token-level surrogates. When $\text{TV}(\mu, \pi_\theta)$ is small, the surrogate is close to the true objective, and token-level corrections are safe.
  • Chi-squared divergence $\chi^2(\pi_\theta \| \mu) = \mathbb{E}_\mu[(\pi_\theta / \mu)^2] - 1$ bounds the variance of importance-weighted estimators. When $\chi^2$ is large, even unbiased sequence-level corrections are too noisy to be useful.

A healthy off-policy pipeline needs both metrics to be moderate. High TV means the bias of cheap corrections is large. High $\chi^2$ means the variance of exact corrections is large. When both are high, no simple correction works well, and the pipeline needs either tighter synchronization or a fundamentally different objective.

For practitioners, the SGA framework suggests a concrete diagnostic protocol: track both $\text{TV}$ (or its proxy, the per-token KL) and $\chi^2$ (at both token and sequence level) across training. When TV is low and $\chi^2$ is low, you are effectively on-policy and no correction is needed. When TV is moderate and $\chi^2$ is low, token-level IS or geometric-mean RS suffices. When TV is low but $\chi^2$ is high (rare, but possible with long sequences and mild per-token drift), sequence-level truncation or masking is the right tool. When both are high, the mismatch is too severe for algorithmic correction alone — you need to reduce the system-level gap (precision alignment, weight synchronization, routing replay) before the algorithm can do its job.

Correcting mismatch — the rollout correction framework

Architecture overview

verl's rollout correction has two orthogonal tools that can be combined:

  • Importance sampling (IS) weights: reweight each sample's gradient contribution by the likelihood ratio between training and rollout policies. Makes the gradient unbiased (up to truncation) at the cost of higher variance.
  • Rejection sampling (RS) masks: zero out the response masks of sequences where the divergence exceeds a threshold. Removes high-divergence samples entirely, reducing variance but discarding data.

Configuration lives in RolloutCorrectionConfig (in verl/trainer/config/algorithm.py). The framework has two entry points: compute_rollout_correction_and_rejection_mask() is the atomic function that returns IS weights and the rejection mask; compute_rollout_correction_and_add_to_batch() is the batch-pipeline wrapper called from ray_trainer.py that attaches both to the batch. Both live in rollout_corr_helper.py.

Importance sampling correction

IS correction reweights the gradient so that training on off-policy data approximates the on-policy gradient. Two aggregation levels are available:

Token-level truncated IS (Token TIS). Each token gets its own weight, clipped to prevent extreme values:

$$ w_t = \min(\rho_t, C) $$

where $\rho_t = \exp(\log\pi_{\text{old}} - \log\pi_{\text{rollout}})$ and $C$ is the truncation threshold (typical range: 1.5–5.0). Biased (clipping introduces bias) but low variance, so it works well as a default.

Sequence-level truncated IS (Sequence TIS). The entire sequence gets a single weight based on the product of token ratios:

$$ w = \min\left(\prod_t \rho_t, C\right) $$

with a typical threshold range of 2.0–10.0. Before truncation, this is closer to the unbiased sequence-level estimator than token TIS. In practice, however, the product of ratios is almost always truncated for long sequences (see the length trap below), and aggressive truncation reintroduces bias — potentially more than token TIS for long outputs. The main advantage of sequence TIS is that it preserves the correct correction unit (the sequence) at the cost of higher variance.

All computations happen in log-space with a safety bound of $\pm 20$ (since $e^{20} \approx 4.85 \times 10^8$) to prevent numerical overflow. After truncation, IS weights are optionally batch-normalized to mean 1.0, which prevents the effective learning rate from drifting.

To enable token-level IS correction:

# Enable token-level IS correction
rollout_corr_config = RolloutCorrectionConfig.decoupled_token_is()

Rejection sampling correction

RS correction identifies and excludes sequences where the training and rollout policies disagree too strongly. Three KL estimators are available:

K1 (ratio-based). Computes the raw log-ratio at each token:

$$ d_{k1}(t) = -\log\rho_t = \log\pi_{\text{rollout}} - \log\pi_{\text{old}} $$

With K1, thresholds are specified as ratio bounds (lower and upper), rejecting sequences where the ratio falls outside $[\ell, u]$.

K2 (MSE-based). A squared divergence measure:

$$ d_{k2}(t) = \frac{1}{2}(\log\rho_t)^2 $$

K2 is always non-negative. The threshold is an upper bound on the divergence.

K3 (low-variance KL). Schulman's estimator:

$$ d_{k3}(t) = \rho_t - \log\rho_t - 1 $$

Also always non-negative, with lower variance than K1.

These token-level divergences can be aggregated to the sequence level in four ways, yielding 11 total modes:

AggregationK1K2K3
Token (per-token threshold)token_k1token_k2token_k3
Seq sum ($\sum_t d_t$)seq_sum_k1seq_sum_k2seq_sum_k3
Seq mean ($\frac{1}{T}\sum_t d_t$)seq_mean_k1seq_mean_k2seq_mean_k3
Seq max ($\max_t d_t$)seq_max_k2seq_max_k3

The geometric-mean modes (seq_mean_*) solve the length trap (detailed in the failure modes section below). With sequence-sum aggregation, long sequences accumulate higher total divergence even if per-token divergence is small, so they get preferentially rejected. For chain-of-thought (CoT) and agent tasks where response lengths vary dramatically, this creates a systematic bias against long (and often correct) reasoning traces. Geometric-mean normalization divides by sequence length, making the criterion length-invariant (Li & Liu 2025). See also kalomaze (2026) for a broader argument against length-dependent rollout exclusion.

Bypass vs decoupled mode

The framework operates in one of two modes, differing in how many distinct policies are involved.

Bypass mode (2 policies). Sets $\pi_{\text{old}} = \pi_{\text{rollout}}$, skipping the expensive actor forward pass that recomputes log-probabilities on the training engine. The IS ratio is computed directly between the current policy $\pi_\theta$ and the rollout policy. This is faster (one fewer forward pass per batch) but less accurate: it conflates training-inference mismatch with policy staleness into a single ratio. Implemented in apply_bypass_mode() in rollout_corr_helper.py.

Decoupled mode (3 policies). Recomputes $\pi_{\text{old}}$ on the training engine, giving three distinct sets of log-probabilities: $\pi_{\text{rollout}}$ (from the rollout engine), $\pi_{\text{old}}$ (from the training engine at the rollout weights), and $\pi_\theta$ (from the training engine at the current weights). This lets you measure and correct the two factors of the importance weight decomposition independently. More accurate, but costs an additional forward pass.

When to use which: Bypass when the mismatch is small (synchronized pipelines, same precision, or when speed matters more than correction accuracy). Decoupled when mismatch is large (async pipelines, precision gaps, multi-epoch training where you need precise staleness measurement).

The full set of configuration presets:

PresetModeISRSLoss
bypass_ppo_clipBypassPPO-clip
bypass_ppo_clip_geo_rsBypassseq_mean_k1PPO-clip
bypass_ppo_clip_k3_rsBypassseq_mean_k3PPO-clip
bypass_pg_isBypassSeq TIS (2.0)REINFORCE
bypass_pg_geo_rsBypassseq_mean_k1REINFORCE
bypass_pg_geo_rs_seq_tisBypassSeq TIS (2.0)seq_mean_k1REINFORCE
bypass_pg_geo_rs_token_tisBypassToken TIS (2.0)seq_mean_k1REINFORCE
decoupled_token_isDecoupledToken TIS (2.0)(any)
decoupled_seq_isDecoupledSeq TIS (2.0)(any)
decoupled_seq_is_rsDecoupledSeq TIS (2.0)seq_sum_k1(any)
decoupled_geo_rsDecoupledseq_mean_k1(any)
decoupled_geo_rs_seq_tisDecoupledSeq TIS (2.0)seq_mean_k1(any)
decoupled_geo_rs_token_tisDecoupledToken TIS (2.0)seq_mean_k1(any)
decoupled_k3_rsDecoupledseq_mean_k3(any)
decoupled_k3_rs_seq_tisDecoupledSeq TIS (2.0)seq_mean_k3(any)
decoupled_k3_rs_token_tisDecoupledToken TIS (2.0)seq_mean_k3(any)
decoupled_token_icepopDecoupledToken TIS w/ band [0.5, 5.0]— (mask via IS band)(any)
bypass_pg_token_icepopBypassToken TIS w/ band [0.5, 5.0]— (mask via IS band)REINFORCE
disabled(any)

The two *_token_icepop presets implement IcePop-style double-sided masking: rollout_is_threshold is set to a lower_upper string like "0.5_5.0", which zeroes the IS weight (and therefore the token's gradient contribution) whenever the engine-mismatch ratio falls outside the band. They are a verl translation of the Ring-1T mechanism described above.

Policy loss functions as staleness mechanisms

Your choice of policy loss function determines how the training objective responds to the gap between the current policy and the sampling policy. Each loss implicitly handles (or fails to handle) off-policy drift through its trust region.

Vanilla PPO clip

The standard PPO clipped objective (Schulman et al. 2017) is the most widely used:

$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] $$

where $r_t(\theta) = \pi_\theta(y_t \mid x, y_{\lt t}) / \pi_{\text{old}}(y_t \mid x, y_{\lt t})$.

PPO clip works well on-policy but breaks down under distribution shift. It over-penalizes tokens where the old policy assigned low probability (small absolute changes produce large ratios) and under-penalizes tokens where the old policy assigned high probability (large absolute changes produce small ratios). Under off-policy drift, this asymmetry compounds: $r_t$ reflects both legitimate policy updates and spurious mismatch, and the clipping mechanism cannot tell them apart (Qi et al. 2026). verl's vanilla loss also implements dual-clip PPO (Ye et al. 2020) with a configurable lower bound clip_ratio_c (default 3.0) that prevents the ratio from dropping too far below 1 on negative-advantage tokens, adding a floor that standard PPO clip lacks.

DPPO-TV / DPPO-KL

DPPO (Qi et al. 2026) replaces ratio-based clipping with per-token divergence bounds:

  • DPPO-TV: constrains total variation distance between $\pi_\theta$ and $\pi_{\text{old}}$ at each token position.
  • DPPO-KL: constrains binary KL divergence at each token position.

These bound the actual distributional distance rather than relying on a ratio proxy, so the trust region size is invariant to the base probability level. More stable under distribution shift. Both use truncated IS (threshold 20.0) internally to stabilize the ratio computation.

GSPO — geometric sequence ratios

GSPO (Zheng et al. 2025) operates at the sequence level using the geometric mean of token ratios:

$$ \bar{r} = \exp\left(\frac{1}{T}\sum_{t=1}^{T} \log r_t\right) $$

This geometric mean is then clipped and used in a PPO-style objective. Dividing by $T$ in the exponent prevents long sequences from producing extreme ratios, making GSPO naturally resistant to the length trap. The vanilla GSPO objective uses ordinary gradient flow through $\log\bar{r}$ — only the GSPO-token variant (§4.3 of the paper) introduces a stop-gradient on $\bar{r}$ so that gradients reach $\log\pi_\theta$ only through the per-token factors, which is what enables per-token credit assignment in long CoT.

SAPO — smooth objective shaping

SAPO (Gao et al. 2025) replaces PPO's hard clip with a smooth surrogate objective whose shaping function is

$$ g(r, \tau) = \sigma(\tau(r - 1)) \cdot \frac{4}{\tau} $$

where $\tau$ is a temperature parameter. The function itself is monotone in $r$ and asymptotes at $4/\tau$ — it is the objective shape, not a gradient gate. The actual gradient weight on $\log\pi_\theta$ is its derivative, $g'(r,\tau) = 4\,\sigma(\tau(r-1))(1-\sigma(\tau(r-1)))$, which peaks at $r=1$ and decays smoothly on both sides. Separate temperatures $\tau_{\text{pos}}$ and $\tau_{\text{neg}}$ for positive and negative advantages give asymmetric trust regions. The smooth shape avoids the discontinuous gradients of PPO's hard clip while still bounding the effective step size near $r=1$.

GMPO — geometric mean clipping

GMPO (Zhao et al. 2025) applies sign-aware geometric-mean clipping at the sequence level. It clamps the per-token log-ratio $\log\pi_\theta - \log\pi_{\text{old}}$ within $[-\epsilon_{\text{low}}, \epsilon_{\text{high}}]$ based on the sign of the advantage, then exponentiates the sequence mean.

CISPO — stop-gradient clipping

CISPO (MiniMax 2025) applies a stop-gradient to the clipped ratio:

$$ \mathcal{J}_{\text{CISPO}} = \mathbb{E}\left[\frac{1}{\sum_{i} |o_i|} \sum_{i} \sum_{t} \text{sg}\left(\hat{r}_{i,t}(\theta)\right) \hat{A}_{i,t} \log \pi_\theta(o_{i,t} \mid q, o_{i,\lt t})\right] $$

where $\hat{r}_{i,t}(\theta) = \text{clip}(r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}})$ and $\text{sg}[\cdot]$ is the stop-gradient operator. Gradients flow only through $\log\pi_\theta$, not through the ratio. This preserves gradients for reflective reasoning tokens that PPO clip would suppress (because their ratios hit the clip boundary).

One caveat: the DPPO analysis shows CISPO lacks an explicit trust region. The stop-gradient removes the mechanism by which the clipped ratio constrains update size. Under growing mismatch, this can lead to progressively larger updates, which increase mismatch further, a positive feedback loop.

GPG (group policy gradient)

GPG (gpg) is a REINFORCE-style loss that applies the raw policy gradient $-\log\pi_\theta \cdot \hat{A}$ without ratio clipping, combined with group-normalized advantages (similar to GRPO). When rollout IS weights are provided, they are applied as explicit multipliers. GPG is useful as a simpler baseline that avoids the ratio-clipping artifacts of PPO while still benefiting from the rollout correction framework.

Covariance-based token filtering

verl also provides two covariance-based losses that take a different approach to staleness: instead of constraining the ratio or divergence, they identify and suppress individual tokens where the advantage estimate is unreliable.

Clip-cov (clip_cov) computes the covariance $\text{Cov}(\hat{A}, \log\pi)$ per token. Tokens with covariance in a specified range are randomly sampled and their loss contribution is zeroed out, reducing the influence of tokens where the advantage and log-probability are strongly coupled (a sign of noisy gradient signal under off-policy data).

KL-cov (kl_cov) takes a softer approach: instead of zeroing high-covariance tokens, it adds a KL penalty $\beta \cdot D_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta)$ on those tokens, discouraging large policy updates where the gradient signal is unreliable. (The paper's eq. 14 uses formal KL; some implementations approximate it with $\beta|\log\pi_\theta - \log\pi_{\text{old}}|$ for efficiency.) Both clip_cov and kl_cov were introduced in Cui et al. 2025 (arXiv:2505.22617).

Bypass mode loss

The bypass mode loss (compute_policy_loss_bypass_mode() in core_algos.py) dispatches to either PPO-clip or REINFORCE while layering the full rollout correction framework:

  • For REINFORCE: IS weights are applied explicitly as multipliers on the policy gradient.
  • For PPO-clip: IS weights are not applied (in bypass mode, $r_t(\theta) = \pi_\theta / \pi_{\text{rollout}}$ already incorporates the full policy shift, so an additional IS multiplier would double-count). RS masks are applied in both cases.

This is verl's primary anti-staleness mechanism. It combines a policy loss with IS reweighting and RS filtering in one place, so correction applies regardless of which underlying loss you pick.

The broader algorithmic landscape

The previous sections covered the core rollout correction framework: IS and RS correction, trust-region losses, and IS-aware baselines. But these are pieces of a larger algorithmic conversation that has unfolded rapidly since mid-2025. Once the community recognized that most "on-policy" LLM RL is secretly off-policy, a wave of methods appeared that redesign how importance ratios are constructed, how trust regions are enforced, and whether off-policyness should be corrected at all or simply accepted as the default operating condition.

This section surveys those methods, organized by design philosophy rather than chronology.

An organizing principle that helps when reading the methods below: the correction unit must match the source of mismatch. Token-level clipping is the right tool when the mismatch is per-token numerical noise, but it is the wrong tool when the mismatch is prefix-state drift, a quantized rollout policy, KV-cache eviction, an async snapshot whose old logits were never saved, a routed MoE feedback path, or a speculative-decoding window. Each cell in the comparison table below picks a different unit at which behavior-policy provenance is preserved or reconstructed; the rest is implementation.

A comparison map of all methods

Before walking through the design families, here is a single dense view of every method this post discusses, aligned on the axes that distinguish them.

MethodAggregationTrust-region mechanismIS treatmentTarget failure modeModel classStop-gradient?Scale demonstratedReference
Vanilla PPO clipTokenSymmetric ratio clipImplicitGeneral stabilityBothNo100B–1T1707.06347
Dual-clip PPOTokenAsymmetric clip + floor on neg-AImplicitLength blow-up on neg-ADenseNoup to 70B1912.09729
TOPRTrajectoryNone on +A; truncated IS on −ASign-asymmetric trajectory ISNegative-trajectory over-suppressionDenseNo7B2503.14286
ASPOTokenSymmetric clip + ratio-flip on +AToken, sign-asymmetricSymmetric clip wastes +A signalDenseNo7–30B2510.06062
MinPROStrict prefixImplicit (prefix-min ratio is bounded)$\bar\rho_t \cdot \rho_t$, prefix surrogateLength trap, accumulated prefix biasDenseNo7B2601.22718
CTPOCumulative prefix$\sqrt{t}$-scaled log clipUnbiased prefix ISSame as MinPRO, with theoryDenseNo7B2605.07331
M2POToken (seq constraint)Second-moment bound on $\log\rho$Token IS under χ² budgetHeavy-tailed χ²; prosperity-before-collapseBothNo7–30B2510.01161
GSPOSequence (geo-mean)Clip on $\bar r$Geo-mean of token ratiosLength trap; MoEBothNo100B–1T2507.18071
GSPO-tokenSequence (geo-mean)Clip on $\bar r$ + stop-gradGeo-mean of token ratiosLength trap; per-token creditBothYes100B–1T2507.18071
GMPOSequence (geo-mean)Sign-aware log-ratio clampGeo-mean of token ratiosLength trap; sign asymmetryDenseNo7B2507.20673
GPGTokenNone (REINFORCE)External IS multiplierBaseline for IS-aware experimentsDenseNo7–30Bverl impl
CISPOTokenClipped ratio with stop-grad on ratioStop-gradient ISReflective-token suppressionDenseYes100B+2506.13585
SAPOTokenSmooth sigmoid-shape objectiveToken gate via derivativeHard-clip discontinuityDenseNo7B2511.20347
DPPO-TVTokenTV-distance boundTruncated IS (20)Ratio-proxy failureBothNo7–30B2602.04879
DPPO-KLTokenBinary-KL boundTruncated IS (20)Ratio-proxy failureBothNo7–30B2602.04879
VESPOSequence (response)Variational reshaping kernelSequence variational ISHigh variance of exact seq-ISBothNo7–30B, lag 64×2602.10693
VCPOToken / sampleESS-guided LR scaleOPOB variance-optimal baselineESS collapse under asyncDenseNo7–30B2602.17616
OAPLSequenceRegression on log-ratio to $\pi_{\text{inf}}$None (built-in)Persistent policy lagDenseNo7–30B, lag 400+2602.19362
R²VPOGroupVariance-aware clip range from group $\text{Var}(\rho)$Token IS with adaptive clipHard-clip wastefulnessDenseNo7–30B2601.03320
DISPOToken, sign-conditionedFour clip regimes (±A × correct/incorrect)Decoupled token ISSign-conditioned failure modesDenseNo7–30B2602.00983
QUATROToken, per-queryPer-query Lagrange $\lambda_q^*$ from reward varianceToken IS with query-adaptive radiusPer-prompt drift sensitivityBothNo7–30B2602.04620
TRMSequence$\max_t D_{\text{KL}}(c_t) \le \delta$ admission testNone (pure mask)One-bad-token-corrupts-sequenceBothNo7–30B2512.23075
Turn-level ST-PPOTurnClip + within-turn renormTurn-level ISMulti-turn agentic lagDenseNo7–30B2511.20718
OTBToken (advantage)N/AIS-aware variance-min baselineOff-policy advantage biasDenseNo7–30B2602.07078
KL-covTokenFormal $\beta \cdot D_{\text{KL}}$ on high-cov tokensNoneNoisy advantage on coupled tokensDenseNo7B2505.22617
Clip-covTokenZero-out high-cov tokensNoneSameDenseNo7B2505.22617
R3 (Routing Replay)Token (router level)Replay rollout routing during trainingNoneMoE routing flipMoEn/a100B+2510.11370
IcePopTokenProbability-ratio band mask $[\alpha,\beta]$$M(k)=k$ in-band, 0 outsideTrain/infer log-prob discrepancyBoth (originated in MoE)n/a1T (Ring-1T)2510.18855
Keep RoutingToken (router level)Enforce sampling-time expert pathsNoneMoE routing flip in productionMoEn/a1T+2512.02556
Keep Sampling MaskTokenReplay top-p/top-k truncation maskNoneAction-subspace mismatchBothn/a1T+2512.02556
ESPOEntropy-grouped sequenceLength-normalized group ratio + entropy-adaptive clipEntropy-bucketed ISProduction MoE GRPO varianceMoEPartial100B+2512.07710
FlashRLTokenNoneIS over 8-bit rolloutQuantization mismatchBothn/a7–30BFlashRL repo
FP8-RLTokenNoneToken TIS + FP8 KV correctionQuantization (production stack)Bothn/a7–100B2601.18150
Jet-RLTokenNoneUnified FP8 flow (eliminate IS need)Mixed-precision driftDensen/a7–30B2601.14243
QuRLTokenAdaptive clip on FP/quant ratioQuantized actor as 1st-classQuantization-as-entropyDensen/a7B2602.13953
QeRLTokenNoneQuantization noise as explorationQuantization as exploration bonusDensen/a32B on H1002510.11696
QaRL / TBPOSequence + quant fwdTrust-band dual clipping for neg samplesTrain fwd aligned to quant rolloutQuantized rollout TIM; long-form artifactsMoEn/aQwen3-30B-A3B2604.07853
A-3POTokenLogprob interpolation for proximal anchorImplicitAsync forward-pass costBothn/a7–100B2512.06547
ECHO-2TokenBounded-staleness as user knobStaleness-aware ISAsync lag as exposed parameterBothn/a7–100B2602.02192
ROLL FlashTokenAsync-ratio bounds policy-version gapToken ISPer-sample stalenessBothn/a7–30B2510.11345
AReaLBatchHard-clip + staleness-aware PPOToken TISAsync lagDensePartialup to 70B2505.24298
Periodic AsynchronyBatchNone (sync after each batch)NoneAsync without stalenessBothn/a8B (3×)2511.18871
SeerBatchNone (system)NoneSync-rollout long-tail latencyBothn/aup to 70B2511.14617
DORATrajectoryMulti-version streaming rolloutBounded stalenessAsync bubbles + long-tail trajectoriesBothn/a30–100B2604.26256
Logit FusionTokenSingle-side cap (no lower clip)Mixed-policy IS, cappedDeliberate teacher mixingDensen/a7BNotion
LR scheduling (Zhang 2026)n/aNone (optimizer layer)NoneOptimizer-amplified mismatchBothn/a7–100B2602.01826
TBIKn/aTP-invariant reduction kernelsNoneTP-induced kernel divergenceBothn/aup to 70B2511.17826
FP16 alignmentn/aSwitch both engines to FP16NoneBF16 precision mismatchBothn/aup to 30B2510.26788

Reading the table. Three clusters dominate. Token-clip variants (PPO, ASPO, DISPO, R²VPO, CISPO, SAPO, KL-cov, clip-cov) keep the per-token loss structure and modify the trust region — the progression is hard-clip → soft-gate → divergence-bounded → second-moment-bounded → variance-aware → query-adaptive. Aggregation-shifters (TOPR, MinPRO, CTPO, GSPO, GMPO, VESPO, TRM, turn-level ST-PPO) carve out a bias-variance Pareto frontier for sequence-correct objectives at different aggregation units. Systems-aligned correctors (R3, IcePop, Keep Routing, Keep Sampling Mask, ESPO, TBIK, FP16/FP8/QuRL/QaRL/QeRL, A-3PO, ECHO-2, ROLL Flash, AReaL, Periodic Async, Seer, DORA) attack the engineering root cause and let standard PPO/GRPO ride on top.

A heuristic for picking a row: (i) if your pipeline is synchronous and dense, start with vanilla PPO-clip + decoupled_geo_rs and read the decision tree for any symptom you observe; (ii) if you are async, layer ECHO-2-style staleness bounds and consider VCPO's ESS-driven LR scaling; (iii) if you are MoE, routing alignment is non-negotiable — pick IcePop (lightweight masking) or Keep Routing (production-grade) before tuning the loss; (iv) if you are quantized, treat the quantization gap as a first-class behavior policy (QuRL/QaRL) rather than a defect to be patched.

Of the nearly 50 methods in the table, six are the practical defaults a 2026 practitioner should know by heart: PPO-clip, GSPO, M2PO, an MoE routing aligner (IcePop or Keep Routing), FP8-RL, and VCPO. The rest are mapped here so you can locate them when you encounter them in the literature, not so you must learn all of them.

Importance sampling design

Standard IS applies a single ratio $\rho_t = \pi_\theta(y_t \mid x, y_{\lt t}) / \pi_{\text{old}}(y_t \mid x, y_{\lt t})$ symmetrically to all tokens regardless of advantage sign, aggregation level, or causal structure. The methods below challenge each of those defaults.

Asymmetric trajectory-level treatment. The earliest departure from symmetric IS in this wave is TOPR (Tapered Off-Policy REINFORCE, Le Roux et al. 2025), which operates at the trajectory level: each complete response gets a single importance weight $\pi(\tau)/\mu(\tau)$, and the asymmetry is applied to the trajectory-level ratio based on the sign of the trajectory reward. For positive-reward trajectories, TOPR uses an SFT-style gradient (no importance weighting), since a large ratio on a good trajectory reflects desirable policy improvement. For negative-reward trajectories, it applies truncated importance weighting to prevent over-suppression: without truncation, the gradient pushes probability mass away from bad trajectories faster than the reward signal warrants. This trajectory-level asymmetric treatment foreshadows the sign-dependent clipping that reappears in several later methods.

ASPO (Wang et al. 2025) makes the asymmetry more explicit. For positive-advantage tokens, ASPO "flips" the ratio so that the correction favors exploration of beneficial actions rather than conservative clamping. For negative-advantage tokens, it retains standard clipping. The intuition is that PPO's symmetric clip suppresses both upside and downside equally, but under off-policy drift the downside suppression matters more (it prevents collapse), while the upside suppression just wastes signal. ASPO keeps the safety guardrail on the negative side while removing it on the positive side. This matters for mismatch because off-policy drift inflates ratios in both directions, and symmetric treatment discards gradient signal that is actually informative.

Prefix-level correction. A deeper question is whether the token-level ratio is even the correct correction object. MinPRO (Lei et al. 2026) argues it is not. In autoregressive generation, the probability of token $t$ depends on all preceding tokens. The causally correct importance weight for correcting the distribution at position $t$ is therefore the prefix ratio $\prod_{s=1}^{t} \rho_s$, not the token ratio $\rho_t$ alone. But cumulative products of ratios explode with sequence length (the length trap from earlier). MinPRO proposes a non-cumulative surrogate: instead of the full product, it tracks the minimum token-level ratio over the strictly preceding prefix, then reintroduces the current token's ratio as a separate factor:

$$ \bar{\rho}_t = \min_{s < t} \rho_s, \qquad w_t^{\text{MinPRO}} = \bar{\rho}_t \cdot \rho_t $$

(with the convention $\bar{\rho}_1 \triangleq 1$ for the empty preceding prefix.)

When all per-token ratios exceed 1 (the common case under policy improvement), $\bar{\rho}_t$ underestimates the true prefix product and is therefore conservative. When some ratios are below 1, $\bar{\rho}_t$ can overestimate the shrinking product, but the overall effect is still stabilizing: it avoids the exponential blowup of the full product while preserving the causal structure that pure token-level IS ignores. This matters most in long-horizon tasks where token-level IS accumulates systematic bias across hundreds of tokens because it fails to account for the distributional shift introduced at earlier positions. A concurrent paper, CTPO (Zhang et al. 2026), formalizes the same intuition differently: it uses the full cumulative prefix ratio $\rho_{1:t} = \prod_{s\le t}\rho_s$ but clips in log-space with bounds scaling as $\sqrt{t}$, giving a position-adaptive trust region that preserves unbiasedness in expectation.

Turn-level correction for agentic settings. When the "response" spans multiple dialogue turns with tool calls and environment feedback, even prefix-level correction operates at the wrong granularity. Li et al. 2025 argue that in multi-turn agentic RL, the natural correction unit is the turn, not the token: each turn is a coherent action whose importance weight should reflect the policy's probability of that entire action given the conversation history. They construct turn-level ratios and pair them with a clipping-triggered normalization scheme that re-normalizes weights within each turn when clipping activates, preventing the bias that standard PPO clipping introduces in long-horizon settings. This is important because agentic tasks are where staleness hits hardest: environment latency means rollouts are old by the time the learner processes them, and the trajectory length means token-level corrections compound their errors.

Clipping, masking, and trust-region methods

Standard PPO clips the ratio $r_t(\theta)$ to $[1-\epsilon, 1+\epsilon]$. This is a blunt instrument: it applies the same trust region to every token in every sequence for every prompt, and it only constrains a ratio proxy rather than the actual distributional divergence. The methods below refine each of these dimensions.

Second-moment trust regions. M2PO (Zheng et al. 2025) starts from an empirical observation: stale data can help early in training (the "prosperity" phase), because the diversity of off-policy samples provides useful exploration. But as the policy moves further from the behavior policy, the second moment of the log importance ratio grows, the chi-squared divergence between policies increases, and training collapses. M2PO replaces PPO's first-moment clip with a second-moment trust constraint on the log-ratio:

$$ \mathbb{E}[(\log \rho_t)^2] \leq C $$

This bounds the Pearson chi-squared divergence between the current and behavior policies (the paper shows $\chi^2(\pi_\theta \| \pi_{\text{behav}}) \leq R^2 \cdot M_2$ where $M_2 = \mathbb{E}[(\log \rho)^2]$ and $R$ is the assumed sup-norm bound on $|\log \rho|$), providing a tighter control on distributional shift than bounding the ratio itself. In practice, M2PO tolerates larger mean ratio deviations (which carry gradient signal) while strictly controlling the tail (which carries noise). The "prosperity before collapse" framing is itself a diagnostic contribution: if you see improving metrics under stale data, do not assume stability. Watch the second moment of the log-ratio.

Sequence-validity screening. Trust Region Masking (TRM, Li et al. 2025) takes a different approach: rather than clipping individual token ratios, it evaluates whether a sequence is trustworthy as a whole using the maximum per-token KL divergence. A sequence is accepted only if $\max_t D_{\text{KL}}(c_t) \leq \delta$; otherwise it is masked entirely. This criterion is explicitly length-invariant (unlike ratio-based bounds, which accumulate with sequence length) and converts local divergence checks to a sequence-level admission test. TRM is most relevant for long-horizon tasks (code generation, multi-step reasoning, agentic trajectories) where a single high-divergence token can corrupt the gradient for the entire sequence, yet token-level clipping only addresses that one position while leaving the rest of the sequence's gradient intact. By masking entire sequences with trust-region violations, TRM trades data efficiency for gradient cleanliness.

Query-adaptive trust regions. QUATRO (Lee et al. 2026) observes that different prompts have different drift sensitivity. A math problem with a unique solution path is fragile: small policy changes can flip the answer. An open-ended creative prompt is robust: the policy can shift substantially without degrading output quality. A globally fixed $\epsilon$ treats both identically, which is either too conservative for robust prompts (wasting gradient signal) or too permissive for fragile ones (allowing collapse). QUATRO solves the trust-region-constrained problem per query, deriving a per-query Lagrange multiplier $\lambda_q^*$ from each prompt's reward-distribution variance. Queries with diffuse, high-variance rewards get larger $\lambda_q^*$ (more conservative updates); queries with concentrated rewards get smaller $\lambda_q^*$ (sharper updates). The trust-region radius itself remains a global hyperparameter; what is adapted per-query is the multiplier on the constraint. This is especially relevant under non-uniform staleness, where some prompts in a batch may be far more off-policy than others.

Industrial adoption: DeepSeek-V3.2's stack. The DeepSeek-V3.2 technical report (arXiv:2512.02556, §3.1) consolidates four explicit training-inference mismatch techniques into one production GRPO recipe. Keep Sampling Mask records the top-p / top-k truncation mask used by the inference engine during sampling and re-applies the same mask to the training engine's distribution during loss computation, so the two policies share an identical action subspace and importance weights are well-defined. It is not a probability-comparison or KL-divergence test. The companion off-policy sequence masking (Eq. 8–9) is the divergence-based gate: it computes per-sequence KL between the sampling distribution and the current policy using the inference engine's returned probabilities, and masks only negative-advantage sequences whose divergence exceeds a threshold. Using inference-returned probabilities is the load-bearing detail: the same gate captures both engine mismatch and minibatch staleness. The third technique, Keep Routing, is discussed below under MoE alignment; the fourth, an unbiased KL estimator that IS-corrects Schulman's k3, is discussed under KL estimator variants.

Off-policy-native algorithms

The methods above all share a premise: off-policyness is a problem to be corrected or constrained. A parallel line of work in early 2026 takes the opposite stance: off-policyness is the natural operating condition for large-scale LLM RL, and the objective should be designed for it from the start rather than patched after the fact.

Variational sequence-level correction. VESPO (Shen et al. 2026) derives a sequence-level reshaping kernel from a variational objective. The standard approach to sequence-level IS uses the raw product $\prod_t \rho_t$, which has correct expectation but catastrophic variance. VESPO instead optimizes a variational bound that allows the reshaping kernel to trade off a small amount of bias for a large reduction in variance. The resulting kernel keeps the semantics of sequence-level correction (matching the unit of reward) while reducing the variance enough for practical async training, with reported stability under staleness up to 64× the synchronous baseline. VESPO parameterizes the kernel so that the bias is bounded and controllable, giving practitioners an explicit bias-variance knob rather than truncation heuristics.

ESS-guided variance control. VCPO (Huang et al. 2026) operationalizes the effective sample size (ESS) as a runtime control signal rather than just a diagnostic. When ESS drops (indicating that a few samples dominate the gradient), VCPO scales down the learning rate proportionally, preventing the optimizer from amplifying noisy gradients. It also derives a minimum-variance off-policy baseline (OPOB) that weights each sample's contribution by both its squared importance weight and its per-sample gradient norm:

$$ b^*_{\text{OPOB}} = \frac{\sum_i w_i^2 \lVert g_i \rVert^2 R_i}{\sum_i w_i^2 \lVert g_i \rVert^2} $$

This accounts for how much each sample actually influences the parameter update, not just its importance weight. The combination of ESS-guided step-size scaling and variance-optimal baselines makes VCPO one of the most complete "accept and control" approaches. Its diagnostic contribution is equally important: VCPO shows that ESS collapse and gradient-norm spikes are reliable indicators of impending training failure, giving practitioners an actionable early warning signal.

Optimal off-policy objectives. OAPL (Ritter et al. 2026) goes further and asks: if the inference policy is always lagged, what is the optimal objective for that setting? Rather than starting from an on-policy objective and adding IS corrections, OAPL derives a regression-based objective that minimizes the squared discrepancy between scaled log-probability ratios — taken against the lagged inference policy — and optimal advantage estimates:

$$ \min_\pi \sum_x \sum_{i=1}^{G} \left(\beta \ln \frac{\pi(y_i \mid x)}{\pi_{\text{inf}}(y_i \mid x)} - (r(x, y_i) - \hat{V}^*(x))\right)^2 $$

This is qualitatively different from standard policy gradient objectives: instead of weighting gradients by advantages, OAPL regresses the log-ratio toward the target advantage. The denominator $\pi_{\text{inf}}$ is the lagged inference/rollout policy (using log-probabilities returned by the rollout engine directly), not the previous-step PPO snapshot. Critically, OAPL does not require a learned critic — $\hat{V}^*(x)$ is estimated from groups of rollouts under $\pi_{\text{inf}}$, the same way GRPO computes group-relative baselines. The practical tradeoff is dependence on enough grouped samples per prompt and on periodic buffer clearing when sync occurs, not on a critic head whose accuracy degrades with lag. The paper reports stable training with policy lags above 400 gradient updates.

Ratio-variance regularization. R$^2$VPO (Luo et al. 2026) replaces hard ratio constraints (clipping at a fixed threshold) with a variance-aware adaptive clip range. For each GRPO group, R$^2$VPO computes the group-level variance of the importance ratio $\rho$, then sets the clip range proportionally: high-variance groups receive tighter clips and low-variance groups receive looser ones. The motivation is that a fixed clip wastes gradient signal on groups whose ratios are concentrated and over-extends on groups whose ratios are dispersed; adapting the boundary to the local distribution recovers the wasted signal while still bounding the worst case. This is closely related to M2PO's second-moment view of the trust region but applied as an adaptive boundary rather than a global constraint.

Decoupled clipping for correct vs. incorrect responses. DISPO (Karaman et al. 2026) observes that the appropriate clip range depends on two signs jointly: the sign of the advantage (positive vs. negative) and the sign of the response label (correct vs. incorrect). The four combinations produce four distinct regimes, each with a different failure mode under symmetric clipping. DISPO uses four independently-tuned clip boundaries — one for each regime — so that, for example, the upper clip on correct responses can be wider (preserving informative positive updates) while the lower clip on incorrect responses is tighter (preventing over-suppression that collapses entropy). This is complementary to ASPO's ratio-flipping: ASPO changes the form of the ratio for positive advantages, while DISPO changes the bounds of the clip across four conditioned regimes.

Deliberate off-policy mixing via logit fusion

The methods above all treat off-policyness as an accident — a gap between the intended on-policy distribution and the actual data distribution, caused by system-level mismatch or async lag. A different line of work deliberately constructs an off-policy behavior policy by fusing a teacher model's distribution into the rollout, turning off-policyness into a feature rather than a bug.

Logit fusion as a behavior policy. Zhang et al. (2026) (code) propose blending teacher and student logits at every decoding step during rollout generation. At token position $i$, the fused logits are:

$$ \ell_{\text{mix}}(y_i \mid x, y_{where $\alpha \in [0,1]$ controls the interpolation strength. The next token is then sampled from $\pi_{\text{mix}} = \text{softmax}(\ell_{\text{mix}})$. This linear interpolation in logit space is equivalent to a geometric mean of the teacher and student distributions: $\pi_{\text{mix}}(y_i) \propto \pi_T(y_i)^\alpha \cdot \pi_S(y_i)^{1-\alpha}$. The geometric mean produces sharper distributions than direct probability mixing $\alpha \pi_T + (1-\alpha)\pi_S$, which tends to over-smooth.

The motivation is that pure on-policy RL reinforces what the student already knows but struggles on hard problems where all rollouts receive zero reward, while pure SFT from teacher data locks the model into imitative behavior and is prone to catastrophic forgetting. Logit fusion occupies a middle ground: the teacher steers rollouts toward successful trajectories on hard prompts, while the student's contribution preserves exploratory diversity.

IS correction for fused rollouts. Since $\pi_{\text{mix}} \neq \pi_S$, training on fused rollouts is off-policy by construction. The per-token IS ratio is $r_i(\theta) = \pi_S(y_i; \theta) / \pi_{{\text{mix}}_{\text{old}}}(y_i)$. Zhang et al. compare three ratio choices: (1) the on-policy ratio $\pi_S / \pi_{S_{\text{old}}}$ (low variance, high bias from ignoring the true behavior policy), (2) the mixed ratio $\pi_S / \pi_{{\text{mix}}_{\text{old}}}$ (lower bias, higher variance), and (3) a shaped ratio $f(r_i) = r_i/(r_i + \gamma)$ that amplifies low-probability tokens. The mixed ratio performs best in evaluation reward — the on-policy ratio consistently underperforms due to its bias, and the shaped ratio is unstable.

A key finding is that standard PPO-style symmetric clipping is poorly suited to fused rollouts. Because the behavior policy $\pi_{\text{mix}}$ can differ substantially from $\pi_S$, the IS ratio is not centered near 1 even when $\theta = \theta_{\text{old}}$, causing asymmetric clipping where the low side triggers far more frequently than the high side. This discards gradient signal from tokens that surprise the student (low $\pi_S$, high $\pi_{\text{mix}}$) — precisely the most informative tokens for learning from the teacher. Their solution is a simple cap on extreme ratios ($\tilde{r}_i = \min(r_i, C_{\text{cap}})$ with $C_{\text{cap}} = 3$) without a lower clip, preserving informative signal while preventing high-variance updates.

They also test decoupled PPO (separating the IS correction weight $w_i = \pi_{S_{\text{old}}} / \pi_{{\text{mix}}_{\text{old}}}$ from the trust-region ratio $r_i^{\text{on}} = \pi_S / \pi_{S_{\text{old}}}$), but find the simpler single-ratio formulation outperforms it — the additional truncation in the decomposition introduces approximation error that compounds under the substantial distribution shift created by teacher mixing.

Decaying $\alpha$ as natural regularization. Rather than a fixed $\alpha$, Zhang et al. decay the mixing coefficient linearly from $\alpha_{\text{init}}$ to 0 over $K$ training steps, with per-prompt scaling by difficulty:

$$ \alpha(x) = \alpha_{\text{init}} \cdot \max\!\big(0, 1 - \tfrac{k}{K}\big) \cdot s(d(x)) $$

where $s(d(x)) \in [0,1]$ is a normalized difficulty score for prompt $x$. Hard prompts receive more teacher guidance early in training (when the student is weakest), while the schedule ensures the rollout policy converges to the student's own distribution as training progresses. This annealing serves as a natural regularizer: it removes the need for ratio clipping entirely, since the source of off-policyness itself is being attenuated. The scheduled-$\alpha$ + no-clipping configuration outperforms all alternatives in their experiments.

The practical limitation is inference cost: every decoding step requires forward passes through both teacher and student in lockstep, and the two-model decoding is incompatible with high-throughput engines like vLLM, forcing a fallback to slower HuggingFace-based generation. This throughput penalty is the direct cost of deliberately constructing the off-policy mixture rather than accepting whatever mismatch the system produces.

New: Mar-May 2026 papers

Between the v1 publication of this post and the v2 update, several papers have appeared that either fill a gap or correct an existing entry. The full enumeration covers ~25 submissions; the entries below are the ones that change how a practitioner should think about a category, rather than incremental clip-coefficient ablations.

Adaptive Layerwise Perturbation (ALP, Ye et al. 2026). Injects small learnable perturbations into each layer's input hidden state during updates, then uses the perturbed policy as the numerator of the IS ratio. The effect is to flatten sharp local distributions and reduce heavy-tailed ratios at the source, without changing the rollout side. This is a representation-level correction — neither a ratio-side fix nor a systems-side fix — and slots between the IS-design subsection and the optimizer-layer LR scheduling work.

Cumulative Token Policy Optimization (CTPO, Zhang et al. 2026). A formal alternative to MinPRO's prefix-min surrogate: CTPO uses the full cumulative prefix ratio $\rho_{1:t} = \prod_{s\le t}\rho_s$ and clips in log-space with bounds scaling as $\sqrt{t}$. Unbiased prefix correction with lower variance than full-sequence IS and a clean theoretical justification for the position-adaptive clip schedule.

QaRL / TBPO (Gu et al. 2026). The clearest production-grade entry in the quantized-rollout family: aligns the training-side forward pass with the quantized rollout policy and adds TBPO, a sequence-level trust-band objective with dual clipping for negative samples. Evaluated on Qwen3-30B-A3B MoE and surfaces a concrete long-form failure mode (repetitive "error tokens") that earlier quantized-RL papers did not name.

Diagnosing Training Inference Mismatch (VeXact, Zhong et al. 2026). Builds a zero-staleness diagnostic harness — same checkpoint, same prompt, identical inputs to rollout and training — and shows that small token-probability disagreements can independently cause RL collapse. This converts training-inference mismatch from "implementation noise" to a first-order optimization perturbation and motivates a required zero-update TIM test before enabling async or multi-epoch training.

Missing Old Logits (Guan et al. 2026). Identifies a specific failure mode of decoupled correction: in async/partial-rollout systems, the training-side old logits $\pi_{\text{old}}$ for the exact behavior-policy version may simply be unavailable, entangling the engine-mismatch and policy-staleness factors of the IS weight. Proposes three exact-recovery routes (snapshot version tracking, a dedicated old-logit model, partial-rollout interruption) and a PPO-EWMA-style approximation. The takeaway: decoupled mode's three-policy decomposition is only as clean as your old-logit recovery path.

DORA (Hu et al. 2026). Multi-version streaming rollout that removes async bubbles while enforcing intra-trajectory policy consistency. Sharpens the async-systems point that bounded staleness alone is not enough; cross-version trajectories within a single response are their own failure mode.

GAC (Xu et al. 2026). Argues that async gradients can become dynamically aligned across stale steps in ways that KL, $\chi^2$, and ESS do not detect, and proposes projecting away the stale-aligned component. The orthogonal-to-distribution diagnostic — gradient cosine across versions — is a concrete addition to the diagnostic toolkit.

Freshness-aware PER (Ma et al. 2026) and Experience Replay for LLMs (Arnal et al. 2026). Replay buffers in LLM RL had been a niche topic; both papers establish them as a legitimate design point when generation dominates cost. Arnal et al. formalize the staleness-induced variance vs. diversity vs. generation-cost tradeoff; Ma et al. show that ordinary PER fails because priorities go stale at LLM update rates and propose ESS-grounded exponential age decay.

Shadow Mask Distillation (Zhu et al. 2026). A new failure mode: KV-cache compression at rollout time creates an off-policy bias because the sampler generates under sparse context while the learner updates under dense context. Records a per-position retention mask during rollout and replays it during training rather than reconstructing missing context.

The unifying theme of the Mar–May 2026 wave is typed off-policyness: rollouts now carry version IDs, old logits, KV masks, quantizer states, prefix products, router decisions, and speculative-window indices. The implicit lesson is the one stated at the top of this section — the correction unit must match the source of mismatch.

Learning rate as a correction layer

The methods above all operate on the loss function or the importance weights. Zhang et al. (2026) argue that the optimizer itself is a critical but overlooked part of the mismatch story. As noted in the precision section, a fixed numerical discrepancy can be tolerable early in training and then suddenly become destabilizing as the optimizer enters regions with steep curvature. The same discrepancy that was benign at step 1000 can trigger collapse at step 5000.

The diagnostic signal they identify is response-length surge: a sudden increase in mean response length that precedes collapse by tens of steps. The causal chain is: mismatch introduces a small bias toward longer responses (which have more tokens to accumulate ratio errors), the optimizer amplifies this bias into a length increase, longer responses produce worse ratios, and the loop feeds back. They propose a reactive learning rate scheduler that monitors response length and decays the learning rate when a surge is detected. This does not fix the mismatch itself but prevents the optimizer from amplifying it into collapse.

Mismatch tolerance is not a fixed property of the algorithm. It depends on the optimizer state, the loss landscape, and the training phase. A pipeline that is stable at step 1000 can become unstable at step 5000 with the same mismatch magnitude, simply because the model has moved into a region where the same numerical noise has larger consequences. Learning rate scheduling adds an orthogonal safety layer that the loss-level corrections do not provide.

Advantage estimation and KL regularization

IS-aware advantage estimators

Standard advantage estimators like GAE (Schulman et al. 2016), GRPO (Shao et al. 2024), and RLOO (Ahmadian et al. 2024) assume on-policy data. When the data is off-policy, they produce biased advantage estimates because the baseline statistics (group mean, leave-one-out mean, etc.) are computed under the wrong distribution.

OTB (Optimal Token Baseline, Li et al. 2026) fixes this by computing a variance-optimal per-token baseline under importance weighting:

$$ B_t^* = \frac{\sum_i G_{i,t} W_{i,t}}{\sum_i W_{i,t}} $$

where $G_{i,t}$ is the return and $W_{i,t}$ is a cumulative path-variance proxy that captures the variance contribution of each token position. When IS weights are provided, $W_{i,t}$ is scaled by $\bar{\rho}^2(t)$ (the squared truncated IS ratio) to minimize the MSE of advantage estimates under importance weighting. The advantage is then $\hat{A}_{i,t} = G_{i,t} - B_t^*$, with the IS correction already incorporated into the baseline computation through the variance proxy. OTB is implemented in compute_optimal_token_baseline_advantage() in core_algos.py. The TIR-OTB variant (compute_multi_turn_optimal_token_baseline_advantage()) extends this to multi-turn settings where IS weights must be tracked across conversation turns.

verl provides 14 advantage estimators in the ADV_ESTIMATOR_REGISTRY (as of verl-project/verl@70744059): gae, grpo, grpo_vectorized, grpo_passk, rloo, rloo_vectorized, reinforce_plus_plus, reinforce_plus_plus_baseline, remax, opo, gpg, gdpo, optimal_token_baseline, and tir_optimal_token_baseline. Of these, only OTB and TIR-OTB are IS-aware.

KL estimator variants

KL regularization appears both as a loss penalty and as a reward shaping term. The choice of KL estimator matters because different estimators have different bias-variance tradeoffs and, more subtly, different gradient properties. Six variants are available:

EstimatorFormulaExpectation BiasGradient BiasVariance
k1$\mathbb{E}_q[-\log\rho]$UnbiasedBiasedHigh
abs$\mathbb{E}_q[\lvert\log\rho\rvert]$BiasedBiasedLow
k2$\mathbb{E}_q[\frac{1}{2}(\log \rho)^2]$BiasedUnbiasedLow
k3$\mathbb{E}_q[\rho - \log\rho - 1]$UnbiasedBiasedLow
k1+Straight-through corrected k1UnbiasedUnbiasedHigh
k3+Straight-through corrected k3UnbiasedUnbiasedLow

The correct estimator depends on where the KL term appears:

  • KL in the loss, naive on-policy: Use k2. Its gradient produces $(\log\rho) \cdot \nabla_\theta \log\pi_\theta$, which is the correct REINFORCE gradient of the KL.
  • KL in the loss, IS-weighted: Use k3. Under importance weighting, k3's gradient yields the exact KL gradient.
  • KL in the reward, with stop-gradient: Use k1. The stop-gradient means the score function picks up k1 as a multiplicative weight.

See Schulman (2020) for the full analysis.

The k1+ and k3+ variants use the straight-through trick to combine the forward-pass value of one estimator with the backward-pass gradient of another:

# Straight-through: backward gradient from `backward_score`,
# forward value from `forward_score`
kl = backward_score - backward_score.detach() + forward_score.detach()

You can, for example, use k3's low-variance forward value for loss computation while retaining k2's unbiased gradient for optimization.

KL regularization appears in two independent paths in verl: as a reward shaping term (subtracted from per-token rewards before advantage estimation, controlled by use_kl_in_reward) and as a loss penalty (added directly to the policy loss, controlled by use_kl_loss). Note: The k1+ and k3+ variants are broken on verl-project/verl@70744059. kl_penalty() calls kl_penalty_forward(logprob, ref_logprob, kl_penalty) with the raw suffixed string ("k1+"/"k3+"), and kl_penalty_forward() does not strip the + — so the call falls through every branch and raises NotImplementedError, not a silent None. The fix is a one-line kl_penalty.rstrip("+") before dispatch. Use plain k1 or k3 until this lands upstream.

A related production datapoint: DeepSeek-V3.2 (§3.1, Eq. 7) corrects the same k3 estimator differently — by IS-reweighting against the rollout policy. They observe that k3 is only unbiased under sampling from the current policy; under any off-policy data their corrected estimator is what makes the KL gradient unbiased. The verl straight-through trick is one way to achieve a similar correction; DeepSeek's explicit IS correction is another.

KL controllers

Two strategies for setting the KL penalty coefficient $\beta$:

Adaptive controller (Ziegler et al. 2019). Adjusts $\beta$ based on the ratio of observed KL to a target:

$$ \beta \leftarrow \beta \cdot \left(1 + \text{clip}\left(\frac{\text{KL}}{\text{target}} - 1, \; -0.2, \; 0.2\right) \cdot \frac{n}{H}\right) $$

where $n$ is the number of samples and $H$ is the horizon. $\beta$ goes up when KL exceeds the target, down when it falls below.

Fixed controller. Sets $\beta = \text{const}$. Simpler to tune and avoids instabilities from the adaptive feedback loop, but needs manual adjustment if training dynamics change.

Systems-level solutions

The previous sections treated mismatch as a statistical problem: measure the divergence, then correct with IS weights or RS masks. But the divergence itself has engineering causes (different kernels, different precisions, different parallelism strategies) and engineering solutions exist that attack the root cause rather than compensating after the fact. The recurring tradeoff: the more aggressively you eliminate mismatch at the systems level, the more you constrain throughput, hardware flexibility, and parallelism. Exact restoration is possible but expensive. Approximate methods are cheaper but need algorithmic correction on top. Production systems land somewhere in between.

Making rollout and training equivalent

The ideal fix is to make the rollout engine and the training engine produce identical outputs for the same weights and inputs, eliminating the first factor of the importance weight decomposition entirely. Four projects attacked this from different angles in 2025.

Bitwise consistency via kernel auditing. Wasti et al. (2025) audited every kernel invocation in the forward pass to achieve bitwise equivalence between TorchTitan (training) and vLLM (inference). In their demo setup, this achieves zero KL between rollout and training log-probabilities, with the mismatch factor literally 1.0 everywhere. The caveat is cost. The bitwise-consistent configuration is slower and more constrained than standard production deployments: you lose the freedom to use different TP sizes, different batch schedulers, or different fused kernels between rollout and training.

Batch-invariant inference. He et al. (2025) from Thinking Machines Lab reframed the problem: most "nondeterminism" in LLM inference is not random GPU concurrency but deterministic dependence on batch composition. When you change the batch size, the reduction order changes, floating-point non-associativity kicks in, and you get different logits for the same input. The fix is batch-invariant kernels that guarantee the same output regardless of what else is in the batch. Batch invariance is the cheapest individual fix: you constrain one dimension (reduction order) rather than the entire engine stack.

FP16 alignment. Qi et al. (2025) make a direct claim: BF16 is the main culprit, and moving both sides to FP16 can nearly eliminate the gap. The extra mantissa precision in FP16 means that accumulation differences between engines produce smaller rounding errors, small enough that the resulting log-probability differences fall below training-relevant thresholds. This is the cleanest "one-line systems fix" — change the dtype and the mismatch largely vanishes. The constraint is FP16's narrower dynamic range, which requires loss scaling and careful attention to overflow.

TP-invariant kernels (TBIK). Zhang et al. (2025) targeted the specific case of different tensor-parallel sizes between rollout and training. Their Tree-Based Invariant Kernels fix the reduction tree structure so that partial sums are always computed in the same order regardless of how many GPUs participate. Integrated into both vLLM and FSDP, TBIK eliminates the TP-induced component of mismatch entirely, with modest throughput cost (single-digit percent).

Deterministic inference in SGLang. The SGLang team (2025) operationalized deterministic inference in another major serving stack, covering chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. Their claim is that reproducible RL training can be recovered with modest overhead rather than catastrophic slowdown.

These projects establish a spectrum: at one end, full bitwise consistency (zero mismatch, maximum constraint); at the other, targeted fixes for the worst offenders. Most production systems pick a point in the middle: fix what is cheap to fix, then use algorithmic correction for the residual.

MoE-specific alignment

As discussed in the root causes section, MoE routing mismatch is qualitatively different from dense-model mismatch because expert routing can change which function is computed, producing discontinuous output changes that generic IS cannot cleanly correct.

Routing Replay (R3). Ma et al. (2025) proposed recording the routing distributions used during inference and replaying them during training. This aligns the computational pathway, not just the output probabilities. The cost is storage (the routing distributions must be saved and transmitted alongside rollout data) and some loss of training-time router adaptivity.

IcePop. Ling Team (2025) (project page) took a lighter-weight approach: mask token positions whose engine-mismatch probability ratio $k = \pi_{\text{train}}/\pi_{\text{infer}}$ falls outside a calibrated band $[\alpha, \beta]$, and use $M(k)=k$ inside the band as the gradient calibration coefficient. Routing flips are one cause of the ratio tail (especially in MoE) but not the criterion — the mask catches any source of train/inference probability discrepancy and only triggers on ~1–2‰ of tokens at the default $[0.5, 5.0]$ band, removing corrupted gradient signal without the overhead of full routing replay.

Keep Routing (DeepSeek-V3.2). DeepSeek-V3.2 took routing alignment to production scale, preserving inference-time routing paths and enforcing them during training. Shipping this in a production model confirms that MoE routing mismatch was a real scaling bottleneck.

Industrial adoption. By late 2025, these techniques were deployed at scale. The Ring-1T report (2025) uses IcePop's per-token probability-ratio mask in trillion-scale MoE RL. The CompassMax-V3-Thinking report (Anxiang Zeng et al. 2025, arXiv:2512.07710) consolidates Router Replay (records vLLM router decisions per token and replays them in Megatron log-prob recomputation — the paper reports the resulting train/infer log-prob discrepancy drops from order $10^{-3}$ to $10^{-4}$), ESPO (Entropy Importance Sampling Policy Optimization, an entropy-grouped sequence ratio with stop-gradient and entropy-adaptive clipping that occupies the middle ground between token-level and sequence-level IS), Multi-Stage Zero-Variance Elimination (filter trivial-pass prompts, expand exploration size $N$, length/repetition penalties, RL-ZVP-style advantage reshaping for residual zero-variance groups), and GenRM with ternary labels (a generative reward model trained to compare against a reference and emit better/similar/worse, preventing advantage-sign flipping near the group mean). Layered with FP8 rollouts, length-aware load balancing, and overlapped reward computation. Routing alignment for MoE is not optional, and at this scale no single mechanism suffices — stability comes from the layered combination.

Quantized rollout stacks

Quantized rollout is the most explicit form of the mismatch problem: you are deliberately using a cheaper representation for inference than for training. The question is not whether this creates off-policy data — it does, by construction — but whether the throughput gain is worth the correction cost.

FlashRL (Yao et al. 2025) established the framing: 8-bit rollouts should be treated as an off-policy transformation that requires explicit correction, not as an approximation that can be ignored. FP8-RL (Qiu et al. 2026) provided a complete production stack — W8A8 inference with FP8 KV cache, paired with IS correction in verl — and showed that low-precision RL can match BF16 baselines when correction is designed in from day one. Jet-RL (Xi et al. 2026) argued for a unified precision flow: either both sides use FP8 or both use BF16, because mixed precision between the two sides of the RL loop is a hidden source of drift. QuRL (Li et al. 2026) treated quantized rollout as a permanent architectural feature and designed adaptive clipping ranges based on the measured ratio between full-precision and quantized actors.

The quantization literature converges on a single lesson: cheap rollouts are off-policy rollouts, and the cost of correction must be budgeted alongside the throughput gain.

Speculative decoding staleness

Speculative decoding introduces a subtler form of mismatch. In standard speculative decoding, a small "drafter" model proposes candidate tokens and a larger "verifier" model accepts or rejects them. In an RL training loop, both models are being updated — but at different rates and on different schedules. ReSpec (Chen et al. 2025) identified the resulting problem: the drafter can become stale relative to the actor as training proceeds. This means the acceptance rate drops, throughput degrades, and the effective sampling distribution shifts because the drafter's staleness biases which tokens get proposed and verified. This expands the mismatch literature from "one policy versus another" to "multiple coupled policies inside the serving algorithm," each with its own staleness trajectory.

The systems-algorithm interface

The systems and algorithmic approaches to mismatch are complements, not alternatives. Systems solutions reduce the magnitude of the mismatch. Algorithmic solutions manage whatever mismatch remains. Full bitwise consistency eliminates the need for algorithmic correction of the mismatch factor, but it constrains throughput and hardware flexibility. Ignoring systems alignment entirely and relying purely on IS/RS works only when the mismatch is mild — severe mismatch produces IS weights with such high variance that the effective sample size collapses.

The likely equilibrium is hybrid: keep system discrepancy small and measurable (FP16 alignment, batch-invariant kernels, TP-invariant reductions where feasible), then use lightweight algorithmic correction (geometric-mean RS, token-level TIS with modest thresholds) for the residual. For MoE models, add routing replay or masking as a separate layer. For quantized rollouts, budget the IS correction cost alongside the throughput gain. The goal is not zero mismatch — it is controlled, measured, and budgeted mismatch with correction that is cheap enough to be worth applying.

Async training — speed vs staleness

One-step off-policy

The simplest form of async training overlaps the generation of batch $N$ with the training on batch $N-1$:

Step N:   [---- train(N-1) ----] [sync]
          [---- gen(N) ---------------]
Step N+1: [---- train(N) ------] [sync]
          [---- gen(N+1) -------------]

Weight synchronization uses NCCL broadcast, which completes in under 300ms for 7B-scale models. The rollout weights lag by exactly one training step, mild enough that bypass mode handles it. This gives a 1.2–1.4x speedup over synchronous training (1.24x on FSDP2 + vLLM, up to 1.40x on Megatron + vLLM).

The configuration enables bypass mode and keeps the vLLM cache engine alive between steps:

rollout:
  free_cache_engine: false
  checkpoint_engine:
    backend: "nccl"
# Python API: select bypass mode preset
rollout_corr_config = RolloutCorrectionConfig.bypass_ppo_clip()  # or .bypass_pg_geo_rs() for RS

The implementation lives in verl/experimental/one_step_off_policy/.

Fully async pipeline

For maximum throughput, verl supports a fully decoupled pipeline where the rollout worker and trainer are independent Ray actors communicating via a message queue. Each runs on its own clock: the rollout worker continuously generates samples, the trainer continuously consumes them.

Four operating modes, controlled by two parameters:

Modetrigger_parameter_sync_stepstaleness_thresholdpartial_rolloutBehavior
On-policy10Sync after every batch; equivalent to synchronous
Stream off-policy>10Multiple local updates before sync
Async stale$\geq$1>0falseAllows stale samples; waits for active rollouts to finish
Async partial$\geq$1>0trueInterrupts active rollouts on sync (best throughput)

"Async partial" gets the highest throughput by interrupting in-progress rollouts when a weight sync triggers, rather than waiting for them to finish. Interrupted rollouts are saved and resumed after synchronization, so no generated samples are discarded — the pipeline never stalls and no compute is wasted.

Performance benchmarks on Qwen2.5-Math-7B with H20 GPUs:

SetupSync timeAsync timeSpeedup
128 GPU (64+64)1d 17h17h 22m2.35x
64 GPU (32+32)1d 18h21h 40m1.92x
32 GPU (16+16)3d 17h1d 9h2.66x

Speedup is largest on smaller GPU counts where the synchronous pipeline has the most idle time. Staleness is tracked via dedicated metrics (fully_async/count/stale_samples_processed, rollouter/idle_ratio, etc.) and bounded by staleness_threshold. Implementation is in verl/experimental/fully_async_policy/.

The wider async RL landscape

verl's one-step overlap and fully async pipeline sit within a fast-moving ecosystem of async RL systems for LLMs. The design space has crystallized around three competing philosophies: tolerate staleness with algorithmic correction, reduce staleness through better systems design, or eliminate the need for asynchrony altogether by making synchronous rollout fast enough.

Tolerating staleness. The intellectual starting point is Noukhovitch et al. (2024), which showed that some RLHF objectives remain surprisingly robust under moderate asynchrony. Online DPO proved especially tolerant of stale data, establishing a key principle: the right objective can buy you headroom against policy lag that no amount of system engineering can. AReaL (Fu et al. 2025) built on this by making staleness a first-class system metric rather than a nuisance variable, pairing a staleness-aware PPO variant with worker-level load balancing that tracks how old each sample is. LlamaRL (Wu et al. 2025) arrived at roughly the same time but emphasized infrastructure over algorithms, providing distributed weight synchronization and off-policy-capable training loops.

The late-2025 wave pushed further. ROLL Flash (Lu et al. 2025) introduced an "asynchronous ratio" constraining the policy version gap per sample — an explicit staleness guardrail baked into the system scheduler. A-3PO (Li et al. 2025) tackled a specific algorithmic bottleneck: the proximal policy anchor normally requires a full forward pass, which is expensive for large models. A-3PO approximates it through log-probability interpolation weighted by staleness, eliminating a forward pass while preserving the stabilizing effect. ECHO-2 (Xiao et al. 2026) represents the clearest sign that stale-data management has matured into an exposed engineering knob: it treats bounded policy staleness as a user-controlled parameter with provisioning rules that let operators trade staleness for cost efficiency.

Reducing staleness through better async topology. AsyncFlow (Han et al. 2025) uses streaming queues and overlapped transport so that training can begin consuming partial batches before the full rollout completes, reducing effective lag. DistFlow (Wang et al. 2025) replaces the single-controller architecture with a fully distributed multi-controller paradigm, eliminating the central-node bottleneck that forces some workers to wait (and become stale) while others are busy. Laminar (Sheng et al. 2025) introduced a relay-worker design with a distributed parameter service allowing rollout workers to pull latest weights at any time without stalling the learner, shifting from monolithic sync points to a fluid topology where staleness is minimized per-worker rather than bounded globally.

Eliminating the need for asynchrony. Seer (Qin et al. 2025) focuses on making synchronous rollout fast enough that you do not need to tolerate policy lag at all, using divided rollout, context-aware scheduling, and adaptive grouped speculative decoding to cut long-tail latency by 72–94% and deliver up to 2.04× end-to-end rollout throughput. Periodic Asynchrony (Lu 2025) takes a middle path: it decouples inference and training for throughput gains but synchronizes weights after each full batch, preserving on-policy semantics while recovering most of the throughput benefits of async — over 3× on 8B models.

These camps are not mutually exclusive. Production systems increasingly combine elements from each: fast synchronous rollout to minimize baseline lag, async overlap where synchronous execution leaves GPU cycles on the table, and staleness-aware objectives as a safety net for the residual policy gap.

Failure modes and practical guidance

Failure mode catalog

The length trap. Sequence-level IS weights are products of per-token ratios. Even a small per-token bias compounds exponentially with length. A per-token ratio of $\rho_t = 1.1$ (10% discrepancy, which is mild):

  • 10-token sequence: $\prod_t \rho_t = 1.1^{10} \approx 2.6$ — within typical truncation thresholds, kept
  • 50-token sequence: $\prod_t \rho_t = 1.1^{50} \approx 117$ — truncated or rejected
  • 100-token sequence: $\prod_t \rho_t = 1.1^{100} \approx 13{,}780$ — massively rejected

Sequence-level IS and RS disproportionately penalize long sequences. In CoT and agent tasks where correct responses tend to be longer, this biases toward shorter (often wrong) outputs. The fix is geometric-mean RS (seq_mean_k1 or seq_mean_k3), which normalizes by length:

$$ d_{\text{geo}} = \frac{1}{T}\sum_{t=1}^{T} d(t) $$

This makes the rejection criterion independent of sequence length.

Toxic tails. Under severe mismatch (e.g., $\chi^2_{\text{token}} \gt 2.0$), the samples with the highest IS weights are often not the most informative. They are artifacts of the ratio computation: tokens where the rollout engine happened to assign much lower probability than the training engine for numerical rather than semantic reasons. Upweighting them amplifies noise, not signal.

In this regime, filtering (RS) is safer than reweighting (IS). Sequence-level RS (masking to exclude outliers) removes the toxic tail entirely, while sequence-level TIS (clipping) still lets corrupted samples influence the gradient, just with reduced weight. When mismatch is severe, prefer RS over IS, or combine both: RS to remove the worst outliers, IS to correct the remaining mild shift.

Practical recommendations

  1. Start with metrics-only mode. Run with RolloutCorrectionConfig.disabled() for the first few hundred steps. Measure the baseline gap: kl, chi2_token, ppl_ratio, and Pearson correlation. If kl < 0.05 and chi2_token < 0.3, you may not need any correction.

  2. Enable RS if seeing outliers. If chi2_token exceeds 0.3 or you observe occasional training spikes, enable geometric-mean RS (bypass_ppo_clip_geo_rs or decoupled_geo_rs). This removes the worst outliers without changing the gradient computation for the remaining samples.

  3. Add IS when comfortable with metrics. If RS alone is insufficient (masked fraction stays high, or KL remains elevated after filtering), add IS weights. Token TIS is the safer default; sequence TIS is more principled but higher variance.

  4. Use bypass for small mismatch, decoupled for significant staleness. If your pipeline is synchronous and the only mismatch source is the precision gap, bypass mode saves a forward pass without losing much accuracy. If you are running async or multi-epoch training where policy staleness is the dominant factor, decoupled mode gives you the three-policy decomposition needed for precise correction.

  5. Watch for the length trap. If you are training on tasks with variable-length outputs (CoT, agents, code generation), always use geometric-mean aggregation for RS. Sequence-sum aggregation will silently filter out your longest — and often best — responses.

Diagnosing and fixing off-policy drift: a decision tree

The catalog of metrics and the catalog of fixes are connected, but the connection has been left to the reader. The tree below makes it explicit. Each branch starts from an observed metric, names the most likely root cause, and prescribes the smallest fix that addresses it. The thresholds are operational defaults, not laws — calibrate them on a metrics-only run for your exact rollout engine, training engine, precision, TP degree, model family, and reward mix.

Step 1: classify the symptom

  • Start with rollout/actor agreement.
    • training/rollout_actor_probs_pearson_corr ≥ 0.99, ppl_ratio ≈ 1.0, kl < 0.02 → no meaningful drift. This is fine — continue training.
    • training/rollout_actor_probs_pearson_corr < 0.95, or rollout_probs_diff_max is large, or ppl_ratio is far from 1.0 while rollout and training claim to use the same weights → Cause A: precision/kernel/engine mismatch.
  • If agreement is sane, inspect drift magnitude.
    • kl > 0.05 on the same checkpoint → Cause A; algorithmic correction will hide, not remove, this bug.
    • kl rises only after async lag, replay, or multi-epoch reuse → Cause B: policy staleness.
  • If drift exists, inspect correction variance.
    • chi2_token > 0.3 but ess ≥ 0.5Cause C: moderate token drift.
    • chi2_token > 1.0, chi2_seq explodes, or ess < 0.5Cause D: IS variance blow-up.
    • proxy3_pure_noise rises after enabling IS → Cause D (IS is making gradients worse).
  • If loss mechanics are the visible problem:
    • pg_clipfrac > 0.2 and rising → Cause E: trust region too tight or drift too high.
    • Response length surges by >20% within 100 steps → Cause F: optimizer amplification.

Step 2: classify the root cause and apply the fix

Cause A — Precision / kernel / engine mismatch. Diagnostic confirmation: low Pearson, ppl_ratio away from 1.0, high rollout_probs_diff_max, or kl > 0.05 even on the same checkpoint. Fix order: systems first, correction second, escalation third. Align rollout/train precision (FP16 alignment per Qi et al. 2025 is the single cheapest fix), then TP degree, then kernels (batch-invariant, TBIK, or deterministic SGLang). For MoE add routing replay — decoupled_token_icepop is the cheapest first try, R3 or DeepSeek's Keep Routing for production. Keep RolloutCorrectionConfig.disabled() while bisecting:

algorithm:
  rollout_correction: { rollout_is: null, rollout_rs: null }
actor:
  model: { dtype: float16 }
rollout: { dtype: float16 }

If Pearson stays below 0.95 after dtype + TP alignment, rebuild the rollout/training engine path. Algorithmic correction cannot rescue this.

Cause B — Policy staleness. Confirmation: agreement metrics are clean on same-checkpoint tests, but kl, chi2_token, and chi2_seq grow with queue age or epoch count. Fix order: reduce staleness, then correct the residual. Use decoupled_token_is for mild lag, decoupled_geo_rs_seq_tis for queue lag or long responses:

rollout_corr_config = RolloutCorrectionConfig.decoupled_geo_rs_seq_tis(
    is_threshold=2.0, rs_threshold="0.99_1.01"
)

If ess ≥ 0.5 and reward is stable, continue. If ess < 0.5, escalate to Cause D.

Cause C — Moderate token drift. Confirmation: chi2_token > 0.3, ess ≥ 0.5, chi2_seq not exploding. Use geometric-mean RS — it strips outliers without distorting the gradient on retained samples:

rollout_corr_config = RolloutCorrectionConfig.decoupled_geo_rs(rs_threshold="0.99_1.01")

If chi2_token falls below 0.3, continue training.

Cause D — IS variance blow-up. Confirmation: ess < 0.5, chi2_token > 1.0, or proxy3_pure_noise rises after enabling IS. Filter outliers before reweighting them. If a small token set dominates, use IcePop-style band masking:

rollout_corr_config = RolloutCorrectionConfig.decoupled_token_icepop(
    threshold_lower=0.5, threshold=5.0
)

If IcePop's mask rate exceeds 10–25%, the algorithmic correction has hit its limit. Reduce staleness, sync more often, or fix the precision/routing gap before retrying.

Cause E — PPO clip saturation. Confirmation: pg_clipfrac > 0.2, advantage signal looks clipped flat. Reduce update pressure, then change objective. Drop ppo_epochs to 1, halve the LR, or switch from token PPO-clip to a length-invariant geometric objective (GSPO/GMPO). In verl: bypass_pg_geo_rs_token_tis is the closest preset to "tight RS + REINFORCE-style loss."

Cause F — Optimizer amplification / response-length surge. Confirmation: response length rises >20% in 100 steps, usually with rising kl and pg_clipfrac. Per Zhang et al. 2026, this precedes collapse by tens of steps. Reactive fix:

if mean_response_len / len_100_steps_ago > 1.20:
    optimizer.param_groups[0]["lr"] *= 0.5
    rollout_corr_config = RolloutCorrectionConfig.bypass_pg_geo_rs()

If length keeps surging after LR decay, the problem is upstream (reward hacking, EOS handling, length bonus). Audit the reward function — no amount of off-policy correction will fix a broken reward.

When to escalate from RS → IS → systems fix

A rule of thumb that has held up in practice (no formal proof; calibrate on your stack):

  • RS only: works while chi2_token < 2.0 and RS masked fraction stays under ~10%.
  • RS + token TIS: needed when chi2_token is in (2.0, 4.0] or RS alone reaches 10–25% mask rate.
  • Systems fix: needed when RS would discard >25% of the batch, when ESS stays below 0.3 even after RS, or when Pearson correlation is below 0.95. At this point no algorithmic correction is cheaper than fixing the underlying divergence.

Anti-pattern sidebar: five common diagnostic mistakes

  1. Treating kl as "policy learning" before checking same-weight rollout/train agreement.
  2. Using sequence TIS because it sounds more principled, then ignoring ess < 0.5 and rising proxy3_pure_noise.
  3. Widening PPO clip when pg_clipfrac > 0.2. Reduce drift or update pressure first.
  4. Debugging a response-length surge as a reward issue before applying the LR-decay guard.
  5. Averaging away training/rollout_probs_diff_max. The max spike is often the engine, routing, or precision bug that ends up killing the run.

Open problems and future directions

The previous version of this section listed seven directions in prose. After several rounds of revision the five most generative have been sharpened into falsifiable conjectures. Each has a candidate experiment or theorem that would settle it.

Problem 1: The sequence-correct low-variance frontier

Token-level IS optimizes a biased surrogate of the sequence-level reward; sequence-level IS is unbiased but its variance grows exponentially with horizon. The SGA framing makes this precise: TV controls bias of token surrogates, $\chi^2(\pi_\theta \| \mu)$ controls variance of unbiased estimators, and these do not trade off monotonically. Methods that occupy the middle (prefix, turn-level, geometric-mean, variational reshaping) are heuristically motivated; the shape of the frontier is unknown.

Conjecture. For long-horizon LLM RL with bounded reward, horizon $T \ge 200$, and nontrivial off-policy mismatch, the optimal SGA progress proxy

$$P(\hat g) = \frac{\langle \nabla J, \mathbb{E}[\hat g] \rangle^2}{\operatorname{tr}\operatorname{Var}(\hat g) + \lVert \operatorname{Bias}(\hat g) \rVert^2}$$

is maximized by an intermediate-aggregation estimator (prefix, geometric-mean, turn-level, or variational), not by pure token-level or pure sequence-level IS.

Falsification. Run a controlled length sweep over $T \in \{64,128,256,512,1024\}$ at fixed model, prompts, reward, optimizer, and behavior-policy lag. Estimate the oracle $\nabla J$ from large on-policy batches and plot $(\lVert\operatorname{Bias}\rVert, \operatorname{Var})$ for each estimator family. The conjecture fails if either extreme is optimal for $P$ across $T \ge 200$.

Problem 2: A minimum sufficient collapse diagnostic

No accepted online diagnostic tells you when an off-policy LLM RL run is about to collapse. Papers report KL, gradient spikes, ESS, clip fraction, length drift, router disagreement, or precision gaps, but these conventions are framework-specific and often measured after failure.

Conjecture. Let collapse mean a $\ge 30\%$ reward drop that does not recover within 100 updates, or a gradient-norm spike $>10\times$ the rolling median followed by reward degradation. A rolling feature set

$$z_t = \{\widehat{\chi^2}_{\text{tok},95},\ \text{ESS}_{\text{seq}}/n,\ \text{pg\_clipfrac},\ \bar{T}_{\text{response}}\}$$

is sufficient to predict collapse within the next 200 optimizer updates with AUROC $\ge 0.90$ across 7B–100B dense and MoE runs.

Falsification. Build a retrospective and prospective benchmark of logged RL runs — successful, collapsed, and manually stopped. The conjecture fails if a calibrated predictor using only $z_t$ cannot reach AUROC 0.90, or if adding stack-specific signals improves AUROC by $>0.03$.

Problem 3: The MoE routing discreteness penalty

Dense-model mismatch can be approximated as smooth logit drift. MoE mismatch is different: top-$k$ routing introduces discrete expert-selection changes, so tiny numerical differences can flip the computation graph. Router replay (R3, Keep Routing) and discrepancy masking (IcePop) reduce observed instability, but no scaling law connects route-flip rate to gradient bias.

Conjecture. For a sparse top-$k$ MoE with $K$ experts, let $\delta = \frac{1}{K}\sum_{i=1}^K \Pr[m_i^{\text{roll}}(s_t) \ne m_i^{\text{train}}(s_t)]$ be the per-expert router flip rate. Holding dense-backbone token KL fixed, token-level IS has an additional router-induced gradient bias

$$\lVert \operatorname{Bias}_{\text{MoE}} \rVert = \Theta(T K \delta)$$

until saturation at $O(1)$.

Falsification. In a controlled MoE testbed, vary $K$ and inject router perturbations to force $\delta$ while holding token KL and reward distribution fixed. Compare token-IS gradients against an on-policy or router-replayed oracle. The conjecture fails if bias is sublinear in $K\delta$, or if token $\chi^2$/KL alone predicts the oracle gap.

Problem 4: A mismatch budget for system repair vs. algorithmic correction

Production stacks mix two instincts: make engines identical (FP16/FP8 alignment, bitwise consistency, deterministic kernels, router replay) and correct mismatch in the loss (TIS, AIS, OAPL). No theorem or benchmark says where the crossover lies, and quantized rollout makes the question sharper: low precision accelerates generation while creating a fresh mismatch channel.

Conjecture. There exists a practical mismatch budget $\epsilon^\star$ such that if

$$D^{95}_{TV,\text{tok}} < 10^{-4},\quad \chi^2_{\text{seq}} < 0.1,\quad \delta_{\text{router}} < 10^{-3},$$

then adding IS, rejection, masking, or adaptive correction improves final benchmark quality by $<0.3\%$ at fixed rollout compute.

Falsification. Factorial comparison: system-only repair, algorithm-only correction, both, and neither, across dense, MoE, BF16, FP16, FP8, and INT8 rollout stacks. The conjecture fails if algorithmic correction gives $\ge 0.3\%$ gain after the measured budget is satisfied.

Problem 5: A law for asynchronous lag tolerance

Async RL increases throughput by letting rollouts lag the learner. The unsolved question is not whether some lag works — OAPL shows lags of 400+ updates can be stable — but how much lag is safe as a function of a measurable mismatch quantity.

Conjecture. For off-policy-native KL-regularized objectives that use the lagged rollout policy $\mu$ as an explicit reference, stability is governed by sequence ESS:

$$L_{\max} \le c \cdot \text{ESS}_{\text{seq}}^\alpha,$$

where $L_{\max}$ is optimizer-step lag, $c \approx 50$, and $\alpha \approx 1$ for current 4B–70B reasoning and coding models.

Falsification. Sweep $L \in \{1, 10, 50, 100, 200, 400, 800\}$ at fixed compute and batch size, measuring sequence ESS on a held-out probe batch at every sync. The conjecture fails if stable and unstable regimes are not separated by any polynomial ESS law, or if $\alpha$ differs by more than $0.5$ across task families.

Cross-cutting note

Recent papers in the Mar–May 2026 wave sharpen what the correction unit should be: not just token, prefix, sequence, or turn, but also policy version (DORA), old-logit snapshot (Missing-Old-Logits), KV-cache mask (Shadow Mask Distillation), quantizer state (QaRL), router decision (Keep Routing, ESPO), and speculative-drafting cycle (ReSpec). The progression is from "make on-policy work" toward making off-policyness typed, versioned, and locally correctable. Whichever of the five conjectures above is settled first, the answer will likely be specific to one of those units rather than to a universal estimator.

References

[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

[2] Li, Y. & Liu, J. (2025). RL Collapse blog series. Part 1, Part 2, Part 3.

[3] Yao, F., et al. (2025). Your Efficient RL Framework Secretly Brings You Off-Policy RL Training. Notion.

[4] Qi, P., et al. (2026). Rethinking the Trust Region in LLM Reinforcement Learning. arXiv:2602.04879.

[5] MiniMax (2025). MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv:2506.13585.

[6] Zheng, C., et al. (2025). GSPO: Group Sequence Policy Optimization. arXiv:2507.18071.

[7] Gao, C., et al. (2025). SAPO: Soft Adaptive Policy Optimization. arXiv:2511.20347.

[8] Zhao, Y., et al. (2025). GMPO: Geometric Mean Policy Optimization. arXiv:2507.20673.

[9] Schulman, J. (2020). Approximating KL Divergence. Blog post.

[10] Ziegler, D. M., et al. (2019). Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593.

[11] Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300.

[12] kalomaze (2026). Don't Exclude Rollouts. Blog post.

[13] verl: An open-source framework for LLM RL. GitHub.

[14] Li, Y., et al. (2026). The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL. arXiv:2602.07078.

[15] Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.

[16] Ahmadian, A., et al. (2024). Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. arXiv:2402.14740.

[17] Zheng, C., et al. (2025). Stabilizing Reinforcement Learning with LLMs: Formulation and Practices. arXiv:2512.01374.

[18] Li, Y., Liu, J., et al. (2025). When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch. Notion.

[19] He, H., et al. (2025). Defeating Nondeterminism in LLM Inference. Thinking Machines Lab.

[20] Yuan, J., et al. (2025). Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. arXiv:2506.09501.

[21] Zhang, Z., et al. (2025). Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch. arXiv:2511.17826.

[22] Ma, W., et al. (2025). Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (R3). arXiv:2510.11370.

[23] DeepSeek (2025). DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556.

[24] Qiu, Z., et al. (2026). FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning. arXiv:2601.18150.

[25] Xi, H., et al. (2026). Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow. arXiv:2601.14243.

[26] Li, Y., et al. (2026). QuRL: Efficient Reinforcement Learning with Quantized Rollout. arXiv:2602.13953.

[27] Le Roux, N., et al. (2025). Tapered Off-Policy REINFORCE (TOPR). arXiv:2503.14286.

[28] Wang, J., et al. (2025). When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL (ASPO). arXiv:2510.06062.

[29] Lei, S., et al. (2026). A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization (MinPRO). arXiv:2601.22718.

[30] Zheng, H., et al. (2025). Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs? (M2PO). arXiv:2510.01161.

[31] Li, Y., et al. (2025). Trust Region Masking for Long-Horizon LLM Reinforcement Learning (TRM). arXiv:2512.23075.

[32] Lee, D., et al. (2026). Query-Adaptive Trust Region Policy Optimization (QUATRO). arXiv:2602.04620.

[33] Shen, G., et al. (2026). Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training (VESPO). arXiv:2602.10693.

[34] Huang, L., et al. (2026). Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs (VCPO). arXiv:2602.17616.

[35] Ritter, D., et al. (2026). LLMs Can Learn to Reason Via Off-Policy Reinforcement Learning (OAPL). arXiv:2602.19362.

[36] Luo, Y., et al. (2026). Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning (R²VPO). arXiv:2601.03320.

[37] Karaman, B. K., et al. (2026). DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning. arXiv:2602.00983.

[38] Zhang, Y., et al. (2026). Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It. arXiv:2602.01826.

[39] Noukhovitch, M., et al. (2024, revised 2025). Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models. arXiv:2410.18252. ICLR 2025.

[40] Fu, W., et al. (2025). AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning. arXiv:2505.24298.

[41] Wu, B., et al. (2025). LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training. arXiv:2505.24034.

[42] Chen, Q., et al. (2025). ReSpec: Towards Optimizing Speculative Decoding in RL Systems. arXiv:2510.26475.

[43] Li, C., et al. (2025). Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level IS. arXiv:2511.20718.

[44] Wasti, B., et al. (2025). No More Train-Inference Mismatch: Bitwise Consistent On-Policy RL with vLLM and TorchTitan. vLLM Blog.

[45] SGLang Team (2025). Towards Deterministic Inference in SGLang and Reproducible RL Training. LMSYS Blog.

[46] Yao, F., Liu, L., et al. (2025). FlashRL: 8Bit Rollouts, Full Power RL. GitHub.

[47] Han, Z., et al. (2025). AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training. arXiv:2507.01663.

[48] Wang, Z., et al. (2025). DistFlow: A Fully Distributed RL Framework for Scalable LLM Post-Training. arXiv:2507.13833.

[49] Sheng, G., et al. (2025). Laminar: A Scalable Asynchronous RL Post-Training Framework. arXiv:2510.12633.

[50] Lu, H., et al. (2025). Part II: ROLL Flash — Accelerating RLVR and Agentic Training with Asynchrony. arXiv:2510.11345.

[51] Li, X., et al. (2025). A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation. arXiv:2512.06547.

[52] Xiao, J., et al. (2026). Cost-Efficient Large Language Model Serving for Multi-turn RL Framework (ECHO-2). arXiv:2602.02192.

[53] Qin, R., et al. (2025). Seer: Online Context Learning for Fast Synchronous LLM RL. arXiv:2511.14617.

[54] Lu, J. (2025). Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning. arXiv:2511.18871.

[55] Ling Team / Ring-1T (2025). Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model. arXiv:2510.18855. (IcePop is introduced in §2.3.2 of this report.)

[56] Zhang, J., Hans, A., Kirchenbauer, J., Goldblum, M., Panda, A., & Goldstein, T. (2026). Learning from Mixed Rollouts: Logit Fusion as a Bridge Between Imitation and Exploration. Notion.

[57] Qi, P., et al. (2025). Defeating the Training-Inference Mismatch via FP16. arXiv:2510.26788.

[58] Huang, W., et al. (2025). QeRL: Beyond Efficiency — Quantization-enhanced Reinforcement Learning for LLMs. arXiv:2510.11696.

[59] CompassMax Team (2025). Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE (CompassMax-V3-Thinking). arXiv:2512.07710.

[60] Cui, G., et al. (2025). The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (clip-cov / KL-cov). arXiv:2505.22617.

[61] Ye, C., et al. (2026). Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL (ALP). arXiv:2603.19470.

[62] Zhang, Y., et al. (2026). Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective (CTPO). arXiv:2605.07331.

[63] Gu, H., et al. (2026). QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training–Inference Mismatch. arXiv:2604.07853.

[64] Zhong, T., et al. (2026). Diagnosing Training Inference Mismatch in LLM Reinforcement Learning (VeXact). arXiv:2605.14220.

[65] Guan, Z., et al. (2026). Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction. arXiv:2605.12070.

[66] Hu, T., et al. (2026). DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training. arXiv:2604.26256.

[67] Xu, H., et al. (2026). GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control. arXiv:2603.01501.

[68] Ma, W., et al. (2026). Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning. arXiv:2604.16918.

[69] Arnal, C., Cabannes, V., Cohen, T., Kempe, J., & Munos, R. (2026). Efficient RL Training for LLMs with Experience Replay. arXiv:2604.08706.

[70] Zhu, et al. (2026). How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment. arXiv:2605.06850.

BibTeX

@article{liu2026offpolicydrift,
  title   = {Off-Policy Drift in LLM RL},
  author  = {Liu, Chris Yuhao},
  year    = {2026},
  month   = {March},
  url     = {https://chrisliu298.ai/posts/off-policy-drift-in-llm-rl/},
  note    = {Blog post; v2 updated May 2026}
}