Note: This post is a work-in-progress and may contain mistakes or miss recent work. If you spot an error or know of something I should cover, please send me an email. If you find this post useful, please cite it using the BibTeX at the end. This post was written with assistance from Claude Code, Codex, and GPT-5.2 Pro.

Modern LLM reinforcement learning is secretly off-policy (Yao et al. 2025). Even frameworks advertised as "on-policy" have a gap between the policy that sampled the data and the policy that computes gradients on it. The causes are many: separate engines for inference and training running at different precisions, different attention kernels, different tensor-parallel configurations, MoE routing decisions that flip between forward passes, async pipelines where the policy moves before stale rollouts are consumed, and multi-epoch training over the same batch. Each source independently shifts the log-probabilities that the gradient depends on, and they compound. The result is corrupted gradients that can collapse training.

This post walks through how mismatch arises, how to detect it, how to correct it with importance sampling and rejection sampling, and how policy loss functions interact with staleness. The treatment follows the rollout correction framework in verl, with references to the relevant source files throughout.

Why LLM RL training is secretly off-policy

The two sources of staleness

In a standard LLM RL pipeline, the full importance weight between the current training policy $\pi_\theta$ and the rollout (behavior) policy $\mu_{\theta_{\text{old}}}$ decomposes into two factors (Zheng et al. 2025):

$$ \frac{\pi_\theta(y_t \mid x, y_{\lt t})}{\mu_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})} = \underbrace{\frac{\pi_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})}{\mu_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})}}_{\text{training-inference mismatch}} \times \underbrace{\frac{\pi_\theta(y_t \mid x, y_{\lt t})}{\pi_{\theta_{\text{old}}}(y_t \mid x, y_{\lt t})}}_{\text{policy staleness}} $$

The first factor is training-inference mismatch: even when the weights are identical ($\theta = \theta_{\text{old}}$), the rollout engine $\mu$ and the training engine $\pi$ produce different log-probabilities. Precision differences (FP8 vs BF16), different attention backends (FlashAttention vs PagedAttention), different parallelism strategies. This factor should be 1 everywhere. In practice it is not.

The second factor is policy staleness: the weights have moved since the rollout was generated. This happens in multi-epoch PPO (the policy updates between epochs but trains on the same rollout data), in one-step async pipelines (generation of batch $N$ overlaps with training on batch $N-1$), and in fully async pipelines (multiple training steps may occur before fresh rollouts arrive).

Both factors corrupt the gradient. Policy gradient methods assume the data was sampled from the current policy. When that assumption breaks, the gradient points in the wrong direction, and the error compounds across training steps. This is the mechanism behind RL collapse (Li & Liu 2025; Yao et al. 2025).

Root causes in detail

The two-factor decomposition above identifies two abstract sources of drift. In practice, six concrete root causes produce this drift, and understanding each determines whether the right fix is a systems change, an algorithmic correction, or both.

Floating-point precision: BF16 vs FP16 vs FP8

BF16 is the default training dtype for most modern LLM stacks, and for good reason: its 8-bit exponent matches FP32's dynamic range, which simplifies mixed-precision recipes. But BF16 has only 7 bits of mantissa (versus FP16's 10), and that coarser rounding is the single largest contributor to training-inference mismatch in many production pipelines. When the inference engine and the training engine both operate in BF16 but use different fused kernels, different reduction orders, or different intermediate accumulation widths, the rounding errors do not cancel — they compound through the softmax into systematically different token probabilities.

Qi et al. (2025) make the sharpest version of this claim: switching both the rollout engine and the training engine to FP16 can nearly eliminate the mismatch with minimal code changes. FP16's extra 3 mantissa bits are enough to make the residual rounding error negligible relative to other noise sources. The cost is a narrower dynamic range (5-bit exponent), which requires careful loss scaling — but that is a solved problem in most frameworks.

At even lower precisions the picture is worse. FP8 and INT8 quantized rollouts are attractive for throughput (roughly 2x over BF16 on modern accelerators), but they introduce a structured distribution shift: the quantized actor is not just a noisy version of the full-precision policy but a systematically different one. FP8-RL (Qiu et al. 2026) and Jet-RL (Xi et al. 2026) both demonstrate that a mixed BF16-train / FP8-rollout stack is not truly on-policy, and that the resulting mismatch must either be corrected with importance sampling or eliminated by unifying precision across both sides. QuRL (Li et al. 2026) goes further, framing the quantized actor as a permanently distinct behavior policy and designing adaptive clipping ranges around the full-precision/quantized ratio.

Zhang et al. (2026) adds a nuance: the failure is not purely about arithmetic precision. A fixed numerical discrepancy between rollout and training can be tolerable early in training and then suddenly become destabilizing later, because the optimizer moves into a region of parameter space where that discrepancy aligns with a steep loss landscape. In other words, mismatch is coupled to optimization state. This is why purely static precision alignment (switching to FP16 once at the start) works most of the time, but occasionally fails in ways that a reactive intervention — like learning-rate scheduling triggered by response-length surges — can catch.

Kernel and framework divergence

Even with identical dtypes, rollout engines and training engines typically use different software paths: different fused attention kernels, different reduction orders in layernorm and softmax, different batch-scheduling policies, and different memory layouts. These differences are invisible at the API level (both engines accept the same weights and produce "the same" outputs) but visible at the floating-point level, where they produce systematically different log-probabilities.

He et al. (2025) argue that what practitioners call "nondeterminism" in LLM inference is mostly a misnomer. The real culprit is batch dependence and floating-point non-associativity: the same prompt produces different logits depending on what other prompts share its batch, because fused kernels reduce across the batch dimension in an order that varies with batch composition. This is deterministic given the batch but not batch-invariant, and since RL training and RL rollout almost never use the same batching strategy, the two engines diverge even when everything else matches.

Yuan et al. (2025) provide a broader catalog of numerical nondeterminism sources in LLM inference, including thread-scheduling in CUDA reductions, non-deterministic cuBLAS GEMM algorithms, and radix-cache-induced prefill chunking differences. While their focus is inference rather than RL, the root causes they identify are exactly the ones that create rollout-training mismatch: every source of numerical variation in the inference engine is a source of off-policy drift in the RL pipeline.

The systems community's response was a sequence of increasingly strict equivalence projects. The vLLM + TorchTitan bitwise consistency article (2025) audited every kernel invocation in the forward pass until rollout and training produced bitwise-identical outputs. The SGLang deterministic inference effort (2025) post achieved reproducible inference across requests by fixing chunked prefill, CUDA graph replay, and radix cache behavior. Both demonstrate that eliminating kernel divergence is possible but expensive: the constrained kernels are slower, and the restrictions (fixed batch sizes, no dynamic batching, deterministic CUDA graphs) limit the serving optimizations that make rollout fast in the first place.

For practitioners, the implication is pragmatic: full kernel-level equivalence is achievable but usually not worth the throughput cost. The more common approach is to accept some kernel divergence and correct it algorithmically — which requires measuring the divergence first (see the diagnostic toolkit below) and then applying IS or RS correction proportional to the observed gap.

Parallelism strategy mismatch

A particularly common and often overlooked divergence source is that the rollout engine and the training engine use different parallelism strategies. The rollout engine typically uses multi-GPU tensor parallelism (TP) for low-latency generation, while the training engine runs with TP=1 under Fully Sharded Data Parallel (FSDP) or a different TP degree under Megatron. Even with identical kernels and identical dtypes, different TP configurations produce different outputs because the all-reduce operations that merge partial results across GPUs use floating-point addition, which is not associative.

Concretely, a TP=4 rollout partitions each matrix multiply across 4 GPUs, computes partial results, and reduces them with an all-reduce. The training engine at TP=1 computes the same matrix multiply on a single GPU with a different reduction tree. The two results differ in the low bits of the mantissa, and these differences propagate through layernorm, softmax, and the autoregressive chain to produce measurably different token probabilities. The gap is typically smaller than the BF16/FP16 precision gap but large enough to matter over long sequences.

Zhang et al. (2025) isolate this problem and propose Tree-Based Invariant Kernels (TBIK): custom reduction kernels whose output is bitwise identical regardless of TP size. TBIK achieves this by fixing a canonical reduction tree (independent of the number of GPUs) and padding to ensure identical accumulation order. Integrated into both vLLM and FSDP, TBIK eliminates the TP-induced component of mismatch entirely. The throughput cost is modest (single-digit percent) because only the reduction kernels are constrained, not the matrix multiplies themselves.

For pipelines that cannot adopt TBIK, the alternative is to match TP configurations across rollout and training — but this is often impractical, since the optimal TP for generation throughput (high TP, low latency) differs from the optimal TP for training throughput (low TP, high batch size under FSDP). The mismatch is a direct consequence of the engineering reality that serving and training have different parallelism sweet spots.

MoE routing discrepancies

For Mixture-of-Experts models, the "policy" is not just a probability distribution over tokens; it includes the router's discrete decisions about which experts process each token. A small numerical perturbation in the router logits can flip the top-$k$ expert selection, replacing one expert's parameters with another's. Unlike the continuous perturbations from precision or kernel differences, this is a discontinuous change. The model suddenly computes with a different subnetwork, and the resulting log-probabilities can shift by much more than the initial numerical perturbation would suggest.

This makes MoE routing mismatch qualitatively different from dense-model mismatch. In a dense model, a small logit perturbation produces a small output perturbation (the computation is Lipschitz). In an MoE model, the same small perturbation can cross a routing boundary and produce an arbitrarily large output change. Standard importance sampling can correct for the resulting ratio, but the ratio itself can be extreme, leading to high variance or outright rejection of the affected sequences.

R3 / Routing Replay (Ma et al. 2025) addresses this by recording the routing decisions made during rollout and replaying them during training. Instead of letting the training engine's router make fresh decisions (which may differ due to numerical divergence), R3 forces the training engine to use the same expert assignments that the rollout engine used. This eliminates the discontinuity: the training forward pass now uses the same computational pathway as the rollout, so the remaining mismatch is only the continuous kernel/precision component.

IcePop (Ling Team 2025) takes a lighter-weight approach: instead of replaying all routing decisions, it identifies token positions where the router's top-$k$ selection disagrees between rollout and training and masks those positions out of the loss. This is cheaper than full replay (no need to store and load routing tables) and effective when disagreement is sparse, which is the common case for well-trained routers with clear expert specialization.

Industrial deployments confirm the severity of the problem. DeepSeek-V3.2 introduced Keep Routing — preserving inference-time routing paths and enforcing them during training — and reports it was necessary for MoE RL stability. The Ring-1T report (2025) lists train-infer routing misalignment as one of the top scaling barriers for trillion-parameter MoE RL and uses IcePop-style masking as part of its stabilization recipe. For MoE models, routing alignment is not optional. Generic IS correction without routing awareness is insufficient.

Asynchronous training lag

Policy staleness from asynchronous pipelines is the most classical form of off-policy drift: the rollout policy is literally an older checkpoint. In one-step async (overlap generation of batch $N$ with training on batch $N-1$), the lag is exactly one gradient step — usually mild. In fully async pipelines like AReaL (Fu et al. 2025), the lag can accumulate across many training steps, producing importance ratios that are heavy-tailed and gradient estimates that are high-variance or biased.

But there is a subtler form of staleness that operates even within a single synchronous batch: minibatch staleness. When a rollout batch is split into multiple minibatches, the policy updates after each minibatch, so by the time the trainer reaches minibatch $k$, the current policy has moved $k-1$ gradient steps from the policy that generated the data. This happens in two common settings. Multi-step PPO uses a single epoch with multiple minibatches — off-policy from the second minibatch onward. Multi-epoch PPO makes multiple passes over the same batch — off-policy from the second epoch onward, and compounding with minibatch splits within each epoch. Both are the same mathematical problem as async staleness, just at a finer timescale. DeepSeek-V3.2 formalizes this with off-policy sequence masking: it computes the KL divergence between the current policy and the sampling policy for each sequence, and masks out negative-advantage sequences whose divergence exceeds a threshold. Critically, the divergence is computed using the sampling probabilities returned by the inference engine, so it captures both engine mismatch and minibatch staleness.

The distinction between these two forms of lag matters for the choice of correction. Classical async staleness is predictable: you know how many gradient steps have elapsed since the rollout, and you can bound the expected drift. Minibatch-reuse staleness is data-dependent: the drift depends on the gradient magnitude of the intervening updates, which depends on the advantage signal, which depends on the reward distribution. This is why per-sequence masking (as in DeepSeek-V3.2) or per-sequence adaptive clipping (as in QUATRO (Lee et al. 2026)) tends to outperform global staleness bounds: the amount of drift varies across sequences within the same batch.

Quantized inference as automatic off-policy

Quantized rollout deserves its own discussion because it is simultaneously the most common mismatch source and the one most often overlooked. When the rollout engine generates with INT8, FP8, or W8A8 (8-bit weights, 8-bit activations) kernels and the training engine computes gradients from full-precision log-probabilities, the pipeline is off-policy by construction. The quantized actor is not an approximation of the training-time policy. It is a different policy, one whose softmax outputs are shifted in structured ways by the quantization grid.

FlashRL (Yao et al. 2025) was among the first to frame this explicitly: 8-bit rollouts can dramatically cut generation cost, but the resulting mismatch must be treated as off-policy data, not ignored. FP8-RL (Qiu et al. 2026) provides a complete production stack — W8A8 inference components, FP8 KV cache, and explicit IS correction integrated into the verl ecosystem — and demonstrates that low-precision RL can match BF16 baselines, but only when the correction mechanism is part of the design from day one. Jet-RL (Xi et al. 2026) makes the stronger architectural claim: the right solution is a unified precision flow across rollout and learning, not post-hoc correction of a precision gap.

The tradeoff is throughput versus correction overhead. Quantized rollout with IS correction is typically faster than full-precision rollout without correction, because the throughput gain from quantization (1.5-2x) exceeds the cost of the extra forward pass needed for IS ratio computation. But the correction adds engineering complexity and monitoring burden: you need to track the quantization-induced mismatch separately from other mismatch sources to know whether your IS weights are doing useful work or just adding noise. QuRL (Li et al. 2026) proposes a cleaner abstraction: treat the quantized actor as a permanent, first-class behavior policy with its own trust region, rather than as a defective copy of the training policy that needs patching.

These sources compound. A typical setup with FP8 rollouts, BF16 training, TP=4 inference, TP=1 training, and 4 PPO epochs experiences the precision gap, parallelism mismatch, and multi-epoch staleness simultaneously. The rollout correction framework treats them uniformly through the importance weight decomposition above.

Identifying mismatch — a diagnostic toolkit

Before correcting mismatch, you need to measure it. verl provides a hierarchy of metrics for this, from per-token signals to aggregate health indicators. All are implemented in rollout_corr_helper.py.

The per-token log-importance ratio

The basic signal is the per-token log importance ratio:

$$ \log\rho_t = \log\pi_{\text{old}}(y_t \mid x, y_{\lt t}) - \log\pi_{\text{rollout}}(y_t \mid x, y_{\lt t}) $$

Here, $\pi_{\text{old}}$ denotes the log-probabilities recomputed by the training engine (the "actor forward pass"), and $\pi_{\text{rollout}}$ denotes the log-probabilities recorded during rollout generation. If these two quantities agree everywhere — if $\log\rho_t \approx 0$ for all tokens — then the policies match and no correction is needed.

To measure this signal without applying any correction, configure verl's rollout correction in metrics-only mode:

# Metrics only, no correction applied
rollout_corr_config = RolloutCorrectionConfig.disabled()

This logs all diagnostic metrics without modifying any gradients or masks, giving you a clean baseline measurement.

Off-policy diagnostic metrics

The raw per-token $\log\rho_t$ is useful but hard to aggregate. The metrics below (computed in compute_offpolicy_metrics() in rollout_corr_helper.py) summarize the distribution shift into scalars you can track on a dashboard.

KL divergence (k1 estimator). The simplest divergence measure:

$$ \text{KL}_{k1} = \mathbb{E}[\log\pi_{\text{rollout}}(y_t \mid x, y_{\lt t}) - \log\pi_{\text{old}}(y_t \mid x, y_{\lt t})] $$

This is the Monte Carlo estimate of $D_{\text{KL}}(\pi_{\text{rollout}} \| \pi_{\text{old}})$. Unbiased but can have high variance.

K3 KL. A lower-variance alternative from Schulman (2020):

$$ \text{KL}_{k3} = \mathbb{E}[\rho_t - \log\rho_t - 1] $$

where $\rho_t = \exp(\log\rho_t)$. Always non-negative, with better numerical properties than k1.

Chi-squared divergence. More sensitive to tail behavior. At the token level:

$$ \chi^2_{\text{token}} = \mathbb{E}[\rho_t^2] - 1 $$

And at the sequence level, using the product of per-token ratios:

$$ \chi^2_{\text{seq}} = \mathbb{E}\left[\left(\prod_t \rho_t\right)^2\right] - 1 $$

Chi-squared divergence directly relates to the variance of importance-weighted estimators. When $\chi^2_{\text{token}} \gt 1.0$, the IS weight distribution is heavy-tailed enough to substantially inflate gradient variance — for estimating a constant, the IS variance exceeds twice the on-policy level.

Perplexity gap and PPL ratio. The absolute and relative differences in perplexity between rollout and training engines:

$$ \text{PPL ratio} = \frac{\text{PPL}_{\text{old}}}{\text{PPL}_{\text{rollout}}} $$

A PPL ratio above 1.0 means the training engine is less confident than the rollout engine (higher perplexity); below 1.0 means the opposite. A ratio slightly above 1.0 is the expected default even when both engines implement the same policy, because $\text{PPL}_\text{rollout}$ on its own samples measures the entropy $H(\pi_\text{rollout})$, while $\text{PPL}_\text{old}$ on those same samples measures the cross-entropy $H(\pi_\text{rollout}, \pi_\text{old}) = H(\pi_\text{rollout}) + D_\text{KL}(\pi_\text{rollout} \| \pi_\text{old})$. By Gibbs' inequality the KL term is non-negative, so any numerical discrepancy between the two engines pushes the ratio above 1.0. A ratio substantially above 1.0, or any ratio below 1.0, therefore signals a systematic gap (precision differences, kernel divergence, TP mismatch, or routing discrepancies) beyond the baseline information-theoretic floor.

IS weight health and rejection sampling metrics

Once you enable IS or RS correction, additional metrics (from compute_is_metrics() in rollout_corr_helper.py) track the health of the correction itself.

Effective Sample Size (ESS). The fraction of samples carrying meaningful weight:

$$ \text{ESS} = \frac{1}{\mathbb{E}[\tilde{w}^2]} $$

where $\tilde{w}$ denotes the normalized IS weights (mean 1). An ESS of 0.5 means your effective batch size is half the actual batch size. Below 0.3, most of your compute is wasted on samples that contribute almost nothing.

IS weight statistics. Mean, standard deviation, min, and max of the IS weights. Healthy weights have mean near 1.0 and low std. A mean far from 1.0 suggests systematic bias; high std means a few samples dominate.

Truncation fractions. The fraction of IS weights clipped by the truncation threshold. High truncation fractions mean severe mismatch that IS alone cannot handle.

RS masked fraction. For each rejection sampling option (and overall), the fraction of sequences whose response masks were zeroed out. This tells you how much data you are throwing away.

Debug probability metrics

For deeper investigation, verl provides debug metrics (in verl/utils/debug/metrics.py) comparing raw probability distributions:

  • Pearson correlation between rollout and actor probabilities (exponentiated log-probs) across all tokens. On-policy training should see correlations above 0.99. Below 0.95 means something is seriously off in the backend.
  • rollout_probs_diff_mean and rollout_probs_diff_max: mean and maximum absolute difference in probabilities. The max catches rare but extreme discrepancies.

Gradient variance proxy metrics

When IS weights are applied, they can inflate gradient variance beyond what the advantage signal warrants. verl tracks this via compute_variance_proxy_metrics() in metric_utils.py:

  • proxy1_signal_strength: $\lVert\bar{g}\rVert^2$ (squared mean gradient) — measures gradient signal strength.
  • proxy2_total_power: $\mathbb{E}[\lVert\hat{g}\rVert^2]$ — total power (signal + noise).
  • proxy3_pure_noise: $\frac{1}{N-1}(\text{proxy2} - \text{proxy1})$ — pure variance in gradient estimates.

If proxy3_pure_noise inflates over training, IS correction is adding unacceptable noise and you should tighten the IS threshold or switch to RS.

Training-time loss diagnostics

Standard PPO health metrics are logged even without rollout correction and interact with off-policy drift:

  • actor/pg_clipfrac: fraction of tokens where PPO clipping activated. Rising clip fraction across epochs in multi-epoch PPO signals excessive divergence from the sampling policy.
  • critic/vf_explained_var: $1 - \text{Var}(G - V) / \text{Var}(G)$. Low explained variance means a stale value function that is no longer tracking actual returns.

The SGA bias-variance framework

The clearest analytic framework for off-policy drift in LLM RL comes from the When Speed Kills Stability blog series (Liu, Li et al. Sep 2025). The usual question, "is this pipeline on-policy or off-policy?", turns out to be less useful than a more precise one: "what is the bias-variance profile of the gradient estimator under the actual distributional drift?"

The framework starts from the observation that policy gradient methods in LLM RL are instances of stochastic gradient ascent (SGA) on the expected reward $J(\theta) = \mathbb{E}_{\pi_\theta}[R(y)]$. When data is sampled from a behavior policy $\mu \neq \pi_\theta$, the gradient estimator becomes an importance-weighted surrogate. The key question is what happens at the token level versus the sequence level, because LLM RL computes token-level losses but assigns sequence-level rewards.

Token-level corrections optimize a surrogate. Standard practice in PPO and GRPO is to correct each token's contribution independently: the per-token ratio $\rho_t = \pi_\theta(y_t \mid x, y_{\lt t}) / \mu(y_t \mid x, y_{\lt t})$ appears in the clipped objective or as an IS weight. But this token-level correction optimizes a surrogate objective that differs from the true sequence-level expected reward. The surrogate is exact only when $\mu = \pi_\theta$; under distributional drift, it incurs a bias that grows with both the sequence length $T$ and the magnitude of the drift. Intuitively, the token-level ratio corrects each conditional $\pi_\theta(y_t \mid y_{\lt t})$ but does not account for the fact that the prefix $y_{\lt t}$ was also sampled from the wrong distribution. The accumulated prefix error is the bias.

Exact sequence-level IS is unbiased but exponentially variable. The theoretically correct correction uses the full sequence-level importance ratio:

$$ \rho_{1:T} = \prod_{t=1}^{T} \frac{\pi_\theta(y_t \mid x, y_{\lt t})}{\mu(y_t \mid x, y_{\lt t})} $$

This yields an unbiased estimator of the on-policy gradient, regardless of how different $\mu$ and $\pi_\theta$ are. The problem is variance: the product of $T$ per-token ratios can be astronomically large or small. Even with moderate per-token ratios (say, each in $[0.9, 1.1]$), a 500-token sequence can produce a sequence ratio anywhere from $\sim 10^{-23}$ to $\sim 10^{21}$. The resulting gradient estimate has variance that grows exponentially with horizon, making it useless in practice without aggressive truncation — which reintroduces bias.

Practical methods live on a bias-variance continuum. The SGA framework places every off-policy correction method on a spectrum:

  • At one extreme, pure token-level correction (PPO-clip, token-level TIS) has low variance but bias that grows with drift and length.
  • At the other extreme, exact sequence-level IS has zero bias but variance that can make learning impossible.
  • In between, methods like geometric-mean ratios (GSPO, GMPO), prefix ratios (MinPRO), turn-level IS, and second-moment-constrained objectives (M2PO, R$^2$VPO) trade bias for variance in different ways.

The framework's concrete contribution to diagnostics is identifying the right divergence measures for each failure mode:

  • Total variation (TV) distance between $\mu$ and $\pi_\theta$ bounds the bias of token-level surrogates. When $\text{TV}(\mu, \pi_\theta)$ is small, the surrogate is close to the true objective, and token-level corrections are safe.
  • Chi-squared divergence $\chi^2(\pi_\theta \| \mu) = \mathbb{E}_\mu[(\pi_\theta / \mu)^2] - 1$ bounds the variance of importance-weighted estimators. When $\chi^2$ is large, even unbiased sequence-level corrections are too noisy to be useful.

A healthy off-policy pipeline needs both metrics to be moderate. High TV means the bias of cheap corrections is large. High $\chi^2$ means the variance of exact corrections is large. When both are high, no simple correction works well, and the pipeline needs either tighter synchronization or a fundamentally different objective.

For practitioners, the SGA framework suggests a concrete diagnostic protocol: track both $\text{TV}$ (or its proxy, the per-token KL) and $\chi^2$ (at both token and sequence level) across training. When TV is low and $\chi^2$ is low, you are effectively on-policy and no correction is needed. When TV is moderate and $\chi^2$ is low, token-level IS or geometric-mean RS suffices. When TV is low but $\chi^2$ is high (rare, but possible with long sequences and mild per-token drift), sequence-level truncation or masking is the right tool. When both are high, the mismatch is too severe for algorithmic correction alone — you need to reduce the system-level gap (precision alignment, weight synchronization, routing replay) before the algorithm can do its job.

Correcting mismatch — the rollout correction framework

Architecture overview

verl's rollout correction has two orthogonal tools that can be combined:

  • Importance sampling (IS) weights: reweight each sample's gradient contribution by the likelihood ratio between training and rollout policies. Makes the gradient unbiased (up to truncation) at the cost of higher variance.
  • Rejection sampling (RS) masks: zero out the response masks of sequences where the divergence exceeds a threshold. Removes high-divergence samples entirely, reducing variance but discarding data.

Configuration lives in RolloutCorrectionConfig (in verl/trainer/config/algorithm.py), and the entry point is compute_rollout_correction_and_add_to_batch() in rollout_corr_helper.py.

Importance sampling correction

IS correction reweights the gradient so that training on off-policy data approximates the on-policy gradient. Two aggregation levels are available:

Token-level truncated IS (Token TIS). Each token gets its own weight, clipped to prevent extreme values:

$$ w_t = \min(\rho_t, C) $$

where $\rho_t = \exp(\log\pi_{\text{old}} - \log\pi_{\text{rollout}})$ and $C$ is the truncation threshold (typical range: 1.5–5.0). Biased (clipping introduces bias) but low variance, so it works well as a default.

Sequence-level truncated IS (Sequence TIS). The entire sequence gets a single weight based on the product of token ratios:

$$ w = \min\left(\prod_t \rho_t, C\right) $$

with a typical threshold range of 2.0–10.0. Before truncation, this is closer to the unbiased sequence-level estimator than token TIS. In practice, however, the product of ratios is almost always truncated for long sequences (see the length trap below), and aggressive truncation reintroduces bias — potentially more than token TIS for long outputs. The main advantage of sequence TIS is that it preserves the correct correction unit (the sequence) at the cost of higher variance.

All computations happen in log-space with a safety bound of $\pm 20$ (since $e^{20} \approx 4.85 \times 10^8$) to prevent numerical overflow. After truncation, IS weights are optionally batch-normalized to mean 1.0, which prevents the effective learning rate from drifting.

To enable token-level IS correction:

# Enable token-level IS correction
rollout_corr_config = RolloutCorrectionConfig.decoupled_token_is()

Rejection sampling correction

RS correction identifies and excludes sequences where the training and rollout policies disagree too strongly. Three KL estimators are available:

K1 (ratio-based). Computes the raw log-ratio at each token:

$$ d_{k1}(t) = -\log\rho_t = \log\pi_{\text{rollout}} - \log\pi_{\text{old}} $$

With K1, thresholds are specified as ratio bounds (lower and upper), rejecting sequences where the ratio falls outside $[\ell, u]$.

K2 (MSE-based). A squared divergence measure:

$$ d_{k2}(t) = \frac{1}{2}(\log\rho_t)^2 $$

K2 is always non-negative. The threshold is an upper bound on the divergence.

K3 (low-variance KL). Schulman's estimator:

$$ d_{k3}(t) = \rho_t - \log\rho_t - 1 $$

Also always non-negative, with lower variance than K1.

These token-level divergences can be aggregated to the sequence level in four ways, yielding 11 total modes:

AggregationK1K2K3
Token (per-token threshold)token_k1token_k2token_k3
Seq sum ($\sum_t d_t$)seq_sum_k1seq_sum_k2seq_sum_k3
Seq mean ($\frac{1}{T}\sum_t d_t$)seq_mean_k1seq_mean_k2seq_mean_k3
Seq max ($\max_t d_t$)seq_max_k2seq_max_k3

The geometric-mean modes (seq_mean_*) solve the length trap (detailed in the failure modes section below). With sequence-sum aggregation, long sequences accumulate higher total divergence even if per-token divergence is small, so they get preferentially rejected. For chain-of-thought (CoT) and agent tasks where response lengths vary dramatically, this creates a systematic bias against long (and often correct) reasoning traces. Geometric-mean normalization divides by sequence length, making the criterion length-invariant (Li & Liu 2025). See also kalomaze (2026) for a broader argument against length-dependent rollout exclusion.

Bypass vs decoupled mode

The framework operates in one of two modes, differing in how many distinct policies are involved.

Bypass mode (2 policies). Sets $\pi_{\text{old}} = \pi_{\text{rollout}}$, skipping the expensive actor forward pass that recomputes log-probabilities on the training engine. The IS ratio is computed directly between the current policy $\pi_\theta$ and the rollout policy. This is faster (one fewer forward pass per batch) but less accurate: it conflates training-inference mismatch with policy staleness into a single ratio. Implemented in apply_bypass_mode() in rollout_corr_helper.py.

Decoupled mode (3 policies). Recomputes $\pi_{\text{old}}$ on the training engine, giving three distinct sets of log-probabilities: $\pi_{\text{rollout}}$ (from the rollout engine), $\pi_{\text{old}}$ (from the training engine at the rollout weights), and $\pi_\theta$ (from the training engine at the current weights). This lets you measure and correct the two factors of the importance weight decomposition independently. More accurate, but costs an additional forward pass.

When to use which: Bypass when the mismatch is small (synchronized pipelines, same precision, or when speed matters more than correction accuracy). Decoupled when mismatch is large (async pipelines, precision gaps, multi-epoch training where you need precise staleness measurement).

The full set of configuration presets:

PresetModeISRSLoss
bypass_ppo_clipBypassPPO-clip
bypass_ppo_clip_geo_rsBypassseq_mean_k1PPO-clip
bypass_ppo_clip_k3_rsBypassseq_mean_k3PPO-clip
bypass_pg_isBypassSeq TIS (2.0)REINFORCE
bypass_pg_geo_rsBypassseq_mean_k1REINFORCE
bypass_pg_geo_rs_seq_tisBypassSeq TIS (2.0)seq_mean_k1REINFORCE
bypass_pg_geo_rs_token_tisBypassToken TIS (2.0)seq_mean_k1REINFORCE
decoupled_token_isDecoupledToken TIS (2.0)(any)
decoupled_seq_isDecoupledSeq TIS (2.0)(any)
decoupled_seq_is_rsDecoupledSeq TIS (2.0)seq_sum_k1(any)
decoupled_geo_rsDecoupledseq_mean_k1(any)
decoupled_geo_rs_seq_tisDecoupledSeq TIS (2.0)seq_mean_k1(any)
decoupled_geo_rs_token_tisDecoupledToken TIS (2.0)seq_mean_k1(any)
decoupled_k3_rsDecoupledseq_mean_k3(any)
decoupled_k3_rs_seq_tisDecoupledSeq TIS (2.0)seq_mean_k3(any)
decoupled_k3_rs_token_tisDecoupledToken TIS (2.0)seq_mean_k3(any)
disabled(any)

Policy loss functions as staleness mechanisms

Your choice of policy loss function determines how the training objective responds to the gap between the current policy and the sampling policy. Each loss implicitly handles (or fails to handle) off-policy drift through its trust region.

Vanilla PPO clip

The standard PPO clipped objective (Schulman et al. 2017) is the most widely used:

$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] $$

where $r_t(\theta) = \pi_\theta(y_t \mid x, y_{\lt t}) / \pi_{\text{old}}(y_t \mid x, y_{\lt t})$.

PPO clip works well on-policy but breaks down under distribution shift. It over-penalizes tokens where the old policy assigned low probability (small absolute changes produce large ratios) and under-penalizes tokens where the old policy assigned high probability (large absolute changes produce small ratios). Under off-policy drift, this asymmetry compounds: $r_t$ reflects both legitimate policy updates and spurious mismatch, and the clipping mechanism cannot tell them apart (Qi et al. 2026). verl's vanilla loss also implements dual-clip PPO (Ye et al. 2020) with a configurable lower bound clip_ratio_c (default 3.0) that prevents the ratio from dropping too far below 1 on negative-advantage tokens, adding a floor that standard PPO clip lacks.

DPPO-TV / DPPO-KL

DPPO (Qi et al. 2026) replaces ratio-based clipping with per-token divergence bounds:

  • DPPO-TV: constrains total variation distance between $\pi_\theta$ and $\pi_{\text{old}}$ at each token position.
  • DPPO-KL: constrains binary KL divergence at each token position.

These bound the actual distributional distance rather than relying on a ratio proxy, so the trust region size is invariant to the base probability level. More stable under distribution shift. Both use truncated IS (threshold 20.0) internally to stabilize the ratio computation.

GSPO — geometric sequence ratios

GSPO (Zheng et al. 2025) operates at the sequence level using the geometric mean of token ratios:

$$ \bar{r} = \exp\left(\frac{1}{T}\sum_{t=1}^{T} \log r_t\right) $$

This geometric mean is then clipped and used in a PPO-style objective, with a stop-gradient so that only $\log\pi_\theta$ receives gradients. Dividing by $T$ in the exponent prevents long sequences from producing extreme ratios, making GSPO naturally resistant to the length trap.

SAPO — smooth gating

SAPO (Gao et al. 2025) replaces hard clipping with a smooth sigmoid gate:

$$ g(r, \tau) = \sigma(\tau(r - 1)) \cdot \frac{4}{\tau} $$

where $\tau$ is a temperature parameter. Separate temperatures $\tau_{\text{pos}}$ and $\tau_{\text{neg}}$ for positive and negative advantages give asymmetric trust regions. The smooth gate avoids the discontinuous gradients of PPO's hard clip while still bounding the step size.

GMPO — geometric mean clipping

GMPO (Zhao et al. 2025) applies sign-aware geometric-mean clipping at the sequence level. It clamps the per-token log-ratio $\log\pi_\theta - \log\pi_{\text{old}}$ within $[-\epsilon_{\text{low}}, \epsilon_{\text{high}}]$ based on the sign of the advantage, then exponentiates the sequence mean.

CISPO — stop-gradient clipping

CISPO (MiniMax 2025) applies a stop-gradient to the clipped ratio:

$$ \mathcal{J}_{\text{CISPO}} = \mathbb{E}\left[\frac{1}{\sum_{i} |o_i|} \sum_{i} \sum_{t} \text{sg}\left(\hat{r}_{i,t}(\theta)\right) \hat{A}_{i,t} \log \pi_\theta(o_{i,t} \mid q, o_{i,\lt t})\right] $$

where $\hat{r}_{i,t}(\theta) = \text{clip}(r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}})$ and $\text{sg}[\cdot]$ is the stop-gradient operator. Gradients flow only through $\log\pi_\theta$, not through the ratio. This preserves gradients for reflective reasoning tokens that PPO clip would suppress (because their ratios hit the clip boundary).

One caveat: the DPPO analysis shows CISPO lacks an explicit trust region. The stop-gradient removes the mechanism by which the clipped ratio constrains update size. Under growing mismatch, this can lead to progressively larger updates, which increase mismatch further, a positive feedback loop.

GPG (group policy gradient)

GPG (gpg) is a REINFORCE-style loss that applies the raw policy gradient $-\log\pi_\theta \cdot \hat{A}$ without ratio clipping, combined with group-normalized advantages (similar to GRPO). When rollout IS weights are provided, they are applied as explicit multipliers. GPG is useful as a simpler baseline that avoids the ratio-clipping artifacts of PPO while still benefiting from the rollout correction framework.

Covariance-based token filtering

verl also provides two covariance-based losses that take a different approach to staleness: instead of constraining the ratio or divergence, they identify and suppress individual tokens where the advantage estimate is unreliable.

Clip-cov (clip_cov) computes the covariance $\text{Cov}(\hat{A}, \log\pi)$ per token. Tokens with covariance in a specified range are randomly sampled and their loss contribution is zeroed out, reducing the influence of tokens where the advantage and log-probability are strongly coupled (a sign of noisy gradient signal under off-policy data).

KL-cov (kl_cov) takes a softer approach: instead of zeroing high-covariance tokens, it adds a KL penalty $\beta|\log\pi_\theta - \log\pi_{\text{old}}|$ on those tokens, discouraging large policy updates where the gradient signal is unreliable.

Bypass mode loss

The bypass mode loss (compute_policy_loss_bypass_mode() in core_algos.py) dispatches to either PPO-clip or REINFORCE while layering the full rollout correction framework:

  • For REINFORCE: IS weights are applied explicitly as multipliers on the policy gradient.
  • For PPO-clip: IS weights are not applied (in bypass mode, $r_t(\theta) = \pi_\theta / \pi_{\text{rollout}}$ already incorporates the full policy shift, so an additional IS multiplier would double-count). RS masks are applied in both cases.

This is verl's primary anti-staleness mechanism. It combines a policy loss with IS reweighting and RS filtering in one place, so correction applies regardless of which underlying loss you pick.

The broader algorithmic landscape

The previous sections covered the core rollout correction framework: IS and RS correction, trust-region losses, and IS-aware baselines. But these are pieces of a larger algorithmic conversation that has unfolded rapidly since mid-2025. Once the community recognized that most "on-policy" LLM RL is secretly off-policy, a wave of methods appeared that redesign how importance ratios are constructed, how trust regions are enforced, and whether off-policyness should be corrected at all or simply accepted as the default operating condition.

This section surveys those methods, organized by design philosophy rather than chronology.

Importance sampling design

Standard IS applies a single ratio $\rho_t = \pi_\theta(y_t \mid x, y_{\lt t}) / \pi_{\text{old}}(y_t \mid x, y_{\lt t})$ symmetrically to all tokens regardless of advantage sign, aggregation level, or causal structure. The methods below challenge each of those defaults.

Asymmetric trajectory-level treatment. The earliest departure from symmetric IS in this wave is TOPR (Tapered Off-Policy REINFORCE, Le Roux et al. 2025), which operates at the trajectory level: each complete response gets a single importance weight $\pi(\tau)/\mu(\tau)$, and the asymmetry is applied to the trajectory-level ratio based on the sign of the trajectory reward. For positive-reward trajectories, TOPR uses an SFT-style gradient (no importance weighting), since a large ratio on a good trajectory reflects desirable policy improvement. For negative-reward trajectories, it applies truncated importance weighting to prevent over-suppression: without truncation, the gradient pushes probability mass away from bad trajectories faster than the reward signal warrants. This trajectory-level asymmetric treatment foreshadows the sign-dependent clipping that reappears in several later methods.

ASPO (Wang et al. 2025) makes the asymmetry more explicit. For positive-advantage tokens, ASPO "flips" the ratio so that the correction favors exploration of beneficial actions rather than conservative clamping. For negative-advantage tokens, it retains standard clipping. The intuition is that PPO's symmetric clip suppresses both upside and downside equally, but under off-policy drift the downside suppression matters more (it prevents collapse), while the upside suppression just wastes signal. ASPO keeps the safety guardrail on the negative side while removing it on the positive side. This matters for mismatch because off-policy drift inflates ratios in both directions, and symmetric treatment discards gradient signal that is actually informative.

Prefix-level correction. A deeper question is whether the token-level ratio is even the correct correction object. MinPRO (Lei et al. 2026) argues it is not. In autoregressive generation, the probability of token $t$ depends on all preceding tokens. The causally correct importance weight for correcting the distribution at position $t$ is therefore the prefix ratio $\prod_{s=1}^{t} \rho_s$, not the token ratio $\rho_t$ alone. But cumulative products of ratios explode with sequence length (the length trap from earlier). MinPRO proposes a non-cumulative surrogate: instead of the full product, it uses the minimum token-level ratio within the preceding prefix:

$$ w_t^{\text{MinPRO}} = \min_{s \leq t} \rho_s $$

When all per-token ratios exceed 1 (the common case under policy improvement), this underestimates the true prefix ratio and is therefore conservative. When some ratios are below 1, the minimum can overestimate the shrinking product, but the overall effect is still stabilizing: it avoids the exponential blowup of the full product while preserving the causal structure that pure token-level IS ignores. This matters most in long-horizon tasks where token-level IS accumulates systematic bias across hundreds of tokens because it fails to account for the distributional shift introduced at earlier positions.

Turn-level correction for agentic settings. When the "response" spans multiple dialogue turns with tool calls and environment feedback, even prefix-level correction operates at the wrong granularity. Li et al. 2025 argue that in multi-turn agentic RL, the natural correction unit is the turn, not the token: each turn is a coherent action whose importance weight should reflect the policy's probability of that entire action given the conversation history. They construct turn-level ratios and pair them with a clipping-triggered normalization scheme that re-normalizes weights within each turn when clipping activates, preventing the bias that standard PPO clipping introduces in long-horizon settings. This is important because agentic tasks are where staleness hits hardest: environment latency means rollouts are old by the time the learner processes them, and the trajectory length means token-level corrections compound their errors.

Clipping, masking, and trust-region methods

Standard PPO clips the ratio $r_t(\theta)$ to $[1-\epsilon, 1+\epsilon]$. This is a blunt instrument: it applies the same trust region to every token in every sequence for every prompt, and it only constrains a ratio proxy rather than the actual distributional divergence. The methods below refine each of these dimensions.

Second-moment trust regions. M2PO (Zheng et al. 2025) starts from an empirical observation: stale data can help early in training (the "prosperity" phase), because the diversity of off-policy samples provides useful exploration. But as the policy moves further from the behavior policy, the second moment of the log importance ratio grows, the chi-squared divergence between policies increases, and training collapses. M2PO replaces PPO's first-moment clip with a second-moment trust constraint on the log-ratio:

$$ \mathbb{E}[(\log \rho_t)^2] \leq C $$

This bounds the Pearson chi-squared divergence between the current and behavior policies (the paper shows $\chi^2(\pi_\theta \| \pi_{\text{behav}}) \leq R^2 \cdot M_2$ where $M_2 = \mathbb{E}[(\log \rho)^2]$), providing a tighter control on distributional shift than bounding the ratio itself. In practice, M2PO tolerates larger mean ratio deviations (which carry gradient signal) while strictly controlling the tail (which carries noise). The "prosperity before collapse" framing is itself a diagnostic contribution: if you see improving metrics under stale data, do not assume stability. Watch the second moment of the log-ratio.

Sequence-validity screening. Trust Region Masking (TRM, Li et al. 2025) takes a different approach: rather than clipping individual token ratios, it evaluates whether a sequence is trustworthy as a whole using the maximum per-token KL divergence. A sequence is accepted only if $\max_t D_{\text{KL}}(c_t) \leq \delta$; otherwise it is masked entirely. This criterion is explicitly length-invariant (unlike ratio-based bounds, which accumulate with sequence length) and converts local divergence checks to a sequence-level admission test. TRM is most relevant for long-horizon tasks (code generation, multi-step reasoning, agentic trajectories) where a single high-divergence token can corrupt the gradient for the entire sequence, yet token-level clipping only addresses that one position while leaving the rest of the sequence's gradient intact. By masking entire sequences with trust-region violations, TRM trades data efficiency for gradient cleanliness.

Query-adaptive trust regions. QUATRO (Lee et al. 2026) observes that different prompts have different drift sensitivity. A math problem with a unique solution path is fragile: small policy changes can flip the answer. An open-ended creative prompt is robust: the policy can shift substantially without degrading output quality. A globally fixed $\epsilon$ treats both identically, which is either too conservative for robust prompts (wasting gradient signal) or too permissive for fragile ones (allowing collapse). QUATRO makes the trust region adaptive at the query level, sizing $\epsilon$ based on the local curvature or drift characteristics of each prompt. This is especially relevant under non-uniform staleness, where some prompts in a batch may be far more off-policy than others due to response length, difficulty, or sampling temperature differences.

Industrial adoption: Keep Sampling Mask. The DeepSeek-V3.2 technical report provides evidence that masking-based correction has moved from research into production. Their Keep Sampling Mask maintains a shared action subspace across policy variants during RL training: sequences are masked if the sampling probabilities returned by the inference engine diverge too far from what the training engine computes, with sign-dependent treatment (negative-advantage sequences are masked more aggressively). The system uses inference-returned probabilities so the divergence measure captures both numerical mismatch and policy staleness simultaneously.

Off-policy-native algorithms

The methods above all share a premise: off-policyness is a problem to be corrected or constrained. A parallel line of work in early 2026 takes the opposite stance: off-policyness is the natural operating condition for large-scale LLM RL, and the objective should be designed for it from the start rather than patched after the fact.

Variational sequence-level correction. VESPO (Shen et al. 2026) derives a sequence-level reshaping kernel from a variational objective. The standard approach to sequence-level IS uses the raw product $\prod_t \rho_t$, which has correct expectation but catastrophic variance. VESPO instead optimizes a variational bound that allows the reshaping kernel to trade off a small amount of bias for a large reduction in variance. The resulting kernel keeps the semantics of sequence-level correction (matching the unit of reward) while reducing the variance enough for practical async training. VESPO parameterizes the kernel so that the bias is bounded and controllable, giving practitioners an explicit bias-variance knob rather than truncation heuristics.

ESS-guided variance control. VCPO (Huang et al. 2026) operationalizes the effective sample size (ESS) as a runtime control signal rather than just a diagnostic. When ESS drops (indicating that a few samples dominate the gradient), VCPO scales down the learning rate proportionally, preventing the optimizer from amplifying noisy gradients. It also derives a minimum-variance off-policy baseline (OPOB) that weights each sample's contribution by both its squared importance weight and its per-sample gradient norm:

$$ b^*_{\text{OPOB}} = \frac{\sum_i w_i^2 \lVert g_i \rVert^2 R_i}{\sum_i w_i^2 \lVert g_i \rVert^2} $$

This accounts for how much each sample actually influences the parameter update, not just its importance weight. The combination of ESS-guided step-size scaling and variance-optimal baselines makes VCPO one of the most complete "accept and control" approaches. Its diagnostic contribution is equally important: VCPO shows that ESS collapse and gradient-norm spikes are reliable indicators of impending training failure, giving practitioners an actionable early warning signal.

Optimal off-policy objectives. OAPL (Ritter et al. 2026) goes further and asks: if the inference policy is always lagged, what is the optimal objective for that setting? Rather than starting from an on-policy objective and adding IS corrections, OAPL derives a regression-based objective that minimizes the squared discrepancy between scaled log-probability ratios and optimal advantage estimates:

$$ \min_\pi \sum_x \sum_{i=1}^{G} \left(\beta \ln \frac{\pi(y_i \mid x)}{\pi_{\text{old}}(y_i \mid x)} - (r(x, y_i) - \hat{V}^*(x))\right)^2 $$

This is qualitatively different from standard policy gradient objectives: instead of weighting gradients by advantages, OAPL regresses the log-ratio toward the target advantage. When the lag is structural (the inference engine will always be behind), post-hoc IS corrections to an on-policy objective are suboptimal. OAPL demonstrates that large policy lag can be tolerated if the objective is designed for the off-policy setting from the start. The tradeoff is that computing the target advantage $\hat{V}^*(x)$ on off-policy samples requires a learned critic whose accuracy degrades with the policy gap.

Ratio-variance regularization. R$^2$VPO (Luo et al. 2026) replaces hard ratio constraints (clipping, masking) with a soft regularizer on the variance of the policy ratio:

$$ \mathcal{L}_{\text{R}^2\text{VPO}} = \mathcal{L}_{\text{PG}} + \lambda \cdot \text{Var}(\rho) $$

The motivation is that clipping throws away gradient information (any ratio beyond the clip boundary contributes zero gradient), while a variance penalty allows all ratios to contribute but penalizes configurations where a few extreme ratios dominate. This is a softer version of M2PO's second-moment constraint: instead of hard-bounding $\mathbb{E}[(\log\rho)^2]$, it penalizes the variance of the ratio. The advantage is smoother optimization dynamics, since the constraint never activates discontinuously.

Decoupled positive/negative clipping. DISPO (Karaman et al. 2026) observes that the optimal clip range for positive-advantage updates differs from the optimal range for negative-advantage updates. Positive updates (reinforcing good tokens) benefit from wider clips because the signal is informative, while negative updates (suppressing bad tokens) need tighter clips because over-suppression can collapse entropy. DISPO decouples the two clip boundaries, allowing independent tuning. This is complementary to ASPO's ratio-flipping: ASPO changes the form of the ratio for positive advantages, while DISPO changes the bounds of the clip for each sign.

Learning rate as a correction layer

The methods above all operate on the loss function or the importance weights. Zhang et al. (2026) argue that the optimizer itself is a critical but overlooked part of the mismatch story. As noted in the precision section, a fixed numerical discrepancy can be tolerable early in training and then suddenly become destabilizing as the optimizer enters regions with steep curvature. The same discrepancy that was benign at step 1000 can trigger collapse at step 5000.

The diagnostic signal they identify is response-length surge: a sudden increase in mean response length that precedes collapse by tens of steps. The causal chain is: mismatch introduces a small bias toward longer responses (which have more tokens to accumulate ratio errors), the optimizer amplifies this bias into a length increase, longer responses produce worse ratios, and the loop feeds back. They propose a reactive learning rate scheduler that monitors response length and decays the learning rate when a surge is detected. This does not fix the mismatch itself but prevents the optimizer from amplifying it into collapse.

Mismatch tolerance is not a fixed property of the algorithm. It depends on the optimizer state, the loss landscape, and the training phase. A pipeline that is stable at step 1000 can become unstable at step 5000 with the same mismatch magnitude, simply because the model has moved into a region where the same numerical noise has larger consequences. Learning rate scheduling adds an orthogonal safety layer that the loss-level corrections do not provide.

Advantage estimation and KL regularization

IS-aware advantage estimators

Standard advantage estimators like GAE (Schulman et al. 2016), GRPO (Shao et al. 2024), and RLOO (Ahmadian et al. 2024) assume on-policy data. When the data is off-policy, they produce biased advantage estimates because the baseline statistics (group mean, leave-one-out mean, etc.) are computed under the wrong distribution.

OTB (Optimal Token Baseline, Li et al. 2026) fixes this by computing a variance-optimal per-token baseline under importance weighting:

$$ B_t^* = \frac{\sum_i G_{i,t} W_{i,t}}{\sum_i W_{i,t}} $$

where $G_{i,t}$ is the return and $W_{i,t}$ is a cumulative path-variance proxy that captures the variance contribution of each token position. When IS weights are provided, $W_{i,t}$ is scaled by $\bar{\rho}^2(t)$ (the squared truncated IS ratio) to minimize the MSE of advantage estimates under importance weighting. The advantage is then $\hat{A}_{i,t} = G_{i,t} - B_t^*$, with the IS correction already incorporated into the baseline computation through the variance proxy. OTB is implemented in compute_optimal_token_baseline_advantage() in core_algos.py. The TIR-OTB variant (compute_multi_turn_optimal_token_baseline_advantage()) extends this to multi-turn settings where IS weights must be tracked across conversation turns.

verl provides a total of 13 advantage estimators in the ADV_ESTIMATOR_REGISTRY: gae, grpo, grpo_vectorized, grpo_passk, rloo, rloo_vectorized, reinforce_plus_plus, reinforce_plus_plus_baseline, remax, opo, gpg, optimal_token_baseline, and tir_optimal_token_baseline. Of these, only OTB and TIR-OTB are IS-aware.

KL estimator variants

KL regularization appears both as a loss penalty and as a reward shaping term. The choice of KL estimator matters because different estimators have different bias-variance tradeoffs and, more subtly, different gradient properties. Six variants are available:

EstimatorFormulaExpectation BiasGradient BiasVariance
k1$\mathbb{E}_q[-\log\rho]$UnbiasedBiasedHigh
abs$\mathbb{E}_q[\lvert\log\rho\rvert]$BiasedBiasedLow
k2$\mathbb{E}_q[\frac{1}{2}(\log \rho)^2]$BiasedUnbiasedLow
k3$\mathbb{E}_q[\rho - \log\rho - 1]$UnbiasedBiasedLow
k1+Straight-through corrected k1UnbiasedUnbiasedHigh
k3+Straight-through corrected k3UnbiasedUnbiasedLow

The correct estimator depends on where the KL term appears:

  • KL in the loss, naive on-policy: Use k2. Its gradient produces $(\log\rho) \cdot \nabla_\theta \log\pi_\theta$, which is the correct REINFORCE gradient of the KL.
  • KL in the loss, IS-weighted: Use k3. Under importance weighting, k3's gradient yields the exact KL gradient.
  • KL in the reward, with stop-gradient: Use k1. The stop-gradient means the score function picks up k1 as a multiplicative weight.

See Schulman (2020) for the full analysis.

The k1+ and k3+ variants use the straight-through trick to combine the forward-pass value of one estimator with the backward-pass gradient of another:

# Straight-through: backward gradient from `backward_score`,
# forward value from `forward_score`
kl = backward_score - backward_score.detach() + forward_score.detach()

You can, for example, use k3's low-variance forward value for loss computation while retaining k2's unbiased gradient for optimization.

KL regularization appears in two independent paths in verl: as a reward shaping term (subtracted from per-token rewards before advantage estimation, controlled by use_kl_in_reward) and as a loss penalty (added directly to the policy loss, controlled by use_kl_loss). Note: The k1+ and k3+ variants are currently broken in verl. The kl_penalty() function dispatches to kl_penalty_forward() without stripping the + suffix, so kl_penalty_forward("k1+") matches none of its branches and returns None, crashing the straight-through computation. Use plain k1 or k3 until this is fixed.

KL controllers

Two strategies for setting the KL penalty coefficient $\beta$:

Adaptive controller (Ziegler et al. 2019). Adjusts $\beta$ based on the ratio of observed KL to a target:

$$ \beta \leftarrow \beta \cdot \left(1 + \text{clip}\left(\frac{\text{KL}}{\text{target}} - 1, \; -0.2, \; 0.2\right) \cdot \frac{n}{H}\right) $$

where $n$ is the number of samples and $H$ is the horizon. $\beta$ goes up when KL exceeds the target, down when it falls below.

Fixed controller. Sets $\beta = \text{const}$. Simpler to tune and avoids instabilities from the adaptive feedback loop, but needs manual adjustment if training dynamics change.

Systems-level solutions

The previous sections treated mismatch as a statistical problem: measure the divergence, then correct with IS weights or RS masks. But the divergence itself has engineering causes (different kernels, different precisions, different parallelism strategies) and engineering solutions exist that attack the root cause rather than compensating after the fact. The recurring tradeoff: the more aggressively you eliminate mismatch at the systems level, the more you constrain throughput, hardware flexibility, and parallelism. Exact restoration is possible but expensive. Approximate methods are cheaper but need algorithmic correction on top. Production systems land somewhere in between.

Making rollout and training equivalent

The ideal fix is to make the rollout engine and the training engine produce identical outputs for the same weights and inputs, eliminating the first factor of the importance weight decomposition entirely. Four projects attacked this from different angles in 2025.

Bitwise consistency via kernel auditing. Wasti et al. (2025) audited every kernel invocation in the forward pass to achieve bitwise equivalence between TorchTitan (training) and vLLM (inference). In their demo setup, this achieves zero KL between rollout and training log-probabilities, with the mismatch factor literally 1.0 everywhere. The caveat is cost. The bitwise-consistent configuration is slower and more constrained than standard production deployments: you lose the freedom to use different TP sizes, different batch schedulers, or different fused kernels between rollout and training.

Batch-invariant inference. He et al. (2025) from Thinking Machines Lab reframed the problem: most "nondeterminism" in LLM inference is not random GPU concurrency but deterministic dependence on batch composition. When you change the batch size, the reduction order changes, floating-point non-associativity kicks in, and you get different logits for the same input. The fix is batch-invariant kernels that guarantee the same output regardless of what else is in the batch. Batch invariance is the cheapest individual fix: you constrain one dimension (reduction order) rather than the entire engine stack.

FP16 alignment. Qi et al. (2025) make a direct claim: BF16 is the main culprit, and moving both sides to FP16 can nearly eliminate the gap. The extra mantissa precision in FP16 means that accumulation differences between engines produce smaller rounding errors, small enough that the resulting log-probability differences fall below training-relevant thresholds. This is the cleanest "one-line systems fix" — change the dtype and the mismatch largely vanishes. The constraint is FP16's narrower dynamic range, which requires loss scaling and careful attention to overflow.

TP-invariant kernels (TBIK). Zhang et al. (2025) targeted the specific case of different tensor-parallel sizes between rollout and training. Their Tree-Based Invariant Kernels fix the reduction tree structure so that partial sums are always computed in the same order regardless of how many GPUs participate. Integrated into both vLLM and FSDP, TBIK eliminates the TP-induced component of mismatch entirely, with modest throughput cost (single-digit percent).

Deterministic inference in SGLang. The SGLang team (2025) operationalized deterministic inference in another major serving stack, covering chunked prefill, CUDA graphs, radix cache, and non-greedy sampling. Their claim is that reproducible RL training can be recovered with modest overhead rather than catastrophic slowdown.

These projects establish a spectrum: at one end, full bitwise consistency (zero mismatch, maximum constraint); at the other, targeted fixes for the worst offenders. Most production systems pick a point in the middle: fix what is cheap to fix, then use algorithmic correction for the residual.

MoE-specific alignment

As discussed in the root causes section, MoE routing mismatch is qualitatively different from dense-model mismatch because expert routing can change which function is computed, producing discontinuous output changes that generic IS cannot cleanly correct.

Routing Replay (R3). Ma et al. (2025) proposed recording the routing distributions used during inference and replaying them during training. This aligns the computational pathway, not just the output probabilities. The cost is storage (routing decisions must be saved and transmitted alongside rollout data) and some loss of training-time router adaptivity.

IcePop. Ling Team (2025) (project page) took a lighter-weight approach: mask or suppress token positions where routing mismatch makes the importance ratio unreliable. Since disagreement is typically sparse (most tokens route consistently), this removes corrupted gradient signal without the overhead of full routing replay.

Keep Routing (DeepSeek-V3.2). DeepSeek-V3.2 took routing alignment to production scale, preserving inference-time routing paths and enforcing them during training. Shipping this in a production model confirms that MoE routing mismatch was a real scaling bottleneck.

Industrial adoption. By late 2025, these techniques were deployed at scale. The Ring-1T report (2025) uses IcePop-style masking in trillion-scale MoE RL. The CompassMax-V3-Thinking report consolidates router replay, FP8 rollouts, entropy-group IS (ESPO), and additional filtering heuristics into a single recipe. Routing alignment for MoE is not optional.

Quantized rollout stacks

Quantized rollout is the most explicit form of the mismatch problem: you are deliberately using a cheaper representation for inference than for training. The question is not whether this creates off-policy data — it does, by construction — but whether the throughput gain is worth the correction cost.

FlashRL (Yao et al. 2025) established the framing: 8-bit rollouts should be treated as an off-policy transformation that requires explicit correction, not as an approximation that can be ignored. FP8-RL (Qiu et al. 2026) provided a complete production stack — W8A8 inference with FP8 KV cache, paired with IS correction in verl — and showed that low-precision RL can match BF16 baselines when correction is designed in from day one. Jet-RL (Xi et al. 2026) argued for a unified precision flow: either both sides use FP8 or both use BF16, because mixed precision between the two sides of the RL loop is a hidden source of drift. QuRL (Li et al. 2026) treated quantized rollout as a permanent architectural feature and designed adaptive clipping ranges based on the measured ratio between full-precision and quantized actors.

The quantization literature converges on a single lesson: cheap rollouts are off-policy rollouts, and the cost of correction must be budgeted alongside the throughput gain.

Speculative decoding staleness

Speculative decoding introduces a subtler form of mismatch. In standard speculative decoding, a small "drafter" model proposes candidate tokens and a larger "verifier" model accepts or rejects them. In an RL training loop, both models are being updated — but at different rates and on different schedules. ReSpec (Chen et al. 2025) identified the resulting problem: the drafter can become stale relative to the actor as training proceeds. This means the acceptance rate drops, throughput degrades, and the effective sampling distribution shifts because the drafter's staleness biases which tokens get proposed and verified. This expands the mismatch literature from "one policy versus another" to "multiple coupled policies inside the serving algorithm," each with its own staleness trajectory.

The systems-algorithm interface

The systems and algorithmic approaches to mismatch are complements, not alternatives. Systems solutions reduce the magnitude of the mismatch. Algorithmic solutions manage whatever mismatch remains. Full bitwise consistency eliminates the need for algorithmic correction of the mismatch factor, but it constrains throughput and hardware flexibility. Ignoring systems alignment entirely and relying purely on IS/RS works only when the mismatch is mild — severe mismatch produces IS weights with such high variance that the effective sample size collapses.

The likely equilibrium is hybrid: keep system discrepancy small and measurable (FP16 alignment, batch-invariant kernels, TP-invariant reductions where feasible), then use lightweight algorithmic correction (geometric-mean RS, token-level TIS with modest thresholds) for the residual. For MoE models, add routing replay or masking as a separate layer. For quantized rollouts, budget the IS correction cost alongside the throughput gain. The goal is not zero mismatch — it is controlled, measured, and budgeted mismatch with correction that is cheap enough to be worth applying.

Async training — speed vs staleness

One-step off-policy

The simplest form of async training overlaps the generation of batch $N$ with the training on batch $N-1$:

Step N:   [---- train(N-1) ----] [sync]
          [---- gen(N) ---------------]
Step N+1: [---- train(N) ------] [sync]
          [---- gen(N+1) -------------]

Weight synchronization uses NCCL broadcast, which completes in under 300ms for 7B-scale models. The rollout weights lag by exactly one training step, mild enough that bypass mode handles it. This gives a 1.2–1.4x speedup over synchronous training (1.24x on FSDP2 + vLLM, up to 1.40x on Megatron + vLLM).

The configuration enables bypass mode and keeps the vLLM cache engine alive between steps:

rollout:
  free_cache_engine: false
  checkpoint_engine:
    backend: "nccl"
# Python API: select bypass mode preset
rollout_corr_config = RolloutCorrectionConfig.bypass_ppo_clip()  # or .bypass_pg_geo_rs() for RS

The implementation lives in verl/experimental/one_step_off_policy/.

Fully async pipeline

For maximum throughput, verl supports a fully decoupled pipeline where the rollout worker and trainer are independent Ray actors communicating via a message queue. Each runs on its own clock: the rollout worker continuously generates samples, the trainer continuously consumes them.

Four operating modes, controlled by two parameters:

Modetrigger_parameter_sync_stepstaleness_thresholdpartial_rolloutBehavior
On-policy10Sync after every batch; equivalent to synchronous
Stream off-policy>10Multiple local updates before sync
Async stale$\geq$1>0falseAllows stale samples; waits for active rollouts to finish
Async partial$\geq$1>0trueInterrupts active rollouts on sync (best throughput)

"Async partial" gets the highest throughput by interrupting in-progress rollouts when a weight sync triggers, rather than waiting for them to finish. Interrupted rollouts are saved and resumed after synchronization, so no generated samples are discarded — the pipeline never stalls and no compute is wasted.

Performance benchmarks on Qwen2.5-Math-7B with H20 GPUs:

SetupSync timeAsync timeSpeedup
128 GPU (64+64)1d 17h17h 22m2.35x
64 GPU (32+32)1d 18h21h 40m1.92x
32 GPU (16+16)3d 17h1d 9h2.66x

Speedup is largest on smaller GPU counts where the synchronous pipeline has the most idle time. Staleness is tracked via dedicated metrics (fully_async/count/stale_samples_processed, rollouter/idle_ratio, etc.) and bounded by staleness_threshold. Implementation is in verl/experimental/fully_async_policy/.

The wider async RL landscape

verl's one-step overlap and fully async pipeline sit within a fast-moving ecosystem of async RL systems for LLMs. The design space has crystallized around three competing philosophies: tolerate staleness with algorithmic correction, reduce staleness through better systems design, or eliminate the need for asynchrony altogether by making synchronous rollout fast enough.

Tolerating staleness. The intellectual starting point is Noukhovitch et al. (2025), which showed that some RLHF objectives remain surprisingly robust under moderate asynchrony. Online DPO proved especially tolerant of stale data, establishing a key principle: the right objective can buy you headroom against policy lag that no amount of system engineering can. AReaL (Fu et al. 2025) built on this by making staleness a first-class system metric rather than a nuisance variable, pairing a staleness-aware PPO variant with worker-level load balancing that tracks how old each sample is. LlamaRL (Wu et al. 2025) arrived at roughly the same time but emphasized infrastructure over algorithms, providing distributed weight synchronization and off-policy-capable training loops.

The late-2025 wave pushed further. ROLL Flash (Lu et al. 2025) introduced an "asynchronous ratio" constraining the policy version gap per sample — an explicit staleness guardrail baked into the system scheduler. A-3PO (Li et al. 2025) tackled a specific algorithmic bottleneck: the proximal policy anchor normally requires a full forward pass, which is expensive for large models. A-3PO approximates it through log-probability interpolation weighted by staleness, eliminating a forward pass while preserving the stabilizing effect. ECHO-2 (Xiao et al. 2026) represents the clearest sign that stale-data management has matured into an exposed engineering knob: it treats bounded policy staleness as a user-controlled parameter with provisioning rules that let operators trade staleness for cost efficiency.

Reducing staleness through better async topology. AsyncFlow (Han et al. 2025) uses streaming queues and overlapped transport so that training can begin consuming partial batches before the full rollout completes, reducing effective lag. DistFlow (Wang et al. 2025) replaces the single-controller architecture with a fully distributed multi-controller paradigm, eliminating the central-node bottleneck that forces some workers to wait (and become stale) while others are busy. Laminar (Sheng et al. 2025) introduced a relay-worker design with a distributed parameter service allowing rollout workers to pull latest weights at any time without stalling the learner, shifting from monolithic sync points to a fluid topology where staleness is minimized per-worker rather than bounded globally.

Eliminating the need for asynchrony. Seer (Qin et al. 2025) focuses on making synchronous rollout fast enough that you do not need to tolerate policy lag at all, using divided rollout, context-aware scheduling, and adaptive grouped speculative decoding to cut long-tail latency by 75-93% and improve throughput by 74-97%. Periodic Asynchrony (Lu 2025) takes a middle path: it decouples inference and training for throughput gains but synchronizes weights after each full batch, preserving on-policy semantics while recovering most of the throughput benefits of async — over 3x on 8B models.

These camps are not mutually exclusive. Production systems increasingly combine elements from each: fast synchronous rollout to minimize baseline lag, async overlap where synchronous execution leaves GPU cycles on the table, and staleness-aware objectives as a safety net for the residual policy gap.

Failure modes and practical guidance

Failure mode catalog

The length trap. Sequence-level IS weights are products of per-token ratios. Even a small per-token bias compounds exponentially with length. A per-token ratio of $\rho_t = 1.1$ (10% discrepancy, which is mild):

  • 10-token sequence: $\prod_t \rho_t = 1.1^{10} \approx 2.6$ — within typical truncation thresholds, kept
  • 50-token sequence: $\prod_t \rho_t = 1.1^{50} \approx 117$ — truncated or rejected
  • 100-token sequence: $\prod_t \rho_t = 1.1^{100} \approx 13{,}780$ — massively rejected

Sequence-level IS and RS disproportionately penalize long sequences. In CoT and agent tasks where correct responses tend to be longer, this biases toward shorter (often wrong) outputs. The fix is geometric-mean RS (seq_mean_k1 or seq_mean_k3), which normalizes by length:

$$ d_{\text{geo}} = \frac{1}{T}\sum_{t=1}^{T} d(t) $$

This makes the rejection criterion independent of sequence length.

Toxic tails. Under severe mismatch (e.g., $\chi^2_{\text{token}} \gt 2.0$), the samples with the highest IS weights are often not the most informative. They are artifacts of the ratio computation: tokens where the rollout engine happened to assign much lower probability than the training engine for numerical rather than semantic reasons. Upweighting them amplifies noise, not signal.

In this regime, filtering (RS) is safer than reweighting (IS). Sequence-level RS (masking to exclude outliers) removes the toxic tail entirely, while sequence-level TIS (clipping) still lets corrupted samples influence the gradient, just with reduced weight. When mismatch is severe, prefer RS over IS, or combine both: RS to remove the worst outliers, IS to correct the remaining mild shift.

Practical recommendations

  1. Start with metrics-only mode. Run with RolloutCorrectionConfig.disabled() for the first few hundred steps. Measure the baseline gap: kl, chi2_token, ppl_ratio, and Pearson correlation. If kl < 0.05 and chi2_token < 0.3, you may not need any correction.

  2. Enable RS if seeing outliers. If chi2_token exceeds 0.3 or you observe occasional training spikes, enable geometric-mean RS (bypass_ppo_clip_geo_rs or decoupled_geo_rs). This removes the worst outliers without changing the gradient computation for the remaining samples.

  3. Add IS when comfortable with metrics. If RS alone is insufficient (masked fraction stays high, or KL remains elevated after filtering), add IS weights. Token TIS is the safer default; sequence TIS is more principled but higher variance.

  4. Use bypass for small mismatch, decoupled for significant staleness. If your pipeline is synchronous and the only mismatch source is the precision gap, bypass mode saves a forward pass without losing much accuracy. If you are running async or multi-epoch training where policy staleness is the dominant factor, decoupled mode gives you the three-policy decomposition needed for precise correction.

  5. Watch for the length trap. If you are training on tasks with variable-length outputs (CoT, agents, code generation), always use geometric-mean aggregation for RS. Sequence-sum aggregation will silently filter out your longest — and often best — responses.

Open problems and future directions

A unified objective that is both sequence-correct and low variance. Tokenwise PPO and GRPO optimize a surrogate that does not directly correspond to the sequence-level reward. Replacing them with exact sequence-level importance sampling recovers the correct objective but introduces variance that scales exponentially with sequence length. GSPO, GMPO, VESPO, MinPRO, and TRM each carve out a different point on the bias-variance Pareto frontier — geometric-mean aggregation, prefix ratios, variational reshaping kernels, sequence-level masking — but none has emerged as a consensus replacement. The field needs a principled framework for choosing among these tradeoffs given a task's horizon, reward sparsity, and tolerable staleness level.

Standardized mismatch diagnostics. There is no universally adopted reporting suite for off-policy drift. Some papers report KL divergence, some report importance-ratio moments, some track effective sample size, some flag entropy collapse, and some use response-length surges as an early warning signal. VCPO demonstrated that ESS and gradient-norm tracking reliably predict collapse, while verl and NeMo RL have converged on logging rollout-vs-training log-probs, corrected ratios, and truncation fractions. But these remain framework-specific conventions, not a community standard. A "mismatch card" — analogous to a model card but reporting the off-policy diagnostics of each training run — would make it far easier to compare results across papers and reproduce stability claims.

MoE-specific theory behind practice. Router replay (R3), IcePop, and production keep-routing heuristics (DeepSeek-V3.2) stabilize MoE RL training in practice, but the theoretical story remains thin. MoE introduces discrete computational branching: a small logit perturbation can flip expert selection, producing a discontinuous change in the active parameter subspace. Standard importance-sampling theory assumes continuous density ratios and does not cleanly account for this discrete switching. A theory that bridges IS-based off-policy correction and the discrete routing dynamics of MoE models would put current practice on firmer ground.

Principled approximate equivalence. Bitwise consistency between rollout and training engines is achievable but the resulting systems are slower, more constrained, and harder to maintain than production stacks that tolerate small numerical discrepancies. The realistic target is not zero mismatch but controlled, measured, budgeted mismatch with algorithmic correction layered on top. What the field lacks is a principled framework for setting that budget: given a model size, precision configuration, parallelism strategy, and staleness level, how much distributional discrepancy can the downstream objective absorb before correction overhead exceeds the cost of eliminating the discrepancy at the system level?

Async agent RL is under-theorized. Multi-turn tool use, environment interaction, speculative decoding, and remote execution all create policy lag at multiple timescales. A single agentic trajectory may involve LLM generation (token-level), tool calls (turn-level), and environment feedback (episode-level), each with different latency and staleness characteristics. Turn-level importance sampling (Li et al. 2025) and ReSpec are early attempts, but the theory for agentic stale-policy correction is still thin relative to the needs of production systems deploying RL-trained agents at scale.

Objective design vs system repair. A major fault line runs through the literature. One camp argues that mismatch should be solved by making the system exact: FP16 alignment, bitwise consistency, deterministic kernels, router replay. The other argues that the objective should be redesigned to tolerate drift: OAPL, VCPO, TRM. The evidence so far suggests that large-scale deployments need both — system-level work to keep the discrepancy small and measurable, plus objective-level work to remain stable when it is nonzero. But the interaction between these two layers is poorly understood. Does tighter system consistency make algorithmic correction unnecessary, or does it merely shift the failure mode?

Quantization as mismatch source AND exploration tool. Quantized rollout is the most common form of deliberate training-inference mismatch. QuRL and QeRL show that this quantization gap can both destabilize training (through ratio inflation and long-term drift) and increase entropy and exploration (through quantization noise that broadens the sampling distribution). Principled quantization-noise schedules — analogous to learning rate schedules but controlling the precision-induced exploration bonus — are still in their infancy.

References

[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

[2] Li, Y. & Liu, J. (2025). RL Collapse blog series. Part 1, Part 2, Part 3.

[3] Yao, F., et al. (2025). Your Efficient RL Framework Secretly Brings You Off-Policy RL Training. Notion.

[4] Qi, P., et al. (2026). Rethinking the Trust Region in LLM Reinforcement Learning. arXiv:2602.04879.

[5] MiniMax (2025). MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv:2506.13585.

[6] Zheng, C., et al. (2025). GSPO: Group Sequence Policy Optimization. arXiv:2507.18071.

[7] Gao, C., et al. (2025). SAPO: Soft Adaptive Policy Optimization. arXiv:2511.20347.

[8] Zhao, Y., et al. (2025). GMPO: Geometric Mean Policy Optimization. arXiv:2507.20673.

[9] Schulman, J. (2020). Approximating KL Divergence. Blog post.

[10] Ziegler, D. M., et al. (2019). Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593.

[11] Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300.

[12] kalomaze (2026). Don't Exclude Rollouts. Blog post.

[13] verl: An open-source framework for LLM RL. GitHub.

[14] Li, Y., et al. (2026). The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL. arXiv:2602.07078.

[15] Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.

[16] Ahmadian, A., et al. (2024). Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. arXiv:2402.14740.

[17] Zheng, C., et al. (2025). Stabilizing Reinforcement Learning with LLMs: Formulation and Practices. arXiv:2512.01374.

[18] Li, Y., Liu, J., et al. (2025). When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch. Notion.

[19] He, H., et al. (2025). Defeating Nondeterminism in LLM Inference. Thinking Machines Lab.

[20] Yuan, J., et al. (2025). Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference. arXiv:2506.09501.

[21] Zhang, Z., et al. (2025). Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch. arXiv:2511.17826.

[22] Ma, W., et al. (2025). Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (R3). arXiv:2510.11370.

[23] DeepSeek (2025). DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. arXiv:2512.02556.

[24] Qiu, Z., et al. (2026). FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning. arXiv:2601.18150.

[25] Xi, H., et al. (2026). Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow. arXiv:2601.14243.

[26] Li, Y., et al. (2026). QuRL: Efficient Reinforcement Learning with Quantized Rollout. arXiv:2602.13953.

[27] Le Roux, N., et al. (2025). Tapered Off-Policy REINFORCE (TOPR). arXiv:2503.14286.

[28] Wang, J., et al. (2025). ASPO: Asymmetric Importance Sampling Policy Optimization. arXiv:2510.06062.

[29] Lei, S., et al. (2026). MinPRO: A Step Back — Prefix Importance Ratio Stabilizes Policy Optimization. arXiv:2601.22718.

[30] Zheng, H., et al. (2025). Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs? (M2PO). arXiv:2510.01161.

[31] Li, Y., et al. (2025). Trust Region Masking for Long-Horizon LLM Reinforcement Learning (TRM). arXiv:2512.23075.

[32] Lee, D., et al. (2026). QUATRO: Query-Adaptive Trust Region Policy Optimization. arXiv:2602.04620.

[33] Shen, G., et al. (2026). VESPO: Variational Sequence-Level Soft Policy Optimization. arXiv:2602.10693.

[34] Huang, L., et al. (2026). VCPO: Stable Asynchrony — Variance-Controlled Off-Policy RL for LLMs. arXiv:2602.17616.

[35] Ritter, D., et al. (2026). OAPL: LLMs Can Learn to Reason Via Off-Policy RL. arXiv:2602.19362.

[36] Luo, Y., et al. (2026). R²VPO: Ratio-Variance Regularized Policy Optimization. arXiv:2601.03320.

[37] Karaman, B. K., et al. (2026). DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning. arXiv:2602.00983.

[38] Zhang, Y., et al. (2026). Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It. arXiv:2602.01826.

[39] Noukhovitch, M., et al. (2025). Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models. arXiv:2410.18252.

[40] Fu, W., et al. (2025). AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning. arXiv:2505.24298.

[41] Wu, B., et al. (2025). LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training. arXiv:2505.24034.

[42] Chen, Q., et al. (2025). ReSpec: Towards Optimizing Speculative Decoding in RL Systems. arXiv:2510.26475.

[43] Li, C., et al. (2025). Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level IS. arXiv:2511.20718.

[44] Wasti, B., et al. (2025). No More Train-Inference Mismatch: Bitwise Consistent On-Policy RL with vLLM and TorchTitan. vLLM Blog.

[45] SGLang Team (2025). Towards Deterministic Inference in SGLang and Reproducible RL Training. LMSYS Blog.

[46] Yao, F., Liu, L., et al. (2025). FlashRL: 8Bit Rollouts, Full Power RL. GitHub.

[47] Han, Z., et al. (2025). AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training. arXiv:2507.01663.

[48] Wang, Z., et al. (2025). DistFlow: A Fully Distributed RL Framework for Scalable LLM Post-Training. arXiv:2507.13833.

[49] Sheng, G., et al. (2025). Laminar: A Scalable Asynchronous RL Post-Training Framework. arXiv:2510.12633.

[50] Lu, H., et al. (2025). Part II: ROLL Flash — Accelerating RLVR and Agentic Training with Asynchrony. arXiv:2510.11345.

[51] Li, X., et al. (2025). A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation. arXiv:2512.06547.

[52] Xiao, J., et al. (2026). ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient RL. arXiv:2602.02192.

[53] Qin, R., et al. (2025). Seer: Online Context Learning for Fast Synchronous LLM RL. arXiv:2511.14617.

[54] Lu, J. & Luo, Y. (2025). Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning. arXiv:2511.18871.

[55] Ring-1T (2025). Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model. arXiv:2510.18855.

BibTeX

@article{liu2026sameweights,
  title   = {Off-Policy Drift in LLM RL},
  author  = {Liu, Chris Yuhao},
  year    = {2026},
  month   = {March},
  url     = {https://chrisliu298.ai/posts/off-policy-drift-in-llm-rl/},
  note    = {Blog post}
}