1Northeastern University · 2Independent Researcher · 3Harvard & Kempner · 4MIT · †Co-advisors
Dead end in a maze?
Go back to the last fork, not the entrance.
TBS applies this to generative video world models. When a rollout fails, it backtracks to the exact failure point and branches from there, instead of discarding the whole plan and starting over.
Generative video models are emerging as world simulators for sequential decision-making: by extracting action plans from generated rollouts, agents can plan and solve complex tasks entirely in imagination. In this setting, the relevant axis for scaling test-time compute is not aesthetic quality but trajectory correctness.
We identify two empirical bottlenecks that make existing test-time scaling paradigms fail. First, video models commit to a high-level spatial trajectory within the earliest denoising steps — adding compute to later steps cannot rescue a wrong plan. Second, reasoning failures cluster heavily in the early temporal stages — root-level Best-of-N (BoN) wastes its budget by repeatedly re-failing at the same bottleneck, discarding valid upstream progress each time.
We introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis via an iterative generate–verify–restart loop: verify the rollout, isolate the first failure, cache the valid prefix, and generate only the remaining suffix. Across algorithmic reasoning, continuous navigation, and robotic manipulation, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where single-shot generation collapses (0.7% for BoN-20), TBS achieves 22.7% — every solved episode from a restarted branch.
Denoising-step search
The spatial trajectory is committed at the very first denoising step — S1, S5, and S40 all produce the same plan. There is nothing to search over.
✗ Wrong axis entirelyRoot-level Best-of-N
72–78% of failures occur in the first 20% of a rollout. Every independent resample replays the same early bottleneck — discarding the verified-correct remainder each time.
✗ Wastes verified upstream progressOne-shot generation, long horizons
Correctness must hold at every frame simultaneously. On 2D Maze paths of 41+ steps, single-shot BoN-1 reaches only 1.9% — errors compound multiplicatively as horizon grows.
✗ Collapses with horizon length
(a) The same noisy latent at S1, S5, S40: the motion plan is identical throughout — the trajectory is committed immediately. (b) Denoising-step search offers no trajectory exploration. (c) Temporal-step search does, by branching from verified prefixes.
(a) BoN exact match collapses past 40 steps; TBS holds (+42.9 pp at length 41+). (b) On strict OOD Keys-and-Doors, BoN-20 reaches 0.7% while TBS-20 reaches 22.7%. (c) 72–78% of BoN failures occur in the first 20% of the trajectory — root resampling re-fails at the same bottleneck every time.
Cache the valid prefix. Branch from there. Repeat.
BoN (top): one error discards everything; restart from frame 1. TBS (bottom): localize the first failure, cache the verified prefix as a restart anchor in a priority frontier, regenerate only the invalid suffix. Full-prefix conditioning (variable-K) preserves the motion context that a single-frame state reset loses entirely.
Three regimes spanning exact symbolic verification, learned noisy supervision, and closed-loop physics simulation.
| Baseline (BoN) | Ours (TBS) | Gain | ||||
|---|---|---|---|---|---|---|
| Domain | Split | BoN‑1 | BoN‑20 | TBS‑10 | TBS‑20 | TBS‑20 − BoN‑20 |
| 2D Maze | Overall | 51.0 | 79.0 | 84.0 | 90.3 | +11.3 |
| Hard | 18.3 | 60.0 | 68.7 | 81.0 | +21.0 | |
| Sokoban | Overall | 41.5 | 82.0 | 77.7 | 82.7 | +0.7 |
| Hard | 37.0 | 78.3 | 73.0 | 79.7 | +1.4 | |
| Sliding Puzzle | Overall | 39.5 | 82.2 | 80.8 | 86.5 | +4.3 |
| Hard | 41.3 | 81.3 | 81.3 | 88.7 | +7.4 | |
| Keys-and-Doors | ID Overall | 52.3 | 97.7 | 96.5 | 99.2 | +1.5 |
| Strict OOD | 0.0 | 0.7 | 8.3 | 22.7 | +22.0 | |
| Method | ADE ↓ | FDE ↓ | MR (%) ↓ | SE ↑ | AC ↑ | Overall ↑ |
|---|---|---|---|---|---|---|
| BoN-1 | 0.541 | 1.074 | 16.70 | 0.268 | 0.730 | 0.323 |
| BoN-3 | 0.300 | 0.765 | 4.05 | 0.407 | 0.817 | 0.424 |
| TBS-3 | 0.276 | 0.640 | 1.73 | 0.496 | 0.936 | 0.508 |
| BoN-5 | 0.265 | 0.680 | 2.70 | 0.466 | 0.872 | 0.473 |
| TBS-5 | 0.263 | 0.622 | 1.56 | 0.508 | 0.936 | 0.511 |
Verifier noise robustness. TBS holds up under increasing reference-path corruption. It converges toward BoN only under the strongest perturbation (~0.7 m ADE noise), showing the mechanism is noise-tolerant.
Robotic Manipulation (RoboTwin). TBS adds a consistent +4% over matched-budget BoN in clean scenes. When BoN approaches correctly but fails the final grasp (gripper closed, contact mismatch), TBS reuses the verified pre-grasp prefix and regenerates only the contact window (gripper open, contact match).
Each clip shows BoN exhausting its budget and failing, followed by TBS backtracking to the failure point and succeeding from a restarted branch.
BibTeX will be added after the arXiv upload.