Temporal Backtracking Search for Test-time Generative Video Reasoning

Preprint

Sejoon Jun1, Zheng Ding2, Chloe H. Su3, Weirui Ye4,†, Yilun Du3,†

1Northeastern University  ·  2Independent Researcher  ·  3Harvard & Kempner  ·  4MIT  ·  Co-advisors

Dead end in a maze?
Go back to the last fork, not the entrance.

TBS applies this to generative video world models. When a rollout fails, it backtracks to the exact failure point and branches from there, instead of discarding the whole plan and starting over.


Abstract

Generative video models are emerging as world simulators for sequential decision-making: by extracting action plans from generated rollouts, agents can plan and solve complex tasks entirely in imagination. In this setting, the relevant axis for scaling test-time compute is not aesthetic quality but trajectory correctness.

We identify two empirical bottlenecks that make existing test-time scaling paradigms fail. First, video models commit to a high-level spatial trajectory within the earliest denoising steps — adding compute to later steps cannot rescue a wrong plan. Second, reasoning failures cluster heavily in the early temporal stages — root-level Best-of-N (BoN) wastes its budget by repeatedly re-failing at the same bottleneck, discarding valid upstream progress each time.

We introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis via an iterative generate–verify–restart loop: verify the rollout, isolate the first failure, cache the valid prefix, and generate only the remaining suffix. Across algorithmic reasoning, continuous navigation, and robotic manipulation, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where single-shot generation collapses (0.7% for BoN-20), TBS achieves 22.7% — every solved episode from a restarted branch.


Why existing test-time scaling fails

Denoising-step search

The spatial trajectory is committed at the very first denoising step — S1, S5, and S40 all produce the same plan. There is nothing to search over.

✗ Wrong axis entirely

Root-level Best-of-N

72–78% of failures occur in the first 20% of a rollout. Every independent resample replays the same early bottleneck — discarding the verified-correct remainder each time.

✗ Wastes verified upstream progress

One-shot generation, long horizons

Correctness must hold at every frame simultaneously. On 2D Maze paths of 41+ steps, single-shot BoN-1 reaches only 1.9% — errors compound multiplicatively as horizon grows.

✗ Collapses with horizon length
Trajectories are determined by the earliest denoising steps.

(a) The same noisy latent at S1, S5, S40: the motion plan is identical throughout — the trajectory is committed immediately. (b) Denoising-step search offers no trajectory exploration. (c) Temporal-step search does, by branching from verified prefixes.

BoN degrades on long horizons; failures cluster at the start.

(a) BoN exact match collapses past 40 steps; TBS holds (+42.9 pp at length 41+). (b) On strict OOD Keys-and-Doors, BoN-20 reaches 0.7% while TBS-20 reaches 22.7%. (c) 72–78% of BoN failures occur in the first 20% of the trajectory — root resampling re-fails at the same bottleneck every time.


Temporal Backtracking Search

Cache the valid prefix. Branch from there. Repeat.

TBS vs BoN: generate–verify–restart.

BoN (top): one error discards everything; restart from frame 1. TBS (bottom): localize the first failure, cache the verified prefix as a restart anchor in a priority frontier, regenerate only the invalid suffix. Full-prefix conditioning (variable-K) preserves the motion context that a single-frame state reset loses entirely.


Results

Three regimes spanning exact symbolic verification, learned noisy supervision, and closed-loop physics simulation.

Symbolic Reasoning — Exact Match (%)
Baseline (BoN) Ours (TBS) Gain
DomainSplit BoN‑1BoN‑20 TBS‑10TBS‑20 TBS‑20 − BoN‑20
2D Maze Overall 51.0 79.0 84.0 90.3 +11.3
Hard 18.3 60.0 68.7 81.0 +21.0
Sokoban Overall 41.5 82.0 77.7 82.7 +0.7
Hard 37.0 78.3 73.0 79.7 +1.4
Sliding Puzzle Overall 39.5 82.2 80.8 86.5 +4.3
Hard 41.3 81.3 81.3 88.7 +7.4
Keys-and-Doors ID Overall 52.3 97.7 96.5 99.2 +1.5
Strict OOD 0.0 0.7 8.3 22.7 +22.0
Do the gains hold when verification is noisy and learned rather than exact symbolic rules?
Continuous Navigation — Target-Bench
Method ADE ↓FDE ↓MR (%) ↓SE ↑AC ↑Overall ↑
BoN-1 0.5411.07416.700.2680.7300.323
BoN-3 0.3000.7654.050.4070.8170.424
TBS-3 0.2760.6401.730.4960.9360.508
BoN-5 0.2650.6802.700.4660.8720.473
TBS-5 0.2630.6221.560.5080.9360.511
Verifier noise robustness: TBS matches or exceeds BoN at every noise level.

Verifier noise robustness. TBS holds up under increasing reference-path corruption. It converges toward BoN only under the strongest perturbation (~0.7 m ADE noise), showing the mechanism is noise-tolerant.

And when verification is a closed-loop physics simulator — not learned predictions, but ground-truth task success?
RoboTwin: TBS adds +4% in clean scenes by reusing the pre-grasp prefix.

Robotic Manipulation (RoboTwin). TBS adds a consistent +4% over matched-budget BoN in clean scenes. When BoN approaches correctly but fails the final grasp (gripper closed, contact mismatch), TBS reuses the verified pre-grasp prefix and regenerates only the contact window (gripper open, contact match).


Qualitative examples

Each clip shows BoN exhausting its budget and failing, followed by TBS backtracking to the failure point and succeeding from a restarted branch.

2D Maze

Grid-world BoN-20 ✗ TBS-20 ✓

3D Maze

Grid-world BoN-20 ✗ TBS-20 ✓

Sliding Puzzle

Grid-world BoN-20 ✗ TBS-20 ✓

Sokoban

Grid-world BoN-20 ✗ TBS-20 ✓

Keys-and-Doors — Strict OOD

Grid-world BoN-20 = 0.7% TBS-20 ✓ depth 3+

Navigation — Target-Bench

Continuous BoN-5 ✗ TBS-5 ✓

Robotic Manipulation — RoboTwin

Physics sim BoN-5 ✗ TBS-5 ✓

BibTeX

BibTeX will be added after the arXiv upload.