WorldForge LogoWorldForge

Unlocking Emergent 3D/4D Generation in
Video Diffusion Model via Training-Free Guidance

1AGI Lab, Westlake University 2 The College of Computing and Data Science, Nanyang Technological University
Email: {songchenxi, yangyanming, zhaotong68, chizhang}@westlake.edu.cn, ruibo001@e.ntu.edu.sg

We propose WorldForge,a training-free framework that unlocks the world-modeling potential of video diffusion models,
delivering controllable 3D/4D generation with unprecedented realism.

More Details

We showcase more interesting cases to demonstrate the capabilities of our project. For 3D scene generation, we include voyager experiences in non-realistic scenes such as Artworks, AIGC content, portrait photography, city walks, and more. For 4D video re-cam, we demonstrate camera arc rotation, local close-ups, outpainting, viewpoint transferring, and video stabilization. Additionally, we perform video editing tasks including object removal, object addition, face swapping, subject transformation, and try-on applications.

Abstract

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, their lack of controllability and geometric consistency often necessitates retraining or fine-tuning for 3D/4D tasks, which risks degrading pretrained knowledge and incurs high computational costs. To resolve the tension between generalization and controllability, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement (IRR) introduces a predict–correct loop within each denoising step, creating injection points for trajectory control. Flow-Gated Latent Fusion (FLF) decouples motion from appearance via optical flow similarity, enabling precise and localized control signal injection. Dual-Path Self-Corrective Guidance (DSG) compares guided and unguided denoising paths to adaptively correct trajectory drift caused by structural noise. Together, these components inject fine-grained, trajectory-aligned guidance without retraining, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

Method

We propose a general inference-time guidance paradigm that leverages the rich priors of large-scale VDMs in spatial intelligence tasks, such as geometry-aware 3D scene generation and video trajectory control. Our method adopts a warping-and-repainting pipeline, in which input frames are warped along a reference trajectory and then used as conditional inputs in the repainting stage. Building on this, we develop a unified, training-free framework composed of three complementary mechanisms, each designed to address a specific challenge in trajectory-controlled generation.

Method Pipeline

Given a single image or video frames, a vision foundation model reconstructs a scene point cloud, which is warped and rendered along a user-specified trajectory to produce a guidance video. The input image (or first frame) is also converted into a textual prompt and latent representation for an image-to-video diffusion model. Trajectory control is injected through a training-free strategy comprising IRR, FLF, and DSG (detailed in Sec.~3.2–3.4), enabling precise control and high-quality synthesis without additional training.

Comparisons

3D Scene Generation from Single View: Compared to existing SOTA methods, our approach produces more consistent scene content under novel viewpoints, with improved image detail, trajectory accuracy, and structural plausibility.

Dynamic 4D Video Re-Cam: Comparison of 4D trajectory-controlled re-rendering. Baselines often produce implausible artifacts (e.g., flattened faces, floating heads), reflecting limited use of pretrained priors. Our inference-time guidance leverages these latent world priors to re-render realistic, high-quality content along the target trajectory. We compare against state-of-the-art baselines under identical inputs; for ReCamMaster (text-controlled), parameters are adjusted to match the target path.

Ablation Experiments

Ablation of the proposed components. IRR enables trajectory injection; without it, the model defaults to prompt-only free generation, and FLF/DSG cannot be applied. FLF decouples trajectory cues from noisy content; removing it introduces noise from warped frames. DSG guides sampling toward high-quality, trajectory-consistent results; without it, detail and plausibility drop. The full model achieves the best fidelity and control, demonstrating their complementary effects.