STANCE: Motion Coherent Video generation Via Sparse-To-dense ANChored Encoding

Zhifei Chen^1,*, Tianshuo Xu^1,*, Leyi Wu^1,*, Luozhou Wang¹, Dongyu Yan¹, Zihan You³, Wenting Luo³, Guo Zhang⁴, Ying-Cong Chen^1,2,†,

¹HKUST(GZ), ²HKUST ³XMU ⁴MIT

* Equal contribution † Corresponding author

Paper arXiv 🤗 Demo Code Dataset

Abstract

We present STANCE, a novel approach for generating motion-coherent videos from single frames with controllable object motion. Our method addresses the challenge of creating physically plausible motion sequences by introducing a sparse-to-dense anchored encoding mechanism that enables precise control over object dynamics while maintaining visual consistency. STANCE leverages a trainable MLP to tokenize instance cues and employs Dense RoPE for enhanced spatial control, allowing users to specify motion through 2D arrows, mass properties, and depth deltas. The method demonstrates superior performance in generating realistic motion trajectories, handling multi-object scenarios, and maintaining physical coherence across diverse scenes. Through extensive experiments on both synthetic and real-world data, we show that STANCE achieves state-of-the-art results in motion-controlled video generation while providing intuitive user control interfaces.

Pipeline & Method

Pipeline of STANCE. Our method is organized as follows: (1) Left: we extend the input of DiT to include new alpha tokens, and use a trainable MLP to tokenize instance cues. (2) Right: The modality embeddings are added to the auxiliary tokens, and the instance cue tokens are paired with Dense RoPE.

Left: 2D Map vs. Dense RoPE. When the 2D control map is downsampled, many tokens inside the window become zeros, yielding a sparse signal that weakens control. Dense RoPE performs non-zero token extraction over the target region (colored), preserves/assigns positional embeddings (e₁,...,eₙ), and feeds a compact, dense sequence to the model—resulting in stronger, spatially focused control.

Right: Depth control. The upward black arrow is the user-drawn 2D arrow; by manipulating a scalar depth delta (Δ depth on the horizontal axis, [-1,+1]), the user specifies out-of-plane motion: Δ>0 (red) points into the screen (away from the camera), while Δ<0 (blue) points out of the screen (toward the camera).

Velocity and Mass Variations

Demonstrating how STANCE can generate different motion patterns from the same initial frame.

Initial condition

Vary Case 1-1

Initial condition

Vary Case 1-2

Initial condition

Vary Case 2-1

Initial condition

Vary Case 2-2

Initial condition

Vary Case 2-3

Initial condition

Vary Case 2-4

Initial condition

Vary Case 2-5

Initial condition

Vary Case 2-6

Multi-Object Scenarios

STANCE's capability in handling complex scenes with multiple interacting objects.

Initial condition

Multi-Object Scene 1

Initial condition

Multi-Object Scene 2

Initial condition

Multi-Object Scene 3

Initial condition

Multi-Object Scene 4

Initial condition

Multi-Object Scene 5

Initial condition

Multi-Object Scene 6

Composite Scenes

Controllable object interactions in composite scenarios with precise motion dynamics control.

Initial condition

Keep Green Static 1

Initial condition

Keep Green Static 2

Initial condition

Slow Collision 1

Initial condition

Slow Collision 2

Initial condition

Direction Change 1

Initial condition

Direction Change 2

Real World Captures

STANCE applied to real-world scenarios captured from actual environments, demonstrating the model's capability to handle authentic natural scenes.

Initial condition

Real World Scene 1

Initial condition

Real World Scene 2

Initial condition

Real World Scene 3

Initial condition

Real World Scene 4

Comparison

Comparative analysis between STANCE and other video generation methods.

Case 1

Initial

Ours

DragAny

SGI2V

MotionPro

MoFA

Case 2

Initial

Ours

DragAny

SGI2V

MotionPro

MoFA

Case 3

Initial

Ours

DragAny

SGI2V

MotionPro

MoFA

BibTeX

@article{stance2024,
      title={STANCE: Motion Coherent Video generation Via Sparse-To-dense ANChored Encoding},
      author={Author Name 1 and Author Name 2 and Author Name 3 and Author Name 4 and Author Name 5},
      journal={arXiv preprint arXiv:XXXX.XXXXX},
      year={2024}
    }