STANCE: Motion Coherent Video generation Via Sparse-To-dense ANChored Encoding

Zhifei Chen1,*, Tianshuo Xu1,*, Leyi Wu1,*, Luozhou Wang1, Dongyu Yan1, Zihan You3, Wenting Luo3, Guo Zhang4, Ying-Cong Chen1,2,†,
1HKUST(GZ), 2HKUST 3XMU 4MIT

* Equal contribution † Corresponding author

Abstract

We present STANCE, a novel approach for generating motion-coherent videos from single frames with controllable object motion. Our method addresses the challenge of creating physically plausible motion sequences by introducing a sparse-to-dense anchored encoding mechanism that enables precise control over object dynamics while maintaining visual consistency. STANCE leverages a trainable MLP to tokenize instance cues and employs Dense RoPE for enhanced spatial control, allowing users to specify motion through 2D arrows, mass properties, and depth deltas. The method demonstrates superior performance in generating realistic motion trajectories, handling multi-object scenarios, and maintaining physical coherence across diverse scenes. Through extensive experiments on both synthetic and real-world data, we show that STANCE achieves state-of-the-art results in motion-controlled video generation while providing intuitive user control interfaces.

Pipeline & Method

STANCE Pipeline

Pipeline of STANCE. Our method is organized as follows: (1) Left: we extend the input of DiT to include new alpha tokens, and use a trainable MLP to tokenize instance cues. (2) Right: The modality embeddings are added to the auxiliary tokens, and the instance cue tokens are paired with Dense RoPE.

Dense RoPE + Depth

Left: 2D Map vs. Dense RoPE. When the 2D control map is downsampled, many tokens inside the window become zeros, yielding a sparse signal that weakens control. Dense RoPE performs non-zero token extraction over the target region (colored), preserves/assigns positional embeddings (e₁,...,eₙ), and feeds a compact, dense sequence to the model—resulting in stronger, spatially focused control.

Right: Depth control. The upward black arrow is the user-drawn 2D arrow; by manipulating a scalar depth delta (Δ depth on the horizontal axis, [-1,+1]), the user specifies out-of-plane motion: Δ>0 (red) points into the screen (away from the camera), while Δ<0 (blue) points out of the screen (toward the camera).

Velocity and Mass Variations

Demonstrating how STANCE can generate different motion patterns from the same initial frame.

Initial Frame

Initial condition

Vary Case 1-1

Initial Frame

Initial condition

Vary Case 1-2

Initial Frame

Initial condition

Vary Case 2-1

Initial Frame

Initial condition

Vary Case 2-2

Initial Frame

Initial condition

Vary Case 2-3

Initial Frame

Initial condition

Vary Case 2-4

Initial Frame

Initial condition

Vary Case 2-5

Initial Frame

Initial condition

Vary Case 2-6

Multi-Object Scenarios

STANCE's capability in handling complex scenes with multiple interacting objects.

Initial Frame

Initial condition

Multi-Object Scene 1

Initial Frame

Initial condition

Multi-Object Scene 2

Initial Frame

Initial condition

Multi-Object Scene 3

Initial Frame

Initial condition

Multi-Object Scene 4

Initial Frame

Initial condition

Multi-Object Scene 5

Initial Frame

Initial condition

Multi-Object Scene 6

Composite Scenes

Controllable object interactions in composite scenarios with precise motion dynamics control.

Initial Frame

Initial condition

Keep Green Static 1

Initial Frame

Initial condition

Keep Green Static 2

Initial Frame

Initial condition

Slow Collision 1

Initial Frame

Initial condition

Slow Collision 2

Initial Frame

Initial condition

Direction Change 1

Initial Frame

Initial condition

Direction Change 2

Real World Captures

STANCE applied to real-world scenarios captured from actual environments, demonstrating the model's capability to handle authentic natural scenes.

Initial Frame

Initial condition

Real World Scene 1

Initial Frame

Initial condition

Real World Scene 2

Initial Frame

Initial condition

Real World Scene 3

Initial Frame

Initial condition

Real World Scene 4

Comparison

Comparative analysis between STANCE and other video generation methods.

Case 1

Initial Frame

Initial

Ours

DragAny

SGI2V

MotionPro

MoFA

Case 2

Initial Frame

Initial

Ours

DragAny

SGI2V

MotionPro

MoFA

Case 3

Initial Frame

Initial

Ours

DragAny

SGI2V

MotionPro

MoFA

BibTeX

@article{stance2024,
      title={STANCE: Motion Coherent Video generation Via Sparse-To-dense ANChored Encoding},
      author={Author Name 1 and Author Name 2 and Author Name 3 and Author Name 4 and Author Name 5},
      journal={arXiv preprint arXiv:XXXX.XXXXX},
      year={2024}
    }