We present STANCE, a novel approach for generating motion-coherent videos from single frames with controllable object motion. Our method addresses the challenge of creating physically plausible motion sequences by introducing a sparse-to-dense anchored encoding mechanism that enables precise control over object dynamics while maintaining visual consistency. STANCE leverages a trainable MLP to tokenize instance cues and employs Dense RoPE for enhanced spatial control, allowing users to specify motion through 2D arrows, mass properties, and depth deltas. The method demonstrates superior performance in generating realistic motion trajectories, handling multi-object scenarios, and maintaining physical coherence across diverse scenes. Through extensive experiments on both synthetic and real-world data, we show that STANCE achieves state-of-the-art results in motion-controlled video generation while providing intuitive user control interfaces.
Pipeline of STANCE. Our method is organized as follows: (1) Left: we extend the input of DiT to include new alpha tokens, and use a trainable MLP to tokenize instance cues. (2) Right: The modality embeddings are added to the auxiliary tokens, and the instance cue tokens are paired with Dense RoPE.
Left: 2D Map vs. Dense RoPE. When the 2D control map is downsampled, many tokens inside the window become zeros, yielding a sparse signal that weakens control. Dense RoPE performs non-zero token extraction over the target region (colored), preserves/assigns positional embeddings (e₁,...,eₙ), and feeds a compact, dense sequence to the model—resulting in stronger, spatially focused control.
Right: Depth control. The upward black arrow is the user-drawn 2D arrow; by manipulating a scalar depth delta (Δ depth on the horizontal axis, [-1,+1]), the user specifies out-of-plane motion: Δ>0 (red) points into the screen (away from the camera), while Δ<0 (blue) points out of the screen (toward the camera).
Demonstrating how STANCE can generate different motion patterns from the same initial frame.
Initial condition
Vary Case 1-1
Initial condition
Vary Case 1-2
Initial condition
Vary Case 2-1
Initial condition
Vary Case 2-2
Initial condition
Vary Case 2-3
Initial condition
Vary Case 2-4
Initial condition
Vary Case 2-5
Initial condition
Vary Case 2-6
STANCE's capability in handling complex scenes with multiple interacting objects.
Initial condition
Multi-Object Scene 1
Initial condition
Multi-Object Scene 2
Initial condition
Multi-Object Scene 3
Initial condition
Multi-Object Scene 4
Initial condition
Multi-Object Scene 5
Initial condition
Multi-Object Scene 6
Controllable object interactions in composite scenarios with precise motion dynamics control.
Initial condition
Keep Green Static 1
Initial condition
Keep Green Static 2
Initial condition
Slow Collision 1
Initial condition
Slow Collision 2
Initial condition
Direction Change 1
Initial condition
Direction Change 2
STANCE applied to real-world scenarios captured from actual environments, demonstrating the model's capability to handle authentic natural scenes.
Initial condition
Real World Scene 1
Initial condition
Real World Scene 2
Initial condition
Real World Scene 3
Initial condition
Real World Scene 4
Comparative analysis between STANCE and other video generation methods.
Initial
Ours
DragAny
SGI2V
MotionPro
MoFA
Initial
Ours
DragAny
SGI2V
MotionPro
MoFA
Initial
Ours
DragAny
SGI2V
MotionPro
MoFA
@article{stance2024,
title={STANCE: Motion Coherent Video generation Via Sparse-To-dense ANChored Encoding},
author={Author Name 1 and Author Name 2 and Author Name 3 and Author Name 4 and Author Name 5},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}