MY ALT TEXT

Abstract

Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods still struggle to ensure both subject appearance similarity and motion pattern consistency. To address this, we propose SMRABooth, which leverages the DINO encoder and optical flow encoder to provide object-level subject appearance and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: Firstly, we exploit spatial representations extracted from reference images with a self-supervised vision encoder to guide spatial alignment, enabling the model to capture the subject’s overall structure and to improve high-level semantic consistency. Secondly, we utilize temporal representations of optical flow from reference videos to capture structurally coherent and object-level motion trajectories independent of appearance. Moreover, we propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.

Overall Framework of SMRABooth

SMRABooth Framework

Overview of SMRABooth. Our method splits customized video generation into two stages: subject learning and motion learning. Subject learning aligns global spatial features from the vision encoder to enhance fidelity, while motion learning utilizes temporal motion representations from the optical flow encoder to guide motion generation. The pretrained video diffusion model remains frozen during training, and LoRAs are merged at inference to generate customized videos. For simplicity, text input is omitted from the figure.

Customized Video Generation with Both Subject and Motion

DiT-based methods

You can generate videos flexibly with any subject and any motion.

All the generated videos are the resolution of 832 × 480.

Comparisons with baselines

More DiT-based Results

Customized Video Generation with Both Subject and Motion

U-Net-based methods

You can generate videos flexibly with any subject and any motion.

All the generated videos are the resolution of 384 × 384.

Comparisons with baselines

More U-Net-based Results

More diverse video generation results with different subjects and motions.

Subject Customization

You can generate videos flexibly with only subject control.

All the generated videos are the resolution of 832 × 480.

Motion Customization

You can generate videos flexibly with only motion control.

All the generated videos are the resolution of 832 × 480.

Reference