ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

1 Department of Computer Science and Engineering, Southeast University, Nanjing, China
2 Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications
3 Singapore Management University
* Equal contribution
Code arXiv

TL;DR: We introduce the first bilingual text-to-motion dataset and the corresponding model for bilingual text-to-motion generation, and a plug-and-play reward-guided alignment to further enhance generation quality.

Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

Visual comparison of bilingual & monolingual motion generation results.

Image 5

Observations reveal that: (1) in bilingual motion generation, MDM and MLD exhibit limitations in processing bilingual inputs; (2) in monolingual motion generation, generated motions demonstrate persistent misalignment with input texts. The left figure shows that our BiMD successfully generates motion from both English and Chinese inputs. Furthermore, the right figure highlights that BiMD, integrated with our ReAlign, successfully mitigates the misalignment issue.

Reward-Guided Sampling in Diffusion-Based Motion Generation.

Image 5

Illustration of the sampling process in diffusion-based motion generation frameworks. The blue region represents the sampling distribution \( p_t(\cdot) \) learned by the diffusion model, while the green region depicts the ideal sampling distribution \( p_t^I(\cdot) \) achieved by incorporating our proposed reward-guided sampling strategy with the sampling distribution \( p_t(\cdot) \).

More visual comparison of monolingual motion generation results.

Image 5

Visual comparison of motion generation results on the HumanML3D dataset. Our proposed BiMD method with ReAlign integration improves alignment between text descriptions and generated motions and enhances overall motion quality compared to MLD. The red text denotes descriptions inconsistent with the generated motions.

Main Results

Visualizations on Chinese Text to Motion

Visualizations on English Text to Motion