TL;DR: We introduce the first bilingual text-to-motion dataset and the corresponding model for bilingual text-to-motion generation, and a plug-and-play reward-guided alignment to further enhance generation quality.
Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
 
           
          Illustration of the sampling process in diffusion-based motion generation frameworks. The blue region represents the sampling distribution \( p_t(\cdot) \) learned by the diffusion model, while the green region depicts the ideal sampling distribution \( p_t^I(\cdot) \) achieved by incorporating our proposed reward-guided sampling strategy with the sampling distribution \( p_t(\cdot) \).
 
           
           
          