TL;DR: We present BiHumanML3D, the first bilingual text-to-motion dataset, and BiMD, a model with cross-lingual alignment for effective bilingual motion generation.
Text-to-motion generation, synthesizing 3D human motions from text descriptions, holds significant promise for cross-linguistic and multicultural applications in gaming, filmmaking, and robotics, particularly for Chinese language applications. However, this pioneering task faces two critical challenges: (1) the fundamental bottleneck of publicly available bilingual text-motion datasets, and (2) limitations in existing language models to understand and encode motion semantics across languages due to highly imbalanced pretraining corpora. To bridge these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed through a multi-stage annotation assisted by large language models and dedicated manual corrections. Moreover, we propose a simple yet effective baseline model, \textit{Bilingual Motion Diffusion (BiMD)}, featuring Cross-Lingual Alignment (CLA) and bilingual diffusion training. CLA explicitly aligns semantic representations between languages, resulting in a robust conditional space. Consequently, BiMD efficiently generates high-quality motions accurately aligned with bilingual text inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are publicly available at https://github.com/wengwanjiang/BilingualT2M.
Framework for training the bilingual motion diffusion model.
We align English and Chinese text embeddings in a shared latent space by freezing the teacher model \( E^t_{\Phi} \) and fine-tuning the student model \( E^s_{\phi} \) with the cross-lingual alignment loss \( \mathcal{L}_{\text{CLA}} \) (Eq. 1). The aligned student model is then used to provide text conditions for training the diffusion model \( \epsilon_\theta \), enabling bilingual motion generation while minimizing \( \mathcal{L}_{\text{BiMD}} \) (Eq. 6).