Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Wanjiang Weng1,*, Xiaofeng Tan1,*, Xiangbo Shu2, Guo-Sen Xie2, Pan Zhou3, Hongsong Wang1
1Southeast University   2Nanjing University of Science and Technology   3Singapore Management University
*Equal Contribution

TL;DR: We present BiHumanML3D, the first bilingual text-to-motion dataset, and BiMD, a model with cross-lingual alignment for effective bilingual motion generation.

Text-to-motion generation, synthesizing 3D human motions from text descriptions, holds significant promise for cross-linguistic and multicultural applications in gaming, filmmaking, and robotics, particularly for Chinese language applications. However, this pioneering task faces two critical challenges: (1) the fundamental bottleneck of publicly available bilingual text-motion datasets, and (2) limitations in existing language models to understand and encode motion semantics across languages due to highly imbalanced pretraining corpora. To bridge these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed through a multi-stage annotation assisted by large language models and dedicated manual corrections. Moreover, we propose a simple yet effective baseline model, \textit{Bilingual Motion Diffusion (BiMD)}, featuring Cross-Lingual Alignment (CLA) and bilingual diffusion training. CLA explicitly aligns semantic representations between languages, resulting in a robust conditional space. Consequently, BiMD efficiently generates high-quality motions accurately aligned with bilingual text inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are publicly available at https://github.com/wengwanjiang/BilingualT2M.

Visual comparison of bilingual & monolingual motion generation results.

Image 5

Visual comparison of bilingual text-to-motion generation. This figure presents motions generated by existing methods, such as MDM and MLD, alongside our Bilingual Motion Diffusion model (BiMD). Observations indicate that existing methods exhibit limitations in bilingual processing, and directly using pretrained multilingual encoders results in imbalanced performance across different languages. Our BiMD with Cross-Lingual Alignment (CLA) successfully generates motion from both English and Chinese textual descriptions.

Framework for training the bilingual motion diffusion model.

Image 5

Framework for training the bilingual motion diffusion model.
We align English and Chinese text embeddings in a shared latent space by freezing the teacher model \( E^t_{\Phi} \) and fine-tuning the student model \( E^s_{\phi} \) with the cross-lingual alignment loss \( \mathcal{L}_{\text{CLA}} \) (Eq. 1). The aligned student model is then used to provide text conditions for training the diffusion model \( \epsilon_\theta \), enabling bilingual motion generation while minimizing \( \mathcal{L}_{\text{BiMD}} \) (Eq. 6).

More visual comparison of monolingual motion generation results.

Image 5

Visual comparisons of bilingual text-to-motion generation results evaluate our method with Cross-Lingual Alignment (CLA) against a baseline without CLA, using English and Chinese descriptions. Red text highlights phrases where baseline motions deviate from descriptions, underscoring CLA’s effectiveness in bridging cross-lingual gaps. Captions include English descriptions with Chinese translations in parentheses.

Main Results

Visualizations of Code-Switched Text-to-Motion Generation
(Trained on English, Tested with English-Chinese Code-Switching)

Zero-Shot Visualizations of Text-to-Motion Generation using Chinese Inputs
(Trained on English Data)

Visualizations on Chinese Text to Motion

Visualizations on English Text to Motion