TL;DR: We present BiHumanML3D, the first bilingual text-to-motion dataset, and BiMD, a model with cross-lingual alignment for effective bilingual motion generation.

Text-to-motion generation, synthesizing 3D human motions from text descriptions, holds significant promise for cross-linguistic and multicultural applications in gaming, filmmaking, and robotics, particularly for Chinese language applications. However, this pioneering task faces two critical challenges: (1) the fundamental bottleneck of publicly available bilingual text-motion datasets, and (2) limitations in existing language models to understand and encode motion semantics across languages due to highly imbalanced pretraining corpora. To bridge these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed through a multi-stage annotation assisted by large language models and dedicated manual corrections. Moreover, we propose a simple yet effective baseline model, \textit{Bilingual Motion Diffusion (BiMD)}, featuring Cross-Lingual Alignment (CLA) and bilingual diffusion training. CLA explicitly aligns semantic representations between languages, resulting in a robust conditional space. Consequently, BiMD efficiently generates high-quality motions accurately aligned with bilingual text inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8% vs. 80.8%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are publicly available at https://github.com/wengwanjiang/BilingualT2M.

Visual comparison of bilingual & monolingual motion generation results.

Visual comparison of bilingual text-to-motion generation. This figure presents motions generated by existing methods, such as MDM and MLD, alongside our Bilingual Motion Diffusion model (BiMD). Observations indicate that existing methods exhibit limitations in bilingual processing, and directly using pretrained multilingual encoders results in imbalanced performance across different languages. Our BiMD with Cross-Lingual Alignment (CLA) successfully generates motion from both English and Chinese textual descriptions.

Framework for training the bilingual motion diffusion model.

Framework for training the bilingual motion diffusion model.
We align English and Chinese text embeddings in a shared latent space by freezing the teacher model \( E^t_{\Phi} \) and fine-tuning the student model \( E^s_{\phi} \) with the cross-lingual alignment loss \( \mathcal{L}_{\text{CLA}} \) (Eq. 1). The aligned student model is then used to provide text conditions for training the diffusion model \( \epsilon_\theta \), enabling bilingual motion generation while minimizing \( \mathcal{L}_{\text{BiMD}} \) (Eq. 6).

More visual comparison of monolingual motion generation results.

Visual comparisons of bilingual text-to-motion generation results evaluate our method with Cross-Lingual Alignment (CLA) against a baseline without CLA, using English and Chinese descriptions. Red text highlights phrases where baseline motions deviate from descriptions, underscoring CLA’s effectiveness in bridging cross-lingual gaps. Captions include English descriptions with Chinese translations in parentheses.

Main Results

Comparison of text-to-motion generation performance on the HumanML3D dataset

Comparison of text-to-motion generation performance on the HumanML3D dataset. Percentages in subscripts indicate improvements over respective baselines. Our BiMD adopts a similar backbone of MLD's, and surpasses it on all metrics.

Comparison of text-to-motion generation performance on the KIT-ML dataset

Comparison of text-to-motion generation performance on the BiHumanML3D dataset.
"Lang." indicates the evaluated language. English and Chinese results are evaluated using the T2MT evaluator [Guo2022] and our proposed evaluator, respectively. Bold highlights the best results.
The symbol "✓" at "CLA" denotes methods that employ our cross-lingual alignment representation, indicating the use of a unified model for generation, whereas the symbol "✗" means methods retrained with XLM as text encoder.

Pipeline for Constructing our Bilingual HumanML3D dataset.

Pipeline for constructing our bilingual HumanML3D dataset. The data collection and filtering process removes unsuitable motions, ensuring high-quality motion-text pairs for annotation. The annotation pipeline begins with an initial translation stage, followed by a refinement stage to address translation issues. Finally, human annotators manually verify and correct the translation with LLM, ensuring linguistic and contextual accuracy.

Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Visual comparison of bilingual & monolingual motion generation results.

Framework for training the bilingual motion diffusion model.

More visual comparison of monolingual motion generation results.

Main Results

Comparison of text-to-motion generation performance on the HumanML3D dataset

Comparison of text-to-motion generation performance on the HumanML3D dataset. Percentages in subscripts indicate improvements over respective baselines. Our BiMD adopts a similar backbone of MLD's, and surpasses it on all metrics.

Comparison of text-to-motion generation performance on the KIT-ML dataset

Pipeline for Constructing our Bilingual HumanML3D dataset.

Visualizations of Code-Switched Text-to-Motion Generation
(Trained on English, Tested with English-Chinese Code-Switching)

Zero-Shot Visualizations of Text-to-Motion Generation using Chinese Inputs
(Trained on English Data)

Visualizations on Chinese Text to Motion

Visualizations on English Text to Motion

Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Visual comparison of bilingual & monolingual motion generation results.

Framework for training the bilingual motion diffusion model.

More visual comparison of monolingual motion generation results.

Main Results

Comparison of text-to-motion generation performance on the HumanML3D dataset

Comparison of text-to-motion generation performance on the HumanML3D dataset. Percentages in subscripts indicate improvements over respective baselines. Our BiMD adopts a similar backbone of MLD's, and surpasses it on all metrics.

Comparison of text-to-motion generation performance on the KIT-ML dataset

Pipeline for Constructing our Bilingual HumanML3D dataset.

Visualizations of Code-Switched Text-to-Motion Generation (Trained on English, Tested with English-Chinese Code-Switching)

Zero-Shot Visualizations of Text-to-Motion Generation using Chinese Inputs (Trained on English Data)

Visualizations on Chinese Text to Motion

Visualizations on English Text to Motion

Visualizations of Code-Switched Text-to-Motion Generation
(Trained on English, Tested with English-Chinese Code-Switching)

Zero-Shot Visualizations of Text-to-Motion Generation using Chinese Inputs
(Trained on English Data)