ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Wanjiang Weng1,2*, Xiaofeng Tan1,2*, Junbo Wang3, Guo-Sen Xie4, Pan Zhou5, Hongsong Wang1,2
1Southeast University 2PALM Lab 3Northwestern Polytechnical University
4Nanjing University of Science and Technology 5Singapore Management University
* Equal Contribution. † Corresponding Author.

TL;DR: We propose ReAlign, a plug-and-play reward-guided alignment strategy for text-to-motion generation and retrieval, which explicitly enhances both semantic consistency and motion realism throughout the denoising process.

Abstract

Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

Poster

Poster (placeholder)

Click the poster to view the PDF version

Toy Example

Reward-Guided Sampling

Reward-guided sampling in diffusion-based motion generation. The blue region represents the sampling distribution \( p_t(\cdot) \) learned by the diffusion model, while the green region depicts the ideal sampling distribution \( p_t^I(\cdot) \) achieved by incorporating our proposed reward-guided sampling strategy.

Main Results

Video Visualizations

More Visual Comparisons

More Visual Comparison

More visual comparison on HumanML3D. ReAlign can be integrated into MLD to improve text-motion alignment and enhance motion quality. The red text denotes descriptions inconsistent with the generated motions.