Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production

CVPR 2026
Zhejiang University of Technology
*Indicates corresponding author

Overview

Overall pipeline of FGDM during training and inference

Overall pipeline of FGDM during training and inference. In training, the noisy sequence $X_t$ is generated by adding noise to the target sequence $X_0$ for $t$ steps. FGDM denoises $X_t$ under glosses guidance to predict the initial sequence $\hat{X_0}$. It comprises two stages: a Focal stage for joint-level dependency modeling and a General stage for global sequence modeling. Beyond the standard regression loss, Semantic Consistent Guidance (SCG) provides auxiliary supervision, where the V2S Adapter and Semantic Decoder project visual features into the semantic space, and the SCG loss enforces tighter consistency with the gloss sequence. During inference, FGDM iteratively denoises random inputs to generate sign sequences.

Abstract

Sign Language Production (SLP) aims to translate spoken language into sign sequences, where the main challenge lies in generating coherent and natural poses from discrete glosses (G2P). Existing G2P methods typically treat each pose as an indivisible unit, limiting their ability to capture fine-grained joint-level dependencies and thus degrading pose quality. To address this, we propose the Focal–General Diffusion Model (FGDM), characterized by a pioneering two-stage denoising framework that harmonizes local joint-level dependencies and global coherence. Specifically, in the Focal stage, a novel Adaptive Sign GCN (ASGCN) adaptively models each pose based on contextual correlations, skeletal topology, and semantic conditions, ensuring precise generation of local details. In the General stage, a Transformer-based module refines the entire pose sequence to enhance global coherence and naturalness. Moreover, we introduce a Semantic Consistent Guidance (SCG) mechanism that seamlessly integrates semantic supervision into diffusion training, enforcing tighter alignment between generated pose sequences and their intended gloss semantics. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate that FGDM achieves SOTA performance.

Qualitative Comparison

Comparison of FGDM with state-of-the-art methods on benchmark datasets

Quantitative and qualitative comparison of FGDM with existing state-of-the-art methods on PHOENIX14T and USTC-CSL datasets. Our proposed method consistently outperforms prior approaches across multiple evaluation metrics, demonstrating the effectiveness of the two-stage denoising framework and Semantic Consistent Guidance.

Interactive Sign Language System

Demonstration of our interactive sign language system. The system supports real-time sign language recognition from camera input and sign language production from text/gloss input, enabling bidirectional communication between deaf and hearing users.

Demonstration Videos

Sign language production results on benchmark datasets.

BibTeX

@inproceedings{yu2026focal,
  title={Focal-General Diffusion Model with Semantic Consistent Guidance for Sign Language Production},
  author={Yu, Yiheng and Liu, Sheng and Feng, Yuan and Jin, Zhelun and Jiang, Yining and Xu, Min},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={35915--35925},
  year={2026}
}