Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Constructor University, Constructor Tech
* Equal contribution
AAAI 2026

TL;DR:  We introduce a mixed diffusion-autoregressive framework that transforms existing diffusion-based sequence generation models into real-time streaming methods, achieving a 4× speedup and superior temporal coherence without sacrificing motion quality or diversity.

Abstract

Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4× speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework’s ability to generate realistic, diverse gestures closely synchronized with the audio input.

Framework

Framework visualization

In our work, we adapt rolling diffusion models for co-speech gesture generation, introducing a novel framework that transforms any diffusion-based architecture into an autoregressive streaming model. Our approach enables seamless and continuous gesture generation of arbitrary length by modifying the model architecture and integrating a structured noise scheduling mechanism, which, combined with the rolling denoising process, ensures smooth temporal transitions and prevents abrupt motion discontinuities. The model generates a new clean frame in each s-step and shifts the generation window forward to include the new frame at the end.

Rolling Diffusion Ladder Acceleration

Rolling Diffusion Ladder Acceleration visualization

We introduce Rolling Diffusion Ladder Acceleration (RDLA), a novel approach that transforms the original noise schedule into a ladder with step size l, enabling the simultaneous denoising of l frames from the same noise level. This modification allows multiple frames to be jointly denoised in each iteration, accelerating the process. By implementing RDLA, we achieve a substantial reduction in inference time while maintaining high visual fidelity and temporal consistency. Empirical results demonstrate that RDLA accelerates gesture synthesis by up to 4× compared to standard rolling diffusion.

Quantitative Comparison

ZEGGS results
BEAT results

To thoroughly examine the impact of our method, we integrate our progressive noise scheduling technique into multiple baseline models and conduct comparisons across two datasets: ZEGGS and BEAT (Tables 1 and 2). As the primary baselines for our work, we selected state-of-the-art diffusion-based models for gesture generation: Taming Diffusion, DiffuseStyleGesture (DSG), PersonaGestor and DiffSHEG. To evaluate the quality of our generated gestures, we utilize Frechet distance (FD) and Diversity (Div) metrics. RDLA 2 and RDLA 4 denote our method with ladder step sizes of 2 and 4, respectively.

User Study

User Study results

To assess the quality of our generated co-speech gestures, we conducted a user study using pairwise comparisons between our model and a baseline. We selected the ZEGGS dataset for its clear and expressive gestures, which allow for a precise evaluation of movement quality, stylistic consistency, and synchronization with speech. We used the DSG model as a baseline for comparison.

Our rolling modification of DSG significantly outperforms the original DSG, aligning well with the quantitative evaluation results (Left chart). To compare RDLA with our original method, we conducted a user study between DSG rolling and RDLA with l = 2. The distribution of the RDLA user study results is shown in the Right chart. The RDLA approach is only slightly inferior to our original method, which is consistent with the quantitative findings.

Acceleration Findings

Acceleration Findings

On a NVIDIA A40 (48 GB), our rolling DSG and RDLA variants greatly reduce latency without sacrificing throughput, as summarized in Table 3.

Qualitative Comparison

Qualitative Comparison 1 Qualitative Comparison 2 Qualitative Comparison 3

We present a qualitative comparison of representative keyframes and spoken phrases, comparing our method against three baselines: DSG, Taming, and PersonaGestor (columns a–c). In each column, our method is shown in the top row and the baseline in the bottom row. As illustrated, our approach produces more expressive and semantically aligned gestures, such as distinct withdrawal motions in response to negations.

The video visualizations at the top of this page are generated by applying our framework to DSG using audios from the ZEGGS dataset.

BibTeX

@inproceedings{vu2026streaming,
  title={Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion},
  author={Vu, Evgeniia and Boiarov, Andrei and Vetrov, Dmitry},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={31},
  pages={26054--26061},
  year={2026}
}