We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality.To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively.Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).
PSNR: 29.21 Speedup: 1.36x
PSNR: 22.45 Speedup: 1.86x
PSNR: 22.35 Speedup: 2.29x
PSNR: 18.21 Speedup: 1.37x
PSNR: 19.77 Speedup: 1.60x
Prompt: a dog running happily
PSNR: 29.37 Speedup: 1.36x
PSNR: 24.95 Speedup: 1.86x
PSNR: 24.29 Speedup: 2.29x
PSNR: 18.55 Speedup: 1.37x
PSNR: 17.26 Speedup: 1.60x
Prompt: A jellyfish floating through the ocean, with bioluminescent tentacles
PSNR: 27.58 Speedup: 1.36x
PSNR: 23.51 Speedup: 1.86x
PSNR: 22.33 Speedup: 2.29x
PSNR: 16.37 Speedup: 1.37x
PSNR: 18.87 Speedup: 1.60x
Prompt: A robot DJ is playing the turntable, in heavy raining futuristic tokyo rooftop cyberpunk night, sci-fi, fantasy
PSNR: 35.00 Speedup: 1.36x
PSNR: 25.76 Speedup: 1.86x
PSNR: 25.28 Speedup: 2.29x
PSNR: 18.28 Speedup: 1.37x
PSNR: 24.08 Speedup: 1.60x
Prompt: A raccoon dressed in suit playing the trumpet, stage background
PSNR: 29.67 Speedup: 1.39x
PSNR: 25.33 Speedup: 1.93x
PSNR: 24.14 Speedup: 2.38x
PSNR: 23.19 Speedup: 1.17x
PSNR: 19.33 Speedup: 1.69x
Prompt: The bund Shanghai, tilt down
PSNR: 30.69 Speedup: 1.39x
PSNR: 25.65 Speedup: 1.93x
PSNR: 24.14 Speedup: 2.38x
PSNR: 23.67 Speedup: 1.17x
PSNR: 22.06 Speedup: 1.69x
Prompt: Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, tilt down
PSNR: 29.14 Speedup: 1.39x
PSNR: 26.07 Speedup: 1.93x
PSNR: 24.46 Speedup: 2.38x
PSNR: 23.18 Speedup: 1.17x
PSNR: 22.68 Speedup: 1.69x
Prompt: a zebra running to join a herd of its kind
PSNR: 24.11 Speedup: 1.46x
PSNR: 23.12 Speedup: 1.91x
PSNR: 18.14 Speedup: 2.35x
PSNR: 18.57 Speedup: 1.23x
PSNR: 16.18 Speedup: 1.69x
Prompt: A couple in formal evening wear going home get caught in a heavy downpour with umbrellas by Hokusai, in the style of Ukiyo
PSNR: 24.03 Speedup: 1.46x
PSNR: 22.46 Speedup: 1.91x
PSNR: 20.74 Speedup: 2.35x
PSNR: 19.48 Speedup: 1.23x
PSNR: 15.69 Speedup: 1.69x
Prompt: an elephant running to join a herd of its kind
PSNR: 22.14 Speedup: 1.46x
PSNR: 20.74 Speedup: 1.91x
PSNR: 18.89 Speedup: 2.35x
PSNR: 18.64 Speedup: 1.23x
PSNR: 12.37 Speedup: 1.69x
Prompt: /A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background
PSNR: 22.89 Speedup: 1.46x
PSNR: 21.67 Speedup: 1.91x
PSNR: 19.77 Speedup: 2.35x
PSNR: 18.24 Speedup: 1.23x
PSNR: 17.56 Speedup: 1.69x
Prompt: botanical garden
@inProceedings{liu2026astraea,
title={Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers},
author={Haosong Liu and Yuge Cheng and Wenxuan Miao and Zihan Liu and Aiyue Chen and Jing Lin and Yiwu Yao and Chen Chen and Jingwen Leng and Minyi Guo and Yu Feng},
year={2026},
booktitle={The Fourteenth International Conference on Learning Representations (ICLR 2026)}
}