Diffusion-Based Action Recognition Generalizes to Untrained Domains
Published in arXiv preprint, 2025
We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across domain shifts. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. Our model sets a new state-of-the-art across three generalization benchmarks, bringing machine action recognition closer to human-like robustness.
Recommended citation: Rogerio Guimaraes, Frank Xiao, Pietro Perona & Markus Marks. (2025). Diffusion-Based Action Recognition Generalizes to Untrained Domains. https://arxiv.org/abs/2509.08908