Text-image Alignment for Diffusion-based Perception

Published in arXiv preprint, 2023

We use automatically generated captions to improve the text-image alignment of a diffusion backbone in downstream visual tasks such as semantic segmentation, depth estimation and object detection. Our method also achieves improves the SOTA in both single-domain and cross-domain tasks.

