Improving the Parameter and Data Efficiency of the Text-to-Image
Priors for UnCLIP Family Models.
Abstract
Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g., DALL-E-2), achieve state-of-the-art
(SOTA) performance on various compositional T2I benchmarks, at the cost of significant computational
resources.
The unCLIP stack comprises T2I prior and diffusion image decoder.
The T2I prior model alone adds a billion parameters compared to the Latent Diffusion Models, which
increases the computational and high-quality data requirements.
We introduce ECLIPSE, a novel contrastive
learning method that is both parameter and data-efficient.
ECLIPSE~leverages pre-trained vision-language models (e.g., CLIP)
to distill the knowledge into the prior
model.
We demonstrate that the ECLIPSE~trained prior, with only 3.3% of
the parameters and trained on a mere
2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score under
resource-limited setting.
It also attains performance on par with SOTA larger models, achieving an average of 63.36% preference
score in terms of the ability to follow the text compositions.
Extensive experiments on two unCLIP diffusion image decoders, Karlo and Kandinsky,
affirm that ECLIPSE~consistently delivers high performance while
significantly reducing resource
dependency.
ECLIPSE Prior w.r.t. colossal counterparts.
Method
CLIP Contrastive Learning is enough to achieve the SOTA the Text-to-Image prior without diffusion process.
This allows us to train SOTA model with only 33M parameters and 0.6M image-text pairs.
ECLIPSE Demo
Examples
ECLIPSE (w Kandinsky v2.2 diffusion image decoder) trained on 5M image-text pairs using only 200 GPU
Hours.
BibTeX
@article{patel2023eclipse,
title={Eclipse: A resource-efficient text-to-image prior for image generations},
author={Patel, Maitreya and Kim, Changhoon and Cheng, Sheng and Baral, Chitta and Yang, Yezhou},
journal={arXiv preprint arXiv:2312.04655},
year={2023}
}