A text-to-image diffusion model with an unprecedented degree of photorealism
Google researchers are giving DALL-E a run for its money:
Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.