The AI imagery competition is getting tough. Google  unveiled a new challenger to OpenAI’s vaunted DALLE-2 text-to-image generator — and took shots at its rival’s efforts. Both models convert text prompts into pictures.

Google’s researchers claim their system provides “unprecedented photorealism and deep language understanding.” Greetings humanoids. Qualitative comparisons between Imagen and DALL-E 2 on DrawBench prompts from the Conflicting category.

Both technologies in simple terms, allow you to generate images from text. For this to happen, the AI requires a deep level of understanding of language…similar to Ultron…just kidding :P. OpenAI was the first to lead this research with their initial iteration being called DALL·E, a term coined after combining the artist Salvador Dali and the robot WALL·E from the Pixar movie.

Google Research has developed a competitor for OpenAI’s text-to-image system, with its own AI model that can create artworks using a similar method. Text-to-image AI models are able to understand the relationship between an image and the words used to describe it. 

Once a description is added, a system can generate images based on how it interprets the text, combining different concepts, attributes, and styles. For example, if the description is ‘a photo of a dog’, the system can create an image that looks like a photograph of a dog. But if this description is altered to ‘an oil painting of a dog’, the image generated would look more like a painting. Imagen’s team has shared a number of example images that the AI model has created – ranging from an acute corgi in a house made from sushi, to an alien octopus reading a newspaper. 

OpenAI created the first version of its text-to-image model called DALL-E last year. But it unveiled an improved model called DALL-E 2 last month, which it said: “generates more realistic and accurate images with four times greater resolution”. 

The AI company explained that the model uses a process called diffusion, “which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognises specific aspects of that image”. 

In a newly published research paper, the team behind Imagen claims to have made several advances in terms of image generation. 

It says large frozen language models trained only on text data are “surprisingly very effective text encoders” for text-to-image generation. 

It also suggests that scaling a pretrained text encoder improves sample quality more than scaling an image diffusion model size. Google’s research team created a benchmark tool to assess and compare different text-to-image models, called DrawBench.

Using DrawBench, Google’s team said human raters preferred Imagen over other models such as DALL-E 2 in side-by-side comparisons “both in terms of sample quality and image-text alignment”.