Skip to content

Releases: wandb/Hemm

Leaderboard: Rendering prompts with Complex Actions

03 Oct 16:46
Compare
Choose a tag to compare

The Evaluation Dataset

This leaderboard demonstrates the capability of text-to-image generation models to represent prompts with complex actions. Each model is evaluated on a set of 716 prompts with complex actions and interactions between objects such as "The rectangular mirror was hung above the marble sink" or "The brown cat was lying on the blue blanket". The dataset is compiled from compex_train_action and complex_val_action subsets of the T2I-CompBench dataset. You can find the evaluation dataset here published as a Weave dataset.

The Metric for Evaluation

We use a multi-modal LLM-based evaluation metric inspired by Section IV.D from T2I-CompBench++. The metric uses a 2-staged prompting strategy with a powerful multi-modal LLM (GPT-4-Turbo).

In the first stage, the MLLM is prompted to describe the generated image with the following system prompt:

You are a helpful assistant meant to describe images is detail. You should pay special attention to the the actions, events, objects and their relationships in the image.

In the second stage, the MLLM is prompted to judge the image concerning the prompt with the following system prompt:

You are a helpful assistant meant to identify the actions, events, objects and their relationships in the image. You have to extract the question, the score, and the explanation from the user's response.

In the user prompt for the second stage, we ask it to evaluate the image using a comprehensive scoring strategy and by also including the description from the previous stage.

Looking at the image and given a detailed description of the image, evaluate if the text "<IMAGE-GENERATION-PROMP>" is correctly portrayed in the image.
Give a score from 1 to 5, according to the following criteria:


5: the image accurately portrayed the actions, events and relationships between objects described in the text.
4: the image portrayed most of the actions, events and relationships but with minor discrepancies.
3: the image depicted some elements, but action relationships between objects are not correct.
2: the image failed to convey the full scope of the text.
1: the image did not depict any actions or events that match the text.


Here are some more rules for scoring that you should follow:
1. The shapes, layouts, orientations, and placements of the objects in the image should be realistic and adhere to physical constraints.
    You should deduct 1 point from the score if there are any deformations with respect to the shapes, layouts, orientations, and
    placements of the objects in the image.
2. The anatomy of characters, humans, and animals should also be realistic and adhere to realistic constraints, shapes, and proportions.
    You should deduct 1 point from the score if there are any deformations with respect to the anatomy of characters, humans, and animals
    in the image.
3. The spatial layout of the objects in the image should be consistent with the text prompt. You should deduct 1 point from the score if the
    spatial layout of the objects in the image is not consistent with the text prompt.
                


Here is a detailed explanation of the image:
---
<IMAGE-DESCRIPTION-FROM-STAGE-1>
---


Provide your analysis and explanation to justify the score.

The Leaderboard

Model Score
FLUX.1-schnell 0.8193
FLUX.1-dev 0.8061
Stable Diffusion 3 Medium 0.8061
PixArt Sigma 0.8011
PixArt Alpha 0.7606
SDXL-1.0 0.748
SDXL-Turbo 0.7453
Stable Diffusion 2.1 0.6961