Prompts are invisible
Most showcases hide prompt effort, cherry-picking, and how much human repair was needed.
New BoilingFrogs lab
CreativeBench will score AI-generated work the way people actually judge it: does it feel fresh, does it solve the brief, can it survive revision, and would you put it in front of a customer, reader, or board?
Why this exists
Most showcases hide prompt effort, cherry-picking, and how much human repair was needed.
Creative work fails in the details: cliché, tone drift, visual clutter, weak story logic, or unusable output.
Leaders need to know where creative AI is genuinely useful today, not just where it looks impressive for ten seconds.
First tracks