Back to writings
GPT-Image-2OpenAIDesign

GPT-Image-2 is the first image model I'd put near a PRD

·5 min

For most of the last two years, image generation has been a parlour trick. You typed a prompt, you got a pretty picture, and if the picture had any text in it the text was nonsense. The image-model demo was the thing you showed at a conference. It wasn't the thing you put inside a real workflow.

GPT-Image-2 changed that. The text rendering works. The instruction following works. The edits work. Three things that were each almost-good in different models all crossed the threshold at the same time. The result is the first image model I'd put near a product requirements doc.

I'm a week or two into actually using it. These are early notes, not a finished take.

A quick mental model of how this works#

If you've never looked under the hood, image generation in 2026 is two architectural ideas competing:

  • Diffusion models (Stable Diffusion, Midjourney through v6): start from pure random noise and gradually denoise it over ~30 steps, conditioned on your prompt. Each step nudges the pixels toward something that matches the text. Diffusion is great at texture, lighting, and "vibes" because it's literally sculpting from noise — but it's bad at structure, because each pixel only sees its neighbours. A glyph like the letter "e" is a global structural commitment that local denoising fights against.
  • Autoregressive image models (GPT-Image-2, Google's nano-banana): treat the image as a sequence of tokens and predict them one at a time, the same way a language model predicts the next word. Because the model has the full sequence so far when picking the next token, it can plan ahead — "the next 40 tokens need to spell W-E-L-C-O-M-E" — in a way diffusion structurally can't.

That difference is why the text-rendering problem cracked at the same time the architecture changed. It wasn't a fix bolted onto diffusion; it's a different model class that happens to be good at the thing diffusion was bad at.

The text-rendering problem was the bottleneck#

Older image models couldn't write. If you asked for a sign that said Welcome, you got a sign that said WelconE or Wlecome, with a half-formed letter floating next to it. This sounds like a small thing. It's not. It meant you couldn't generate any image that contained UI, logos, signage, screenshots, or anything text-shaped — which excluded most of the images a software company actually needs.

GPT-Image-2 writes. Not perfectly, but reliably enough that I trust it for short strings: button labels, headlines, brand names, room signs. The first time I generated a mock landing page where the headline read exactly the headline I'd asked for, I sat there for a minute. That had never been possible before.

The cost of that capability is latency. A 1024×1024 image takes me 18-25 seconds end-to-end on the API, versus 3-5 seconds for a comparable SDXL diffusion call. Predicting tokens one at a time is slower than denoising in parallel — that's the trade. For my workflows (mocks and assets, not real-time generation) it's the right side of the trade. For something like a live avatar generator, diffusion still wins.

Three workflows are different now#

Mockups before code. I can describe a feature — "a chat sidebar with a list of past documents on the left, a streaming answer on the right, citations at the bottom" — and get a usable visual mock in two minutes. Not pixel-perfect. Good enough that a stakeholder can react to a real picture instead of a verbal description. Replaces about half of what I was using Figma for at the early concept stage.

Marketing-asset iteration. Hero images, social posts, App Store screenshots. The previous workflow was describe to a designer, wait three days, iterate twice. The new workflow is iterate twenty times in an hour, pick one, hand it to the designer for polish. The designer's job becomes the last twenty percent (consistency, brand, polish), not the first eighty percent (composition, layout, idea).

Inpainting as design conversation. "Keep this layout, change the colour palette to warmer tones." "Same image, but the person is older." "Add a small floating menu in the corner." These edits used to be either impossible or a manual rebuild. Now they're a sentence. The editing model is what makes the tool feel collaborative instead of slot-machine.

What it doesn't do yet#

The failure modes that remain are real. Long text strings still break — anything over about a sentence. Specific brand fonts are out of reach; you can ask for "in the style of Helvetica" and get something that looks vaguely Helvetica-adjacent but isn't. Compositional precision is a step better but not surgical: ask for exactly seven elements in the foreground and you'll get six or eight. Generating the same character consistently across multiple images is still hard, though improving.

These are real limits. They mean the model isn't a replacement for a designer. It's a faster first draft.

What I'm rebuilding#

I'm experimenting with a workflow where the PRD itself includes generated mockups inline. The doc has the user story, the tech notes, and a rough visual. The visual is wrong in detail but right in shape. It pulls the conversation off "what should this look like in the abstract" and onto "what's wrong with this specific picture."

That shift — from abstract to concrete — is what makes design conversations productive. Faster shift means faster feedback. Faster feedback means the design that ships is closer to the design that was right.

The bigger pattern#

Image generation has followed the same arc text generation followed two years earlier. For a while it was a fun toy. Then one threshold flipped — instruction following, in the case of language; text rendering, here — and suddenly the toy was tooling. The interesting question isn't "is the output perfect" (it isn't). It's "is the output good enough that the workflow around it is faster end-to-end." For most of what I do visually now, the answer is yes.

I'll come back to this in a few months when I've actually shipped something with it.