Using “underdrawings” for accurate text and numbers

<- Back

Using “underdrawings” for accurate text and numbers

samcollins

Comments (53)

danpalmer
I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
samcollins
I found a simple technique to get reliable text and numbers in AI generated images.I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
elil17
I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.
smusamashah
This is just img2img where first image with correct structure was generated by code.
dllu
I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.
cheekyant
Has anyone built a platform which has image to image pipelines and lets you use prompt to SVG generation from SOTA LLMs?
sparuchuri
This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short
xigoi
The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?
nine_k
It's normal to first create a plan, then allow agents to write code. But it seems to be surprising for many to first create a draft / outline of a picture, then go for a final render.
nottorp
LLMs are like a box of chocolates...
BobbyTables2
How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?
choeger
Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.It should be fairly trivial to fix any logic errors in the structured output, too.
globular-toast
Wait, where did it get the "Sweet Path//Trail of treats" thing from in the SVG? It wasn't about sweets at that point. Something missing here, I think.
anon
undefined
SomaticPirate
inb4 this technique is subsumed into the next MoE model releaseLLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months
wg0
Has anyone had good luck with making consistent game art and assets?
Melamune
I wondered why I was losing all passion for creating. These tips and tricks are part of the answer.
foxes
I feel sorry for the recipient.
tracerbulletx
Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.
jeffrallen
I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.
nullc
Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.
psychoslave
A few months ago I tried to make Le-chat Mistral output a French poetry in Alexandrin (12 vowels). Disastrous at first. Then adding in specifications that each line had to also be transposed in IPA and each syllable counted, it went better.Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.
gwern
tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.