Beyond prompt engineering

Thinking outside the (text) box to find better interfaces for generative models

Beyond prompt engineering
A Reddit user shares a hack for generating realistic hands in Stable Diffusion photos

Generative models are transforming how we create. The output of these models is now, in many cases, indistinguishable from human-generated content. Text prompts have emerged as the primary interface to these models. Yet, text prompts are a lossy interface for harnessing the power of generative models, and non-obvious prompt engineering is required to generate high-quality aesthetics. Next-generation models and the applications built on top will have to improve usability by expanding beyond text interfaces.

Prompt engineering is an awful tax on creativity

Language is a vague abstraction of our imagination. Therefore, we compress details when we use language to describe our thoughts. This compression is desirable when talking to someone with a lot of shared context. However, when we try to use language to describe precise mental images, it becomes a burden. Text-to-image interfaces fall into this trap. They fail to capture exact aesthetics. The imprecise nature of their output can be novel and thrilling. But, if you have something specific in mind, it can be painful to articulate in a text prompt.

The overall response to this rather frustrating limitation has been to resort to prompt engineering. Like modern-day spellcasters, we must all painstakingly learn the right incantations to coax models to do our bidding. If we fail to do so, we must approach a new class of anointed prompt engineers as the divine intermediaries between our imagination and desired output.

Prompt engineering is a suboptimal tax on creativity, and we can imagine a more inspiring future.

Nonlinguistic interfaces

The missing piece to overcoming the limitations of text is to create interfaces that embed non-verbal context and reduce the iteration steps required to get to your desired output — that is, nonlinguistic interfaces. The central insight is that we capture ideas in linguistic and non-linguistic forms.

Non-linguistic forms encode information lost in languages, such as imagery, smell, and taste.

I am particularly excited by emerging interfaces that augment text prompts by embedding nonlinguistic context.

Image to image: capturing composition

Stable Diffusion's image-to-image implementation is one of the most popular projects in this direction. How does it work? Upload an image, which could be a high-fidelity image or a rough sketch. This initial image conditions the output generated by the model from a provided text prompt. The image serves as a template and a constraint for your prompt.

Image-to-image is a way to specify composition. The initial image encodes nonlinguistic information about how objects in a scene relate to each other. For example, where should each element in a picture be in relation to each other? How big or small? This is cumbersome via text prompt for a scene with multiple components. However, abstracting image composition from text prompts lets us create impressive results like this from a Reddit user.

Before: simple sketch
After with prompt: “A distant futuristic city full of tall buildings inside a huge transparent glass dome, In the middle of a barren desert full of large dunes, Sun rays, Artstation, Dark sky full of stars with a shiny sun, Massive scale, Fog, Highly detailed, Cinematic, Colorful.”

You’ll notice that while this user still had to specify several stylistic aspects in their prompt, the original input image guides the overall composition.

Principal object to image: subject-driven generation with Dreambooth

Another area where text alone falls short as an interface is generating images of the same subject in various contexts. Say you wanted to create multiple renditions of your dog in different environments, poses, angles, or light conditions. The prompt you would provide would not be subject-aware. There is no obvious way to capture your dog's features in the text. Additionally, there is no obvious way to tell a generative model to preserve those features across various versions. DreamBooth, a project from a team of Google researchers, shows how you can imitate the appearance of a given subject and create new versions of them in different contexts. The paper includes some striking canonical examples. Starting with a few input images of your dog, you can create subject-aware renditions that preserve your dog’s appearance in various contexts.


Aesthetics to image: style-driven generation with textual inversion

How do you teach a generative model to render your prompts in a specific artistic style? With textual inversion, you can generate a pseudo-word that matches your desired aesthetic by feeding a handful of images to the model. You can then use that pseudo-word in your prompt to achieve your desired style. No more guessing if the right hack is to insert “trending on art station.”

Textual inversion solves the vocabulary problem. What word should I use to represent a specific style? Vocabulary is a bottleneck in human-to-human conversation. In prompt engineering, it is arguably worse.

Side note: textual inversion and Dreambooth can be used for style transfer and subject-driven generation, but my explorations and others on the internet have led me to believe that Textual Inversion is better for style transfer and Dreambooth is better for subject-driven generation.

Exploration to image: Latent space exploration

Another area where text prompts are suboptimal is in exploring images similar to an initial starting point. Prompts are static, and it is not always clear what to tweak to alter the current image slightly.

Latent space exploration is a technique that hints at an elegant way to do this. We can think of the visual space of images produced by generative models as a vector space (a cluster of points) where each point represents an image. Moving any direction in this space yields a similar image with a slight tweak. Given a simple prompt that is less refined than you would like, you can explore the surrounding area of your generated image to find an image that you do, in fact, like. In this implementation, the creative process is visual. Prompts only serve to anchor you to a starting point. This is a much better way to iterate on image creation versus tweaking prompts.

The original paper that sparked this idea has a neat visualization of how far you can take this concept:

Here’s another terrific example with a slightly different technique (complete with trippy AGI music)

Conclusion: Native interfaces will free us from the tyranny of prompt engineering

I have chosen in this post to focus on image models because it is easier to see where text prompts fall short. However, the general point holds for other modalities such as audio and animations.

All the techniques discussed above try to increase our degrees of freedom for interfacing with generative models. This feels to me like the right direction to go in. Text prompts have a role to play but not a primary one.

A few people have pointed out that as models get better, they’ll better understand prompts, and we’ll have to do less prompt engineering. But, linguistic context barriers will always remain as far as text is the sole and primary interface. The evolution beyond text, I suspect, will be easier to implement via native interfaces. That is, an interface directly in the context of a user’s workflow. RunwayML’s video editing tool is an excellent example of this.

I, for one, am spending little time getting better at text prompts. I’m betting on the builders to free us from its tyranny.

Subscribe to Kojo's blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe