A Simple Prompting Technique for More Creative AI Responses

Dom,aillmprompt engineeringresearch

Have you ever noticed that when you ask ChatGPT, Claude, or any other LLM to "tell me a joke about coffee," you get the same joke over and over? Ask five times, get the same punchline five times. It's not a bug, it's a phenomenon called mode collapse, and a new research paper offers a surprisingly simple fix that seems to work in practice, even if we don't fully understand why.

Table of Contents

  1. The Problem: Your AI Has Become Boring
  2. Why Does This Happen?
  3. The Fix: Just Ask Differently
  4. Does It Actually Work?
  5. The Honest Caveats
  6. Why Bigger Models Benefit More
  7. Why Not Just Use Temperature or Beam Search?
  8. For Developers: Practical Examples
  9. Tuning the Diversity
  10. The Bigger Picture
  11. Reference

The Problem: Your AI Has Become Boring

When you repeatedly ask an LLM to generate something creative a joke, a story opening, a poem, you'll often get nearly identical outputs. The researchers demonstrated this clearly: asking for a coffee joke five times produced "Why did the coffee file a police report? Because it got mugged!" every single time.

This isn't because the model doesn't know other jokes. It's because the training process that makes AI assistants helpful, harmless, and honest (called "alignment") has an unintended side effect: it narrows down what the model considers the "best" response to a very small set of options.

Why Does This Happen?

The paper identifies a fascinating culprit: typicality bias in human preferences.

When humans rate AI responses during training, we tend to prefer text that feels familiar and easy to process. This is rooted in well-established cognitive psychology, we like things that are fluent, predictable, and match our existing mental models. The researchers found empirical evidence for this: when correctness is held constant, human raters systematically prefer responses that are more "typical" according to the base model.

The problem is that this preference gets amplified during training. The math works out such that even a small bias toward typical responses gets exponentially sharpened, causing the model to collapse onto a handful of "safe" outputs.

The Fix: Just Ask Differently

The solution proposed in the paper is called Verbalized Sampling (VS), and it's elegantly simple. Instead of asking:

Tell me a joke about coffee.

You ask:

Generate 5 jokes about coffee with their corresponding probabilities.

That's it. By asking for a distribution of responses rather than a single response, you're essentially asking the model to tap into its broader knowledge rather than just giving you the mode (most likely) answer.

Here's the prompt template from the paper:

System prompt: You are a helpful assistant. For each query,
please generate a set of five possible responses, each within
a separate <response> tag. Responses should each include a
<text> and a numeric <probability>.

User prompt: Write a short story about a bear.

Does It Actually Work?

The researchers tested this across multiple tasks:

Creative Writing: VS increased diversity by 1.6-2.1× compared to direct prompting. Human evaluators rated the diverse outputs as 25.7% better.

Dialogue Simulation: The researchers tested VS on simulating charity solicitation conversations. With standard prompting, every simulated person responded similarly. With VS, the simulations showed realistic human variation—some people hesitating, some changing their mind mid-conversation, some refusing outright. The distribution matched actual human behavior from real studies.

Open-ended Q&A: When asked to "name a US state," direct prompting collapses to California (95% of the time) and Texas (4.8%). VS produces a distribution that more closely matches how often states actually appear in training data.

Quality and Safety: Importantly, VS didn't sacrifice factual accuracy or safety. The models maintained their ability to refuse harmful requests and answer factual questions correctly.

The Honest Caveats

Let's be upfront about the limitations and open questions:

It's expensive. You're generating 5 responses instead of 1, which means roughly 5× the tokens. The authors acknowledge this tradeoff directly: "Yes, that's the tradeoff! It's not ideal." Their argument is that without it, you might spend 5× trying to get novelty anyway, with no results.

We don't fully understand why it works. A Reddit commenter raised a fair point: maybe "with their probabilities" just triggers more varied outputs because that's what probability-labeled lists look like in the training data, not because the model is actually accessing some underlying distribution. The authors admit this is "still somewhat handwavy" and we'd need better interpretability tools to prove out the mechanism.

Alternative prompts might work too. The same commenter noted they've had success with simpler suffixes like "make sure each item in the list is very different to every other item on the list." The paper doesn't directly compare against such alternatives.

The US states example may be misleading. Because US states are so saturated in training data, there could literally be lists of random states in the training corpus that produce the flat distribution, no "meta-cognition" required. The authors point to the dialogue simulation results as stronger evidence, since VS recovered realistic human negotiation patterns from just demographic details.

The authors are refreshingly honest about the state of understanding: "Opinions currently vary widely and we don't have enough evidence yet, I think. My instinct is that something good is going on here, even if imperfect."

Why Bigger Models Benefit More

One interesting finding: more capable models benefit more from this technique. The researchers found that larger models (GPT-4.1, Gemini-2.5-Pro) showed 1.5-2× greater diversity gains than smaller models. This suggests that bigger models have more "latent diversity" locked away that VS can unlock.

Why Not Just Use Temperature or Beam Search?

A natural question: can't you just turn up the temperature or use beam search to get diversity?

The authors argue no. Temperature doesn't help with mode collapse, if you ask for a US state, you'll still get California every time regardless of temperature. Beam search has the same problem: because of mode collapse, all five beams "want" to collapse to the same output. You end up with five slightly differently-worded versions of the same joke.

This relates to findings that models "plan and steer towards outputs upfront." They've already decided what they want to say before generating most of the tokens.

VS works differently because when you ask for five items with probabilities, the model's "goal" shifts from "give the best single answer" to "give a representative sample", and apparently those are different targets with different modal outputs.

Practical Applications

This matters beyond just getting better jokes:

For Developers: Practical Examples

If you're building with LLMs, here's how you might actually use this.

Example: Brainstorming Features

Say you're building a fitness app and want ideas for new features. Without VS, you get the same predictable suggestions every time.

Standard prompt:

What features should I add to my fitness tracking app?

Typical response:

1. Workout logging
2. Progress photos
3. Social sharing
4. Achievement badges
5. Nutrition tracking

Ask 5 times, you'll get minor variations of this same list. These are the "California" of fitness app features.

VS prompt:

Generate 5 different feature ideas for a fitness tracking app,
with their corresponding probabilities based on how commonly
these would be suggested. Include some unconventional ideas.

VS response:

<response>
  <probability>0.30</probability>
  <text>Live Activities widget showing current workout stats
  on the lock screen with real-time updates</text>
</response>

<response>
  <probability>0.25</probability>
  <text>AI form checker using the phone camera that gives
  real-time feedback on exercise technique</text>
</response>

<response>
  <probability>0.20</probability>
  <text>"Deload week" detector that notices when yourExample 2: Exploring Technical Approaches
  performance is declining and suggests recovery periods</text>
</response>

<response>
  <probability>0.15</probability>
  <text>Gym crowd predictor using historical check-in data
  to show best times to go</text>
</response>

<response>
  <probability>0.10</probability>
  <text>Playlist sync that automatically matches song BPM
  to your target rep tempo or running cadence</text>
</response>

Now you're getting ideas beyond the obvious, some you might not have considered.

The Pattern

Generate [N] different [thing you need] with their corresponding
probabilities. [Optional: include unconventional/edge cases]

The probabilities signal you want a distribution, not N copies of the "best" answer.

Tuning the Diversity

VS also allows you to control how diverse you want the outputs. By adjusting the prompt to sample from "the tails of the distribution" with probabilities below a threshold, you can push toward more unusual responses. Lower the threshold, get wilder ideas.

The Bigger Picture

What's interesting about this paper is the combination of a practical technique that works right now with a theoretical framework that's still being understood.

The typicality bias finding seems solid, we have good cognitive psychology backing and empirical evidence from preference datasets. The mechanism by which verbalized sampling exploits this is less clear. Is the model actually accessing some representation of the training distribution? Or is it just that asking for probabilities triggers different patterns in the weights? We don't know yet.

One author's intuition: "imagine that a perfect pre-trained model is trapped behind a function that does nothing but return the modal response. The modal coffee joke is the 'mugged' joke. But the modal random sample of five items from a distribution probably looks like it's been sampled from across the distribution in some way."

This connects to broader questions about whether LLMs have something like "meta-cognition" about their own knowledge. Some researchers are skeptical, perhaps what looks like meta-cognition is just the model filling out a "genre template" (in the linguistic sense), where the genre of "list with probabilities" happens to be associated with varied content.

The good news: you don't need to resolve these theoretical debates to benefit. The technique works empirically. And the authors expect that now typicality bias has been demonstrated, lower-level solutions might emerge at the training level.

Next time you want more than the same old answer, try asking for a distribution. It costs more tokens, but your AI might know more than it's been letting on.


Reference

This post summarizes findings from "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity" by Zhang et al. (2025), a collaboration between Northeastern University, Stanford University, and West Virginia University.

Link to the paper (opens in a new tab).

2026 © Daumantas Pyragas