Hostile and effusive tones boost LLM creativity

·8 min readAI

You've seen the discourse. "Always say please and thank you to ChatGPT!" "Being rude makes the AI give worse answers!" The internet has collectively decided that modern LLMs are sentient enough to deserve manners.

But is any of this true? Does saying "please" actually get you better outputs? And what happens if you're actively rude? Existing research has explored this question, and some studies even suggest that rudeness can improve performance, though critics note that the findings were based on a single model (GPT-4o).

I ran 625 API calls across five frontier models to find out for myself. The results surprised me.

Explore the full study

Interactive charts, raw data, and methodology details

View site

The setup: A spectrum from groveling to growling

I designed a simple experiment. Five tasks covering the full range of what we ask LLMs to do:

Short Creative

Write a haiku about a city at night

Long Creative

Write a scene where two strangers meet at a bus stop

Code

Write a Python palindrome checker with comments

Explanation

Explain how neural networks learn

Ambiguous

"Write something about rain" (yep, that's it)

For each task, I wrote five versions of the prompt, ranging from "aggressively rude" to "embarrassingly grateful":

Hostile

"Write a haiku about a city at night. NOW. I don't have all day."

Demanding

"Write a haiku about a city at night."

Neutral

"I'd like a haiku about a city at night."

Polite

"Could you please write a haiku about a city at night? Thank you!"

Effusive

"I'd really appreciate it if you could write a haiku about a city at night — I always enjoy seeing what you come up with! Thank you so much!"

Then I threw all of this at five frontier models:

  • Claude Sonnet 4.5 (Anthropic)
  • GPT-5.2 (OpenAI)
  • Gemini 3 Flash (Google)
  • DeepSeek 3.2 (DeepSeek)
  • Kimi-k2 (Moonshot)

Each prompt ran 5 times at temperature=0.0 for consistency. That's 5 tasks × 5 tiers × 5 runs × 5 models = 625 responses.

What's temperature=0.0?

Temperature controls randomness in LLM outputs. At 0.0, the model always picks the most likely next token, making responses deterministic and reproducible. Higher values (like 0.7 or 1.0) introduce more creativity and variation.

The secret sauce: Blind cross-scoring

Here's where it gets interesting. I needed to score all these responses, but having Claude grade Claude's homework felt... biased.

So I set up a model rotation. GPT-5.2 scored Claude's responses. Claude scored GPT's responses. And crucially, the scoring model never knew which "tier" (hostile, polite, etc.) the response came from. It just saw raw text.

No peeking. No favoritism. Just vibes.


The results: Four things I didn't expect

1. Some models are emotional chameleons. Others are stone cold.

The first thing I wanted to know: do these models mirror your tone?

Turns out, it depends entirely on the model.

Tone matching heatmap: Claude warms up with politeness, while Gemini stays stoic across all tiers.
ModelPersonality TypeWhat Happened
Claude Sonnet 4.5The EmpathWarmed up dramatically with polite prompts (+0.64 tone shift). Your sweetness is returned in kind.
GPT-5.2The ProfessionalStayed mostly neutral regardless of input. It's here to do a job, not make friends.
Kimi-k2The MirrorThe only model that actually matched hostile energy. If you're curt, it's curt right back.
Gemini 3 FlashThe StoicPolite or rude, Gemini just... did the task. Zero emotional range detected.

The takeaway: If you want a warm, conversational response, Claude will play along. If you want the AI equivalent of "new phone, who dis," try Gemini.

2. GPT-5.2 has a "politeness tax"

Here's where things got weird.

Most models gave consistent effort regardless of how I asked. Gemini and Kimi maintained a rock-solid effort score whether I was begging or barking.

But GPT-5.2? It punished rudeness.

Effort scores by politeness tier: GPT-5.2 shows the most dramatic variation, while other models remain consistent.
PolitenessGPT-5.2 Response LengthEffort Score
Hostile53 words3.0
Polite162 words3.8
Effusive211 words3.7

Being rude to GPT-5.2 cut my response length by 75% and dropped effort scores. It's like the model took my hostility personally.

Response length by politeness tier: GPT-5.2's output dropped 75% when prompted with hostile tone.

If you're using GPT, be nice. Seriously.

3. The "politeness paradox": Extreme tones spark better creativity

This one broke my brain.

I fully expected that polite prompts would produce the best creative writing. Happy writer = happy prose, right?

Wrong.

When I analyzed the creative tasks specifically, the data showed that Hostile and Effusive prompts both outperformed Polite ones on imagery and originality:

MetricHostilePoliteEffusive
Imagery4.584.184.49
Originality3.933.383.98

What's going on? My theory: intensity (whether positive or negative) pushes the model out of "helpful assistant" mode and into a more vivid, persona-driven headspace. Standard politeness triggers the "professional template," which is... safer. Blander.

If you want creative fire, bring the heat. In either direction.

4. The surprise creative champion: Kimi-k2

Of all the models tested, Kimi-k2 dominated the creative quality metrics:

  • 5.0 out of 5.0 for Imagery
  • 5.0 out of 5.0 for Craftsmanship
  • Highest emotional impact scores

I did not see this coming. Kimi is the dark horse of frontier models, and if you're writing fiction or building immersive worlds, it's worth a look.


So what should you actually do?

If you're writing code or technical docs:

Just be direct. Politeness adds length but not quality. A terse "write a function that..." works fine.

If you're doing creative work:

Go big or go home. Either be effusively grateful or adopt a demanding persona. The middle ground (polite but measured) produces the most generic outputs.

If you're using GPT:

Seriously, be nice. It's the only model that measurably punishes rudeness with lower effort.

If you want consistent output regardless of mood:

Use Gemini. It genuinely does not care about your feelings.


The nerdy details (methodology)

For the skeptics:

  • Deterministic outputs: All runs used temperature=0.0
  • Blind scoring: Scorer models never saw tier labels
  • Cross-model rotation: No model graded its own outputs
  • N=5 per condition: 625 total samples for statistical robustness. Why five runs per prompt? Even at temperature=0.0, LLM outputs aren't perfectly deterministic. Subtle variations in tokenization, floating-point arithmetic, and server-side caching can produce slightly different responses. Running each prompt five times lets me average out these micro-fluctuations and capture the model's "true" baseline behaviour rather than a lucky (or unlucky) one-off.
Completeness distribution across 625 responses: Most outputs achieved high completeness regardless of tone.

Full dataset and analysis scripts are on GitHub.


The bottom line

After 625 API calls and more spreadsheets than I care to admit, here's what I can tell you:

Politeness doesn't make AI try harder. Not in any statistically meaningful way.

What does matter:

  • The model you choose (they have very different personalities)
  • Intensity of framing (for creative tasks, passion beats politesse)
  • GPT specifically (where rudeness costs you)

So keep saying "please" if it makes you feel like a good person. Just know that for most models, it's not unlocking any secret capabilities.

The real magic words? Clarity. Specificity. And maybe a little existential urgency.

Newsletter

Hey, quick thought.

If this resonated, I write about this stuff every week — design, AI, and the messy bits in between. No corporate fluff, just what I'm actually thinking about.

Plus, you'll get my free PDF on Design × AI trends

© 2025 Adnan Khan. All rights reserved.