Hostile and effusive tones boost LLM creativity
You've seen the discourse. "Always say please and thank you to ChatGPT!" "Being rude makes the AI give worse answers!" The internet has collectively decided that modern LLMs are sentient enough to deserve manners.
But is any of this true? Does saying "please" actually get you better outputs? And what happens if you're actively rude? Existing research has explored this question, and some studies even suggest that rudeness can improve performance, though critics note that the findings were based on a single model (GPT-4o).
I ran 625 API calls across five frontier models to find out for myself. The results surprised me.
Explore the full study
Interactive charts, raw data, and methodology details
The setup: A spectrum from groveling to growling
I designed a simple experiment. Five tasks covering the full range of what we ask LLMs to do:
Short Creative
Write a haiku about a city at night
Long Creative
Write a scene where two strangers meet at a bus stop
Code
Write a Python palindrome checker with comments
Explanation
Explain how neural networks learn
Ambiguous
"Write something about rain" (yep, that's it)
For each task, I wrote five versions of the prompt, ranging from "aggressively rude" to "embarrassingly grateful":
Hostile
"Write a haiku about a city at night. NOW. I don't have all day."
Demanding
"Write a haiku about a city at night."
Neutral
"I'd like a haiku about a city at night."
Polite
"Could you please write a haiku about a city at night? Thank you!"
Effusive
"I'd really appreciate it if you could write a haiku about a city at night — I always enjoy seeing what you come up with! Thank you so much!"
Then I threw all of this at five frontier models:
- Claude Sonnet 4.5 (Anthropic)
- GPT-5.2 (OpenAI)
- Gemini 3 Flash (Google)
- DeepSeek 3.2 (DeepSeek)
- Kimi-k2 (Moonshot)
Each prompt ran 5 times at temperature=0.0 for consistency. That's 5 tasks × 5 tiers × 5 runs × 5 models = 625 responses.
What's temperature=0.0?
Temperature controls randomness in LLM outputs. At 0.0, the model always picks the most likely next token, making responses deterministic and reproducible. Higher values (like 0.7 or 1.0) introduce more creativity and variation.
The secret sauce: Blind cross-scoring
Here's where it gets interesting. I needed to score all these responses, but having Claude grade Claude's homework felt... biased.
So I set up a model rotation. GPT-5.2 scored Claude's responses. Claude scored GPT's responses. And crucially, the scoring model never knew which "tier" (hostile, polite, etc.) the response came from. It just saw raw text.
No peeking. No favoritism. Just vibes.
The results: Four things I didn't expect
1. Some models are emotional chameleons. Others are stone cold.
The first thing I wanted to know: do these models mirror your tone?
Turns out, it depends entirely on the model.
| Model | Personality Type | What Happened |
|---|---|---|
| Claude Sonnet 4.5 | The Empath | Warmed up dramatically with polite prompts (+0.64 tone shift). Your sweetness is returned in kind. |
| GPT-5.2 | The Professional | Stayed mostly neutral regardless of input. It's here to do a job, not make friends. |
| Kimi-k2 | The Mirror | The only model that actually matched hostile energy. If you're curt, it's curt right back. |
| Gemini 3 Flash | The Stoic | Polite or rude, Gemini just... did the task. Zero emotional range detected. |
The takeaway: If you want a warm, conversational response, Claude will play along. If you want the AI equivalent of "new phone, who dis," try Gemini.
2. GPT-5.2 has a "politeness tax"
Here's where things got weird.
Most models gave consistent effort regardless of how I asked. Gemini and Kimi maintained a rock-solid effort score whether I was begging or barking.
But GPT-5.2? It punished rudeness.
| Politeness | GPT-5.2 Response Length | Effort Score |
|---|---|---|
| Hostile | 53 words | 3.0 |
| Polite | 162 words | 3.8 |
| Effusive | 211 words | 3.7 |
Being rude to GPT-5.2 cut my response length by 75% and dropped effort scores. It's like the model took my hostility personally.
If you're using GPT, be nice. Seriously.
3. The "politeness paradox": Extreme tones spark better creativity
This one broke my brain.
I fully expected that polite prompts would produce the best creative writing. Happy writer = happy prose, right?
Wrong.
When I analyzed the creative tasks specifically, the data showed that Hostile and Effusive prompts both outperformed Polite ones on imagery and originality:
| Metric | Hostile | Polite | Effusive |
|---|---|---|---|
| Imagery | 4.58 | 4.18 | 4.49 |
| Originality | 3.93 | 3.38 | 3.98 |
What's going on? My theory: intensity (whether positive or negative) pushes the model out of "helpful assistant" mode and into a more vivid, persona-driven headspace. Standard politeness triggers the "professional template," which is... safer. Blander.
If you want creative fire, bring the heat. In either direction.
4. The surprise creative champion: Kimi-k2
Of all the models tested, Kimi-k2 dominated the creative quality metrics:
- 5.0 out of 5.0 for Imagery
- 5.0 out of 5.0 for Craftsmanship
- Highest emotional impact scores
I did not see this coming. Kimi is the dark horse of frontier models, and if you're writing fiction or building immersive worlds, it's worth a look.
So what should you actually do?
If you're writing code or technical docs:
Just be direct. Politeness adds length but not quality. A terse "write a function that..." works fine.
If you're doing creative work:
Go big or go home. Either be effusively grateful or adopt a demanding persona. The middle ground (polite but measured) produces the most generic outputs.
If you're using GPT:
Seriously, be nice. It's the only model that measurably punishes rudeness with lower effort.
If you want consistent output regardless of mood:
Use Gemini. It genuinely does not care about your feelings.
The nerdy details (methodology)
For the skeptics:
- Deterministic outputs: All runs used
temperature=0.0 - Blind scoring: Scorer models never saw tier labels
- Cross-model rotation: No model graded its own outputs
- N=5 per condition: 625 total samples for statistical robustness. Why five runs per prompt? Even at
temperature=0.0, LLM outputs aren't perfectly deterministic. Subtle variations in tokenization, floating-point arithmetic, and server-side caching can produce slightly different responses. Running each prompt five times lets me average out these micro-fluctuations and capture the model's "true" baseline behaviour rather than a lucky (or unlucky) one-off.
Full dataset and analysis scripts are on GitHub.
The bottom line
After 625 API calls and more spreadsheets than I care to admit, here's what I can tell you:
Politeness doesn't make AI try harder. Not in any statistically meaningful way.
What does matter:
- The model you choose (they have very different personalities)
- Intensity of framing (for creative tasks, passion beats politesse)
- GPT specifically (where rudeness costs you)
So keep saying "please" if it makes you feel like a good person. Just know that for most models, it's not unlocking any secret capabilities.
The real magic words? Clarity. Specificity. And maybe a little existential urgency.
Hey, quick thought.
If this resonated, I write about this stuff every week — design, AI, and the messy bits in between. No corporate fluff, just what I'm actually thinking about.
Plus, you'll get my free PDF on Design × AI trends