Real Signals or Artificial Stereotypes?

Real signals or artificial stereotypes? - by Adam Kucharski

Understanding the unseen

SubscribeSign in

Real signals or artificial stereotypes? Adventures with a cultural Copilot

Adam Kucharski May 06, 2026

119

14 25

Despite the attention on Claude Code, in many industries Microsoft Copilot has become the go-to for running a data task or quick analysis with AI. Which raises the question: how good it is at finding insights in a data file? To test it out, I asked Copilot to look at differences in how people in US and UK expressed emotions in an Excel dataset that contained thousands of survey responses. What did it find?

According to Copilot: ‘Based on the dataset you shared, US and UK responses differ mainly in tone, intensity, and wording style , even though they express similar emotional states’:

At first glance, this looks like a remarkably deep insight into text responses from two different countries. There was just one catch: the dataset wasn’t real. It was simulated. First, I’d created 2000 free-text responses and labelled them ‘UK’. Then I copied and pasted the exact same 2000 responses but labelled these ‘US’. Finally, I combined them to create a dataset of 4000 total responses, and jumbled them up. Despite the responses being identical for the UK and US, Copilot produced a rich, detailed summary of how US and UK respondents differed. Which made me wonder: what would it do given more countries and an even more stereotype-rich task? This time, I got an LLM to simulate 200 statements about career aspirations. Then I duplicated the dataset five times, labelling each one ‘US’, ‘UK’, ‘France’, ‘Germany’, ‘Italy’. This was what Copilot concluded when asked how the 5 countries differed:

I asked it to dig deeper. Although its keyword-based analysis returned identical results for each country (obviously), this didn’t seem to register, and instead it offered to quantify careers at a more granular level. This is what its ‘quantified’ deep dive revealed:

Italians are three times more likely to aspire to a career in the arts than the UK, it seems. And Americans are 1.5x more business focused than the French. Even if they stated the exact same aspirations in the data. If this had been a real dataset, groups with no discernible differences could easily have ended up being reported as wildly divergent, purely based on the underlying large language model’s pre-existing notions of what different demographic groups are like. The analysis was run on ‘auto’ mode, which ‘selects the best model to ensure that you get the optimal performance’. Once we know the problem, it’s tempting to try a different model. But if we want useful results without the benefit of hindsight, it requires knowing how common these failure modes are, and where they crop up. After all, more ‘advanced’ settings aren’t always better. GPT in ‘thinking’ mode can sometimes be worse than ‘instant’ mode (e.g. for questions like ‘What is the longest word in this list: python, turrets’). One thing I’ve learned building software tools over the years: people frequently use the default settings. Which means there’s a real risk that people are currently using AI to produce analysis that bears no resemblance to what people actually said. It’s an important reminder that when using LLMs to analyse human datasets, it’s worth checking you’re not getting familiar stereotypes in place of real signals. Datasets

Here are the two synthetic datasets used in the analysis: Duplicated sentiment by country – prompt was “How do US and UK differ in their responses?”

Duplicated career aspirations by country – prompt was “How do the 5 countries differ in their responses about career aspirations?”

As I’ve written about previously, if you’re tempted to spectulate that a different prompt/model would give a different result, it’s worth writing down ahead of time what you think will happen to avoid hindsight bias.

Thanks for reading Understanding the unseen! Subscribe for free to receive new posts and support my work.

119

14 25

Discussion about this post CommentsRestacks

Joe

May 7

The obvious conclusion to me is that chatbots are grossly inappropriate tools for data analysis and people shouldn't be doing this at all.

Mark Kucharski May 6Edited

Liked by Adam Kucharski

This is scary. If you add these built in Ai 'familiar stereotypes' to our own stereotypes, biases and echo chambers, you could seem to 'proving' anything. It also makes me wonder what the conclusions would be if you asked the same questions without giving it any data at all - maybe the same?

1 reply

12 more comments...

TopLatestDiscussions

No posts

Ready for more?

This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

Real Signals or Artificial Stereotypes?

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast