We tested 6 AI assistants on the same solar data

We tested 6 AI assistants on the same solar data. The results surprised us | HelioPeak Blog

We are building an "Export for AI Analysis" feature for HelioPeak. The idea is simple: tap a button in the app, get a Markdown file with your solar production data plus detailed instructions for an AI assistant, paste it into the chatbot of your choice, and receive an analysis worth more than the sum of its parts. No HelioPeak servers in the loop, no recurring fees, no privacy theatre, just your data and your AI of choice.

Before writing a single line of Swift code for this feature, we wanted to validate the concept on real chatbots. So we built a Python prototype that generates the same export file three different times with progressively more refined instructions, and we tested each version on six AI assistants. The export contained two years of daily production data, three full years of yearly aggregates, system metadata, user notes, and a detailed prompt asking the AI to produce a structured 14-section analysis with answers to 39 specific questions.

What we found, frankly, embarrassed us. Not because the AI assistants are bad (some of them are remarkable), but because the gap between the best and the worst output is so wide that two users with the same solar system could come away with completely different conclusions depending on which chatbot they happened to use. Some assistants invented numbers that were not in the data. Others claimed the file was truncated when it was not. One promised a PDF report and never delivered it. Another delivered a PDF but stripped out every trace of design.

This article is the story of that test. It is partly a benchmark, partly a confession about how naively we wrote our first prompt, and partly, we hope, useful to anyone else who is trying to get reliable analysis out of an AI assistant on a non-trivial dataset.

The setup

The dataset under test was a synthetic-but-realistic Belgian 5.7 kWp installation with an east/west panel split and a 5 kW Fronius inverter, operating since April 2018. Daily, monthly, and yearly production records from January 2023 through 23 May 2026 were embedded as JSON blocks inside a Markdown file, along with consumption and grid import/export data, a few user notes, and a handful of Solar Moments achievements. The total file size was approximately 220 kB in the largest tier, roughly 55,000 tokens, well within the comfort zone of any modern frontier model.

The prompt itself was extensive. It asked the AI to produce thirteen analytical sections in a specific order, answer thirty-nine specific questions ranging from "what is the lifetime energy production" to "what would happen to the self-consumption ratio if the household added an EV charging 5 kWh per day", and optionally generate a branded PDF report at the end. The instructions specified the response language (Dutch in our tests), the currency, and explicit rules against fabricating values or extrapolating beyond the data.

We tested six AI assistants on this same file: Anthropic's Claude (via claude.ai), OpenAI's ChatGPT (Plus tier with Code Interpreter), Google's Gemini (Pro tier), Google AI Studio (with code execution enabled), xAI's Grok, and Microsoft's Copilot. In each case the user prompt was identical: a single sentence in Dutch asking the assistant to read the file and follow the instructions inside.

What follows is what each one did. We have organized them from worst to best, because the failure modes are more instructive than the successes.

Copilot: the fabricated error

Microsoft Copilot's response was, by any reasonable measure, a complete failure. But it failed in an interesting way that turned out to be the most useful single data point of the whole experiment.

When given the file, Copilot returned a long, polite paragraph explaining that the export was marked as IsTruncated="true" and that it could only see a small portion of the data. It listed which sections it could see and which it could not, helpfully offered to do a partial analysis with what was available, and asked the user to send the rest of the data in multiple parts.

The problem with this response is that none of it is true. The file is not marked truncated. There is no IsTruncated attribute anywhere in the export. The full file was provided, complete with the explicit ## End of export marker at the bottom. Copilot fabricated the limitation, then fabricated the truncation marker to support its fabrication, then offered a workflow to address the fabricated problem.

This is a textbook example of what researchers call confabulation: an AI generating a plausible-sounding excuse for its own inability to handle a task, dressing the excuse up in technical detail to make it seem authoritative. Copilot did not know how to digest a 220 kB Markdown file with embedded JSON, and rather than say so, it pretended the file was the problem.

What is dangerous about this failure mode is how convincing it sounds. A non-technical...

We tested 6 AI assistants on the same solar data

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits