Lessons from a weekend building local AI workflows

stefanopetrilli1 pts0 comments

Lessons from a weekend building local AI workflows

Like everyone and their grandmother, these days I am into Agents! I finally got to spend some time learning more about multi-agent workflows: I came up with a simple use case, built a first iteration and watched it shatter against the messy reality. Then I learned a few things.

This post shares three things learned: lost-in-the-middle, the bias compound problem, and that Whisper isn’t a silver bullet.

The tool I built sort of works and is available on GitHub. The whole thing is a multi-agent video editor which takes a video and outputs a shortened down version by removing all the fluff so just the juicy parts remain.

Don’t expect production ready magic, but I find it pretty entertaining :).

Naive solution

The first naive solution that came to my mind is the following:

graph TD<br>A[Initial Video] -->|"raw video"| B[Speech To Text]<br>B -->|"full transcript"| C[Editor Agent]<br>B -->|"full transcript"| D[Reviewer Agent]<br>C -->|"proposed cuts"| D<br>D -.->|"❌ Rejected: retry"| C<br>D -->|"✅ Accepted: cut list"| E[Video Editing Agent]<br>E -->|"stitched video"| F[Final Video]

The plan: take a video, run it through a speech-to-text model to get the transcription, feed the full video transcript into an editor agent that decides what the most important segments are, then feed the full transcript and the selected segments to a Reviewer Agent tasked with deciding whether the selected sections of the video actually preserve the message.<br>In this plan, the editor agent and the reviewer agent would go back and forth until the reviewer agent agrees with the selection made by the editor agent.<br>Finally, FFmpeg stitches the final video together.

On paper? Flawless.<br>In reality? The output looked terrible 🥹.

You can look at it yourself:

Original

First iteration version

The rest of the post is about what went wrong and what I learned.

Lessons learned:

Loss-in-the-middle

A 2024 paper, Lost in the Middle: how Language Models Use Long Contexts, documents that models oversample the beginning and the end of their context window and are less efficient at retrieving information from the middle of their context window.

What this paper formally proves won’t surprise the OG ChatGPT 3.5 users who, in one way or another, already experienced this firsthand.<br>2026 is a different geological era in comparison to 2024 in the LLM world and this defect became much less noticeable as models became better and can juggle longer context windows. Still, Lost-in-the-middle is inherent to transformer architectures so the problem remains.

It’s also difficult to report on more recent literature on this topic. LLMs aren’t a moving target, they’re a running target. Every finding achieved might be obsolete the moment a new model generation comes out.<br>The most recent literature found on the topic comes from the paper LongFuncEval: Measuring the effectiveness of long context models for function calling where appendix F is entirely dedicated to measuring this on the SOTA of May 2025. Empirically, the lost-in-the-middle is still here and kicking, at least with the model families tested on this project: DeepSeek V4, Qwen 3.7, and GLM 5.

The editor agent from the workflow is the perfect storm for lost-in-the-middle. The videos tested on the workflow are quite long. Often, the real theme hides under a pile of fluff and exactly in the areas where the models are less sensitive: around the middle.

Often the creator makes a short summary of the content at the beginning of the video. So the LLM, which by design oversamples that part, easily decides that the introductory summary is everything the user needs to know.<br>Often the opposite is actually true and initial summary brings very little value and the middle is the juicy part that interests the user.

This resulted in the editor agent always oversampling the introduction or the end of the video.<br>The solution was to modify the architecture to add one more node in the workflow. The new agent receives the whole transcript and finds the core message from it. Then the agent passes that along to the editor and to the reviewer in the format of [core message] + [full transcript] + [core message]. This idea came from reading the original lost-in-the-middle paper.

I had zero expectation for it to work but surprisingly the agents stopped over sampling the beginning of the videos.

The compound bias problem:

The initial assumption for the workflow was that the Editor and the Reviewer would debate and iterate before coming to an agreement. What really happened is that the reviewer agent acted as a rubber stamper. It was basically always approving the findings of the editor.

I peeked at the literature and what I discovered is elegantly summarized by this quote: “LLMs’ inherent sycophancy can collapse debates into premature consensus, potentially undermining the benefits of multi-agent debate. Sycophancy is a core failure mode that amplifies disagreement collapse before...

agent video middle editor from reviewer

Related Articles