Toward More Controllable AI Video Editing: Research Exploration at Netflix

MattSayar1 pts0 comments

Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix | by Netflix Technology Blog | Jun, 2026 | Netflix TechBlogSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

Netflix TechBlog

Learn about Netflix’s world class engineering efforts, company culture, product developments and more.

Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix

Netflix Technology Blog

10 min read·<br>15 hours ago

Listen

Share

By Zhuoning Yuan, Ta-Ying Cheng, Benjamin Klein, Bahareh Azarnoush<br>Introduction<br>At Netflix, we build technology to help storytellers bring their creative visions to life and to help members discover the stories they love.<br>To connect stories with diverse audiences around the world, we produce promotional assets, including trailers, teasers, and social short‑form videos, that build on and elevate the original footage. Through close collaboration with the teams crafting these assets, we identified a recurring gap in current tools. Transforming raw footage into a polished final asset often requires complex edits like seamlessly adding new visual elements, patching or replacing backgrounds, or removing unwanted objects without breaking the scene’s physical continuity. These tasks typically demand hours of specialized manual editing work. While recent generative video editing models show promise, they often struggle to preserve the integrity of the source footage. Many methods regenerate every pixel to make an edit, which can fail to isolate changes and inadvertently alter elements that should remain untouched. To execute these tasks effectively, artists need tools that empower them to dictate exactly what changes and how it changes.<br>Get Netflix Technology Blog’s stories in your inbox

Join Medium for free to get updates from this writer.

Subscribe

Subscribe

Remember me for faster sign in

Our research goal is to make this process easier for artists. We’re deliberate about where and how AI is applied, ensuring that the technology always serves the creative intent. That principle drives our recent work: exploring the benefits of generative AI in ways that protect and expand creative choice, and keeping artists in precise control of their final vision. Recent advancements in AI video editing have demonstrated impressive capabilities in streamlining complex manual editing workflows, but key challenges remain before they can reliably support professional use:<br>Unintended edits: When editing a specific element in a video clip, many methods regenerate the entire video, which can inadvertently alter identity, performance, and other elements like objects, backgrounds, or critical scene details.<br>Press enter or click to view image in full size

Left: input video. Right: output from Ditto using the prompt “change the background to a winding coastal highway in California,” which completely changes the scene.Unnatural physics : When removing objects, many methods focus only on erasing the target while ignoring the scene’s physical continuity. This can lead to inconsistent motion and implausible interactions, making the results look unnatural.<br>Press enter or click to view image in full size

Left: the green mask denotes the target to be removed. Right: output from Gen-Omnimatte where the target was removed, but the physical continuity of the scene was ignored — the pool float shouldn’t move if there’s no interaction with it.Today, we’re sharing two research explorations that aim to address these challenges. We believe this work can help advance the field in a way that’s both meaningful and responsible:<br>Vera : a layered video diffusion model. Vera generates only what needs to change as separate edit layers while leaving the rest of the video untouched, preserving the identities, performances, and other details from the source footage exactly as filmed.<br>VOID : a video inpainting model for video object and interaction deletion. VOID performs physically plausible inpainting in complex scenes: it doesn’t just remove an object, but also reconstructs the scene as if the object was never there.<br>Along with this blog post, we’re also publicly releasing the research papers that detail the algorithmic innovations behind Vera and VOID. We hope these publications will enable other researchers to experiment with these ideas, build upon our findings, and further advance the field.<br>Vera: A Layered Video Diffusion Model<br>Existing video editing models regenerate the entire clip, coupling the intended edit with regions that should remain unchanged. This increases the risk of altering details of the original footage. To tackle this challenge, we introduce Vera , a novel layered video diffusion framework for content-preserving video editing.

Teaser for Vera (disclaimer: This is a research prototype, not an official product).Inference Pipeline<br>Given a source video and a text editing instruction, Vera jointly generates an edit layer and an alpha matte....

video editing netflix research vera scene

Related Articles