ByteDance Open-Sources Lance, a 3B Multimodal Model for Images and Video

ByteDance Just Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning - Firethering

Home

Softwares

AI Tools

DevTools

3D Tools

Design Tools

Image Editors

Video Editors

Productivity

Utilities

Apps

Android Apps

iOS Apps

Games

Windows Games

macOS Games

Android Games

iOS Games

Tech

Picks

AI Picks

AI Models

Trends

Thursday, May 21, 2026

Home

Softwares

AI Tools

DevTools

3D Tools

Design Tools

Image Editors

Video Editors

Productivity

Utilities

Apps

Android Apps

iOS Apps

Games

Windows Games

macOS Games

Android Games

iOS Games

Tech

Picks

AI Picks

AI Models

Trends

Facebook Instagram Twitter Vimeo Youtube

Home

Softwares

AI Tools

DevTools

3D Tools

Design Tools

Image Editors

Video Editors

Productivity

Utilities

Apps

Android Apps

iOS Apps

Games

Windows Games

macOS Games

Android Games

iOS Games

Tech

Picks

AI Picks

AI Models

Trends

HomeTechByteDance Just Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

ByteDance Just Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

By Mohit Geryani

May 21, 2026

Last updated: May 21, 2026

Facebook

Twitter

- Advertisement -

Most multimodal AI systems today are still collections of separate tools pretending to be one product.

One model generates images. Another edits them. A different one handles video. The entire stack works, but it often feels stitched together behind the scenes.

ByteDance just used a different approach. The company just released Lance, a new open multimodal model that tries to handle image generation, video generation, editing, and visual reasoning inside one native framework. The surprising part is not just the scope. It is the size. Lance runs with only 3 billion active parameters while still posting competitive numbers across image, video, and editing benchmarks.

The industry has spent the last two years building specialized AI systems for every separate media task imaginable. Lance is part of a growing push in the opposite direction: fewer models, more unified behavior, and systems that can move between understanding and generation.

Table of Contents

The problem with multimodal models

A lot of multimodal AI products still work like several different systems hiding behind one interface.

If you want the AI to explain what happened in the video or answer questions about it, that usually becomes a pipeline entirely.

The result is that many multimodal products are really collections of specialized models passing information back and forth behind the scenes.

That setup works, but it becomes complex the moment you try building longer AI workflows. Context gets lost between systems. Outputs become inconsistent. One model may generate something another model struggles to understand later.

ByteDance is trying to simplify that with Lance. Instead of separating generation, editing, and reasoning into different stacks, the company trained one framework to handle all of them together. The same model can generate images, create video, edit both, and answer questions about visual content.

AI companies are slowly moving toward agents and autonomous workflows instead of single prompts. A system that can create and understand visual content inside the same model is much easier to plug into those workflows than a chain of disconnected tools.

What Lance actually is

Lance is a native multimodal model from ByteDance.

The model supports text-to-image generation, text-to-video generation, image editing, video editing, image understanding, and video understanding inside one framework. ByteDance says the model was trained from scratch using a staged multi task setup.

Lance runs with 3B active parameters, which is relatively small compared to many recent multimodal systems pushing into video generation. Despite that, ByteDance is positioning it directly against larger unified models like BAGEL, TUNA, and InternVL-U across image generation, editing, and video benchmarks.

via Lance Github Page

Some demos show standard text-to-video clips, others lean into multi-turn editing and visual reasoning. In one example, the model edits a video while preserving consistency across multiple changes.

Question: How many times did the person launch objects on the table? Options: (A) 3 (B) 2 (C) 4 Response: (A) 3

In another, it answers questions about object movement and repeated actions inside short clips. The model can also describe images, read charts, recognize license plates, and handle basic visual reasoning tasks.

That combination is really the point of Lance. ByteDance is not treating generation and understanding as separate products anymore. It wants one system moving between both naturally.

Lance

You May Like: MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

What Benchmarks Show?

BenchmarkLance (3B)Notable ComparisonVBench (video generation)85.11Higher than Wan 2.1...

ByteDance Open-Sources Lance, a 3B Multimodal Model for Images and Video

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down