bytedance-research/Lance · Hugging Face
Log In<br>Sign Up
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Fengyi Fu*,<br>Mengqi Huang*,✉,<br>Shaojin Wu*,<br>Yunsheng Jiang*,<br>Yufei Huo,<br>Jianzhu Guo✉,§
Hao Li,<br>Yinghang Song,<br>Fei Ding,<br>Qian He,<br>Zheren Fu,<br>Zhendong Mao,<br>Yongdong Zhang
ByteDance
* Equal contribution<br>✉ Corresponding authors<br>§ Project lead
English | 简体中文
Note: Lance is a research project rather than a polished product model. The released checkpoint was trained with up to 128 A100 GPUs, with training conducted up to 768x768 image generation and 480p, 12 FPS video generation. Our goal is to share a research artifact for studying unified image/video understanding, generation, and editing under a relatively small model and limited compute budget. Output quality may vary across prompts, resolutions, duration, motion complexity, and editing scenarios, and we see further opportunities to improve the post-training recipe. We appreciate constructive feedback from the community as we continue improving the project.
🔥 Updates
2026/05/26 : 🎨 The Gradio interface now supports image and video generation, editing, and understanding. Try it out!
2026/05/25 : ✨ The Hugging Face Space is now live, thanks to the HF team!
2026/05/19 : 🤗 The technical report is now available on arXiv.
2026/05/18 : 🔥 We launched the project homepage and released the initial inference code and model weights on GitHub and Hugging Face.
🌟 Highlights
Lance is a 3B native unified multimodal model that supports image and video understanding, generation, and editing within a single framework.
Efficient at 3B scale. With only 3B active parameters , Lance achieves competitive performance across image generation, image editing, and video generation benchmarks.
Training from scratch. Lance is trained from scratch with a staged multi-task recipe and within a budget of up to 128 A100 GPUs .
We are actively updating and improving this repository. If you find any bugs or have suggestions, please feel free to open an issue or submit a pull request (PR) 💖.
📅 Roadmap
Release the fine-tuning code.
Add support for image-to-video generation code.
🎨 Demo
Show demo results
🔥 We recommend visiting our homepage for more visual results. 🔥
Text-to-Video
Video Editing
Multi-turn Consistency Editing
Intelligent Video Generation
🚀 Installation
Recommended Environment
Software: Python 3.10+, CUDA 12.4+ (required)
Hardware: A GPU with at least 40GB VRAM is required for inference
We have tested the following dependency combinations on NVIDIA A100:
PyTorch 2.8.0 + cu126 + flash-attn 2.8.3
PyTorch 2.5.1 + cu124 + flash-attn 2.6.3
The default installation commands use the PyTorch 2.8.0 + cu126 setup. For other GPU models, please choose and validate a PyTorch build and a matching flash-attn version according to your driver, CUDA runtime, Python version, and GPU architecture.
Installation Steps
First, clone the repository:
git clone https://github.com/bytedance/Lance.git<br>cd Lance
Then, set up the environment:
conda create -n Lance python=3.11 -y<br>conda activate Lance<br>pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126<br>pip install -r requirements.txt<br>pip install flash-attn==2.8.3 --no-build-isolation
Note: If installing flash-attn from source fails, you can install a prebuilt wheel instead. The wheelhouse below is from a third-party repository and is provided for reference only ; please verify that any wheel you install matches your Python, PyTorch and CUDA versions.
pip install --no-cache-dir --no-deps --force-reinstall \<br>"https://huggingface.co/strangertoolshf/flash_attention_2_wheelhouse/resolve/main/wheelhouse-flash_attn-2.8.3/linux_x86_64/torch2.8/cu12/abiTRUE/cp311/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl"
Then, download the model weights from Lance-3B on Hugging Face and place them in the downloads/ directory:
from huggingface_hub import snapshot_download
save_dir = "./downloads/"<br>repo_id = "bytedance-research/Lance"<br>cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,<br>local_dir=save_dir,<br>repo_id=repo_id,<br>local_dir_use_symlinks=False,<br>resume_download=True,<br>allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt","*.pth",],
📚 Usage
Inference
Basic Usage
bash inference_lance.sh
Before running, please configure the inference parameters at the top of inference_lance.sh.
Supported tasks: t2i, t2v, image_edit, video_edit, x2t_image, and x2t_video. You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to customize the default data samples for each task.
Note: For all tasks, we recommend following the prompt format used in the provided examples when writing input prompts, as this typically leads to better generation quality.
Task Examples
Text-to-Video
bash inference_lance.sh \<br>--TASK_NAME t2v \<br>--MODEL_PATH downloads/Lance_3B_Video \<br>--RESOLUTION video_480p \<br>--NUM_FRAMES 121...