LocateAnything: Fast Vision-Language Grounding with Parallel Box Decoding

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Detection Showcase. LocateAnything performs diverse localization tasks under a unified vision-language model, including document understanding, GUI grounding, dense object detection, and OCR localization.

Decoding Speed Comparison. Parallel Box Decoding (PBD) vs. Quantized Coordinate Decoding — PBD predicts each bounding box atomically in a single forward pass, achieving significantly faster decoding throughput.

Highlights

Advancing the Speed-Accuracy Frontier

Top: LocateAnything supports diverse localization tasks under a unified vision-language model. Bottom: Textual digit decoding and quantized coordinate decoding predict coordinate tokens sequentially. In contrast, Parallel Box Decoding predicts each geometric unit (e.g., a bounding box) in a single forward pass.

Parallel Box Decoding (PBD)

Treats each bounding box (or point) as an atomic unit, learning to predict the complete coordinate set simultaneously. PBD preserves intra-box geometric coherence and prevents generating irregular structural tokens.

Hybrid Inference Mode

Uses Fast Mode (MTP) by default and seamlessly falls back to Slow Mode (NTP) when parallel outputs are unreliable, e.g., due to format irregularity or spatial ambiguity. Preserves most of the speed gains while maintaining robust outputs.

LocateAnything-Data

A massive, diverse training corpus with 138M language queries and 785M bounding boxes covering general OD, GUI grounding, referring comprehension, text localization, and point-based tasks.

State-of-the-Art Performance

Improves throughput by up to 2.5× while surpassing prior VLM grounding models in localization quality across challenging benchmarks like LVIS, M6Doc, and ScreenSpot-Pro.

Abstract

Overcoming Autoregressive Bottlenecks in VLM Grounding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation.

We introduce LocateAnything , a unified generative grounding and detection framework based on Parallel Box Decoding (PBD) . By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy.

We further develop a scalable data engine and curate LocateAnything-Data , a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed–accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

Method

LocateAnything: Parallel Box Decoding

To reconcile high-throughput decoding with reliable localization, we propose LocateAnything , a unified framework for VLM-based visual detection and grounding built upon Parallel Box Decoding (PBD) .

Comparison of standard token decoding methods vs Parallel Box Decoding (PBD).

Box-Aligned Atomic Units

Input: An image and a natural language text query. The vision encoder extracts visual tokens at native resolution, preserving fine-grained spatial details for high-precision localization.

Parallel Decoding: LocateAnything treats each bounding box (or point) as an atomic unit of constant length and predicts the full coordinate set (x1, y1, x2, y2) in one parallel step, avoiding arbitrary chunking of coordinate tokens.

Architecture: Built upon a Moon-ViT vision encoder and a Qwen2.5 language decoder, bridged by a MLP projector, directly converting visual tokens into a sequence of box-aligned block-level predictions.

Flexible Inference Modes

Fast Mode (MTP): Predicts full boxes in parallel for maximum throughput, suitable for latency- and compute-constrained settings such as on-device robotics and embodied agents.

Slow Mode (NTP): Decodes coordinate tokens autoregressively for maximum stability, appropriate for high-precision labeling, dataset curation, and accuracy-oriented offline evaluation.

Hybrid Mode: Uses Fast Mode by default and falls back to Slow Mode when format irregularity or spatial ambiguity is detected, preserving most speed gains while maintaining robust outputs.

Architecture overview of LocateAnything using Parallel Box Decoding.

On-Demand Inference: Corrected NTP Re-decoding

When parallel decoding encounters Format Irregularity (malformed syntax at...

LocateAnything: Fast Vision-Language Grounding with Parallel Box Decoding

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs