LocateAnything: Fast Vision-Language Grounding with Parallel Box Decoding

gmays1 pts0 comments

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Detection Showcase. LocateAnything performs diverse localization tasks under a unified vision-language model, including document understanding, GUI grounding, dense object detection, and OCR localization.

Decoding Speed Comparison. Parallel Box Decoding (PBD) vs. Quantized Coordinate Decoding — PBD predicts each bounding box atomically in a single forward pass, achieving significantly faster decoding throughput.

Highlights

Advancing the Speed-Accuracy Frontier

Top: LocateAnything supports diverse localization tasks under a<br>unified vision-language model. Bottom: Textual digit decoding and quantized coordinate<br>decoding predict coordinate tokens sequentially. In contrast, Parallel Box Decoding<br>predicts each geometric unit (e.g., a bounding box) in a single forward pass.

Parallel Box Decoding (PBD)

Treats each bounding box (or point) as an atomic unit, learning to predict the complete<br>coordinate set simultaneously. PBD preserves intra-box geometric coherence and prevents<br>generating irregular structural tokens.

Hybrid Inference Mode

Uses Fast Mode (MTP) by default and seamlessly falls back to Slow Mode (NTP) when parallel<br>outputs are unreliable, e.g., due to format irregularity or spatial ambiguity. Preserves<br>most of the speed gains while maintaining robust outputs.

LocateAnything-Data

A massive, diverse training corpus with 138M language queries and 785M bounding boxes<br>covering general OD, GUI grounding, referring comprehension, text localization, and<br>point-based tasks.

State-of-the-Art Performance

Improves throughput by up to 2.5× while surpassing prior VLM grounding models in localization<br>quality across challenging benchmarks like LVIS, M6Doc, and ScreenSpot-Pro.

Abstract

Overcoming Autoregressive Bottlenecks in VLM Grounding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a<br>coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are<br>learned and decoded largely independently. This token-by-token decoding mismatches the coupled<br>structure of box geometry and creates a practical inference bottleneck due to<br>strictly sequential generation.

We introduce LocateAnything , a unified generative grounding and detection framework<br>based on Parallel Box Decoding (PBD) . By decoding geometric elements such as<br>bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box<br>geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding<br>throughput and localization accuracy.

We further develop a scalable data engine and curate LocateAnything-Data , a<br>large-scale dataset with more than 138 million training samples, substantially increasing data<br>diversity for high-precision localization. Extensive evaluations show that LocateAnything advances<br>the speed–accuracy frontier, achieving significantly higher decoding throughput while improving<br>high-IoU localization quality across diverse benchmarks. The results highlight the complementary<br>benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise<br>unified visual grounding and detection.

Method

LocateAnything: Parallel Box Decoding

To reconcile high-throughput decoding with reliable localization, we propose<br>LocateAnything , a unified framework for VLM-based visual detection and grounding<br>built upon Parallel Box Decoding (PBD) .

Comparison of standard token decoding methods vs Parallel Box Decoding (PBD).

Box-Aligned Atomic Units

Input: An image and a natural language text query. The vision encoder<br>extracts visual tokens at native resolution, preserving fine-grained spatial details<br>for high-precision localization.

Parallel Decoding: LocateAnything treats each bounding box (or point)<br>as an atomic unit of constant length and predicts the full coordinate set<br>(x1, y1, x2,<br>y2) in one parallel step, avoiding arbitrary chunking of<br>coordinate tokens.

Architecture: Built upon a Moon-ViT vision encoder and a Qwen2.5<br>language decoder, bridged by a MLP projector, directly converting visual tokens into<br>a sequence of box-aligned block-level predictions.

Flexible Inference Modes

Fast Mode (MTP): Predicts full boxes in parallel for maximum<br>throughput, suitable for latency- and compute-constrained settings such as on-device<br>robotics and embodied agents.

Slow Mode (NTP): Decodes coordinate tokens autoregressively for maximum<br>stability, appropriate for high-precision labeling, dataset curation, and<br>accuracy-oriented offline evaluation.

Hybrid Mode: Uses Fast Mode by default and falls back to Slow Mode when<br>format irregularity or spatial ambiguity is detected, preserving most speed gains while<br>maintaining robust outputs.

Architecture overview of LocateAnything using Parallel Box Decoding.

On-Demand Inference: Corrected NTP Re-decoding

When parallel decoding encounters Format Irregularity (malformed syntax at...

decoding parallel locateanything grounding localization language

Related Articles