Sam 2 deep dive: why its FIFO memory eviction bothers me

chizkidd1 pts0 comments

SAM 2: Segment Anything in Images & Videos

-->

Meta’s unified model for promptable image and video segmentation.

A foundation model for solving promptable visual segmentation in images & videos .

Built a data engine to collect the largest video segmentation dataset to date.

Model: Simple transformer architecture with streaming memory for real-time video processing.

Trained on a wide range of tasks: video segmentation and image segmentation.

The paper can be found here.

Table of Contents

1. Introduction

2. Related Work

3. Task: Promptable Visual Segmentation (PVS)

4. Model

5. Data

6. Zero-Shot Experiments

7. Comparison to SOTA in Semi-Supervised VOS

8. Conclusion

9. Discussion

Appendix

Citation

References

1. Introduction

Why video and not image?

Image is only a static snapshot of the real world; lacks motion information (temporal).

Video captures temporal information.

Many vital applications (robotics, AR/VR, autonomous vehicles) require temporal localization beyond image-level segmentation.

A universal visual segmentation system should be applicable to both images & videos.

Video segmentation aims to determine the spatio-temporal extent of entities, which presents unique challenges beyond those in images.

Significant changes in appearance encountered by entities & lower quality nature of videos than images present challenges for video segmentation.

SAM successfully solves image segmentation, but existing video segmentation models & datasets fall short in providing a comparable capability to “segment anything in videos.”

SAM 2 : A unified model for video & image segmentation.

Promptable Visual Segmentation (PVS) : Task that generalizes image segmentation to the video domain.

A data engine that generates training data via an in-the-loop model with annotators and produces the Segment Anything Video (SA-V) dataset .

2. Related Work

Video Object Segmentation (VOS)

Video augmentation datasets

Interactive Video Object Segmentation (iVOS)

Image Segmentation task, model and dataset

Research Paper: Segment Anything (SA)

Segment Anything (Adapted from the Paper)

We aim to build a foundation model for segmentation by introducing three interconnected components: a promptable segmentation task , a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.

3. Task: Promptable Visual Segmentation (PVS)

\[\text{PVS} \longrightarrow \text{SAM 2} \longrightarrow \text{SA-V dataset}\]

PVS task allows providing prompts to the model on any frame of a video.

The interactive segmentation with SAM2 involves the steps below:

SAM 2 is prompted on a single frame and responds instantly with a valid segmentation mask of the target object on this frame.

SAM 2 then propagates the target object’s segment to multiple frames to form a masklet .

Multiple initial prompts are received and propagated by the model to obtain the masklet of the object across the entire video , which leads to localization of the segmentation mask of the target on every single video frame.

Additional prompts on any frame can be added to SAM 2 for segmentation mask refinement.

SAM 2 is applied as a data collection tool to the PVS task for building the SA-V dataset.

Model evaluation is done via simulation of interactive video segmentation scenarios across multiple frames in the conventional first-frame, limited, semi-supervised VOS setting, and for image segmentation on the SA benchmarks.

4. Model

SAM 2 is a generalization of SAM to the video (& image) domain. Essentially, it employs taking point, box & mask prompts on individual frames to define the spatial extent of the object to be segmented spatio-temporally .

Figure 1: The SAM 2 architecture. For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Video frames are processed in a streaming fashion by the image encoder, cross-attended to memories of the target object from previous frames stored in the memory bank, and decoded via the mask decoder (optionally prompted by the prompt encoder) to predict the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings for use in future frames.

SAM 2 Architecture

For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Video frames are processed in a streaming fashion by the image encoder, cross-attended to memories of the target object from previous frames stored in the memory bank, and decoded via the mask decoder (optionally prompted by the prompt encoder) to predict the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings for use in future frames.

-->

Components:

Image encoder : For real-time processing of arbitrarily...

segmentation video image model frame frames

Related Articles