SilverTorch: Index as Model — A New Retrieval Paradigm for Recommendation Systems - Engineering at Meta
Skip to content
By Lei Chen, Yiyi Pan, Ivy Sun, Sha Meng, Cornelia Carapcea, Shilin Ding, Ram Ramanathan, Nipun Mathur, Hong Yan, Lars Backstrom
We’re introducing SilverTorch, a reimagining of recommendation systems that unifies all retrieval components for user generated content under a unified architecture.
SilverTorch shows up to 23.7x higher throughput compared to the state-of-the-art approaches. It’s also showing 20.9x more compute cost efficiency compared to a CPU-based solution while also improving accuracy.
Our research paper, “SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs,” accepted to the full paper track at SIGIR 2026, contains full technical details.
The retrieval system within industry recommendation systems have consisted of microservices stitched together, with neural networks inconsistently integrated. Our recommendation can scale to serve people across multiple platforms. Retrieval is responsible for narrowing from millions of pieces of content (e.g., reels and photos) down to thousands before passing them to ranking systems, all in less than 100 milliseconds.
However, the microservice based design had hard constraints on model complexity and the number of candidates evaluated, ultimately creating a ceiling on the quality of recommendations that people on our platforms see.
To break through this ceiling, we’ve fully reimagined our retrieval ecosystem into a unified model-based system – SilverTorch.
SilverTorch operates under a new paradigm we call Index as Model . We’ve built our retrieval system as a single neural network and now express different microservices as model modules within this integrated neural network. Under Index as Model previous microservice-based item indices used for retrieval become a tensor inside the model. As a user opens up their app, one request flows through a SilverTorch model, completes all critical retrieval functions (searching for items similar to the user’s interests, filtering for eligibility, reranking and scoring engagement likelihood against multiple user engagement actions), and returns a list of high-quality content candidates to ranking. This new design effectively allows us to increase modeling complexity and the number of candidates evaluated without breaking the sub-100 milliseconds bar.
SilverTorch makes retrieval significantly more efficient, runs at scale, and enables better recommendations.
Higher throughput, lower total cost of ownership (TCO) . In an 80M-item end-to-end evaluation, SilverTorch served 23.7× more requests per second than a strong traditional multi-service baseline built on the same model architecture, while improving estimated TCO efficiency by 20.9×.
Proven at scale . Results show SilverTorch can scale across a family of apps as the major retrieval system behind the feed and video content people see.
Better recommendations . By making neural reranking and multi-task scoring practical within tight latency budgets, SilverTorch has consistently enabled retrieval quality improvements that would have been impractical under a microservices architecture.
Moving From Microservice Mesh to One Integrated Neural Network
The Microservice Paradigm We Replaced
Traditional recommendation retrieval is built as a mesh of microservices. When a user opens a social media platform, the request hits an orchestrator, which fans out to a user-tower model service (which computes a vector representation of the user’s interests, called a "user embedding"), a combined retrieval service (which finds and filters candidate items based on similarity to the user vector and eligibility rules like language and geography), and a scoring service (which ranks the survivors). The orchestrator merges results and hands them downstream. Each service has its own codebase, often in a different programming language, with its own deployment lifecycle.
This worked well in the CPU era. But as retrieval systems grew in scale and sophistication, three problems compounded into structural limits that no component-level optimization can fix:
Latency lost to data movement. Every hop between services costs network round-trip time and serialization overhead, eating into our sub-100-millisecond retrieval budget that should fund actual computation. And because filtering, search, and scoring are designed independently, they cannot be jointly optimized.
Version inconsistency. The user-tower model, the item index, and the filtering rules each update on their own cadence. When the user model ships v2 but the item index is still on v1, the system queries v1 embeddings with v2 user representations — creating quality gaps no downstream ranking can recover.
Siloed development environments. Machine learning (ML) engineers write PyTorch. Infrastructure engineers write C++. Different release cycles, different testing...