How we run Gemini at scale across billions of posts

API

Use cases

Resources

PricingRequest a demoLogin

Get a demo Log in Try for free

Engineering blog May 29, 2026

How we run Gemini at scale across billions of posts

Post author Iván Senior Data Engineer

Using LLMs with billions of inputs in a multi-cloud setup At Modash we sit on top of a creator-discovery dataset that grows by millions of posts every day. A growing slice of that pipeline now runs through LLMs.

This massive volume of inference adds up on our cloud bills and our operational complexity. In this article you will learn how we actually run an LLM against billions of inputs without going broke . Why We Use LLMs Is the AI hype worth it? Do LLMs have any real use beyond being a 24/7 chatbot? We think so, and over the last year we’ve shipped several production pipelines where LLMs are visibly improving the data we deliver to our customers. Several of those pipelines exist to extract structured meaning from messy, multilingual, multimodal social content. Historically these were patchworks of regex rules, keyword lists, and hand-coded extractors. They scaled in lines of code, not in coverage. Take a caption like “This is not a sponsored post” or “I’m not being paid for this promotion”: it contains every keyword the rules were looking for, but means the exact opposite. The only way to handle it correctly is to actually understand the content. Those false positives are unacceptable as they erode the perceived quality of a product customers are paying for . LLMs reframe these as language and vision tasks instead of pattern-matching ones. The tradeoff is cost, throughput, and validation — which is what the rest of this article is about. Our solution

Press enter or click to view image in full sizeOur upstream data lives in Iceberg tables on S3 . Each LLM use case has a corresponding Airflow DAG that triggers PySpark ETL’s that read our curated tables and extracts the rows that need inference. The AWS Batch jobs generate the JSONL files (with the Gemini prompts) and stores them in different GCP buckets (one per region to leverage as much as possible compute capacity, more on that below), a pub/sub detects the event and send the JSONL file to a Gemini Enterprise Agent Platform (using the Batch API as it’s 50% cheaper). Gemini Enterprise (formerly Vertex ) will read the model to be used from the path of the file and will store the output using the same partitioning strategy. From there, we run a periodic sync job that pulls those output JSONLs into S3 and lands them as Parquet. Each input row is identified by a unique ID that is also present in the LLM output. Finally, the data is ready to be used and our scheduled EMR jobs generate the data that we produce for our customers. Check this link if you want to lear more about how we optimize our EMR jobs. What each Batch job does From there, for each Parquet file, Airflow triggers out one AWS Batch job that, at a high level, prepares our raw platform data so Gemini can digest it: Reads the necessary post data and handles heavy I/O tasks: (like downloading and encoding media) so it has to be parallel or the job’s resources will be infra-utilized. Encapsulates the data into Gemini requests: Each post (or batch of posts) is packaged into a single, self-contained request payload along with its prompt instructions and structured output schemas. Aggregates these request payloads and writes them into a large JSONL file that rotates when it hits ~900 MB. JSONL structure : "key": "XYZ", "request": { "contents": [ "parts": [ "text": " POST DATA ], "role": "user" ], "generationConfig": { "temperature": 0, "maxOutputTokens": 8192, "responseMimeType": "application/json", "responseSchema": { "type": "object", "title": "SponsoredPostBatchResponse", "properties": { "results": { "type": "array", "title": "Results", "description": "List of analysis results. Must contain exactly one result per input post.", "items": { "type": "object", "title": "SponsorIdentification", "properties": { "PYDANTIC OBJECT SCHEMA" }, }, }, "thinkingConfig": { "thinkingBudget": 0 }, "systemInstruction": { "parts": [ "text": "OUR SPONSORED POST DETECTION PROMPT" }, ... Why JSONLs are big The 900 MB number is deliberate — Gemini’s hard input cap is 1 GB. But packing requests into large JSONLs does not save us token money. What large JSONLs do save: Concurrent-batch-job quota . Gemini Enterprise enforces a cap on how many batch jobs we can have running per region simultaneously. A 1 KB file and a 900 MB file each consume one quota slot. GCS object operations . Gemini Enterprise writes results back as files mirroring the input shape. Thousands of micro-JSONLs become thousands of micro-output-files, all of which cost way more on GCS at billion-scale. Downstream simplicity . Fewer files for our ingestion job to enumerate, glob, and merge. Spreading load across GCP regions Gemini Enterprise AI Batch quotas are per-region , not per-project. If we...

How we run Gemini at scale across billions of posts

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan