What does it cost to process an image with a vision model?

ankitg121 pts0 comments

What does it cost to process an image with a vision model?

Solutions

Resources

Pricing<br>Docs<br>Blog

Search

Sign In<br>Book a demo<br>Get Started

Search

Sign in<br>Book a demo<br>Get Started

Search

Blog<br>Vision Token Counts: What does it cost to process an image with a frontier vision model?

Trevor Lynn

Published<br>May 4, 2026<br>6 min read

A reproducible breakdown of GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro. Updated May 2026.<br>Why VLM pricing is harder than LLM pricing<br>Estimating the cost of an LLM call is mostly arithmetic. Count the input tokens, count the output tokens, multiply by the rate card, done. Vision-language models break that habit. The same JPEG can become 87 tokens on one provider and 6,636 on another, before the model has generated a single word of output. If you are sizing a workload, the question of how much it costs to process an image only has an answer once you specify the image, the provider, and what you want back.<br>This piece walks through the cost equation, the per-provider tokenization rules as of May 2026, and a worked grid across five image sizes. The goal is to give you something you can plug your own numbers into.<br>The VLM cost equation<br>Cost per image = (image input tokens + text input tokens) × input price + output tokens × output price<br>Three of those four terms behave like normal LLM math. The fourth, image input tokens, is where the providers diverge. The rest of this post focuses there, because that is the hardest part when making a budget.<br>For the comparisons below, we hold text input and output constant (a 100-token instruction, a 500-token JSON response) and vary the image. That isolates the variable that vision pricing actually depends on.<br>How each provider turns pixels into tokens<br>OpenAI GPT-5.5<br>GPT-5.5 uses patch-based image tokenization. Images are covered by 32 by 32 pixel patches, and the image token count is based on the number of patches after any model resizing. In `high` detail mode, GPT-5.5 allows up to 2,500 patches or a 2,048-pixel maximum dimension. If either limit is exceeded, the image is resized while preserving aspect ratio.<br>In `original` detail mode, GPT-5.5 allows up to 10,000 patches or a 6,000-pixel maximum dimension. One important gotcha: on GPT-5.5, omitted `detail` and `auto` behave like `original`, not `high`. For the comparison grid below, we use `detail: "high"`.<br>Input price: $5.00 per million tokens for GPT-5.5 standard input.<br>Anthropic Claude Opus 4.7<br>Anthropic uses an area-based formula. Image tokens approximate (width × height) / 750. The long edge is capped at 2,576 pixels in Opus 4.7, up from 1,568 in prior Claude models. Anything larger gets resized down before tokenization.<br>There is one wrinkle worth knowing about. Opus 4.7 ships with a new tokenizer that produces 1.0x to 1.35x more tokens for the same input compared to Opus 4.6. Image tokens are affected too, so a phone photo that cost X on Opus 4.6 can cost noticeably more on Opus 4.7 even at the same nominal price per token.<br>Input price: $5.00 per million tokens.<br>Google Gemini 3.1 Pro<br>Gemini has the simplest rule. Images where both dimensions are 384 pixels or smaller cost a flat 258 tokens. Anything larger is cropped and scaled as needed into 768 by 768 tiles, and each tile costs 258 tokens.<br>Input price: $2.00 per million tokens (standard context). The lower per-token price partially offsets the higher tile count on big images.<br>VLM pricing comparison grid<br>Five representative image sizes, run through each provider's rule. Image input tokens only.

Image<br>GPT-5.5 tokens<br>Claude tokens<br>Gemini tokens<br>Thumbnail (256×256)<br>64<br>87<br>258<br>Web image (1024×1024)<br>1,024<br>1,398<br>1,032<br>Phone photo (4032×3024)<br>2,451<br>6,636<br>6,192<br>Document scan (2480×3508)<br>2,478<br>6,255<br>5,160<br>4K video frame (3840×2160)<br>2,304<br>4,977<br>3,870

Translating to dollars at current input prices:

Image<br>GPT-5.5<br>Claude Opus 4.7<br>Gemini 3.1 Pro<br>Thumbnail (256×256)<br>$0.00032<br>$0.00044<br>$0.00052<br>Web image (1024×1024)<br>$0.00512<br>$0.00699<br>$0.00206<br>Phone photo (4032×3024)<br>$0.01226<br>$0.03318<br>$0.01238<br>Document scan (2480×3508)<br>$0.01239<br>$0.03127<br>$0.01032<br>4K video frame (3840×2160)<br>$0.01152<br>$0.02489<br>$0.00774

The same grid at one million images, to give you an idea for real world applications like the volume of an inspection line, content moderation pipeline, or document processing:

Image<br>GPT-5.5<br>Claude Opus 4.7<br>Gemini 3.1 Pro<br>Thumbnail (256×256)<br>$320<br>$435<br>$516<br>Web image (1024×1024)<br>$5,120<br>$6,990<br>$2,064<br>Phone photo (4032×3024)<br>$12,255<br>$33,180<br>$12,384<br>Document scan (2480×3508)<br>$12,390<br>$31,275<br>$10,320<br>4K video frame (3840×2160)<br>$11,520<br>$24,885<br>$7,740

These numbers are image-input only. Add 100 input tokens for the instruction and 500 output tokens for a JSON response and the total per call goes up by roughly $0.0130 on Claude, $0.0155 on GPT-5.5, and $0.0062 on Gemini, depending on output rates. For binary classification (one-token outputs), output cost is negligible. For long-form analysis (2,000+ output tokens), output cost can dominate the image cost...

image tokens input cost output opus

Related Articles