How to reduce the cost of generating an Image

How to reduce the cost of generating an Image | 🌊 Ritesh Pallod

07 Feb, 2026

For the sake of this discussion, we’ll not go into the personal cost I had to pay and keep this technical.

Think of reducing cost as increasing throughput of the model on a given hardware. Sometimes improving latencies for a model - running it faster - also corelates to improving throughput and thus reducing cost. Whenever that correlation is implied do note that the primary objective of the approaches listed in the doc is to improve throughput (and cost) and not latency.

There are different model architectures that support different mechanisms to reduce costs. And there are different model architectures that solve the task of generating an image. There are lot of common approaches that apply regardless of the model architecture.

decide a few models

Comparing SD (2022), SDXL (2023), Flux Dev (2024), Qwen Image (2025), Nano Banana Pro (2025), Z-Image Turbo (late 2025) for a few use cases showcasing how far we have come in the last 3 years. Our definition of good output has changed with every massive release, and the frontier models have become bigger and bigger.

What this this translates is the minimum machine size (or rather GPU) required to run a model has changed as well. Stable Diffusion could run on CPU. SDXL worked on a L4 GPU. Flux, Qwen needs a bigger machine to run, the latest RTX 6000 Pro / G4 or an A100 at least. There are inference frameworks that support quantized version of those on L4. What this showcase is, you get what you pay for. Each iteration needs more GPU to run than the previous one, while also being that better in the quality of image & the intelligence of model.

The choice can never be run the best & biggest model for any task, even though it would perform exceptional on it; but it has to be relevant to your task if you care about cost. You can get almost the same level of realism by hosting Qwen Image & Z-Image Turbo then paying 4 c / image for Nano Banana Pro.

Every individual would have its own take of what is the right accuracy or realism they love about an image. It becomes clearer that a frontier proprietary model from a closed source lab looks better than the Open Source one that is catching up to the frontier. I haven’t found a single set of metrics to quantify what model works best for a task with the tradeoffs of cost. Few such things for deciding are Mean Opinion Score, how impacted are your users if you move them from Model 1 to slightly cheaper Model 2.

building on top of a model<br>While picking the vanilla model is one option, another is that the expectation you seek from a model can be brought by trickier means. Like realism you chase can brought by clever preprocessing of images or by additional improvement to a model via the means of a Control Net or LORA on a cheaper model iteration.

If you truly care about reducing cost, the problem statement goes deeper than just pick this model, apply the same optimizations and run on this GPU.

Every model that is released in the open source, has given its users the option to extend it. Fine tune it on your datasets. Or condition it your liking. And even making it faster at a tiny drop in accuracy metrics. A vanilla Qwen Image lacks the realism of Nano Banana Pro & Z-Image Turbo. Everyone’s aware of that and that’s why you’d see someone on the internet train a realism LORA for it.

This is not new. In fact, it started with the OG series of Stable Diffusion. The original Stable Diffusion were too generalized and thus lacking. Developers and passionate hobbyists built an anime SD, multiple sets of realistic SD, someone released weights for a model that did better western, wild west images while others did it for futuristic cyber punk.

The issue of having deformities on a human were solved to an extent by allowing the model to not imagine what the human anatomy is but by providing a pose or an existing image to learn from (more like be conditioned on). This was done via Control Nets & IP Adapters.

There are other preprocessing tricks - if you want to generate the most realistic looking photo of 200 cars driving in Georgia (the country), you can work with few background images of places Georgia, these backgrounds either real or imagined from the most realistic generative models and use a cheaper model to place the cars there.

now comes the optimizations<br>We’ll go into quantization, tricks like “torch.compile”, detaching CPU work from GPU, choice of cost to performance numbers for a GPU, step sizes, batch sizes and the most important thing being choice of inference framework.

quantization<br>Quantization is the process of converting a large set of values (high precision) to a smaller, discrete set of values (low precision). In our ML models, this typically means converting 32-bit floating-point weights and activations into 16-bit, 8-bit or even 4-bit integers.

Quantization’s obvious concern is the...

How to reduce the cost of generating an Image

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan