Serving Large Language Models with a Minimalist Python CLI

Serving LLMs with the Flama CLI - FlamaGift me a ⭐ !

Go back to blog

Serving LLMs with the Flama CLI Flama 2.0 brings first-class support for generative AI: downloading, packaging, and serving large language models (LLMs) is now as simple as running a few commands in your terminal. No boilerplate code, no custom serving infrastructure, no configuration files. Just the CLI and a model. In this post, we walk through the entire workflow: fetching a model from HuggingFace, interacting with it locally in your terminal, and serving it over HTTP with a production-ready API and a built-in chat interface. We will also show how a locally served model can power agentic workflows, using Claude CLI as a practical example. Before we dive into the details, we recommend you to have the following resources at hand:

Official Flama documentation: Flama documentation

Generative AI section: Generative AI docs

Flama GitHub repository: Flama on GitHub

Table of contents

Fetching a model with flama get

What happens under the hood

Interacting with the model locally

One-shot queries with flama model run

Streaming responses with flama model stream

Serving the model over HTTP

The flama serve command

The built-in chat interface

Powering agentic workflows

Using Claude CLI with a local model

Conclusions

References

Support our work

About the authors

Fetching a model with flama get The first step in serving an LLM with Flama is downloading and packaging a model into a .flm artifact (a Flama Lightweight Model file). The flama get command handles this in a single step: it downloads the model weights and configuration from a supported source and serialises them into the portable .flm format. All examples in this post assume Flama has been installed with the LLM extras via uv:

uv pip install "flama[llm,pydantic]"

Alternatively, you can run any command without a prior install by using uvx --from "flama[llm,pydantic]" flama ..., but for brevity we assume Flama is already installed throughout. Let us fetch a quantised version of Google's Gemma 4 model, optimised for Apple Silicon via the MLX Community:

flama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bit Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2.3 GB 28.7 MB/s 0:00:00Packaging...Model saved to mlx-community_gemma-4-E2B-it-qat-4bit.flm

Two options are required: --source tells Flama where to download from (currently HuggingFace), and --family declares whether the artifact is a traditional machine-learning model (ml) or a generative model (llm). For large language models, you always pass --family llm. The output path defaults to .flm with slashes replaced by underscores. If you prefer a custom path, pass --output:

flama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bit --output models/gemma.flm

What happens under the hood When you run flama get, the following happens:

Flama resolves the model identifier against the HuggingFace Hub and discovers the files that make up the model (weights, tokenizer, configuration).

Files are downloaded concurrently (up to 8 parallel downloads by default, configurable with --max-concurrent).

Once all files are on disk, Flama packages them into a single .flm archive alongside a manifest that records the model family, the originating library, and metadata such as the model name and creation timestamp.

The result is a self-contained, portable artifact. The .flm format is framework-agnostic: the same file runs on vLLM (Linux with CUDA) or MLX (Apple Silicon), with Flama selecting the appropriate backend at load time based on what is available in the environment.

Interacting with the model locally Once you have a packaged .flm artifact, you can interact with it directly from your terminal using the flama model command. No server, no HTTP, no code. This is invaluable for quick testing, prompt experimentation, and pipeline scripting.

One-shot queries with flama model run The run sub-command sends a prompt to the model, waits for the full response, and prints it:

echo "What is Flama?" | flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run --system "Be concise." Flama is a Python framework for building production-ready APIs with a focus on machine learningand generative AI, enabling one-line model serving behind HTTP endpoints.

You can tune generation with --param flags:

echo "Explain dependency injection in three sentences." | \ flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run \ --system "You are a software engineering instructor." \ --param temperature=0.7 \ --param max_tokens=256

For multi-turn conversations, use the --transport conversation flag and pass a JSON message list:

echo '[{"role": "user", "content": "Hi!"}, {"role": "assistant", "content": "Hello! How can I help?"}, {"role": "user", "content": "What is an API?"}]' | \ flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run --transport conversation

Streaming responses with flama model stream For an...

Serving Large Language Models with a Minimalist Python CLI

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI