Serving Large Language Models with a Minimalist Python CLI

vorticotech1 pts0 comments

Serving LLMs with the Flama CLI - FlamaGift me a ⭐ !

Go back to blog

Serving LLMs with the Flama CLI<br>Flama 2.0 brings first-class support for generative AI: downloading, packaging, and serving large language models (LLMs)<br>is now as simple as running a few commands in your terminal. No boilerplate code, no custom serving infrastructure, no<br>configuration files. Just the CLI and a model.<br>In this post, we walk through the entire workflow: fetching a model from HuggingFace, interacting with it locally in<br>your terminal, and serving it over HTTP with a production-ready API and a built-in chat interface. We will also show how<br>a locally served model can power agentic workflows, using Claude CLI as a practical example.<br>Before we dive into the details, we recommend you to have the following resources at hand:

Official Flama documentation: Flama documentation

Generative AI section: Generative AI docs

Flama GitHub repository: Flama on GitHub

Table of contents

Fetching a model with flama get

What happens under the hood

Interacting with the model locally

One-shot queries with flama model run

Streaming responses with flama model stream

Serving the model over HTTP

The flama serve command

The built-in chat interface

Powering agentic workflows

Using Claude CLI with a local model

Conclusions

References

Support our work

About the authors

Fetching a model with flama get<br>The first step in serving an LLM with Flama is downloading and packaging a model into a .flm artifact (a Flama<br>Lightweight Model file). The flama get command handles this in a single step: it downloads the model weights and<br>configuration from a supported source and serialises them into the portable .flm format.<br>All examples in this post assume Flama has been installed with the LLM extras via<br>uv:

uv pip install "flama[llm,pydantic]"

Alternatively, you can run any command without a prior install by using<br>uvx --from "flama[llm,pydantic]" flama ..., but for brevity we assume Flama is already installed throughout.<br>Let us fetch a quantised version of Google's Gemma 4 model, optimised for Apple Silicon via the MLX Community:

flama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bit<br>Downloading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2.3 GB 28.7 MB/s 0:00:00Packaging...Model saved to mlx-community_gemma-4-E2B-it-qat-4bit.flm

Two options are required: --source tells Flama where to download from (currently HuggingFace), and --family declares<br>whether the artifact is a traditional machine-learning model (ml) or a generative model (llm). For large language<br>models, you always pass --family llm.<br>The output path defaults to .flm with slashes replaced by underscores. If you prefer a custom path, pass<br>--output:

flama get --family llm --source huggingface mlx-community/gemma-4-E2B-it-qat-4bit --output models/gemma.flm

What happens under the hood<br>When you run flama get, the following happens:

Flama resolves the model identifier against the HuggingFace Hub and discovers the files that make up the model (weights, tokenizer, configuration).

Files are downloaded concurrently (up to 8 parallel downloads by default, configurable with --max-concurrent).

Once all files are on disk, Flama packages them into a single .flm archive alongside a manifest that records the model family, the originating library, and metadata such as the model name and creation timestamp.

The result is a self-contained, portable artifact. The .flm format is framework-agnostic: the same file runs on vLLM<br>(Linux with CUDA) or MLX (Apple Silicon), with Flama selecting the appropriate backend at load time based on what is<br>available in the environment.

Interacting with the model locally<br>Once you have a packaged .flm artifact, you can interact with it directly from your terminal using the<br>flama model command. No server, no HTTP, no code. This is invaluable for quick testing, prompt experimentation, and<br>pipeline scripting.

One-shot queries with flama model run<br>The run sub-command sends a prompt to the model, waits for the full response, and prints it:

echo "What is Flama?" | flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run --system "Be concise."<br>Flama is a Python framework for building production-ready APIs with a focus on machine learningand generative AI, enabling one-line model serving behind HTTP endpoints.

You can tune generation with --param flags:

echo "Explain dependency injection in three sentences." | \ flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run \ --system "You are a software engineering instructor." \ --param temperature=0.7 \ --param max_tokens=256

For multi-turn conversations, use the --transport conversation flag and pass a JSON message list:

echo '[{"role": "user", "content": "Hi!"}, {"role": "assistant", "content": "Hello! How can I help?"}, {"role": "user", "content": "What is an API?"}]' | \ flama model mlx-community_gemma-4-E2B-it-qat-4bit.flm run --transport conversation

Streaming responses with flama model stream<br>For an...

flama model serving 4bit generative from

Related Articles