Mochallama

mochallama

Appearance

mochallamaA local, tool-calling LLM inside your JVM The only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed via Project Panama FFM. No JNI, no daemon, no native-install dance. Requires JDK 22+. Quickstart Why mochallama Examples GitHub

In-process — no daemon, no network hop The model runs inside your application's own process. No Ollama-style sidecar to install and supervise, no HTTP round-trip, no idle resource drain. Inference is stateful and rides your app's lifecycle, Actuator health, and Micrometer metrics.

No JNI — all Panama FFM Java talks to llama.cpp through the JDK 22 Foreign Function & Memory API (GA, not incubator), over a thin ~700-LOC extern-C bridge on llama.cpp's common_chat. No hand-written JNI glue, far fewer crash vectors.

Prebuilt llama.cpp, 5 platforms, zero native-install Consumes upstream's official prebuilt llama.cpp release libs (tag b9371) and compiles only the bridge (~2–11s, not a 95-minute from-source build). Per-platform classifier jars auto-load the right native — macOS Intel + Apple Silicon, Linux x86-64 + ARM64, Windows x86-64.

Spring autoconfig, OpenAI-compatible One @AutoConfiguration dependency exposes POST /v1/chat/completions (with SSE when stream:true) and GET /v1/models. Tools and streaming work together. Drop-in for code already written against OpenAI or Ollama.

Tool-calling-only — fail-fast Built for agentic / function-calling work. Non-tool-capable models are rejected at load with MODEL_NOT_TOOL_CAPABLE — an explicit contract instead of silent degradation on small models.

Metrics via Actuator The starter registers inference meters (timer, token distributions, tool-call counter, tokens/sec) and a model health indicator through Actuator + Micrometer. Prometheus is opt-in.

The 10-second hook No Java install, no daemon, no native build — npx a tool-calling local LLM and start chatting: bashnpx @deemwario/mochallama chat -m qwen2.5-1.5b The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models. Embed it: the smallest plain-Java snippet Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host: build.gradle.ktspom.xml kotlinimplementation("io.github.deemwario:mochallama-core:0.1.6") runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6") xmldependency> groupId>io.github.deemwariogroupId> artifactId>mochallama-coreartifactId> version>0.1.6version> dependency> dependency> groupId>io.github.deemwariogroupId> artifactId>mochallama-core-platformartifactId> version>0.1.6version> scope>runtimescope> dependency>

javaimport tools.deemwar.mochallama.panama.ChatEngine; import java.nio.file.Path;

var engine = ChatEngine.load(Path.of("/path/to/model.gguf")); String reply = engine.chat("Write a haiku about Project Panama.", 128, 0.7); System.out.println(reply); JVM flags JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED.

Or one Spring dependency The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no spring-ai dependency required: build.gradle.ktspom.xml kotlinimplementation("io.github.deemwario:mochallama-spring-boot-starter:0.1.6") runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6") xmldependency> groupId>io.github.deemwariogroupId> artifactId>mochallama-spring-boot-starterartifactId> version>0.1.6version> dependency> dependency> groupId>io.github.deemwariogroupId> artifactId>mochallama-core-platformartifactId> version>0.1.6version> scope>runtimescope> dependency>

Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties: propertiesllamacpp.model.hf-id=Qwen/Qwen2.5-1.5B-Instruct-GGUF # or an explicit url + filename: # llamacpp.model.url=https://.../qwen2.5-1.5b-instruct-q4_k_m.gguf # llamacpp.model.filename=qwen2.5-1.5b-instruct-q4_k_m.gguf Start the app (the model loads asynchronously — endpoints return 503 until state: READY), then point any OpenAI client at it. POST /v1/chat/completions handles non-streaming, stream:true SSE, and tools / tool_choice; GET /v1/models lists the loaded model. bashcurl http://localhost:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"messages":[{"role":"user","content":"Hello from local llama.cpp"}]}' A real multi-turn CLI mochallama chat is a stateful REPL — it keeps the full conversation history, not amnesiac single turns. bash# List the tool-capable presets / loaded models npx @deemwario/mochallama models

# Start a multi-turn chat; the conversation is saved as a session npx @deemwario/mochallama chat -m qwen2.5-1.5b

# List past sessions (id, model, turns, last-updated) npx @deemwario/mochallama sessions

# Continue a prior...

Mochallama

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews