Mochallama

deemwar1 pts0 comments

mochallama

Skip to content

Appearance

mochallamaA local, tool-calling LLM inside your JVM<br>The only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed via Project Panama FFM. No JNI, no daemon, no native-install dance. Requires JDK 22+.<br>Quickstart<br>Why mochallama<br>Examples<br>GitHub

In-process — no daemon, no network hop<br>The model runs inside your application's own process. No Ollama-style sidecar to install and supervise, no HTTP round-trip, no idle resource drain. Inference is stateful and rides your app's lifecycle, Actuator health, and Micrometer metrics.

No JNI — all Panama FFM<br>Java talks to llama.cpp through the JDK 22 Foreign Function & Memory API (GA, not incubator), over a thin ~700-LOC extern-C bridge on llama.cpp's common_chat. No hand-written JNI glue, far fewer crash vectors.

Prebuilt llama.cpp, 5 platforms, zero native-install<br>Consumes upstream's official prebuilt llama.cpp release libs (tag b9371) and compiles only the bridge (~2–11s, not a 95-minute from-source build). Per-platform classifier jars auto-load the right native — macOS Intel + Apple Silicon, Linux x86-64 + ARM64, Windows x86-64.

Spring autoconfig, OpenAI-compatible<br>One @AutoConfiguration dependency exposes POST /v1/chat/completions (with SSE when stream:true) and GET /v1/models. Tools and streaming work together. Drop-in for code already written against OpenAI or Ollama.

Tool-calling-only — fail-fast<br>Built for agentic / function-calling work. Non-tool-capable models are rejected at load with MODEL_NOT_TOOL_CAPABLE — an explicit contract instead of silent degradation on small models.

Metrics via Actuator<br>The starter registers inference meters (timer, token distributions, tool-call counter, tokens/sec) and a model health indicator through Actuator + Micrometer. Prometheus is opt-in.

The 10-second hook ​<br>No Java install, no daemon, no native build — npx a tool-calling local LLM and start chatting:<br>bashnpx @deemwario/mochallama chat -m qwen2.5-1.5b<br>The CLI ships its own jlink JDK-22 runtime image via npm, so this needs no JDK on the host. qwen2.5-1.5b is the default tool-capable preset; the model downloads on first run into ~/.chatbot_models.<br>Embed it: the smallest plain-Java snippet ​<br>Two dependencies — the Java jar plus the platform aggregator that resolves the right native classifier jar for your host:<br>build.gradle.ktspom.xml<br>kotlinimplementation("io.github.deemwario:mochallama-core:0.1.6")<br>runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")<br>xmldependency><br>groupId>io.github.deemwariogroupId><br>artifactId>mochallama-coreartifactId><br>version>0.1.6version><br>dependency><br>dependency><br>groupId>io.github.deemwariogroupId><br>artifactId>mochallama-core-platformartifactId><br>version>0.1.6version><br>scope>runtimescope><br>dependency>

javaimport tools.deemwar.mochallama.panama.ChatEngine;<br>import java.nio.file.Path;

var engine = ChatEngine.load(Path.of("/path/to/model.gguf"));<br>String reply = engine.chat("Write a haiku about Project Panama.", 128, 0.7);<br>System.out.println(reply);<br>JVM flags<br>JDK 22+ is required (FFM is GA there). Run with --enable-native-access=ALL-UNNAMED.

Or one Spring dependency ​<br>The starter autoconfigures a local model service and the OpenAI-compatible endpoints — no spring-ai dependency required:<br>build.gradle.ktspom.xml<br>kotlinimplementation("io.github.deemwario:mochallama-spring-boot-starter:0.1.6")<br>runtimeOnly("io.github.deemwario:mochallama-core-platform:0.1.6")<br>xmldependency><br>groupId>io.github.deemwariogroupId><br>artifactId>mochallama-spring-boot-starterartifactId><br>version>0.1.6version><br>dependency><br>dependency><br>groupId>io.github.deemwariogroupId><br>artifactId>mochallama-core-platformartifactId><br>version>0.1.6version><br>scope>runtimescope><br>dependency>

Tell it which model to load — a Hugging Face id is the simplest (it resolves + caches the GGUF on first start). In src/main/resources/application.properties:<br>propertiesllamacpp.model.hf-id=Qwen/Qwen2.5-1.5B-Instruct-GGUF<br># or an explicit url + filename:<br># llamacpp.model.url=https://.../qwen2.5-1.5b-instruct-q4_k_m.gguf<br># llamacpp.model.filename=qwen2.5-1.5b-instruct-q4_k_m.gguf<br>Start the app (the model loads asynchronously — endpoints return 503 until state: READY), then point any OpenAI client at it. POST /v1/chat/completions handles non-streaming, stream:true SSE, and tools / tool_choice; GET /v1/models lists the loaded model.<br>bashcurl http://localhost:8080/v1/chat/completions \<br>-H 'Content-Type: application/json' \<br>-d '{"messages":[{"role":"user","content":"Hello from local llama.cpp"}]}'<br>A real multi-turn CLI ​<br>mochallama chat is a stateful REPL — it keeps the full conversation history, not amnesiac single turns.<br>bash# List the tool-capable presets / loaded models<br>npx @deemwario/mochallama models

# Start a multi-turn chat; the conversation is saved as a session<br>npx @deemwario/mochallama chat -m qwen2.5-1.5b

# List past sessions (id, model, turns, last-updated)<br>npx @deemwario/mochallama sessions

# Continue a prior...

mochallama model github dependency tool chat

Related Articles