Native Inference Engine for macOS 14 or newer

GitHub - tictacguy/embershard: Native LLM inference engine and chat app for macOS / Apple Silicon. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

tictacguy

embershard

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 7 Commits 7 Commits

app

docs

include

src

.gitignore

CMakeLists.txt

README.md

View all files

Repository files navigation

Embershard

⬇️ Download the latest .dmg

Latest release: v0.1.1

Grab the signed .dmg from the latest release, drag Embershard into Applications, and right-click → Open the first time (you'll need to approve it's opening by going to System Settings->Privacy and Security->Scroll down until "Security" section and click "Open Anyway"). No clone, no toolchain. Apple Silicon, macOS 14 or newer.

Embershard is a macOS chat app with its own LLM inference engine underneath. The interesting part is what isn't there: at inference time the chat path never calls into llama.cpp. Embershard opens the GGUF on its own, pushes the weights to Metal, assembles its transformer compute graph directly on ggml, keeps the KV cache resident across turns, and runs its own byte-level BPE / SentencePiece tokenizer. ggml is used purely as a bag of tensor kernels.

That independence is deliberate, and so is the narrow scope: Embershard runs the llama and qwen2 families (Llama 3.x, Mistral, Qwen 2.5, and anything that reports those architectures in its GGUF) and nothing else. It is a focused engine checked for numerical parity against the reference — not a drop-in GGUF runner.

Embershard grew out of ds4, which set the template for a small, honest, self-contained native project. ds4 was the inspiration; the engine, app, and writing here are their own thing.

Orchestrated by me, written with ClaudeCode. App icon by DinosoftLabs.

Why bother re-implementing the forward pass

Wrapping libllama is the easy path, and it is the one most apps take. Embershard takes the harder one so the whole hot loop — graph construction, the KV-cache layout, the sampler, tokenization — lives in code we own and can reason about. llama.cpp and ggml still made it possible: their kernels, the GGUF format and its tooling, the quant formats, and a great deal of hard-won engineering were the map we followed while building everything above the tensor ops. We link ggml for those ops and the Metal backend, and keep llama.cpp around only for the experimental multi-agent orchestrator. Thanks to Georgi Gerganov and the contributors.

What's proven

Beta, but the core is measured rather than asserted:

Logit parity. The llama and qwen2 forward pass matches llama.cpp to a cosine of 0.999999; greedy continuations come out token-for-token identical.

Resident KV cache in F16 / Q8_0 / Q4_0, incremental O(n) decode, reused across turns. When the context fills, a sliding window evicts the oldest tokens while keeping absolute RoPE positions intact (no re-roping), so long conversations keep going.

Throughput at parity with llama.cpp on the same model — both are memory-bandwidth bound on identical ggml kernels, so there is nothing to win or lose here.

One engine for everything. Plain chat and the planner → executor agent pipeline both run on es_gx; llama.cpp is not in the inference path.

Tokenizer parity. Token IDs match llama.cpp across the test corpus, with two backends: byte-level BPE (gpt2: llama-bpe, qwen2) and SentencePiece (llama/SPM: Llama 2, Mistral v0.1/v0.2, TinyLlama).

Sharded GGUFs (-00001-of-N) load by following the split metadata.

Limits

Architectures stop at llama / qwen2. Gemma, Phi, and MoE models (gpt-oss, Mixtral, …) are unsupported and filtered out of the browser.

A model that exceeds the GPU working set is not streamed from SSD — loading it fails cleanly, and the browser filters by available RAM up front. SSD streaming is future work.

No bespoke Metal kernels yet: prefill uses ggml_flash_attn_ext, decode a manual ggml path.

Tokenizers past...

Native Inference Engine for macOS 14 or newer

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews