Native Inference Engine for macOS 14 or newer

tomolomolo1 pts0 comments

GitHub - tictacguy/embershard: Native LLM inference engine and chat app for macOS / Apple Silicon. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

tictacguy

embershard

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>7 Commits<br>7 Commits

app

app

docs

docs

include

include

src

src

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

README.md

README.md

View all files

Repository files navigation

Embershard

⬇️ Download the latest .dmg

Latest release: v0.1.1

Grab the signed .dmg from the latest release, drag Embershard into<br>Applications, and right-click → Open the first time (you'll need to approve it's opening by going to System Settings->Privacy and Security->Scroll down until "Security" section and click "Open Anyway"). No clone, no toolchain. Apple Silicon, macOS 14 or newer.

Embershard is a macOS chat app with its own LLM inference engine underneath. The<br>interesting part is what isn't there: at inference time the chat path never<br>calls into llama.cpp. Embershard opens the GGUF on its own, pushes the weights to<br>Metal, assembles its transformer compute graph directly on ggml, keeps the KV<br>cache resident across turns, and runs its own byte-level BPE / SentencePiece<br>tokenizer. ggml is used purely as a bag of tensor kernels.

That independence is deliberate, and so is the narrow scope: Embershard runs the<br>llama and qwen2 families (Llama 3.x, Mistral, Qwen 2.5, and anything that<br>reports those architectures in its GGUF) and nothing else. It is a focused engine<br>checked for numerical parity against the reference — not a drop-in GGUF runner.

Embershard grew out of ds4, which set the<br>template for a small, honest, self-contained native project. ds4 was the<br>inspiration; the engine, app, and writing here are their own thing.

Orchestrated by me, written with ClaudeCode. App icon by DinosoftLabs.

Why bother re-implementing the forward pass

Wrapping libllama is the easy path, and it is the one most apps take. Embershard<br>takes the harder one so the whole hot loop — graph construction, the KV-cache<br>layout, the sampler, tokenization — lives in code we own and can reason about.<br>llama.cpp and ggml still made it possible: their kernels, the GGUF format and<br>its tooling, the quant formats, and a great deal of hard-won engineering were the<br>map we followed while building everything above the tensor ops. We link ggml<br>for those ops and the Metal backend, and keep llama.cpp around only for the<br>experimental multi-agent orchestrator. Thanks to Georgi Gerganov and the<br>contributors.

What's proven

Beta, but the core is measured rather than asserted:

Logit parity. The llama and qwen2 forward pass matches llama.cpp to a<br>cosine of 0.999999; greedy continuations come out token-for-token identical.

Resident KV cache in F16 / Q8_0 / Q4_0, incremental O(n) decode, reused<br>across turns. When the context fills, a sliding window evicts the oldest tokens<br>while keeping absolute RoPE positions intact (no re-roping), so long<br>conversations keep going.

Throughput at parity with llama.cpp on the same model — both are<br>memory-bandwidth bound on identical ggml kernels, so there is nothing to win or<br>lose here.

One engine for everything. Plain chat and the planner → executor agent<br>pipeline both run on es_gx; llama.cpp is not in the inference path.

Tokenizer parity. Token IDs match llama.cpp across the test corpus, with<br>two backends: byte-level BPE (gpt2: llama-bpe, qwen2) and SentencePiece<br>(llama/SPM: Llama 2, Mistral v0.1/v0.2, TinyLlama).

Sharded GGUFs (-00001-of-N) load by following the split metadata.

Limits

Architectures stop at llama / qwen2. Gemma, Phi, and MoE models (gpt-oss,<br>Mixtral, …) are unsupported and filtered out of the browser.

A model that exceeds the GPU working set is not streamed from SSD — loading it<br>fails cleanly, and the browser filters by available RAM up front. SSD streaming<br>is future work.

No bespoke Metal kernels yet: prefill uses ggml_flash_attn_ext, decode a<br>manual ggml path.

Tokenizers past...

llama embershard engine ggml inference search

Related Articles