GitHub - tictacguy/embershard: Native LLM inference engine and chat app for macOS / Apple Silicon. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
tictacguy
embershard
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>7 Commits<br>7 Commits
app
app
docs
docs
include
include
src
src
.gitignore
.gitignore
CMakeLists.txt
CMakeLists.txt
README.md
README.md
View all files
Repository files navigation
Embershard
⬇️ Download the latest .dmg
Latest release: v0.1.1
Grab the signed .dmg from the latest release, drag Embershard into<br>Applications, and right-click → Open the first time (you'll need to approve it's opening by going to System Settings->Privacy and Security->Scroll down until "Security" section and click "Open Anyway"). No clone, no toolchain. Apple Silicon, macOS 14 or newer.
Embershard is a macOS chat app with its own LLM inference engine underneath. The<br>interesting part is what isn't there: at inference time the chat path never<br>calls into llama.cpp. Embershard opens the GGUF on its own, pushes the weights to<br>Metal, assembles its transformer compute graph directly on ggml, keeps the KV<br>cache resident across turns, and runs its own byte-level BPE / SentencePiece<br>tokenizer. ggml is used purely as a bag of tensor kernels.
That independence is deliberate, and so is the narrow scope: Embershard runs the<br>llama and qwen2 families (Llama 3.x, Mistral, Qwen 2.5, and anything that<br>reports those architectures in its GGUF) and nothing else. It is a focused engine<br>checked for numerical parity against the reference — not a drop-in GGUF runner.
Embershard grew out of ds4, which set the<br>template for a small, honest, self-contained native project. ds4 was the<br>inspiration; the engine, app, and writing here are their own thing.
Orchestrated by me, written with ClaudeCode. App icon by DinosoftLabs.
Why bother re-implementing the forward pass
Wrapping libllama is the easy path, and it is the one most apps take. Embershard<br>takes the harder one so the whole hot loop — graph construction, the KV-cache<br>layout, the sampler, tokenization — lives in code we own and can reason about.<br>llama.cpp and ggml still made it possible: their kernels, the GGUF format and<br>its tooling, the quant formats, and a great deal of hard-won engineering were the<br>map we followed while building everything above the tensor ops. We link ggml<br>for those ops and the Metal backend, and keep llama.cpp around only for the<br>experimental multi-agent orchestrator. Thanks to Georgi Gerganov and the<br>contributors.
What's proven
Beta, but the core is measured rather than asserted:
Logit parity. The llama and qwen2 forward pass matches llama.cpp to a<br>cosine of 0.999999; greedy continuations come out token-for-token identical.
Resident KV cache in F16 / Q8_0 / Q4_0, incremental O(n) decode, reused<br>across turns. When the context fills, a sliding window evicts the oldest tokens<br>while keeping absolute RoPE positions intact (no re-roping), so long<br>conversations keep going.
Throughput at parity with llama.cpp on the same model — both are<br>memory-bandwidth bound on identical ggml kernels, so there is nothing to win or<br>lose here.
One engine for everything. Plain chat and the planner → executor agent<br>pipeline both run on es_gx; llama.cpp is not in the inference path.
Tokenizer parity. Token IDs match llama.cpp across the test corpus, with<br>two backends: byte-level BPE (gpt2: llama-bpe, qwen2) and SentencePiece<br>(llama/SPM: Llama 2, Mistral v0.1/v0.2, TinyLlama).
Sharded GGUFs (-00001-of-N) load by following the split metadata.
Limits
Architectures stop at llama / qwen2. Gemma, Phi, and MoE models (gpt-oss,<br>Mixtral, …) are unsupported and filtered out of the browser.
A model that exceeds the GPU working set is not streamed from SSD — loading it<br>fails cleanly, and the browser filters by available RAM up front. SSD streaming<br>is future work.
No bespoke Metal kernels yet: prefill uses ggml_flash_attn_ext, decode a<br>manual ggml path.
Tokenizers past...