Show HN: We built an LLM inference engine in pure Python

Release v2.0.0 — The Own Everything Release · Zyora-Dev/zse · GitHub

//releases/show" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

//releases/show;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Zyora-Dev

zse

Public

Notifications You must be signed in to change notification settings

Fork

Star 151

v2.0.0 — The Own Everything Release

Latest

Compare

Choose a tag to compare

Sorry, something went wrong.

Filter

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

No results found

View all tags

zyoraclub

released this

22 May 11:47

4 commits

to main since this release

v2.0.0

55a9742

ZSE v2.0.0 — The "Own Everything" Release

A complete rewrite. Zero third-party dependencies. No PyTorch, no Triton, no transformers, no bitsandbytes. Pure-Python kernel compiler emits CUDA C, HIP C, and Metal Shading Language directly.

Install size: ~3 GB → ~5 MB .

Headline numbers (Qwen2.5-14B INT4 vs vLLM AWQ INT4 on A100-80GB)

Metric ZSE vLLM

Cold start 6.29s 127.02s 20.2×

VRAM used 12.28 GB 71.45 GB 5.82× less

Single-seq tok/s 37.0 26.5 1.40×

Validated on 6 platforms

GPU Cold start vs vLLM AWQ INT4 cold

NVIDIA T4 (sm_75) 7.25s 30.2× faster

NVIDIA L4 (sm_89) 5.58s 26.0× faster

NVIDIA A10G (sm_86) 6.01s 32.1× faster

NVIDIA A100-80GB 6.29s 20.2× faster

AMD MI300X 3.14s 13.6× faster (vs vLLM-ROCm FP16)

Apple M1 E2E vector_add validated, full inference pending

Install

pip install zse-engine zse serve model.zse --port 8000

Or run the kernel compiler standalone:

pip install zse-compiler

What's in this release

ZSE Kernel Compiler — @zse.kernel Python DSL → CUDA / HIP / Metal. Warp primitives, vectorized memory, block reductions, tiling, fusion, WMMA, CDNA3 MFMA matrix cores, auto-tuning.

.zse model format v2 — pre-quantized INT4/INT8/FP16, mmap-friendly, C-accelerated quantization (~600× faster). Adapters for Llama / Mistral / Qwen2 / Gemma2 / Phi3.

Own PagedAttention — adaptive block sizing, token-level eviction, FNV-1a dedup, COW forking.

ZStreamer — continuous batching, disaggregated prefill/decode, chunked prefill, speculative decoding (n-gram + self-draft).

Orchestrator — unified VRAM allocator, 29 GPU kernels on MI300X, CUDA Graphs + HIP Graphs, LoRA hot-swap.

Server — OpenAI-compatible API, API key auth, rate limiting, SQLite store, built-in RAG (/v1/rag/*), web dashboard.

RAG — BM25 + TF-IDF + dense embeddings (via the loaded LLM, zero extra deps) + Reciprocal Rank Fusion + LLM cross-encoder rerank.

Tensor Parallelism — pure-ctypes NCCL/RCCL wrapper, multi-process workers.

Breaking changes

Package rename: zllm-zse → zse-engine on PyPI

Module rename: zse → zse_engine

.zse format v2 is incompatible with 1.x — re-convert with zse convert

bnb / bitsandbytes backend removed

PyTorch / Triton / transformers dependencies removed

Full migration guide and detailed change log: CHANGELOG.md

Acknowledgments

AMD MI300X validation, 32B-parameter benchmarks, and our ROCm wave-64 kernel development were made possible by DigitalOcean's Open Source Sponsorship Program .

447 tests passing. Zero dependencies. Three GPU backends. One package.

Assets

Uh oh!

There was an error while loading. Please reload this page.

-->

All reactions

You can’t perform that action at this time.

Show HN: We built an LLM inference engine in pure Python – no PyTorch, no Triton

Related Articles

Show HN: We built an LLM inference engine in pure Python – no PyTorch, no Triton

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy