Show HN: 150M Mandarin transcription model with real-time metadata detection

WhissleAI/STT-meta-ZH-150m · Hugging Face

STT-meta-ZH-100m

A dual-head Mandarin Chinese ASR model that simultaneously performs speech-to-text transcription and speaker attribute classification (age, gender, dialect) in a single forward pass.

Built on NVIDIA Citrinet-1024 with language-specific bottleneck adapters and a trailing tag classifier head, fine-tuned on 60 hours of meta-annotated Mandarin speech data using PromptingNemo.

Metric Value

Parameters 157.7M

WER 19.22%

Tag Accuracy 94.2%

Language Mandarin Chinese (zh)

Audio 16kHz mono

Architecture

Audio (16kHz) ──▶ Mel Spectrogram (80-dim) ──▶ Citrinet-1024 Encoder (23 blocks) ┌─────────┴─────────┐ ▼ ▼ CTC Decoder Tag Classifier (5001 vocab) (3 linear heads) │ │ ▼ ▼ Transcription + AGE / GENDER / Entity Tags DIALECT labels

Parameter Breakdown

Component Parameters Description

Citrinet-1024 Encoder 140.4M 23 Jasper-style blocks with squeeze-excitation

Language Adapter 12.1M Bottleneck adapters (dim=256) in each encoder block

CTC Decoder 5.1M Conv1d projecting 1024 → 5001 (BPE vocab + blank)

Tag Classifier 12.3K 3 linear heads on mean-pooled encoder output

Total 157.7M

Tag Categories

Category Classes Labels

AGE NONE, AGE_14_25, AGE_26_40, AGE_, AGE_>41

GENDER NONE, GENDER_FEMALE, GENDER_MALE

DIALECT NONE, DIALECT_NORTH, DIALECT_OTHERS, DIALECT_SOUTH

The CTC head also outputs inline entity tags (e.g., ENTITY_PERSON_NAME ... END, ENTITY_TEMPERATURE ... END) as part of the transcription vocabulary.

Files

File Description

zh-citrinet-meta-v11.nemo Full NeMo checkpoint (encoder + decoder + adapter + tag classifier)

onnx/model.onnx ONNX model with dual outputs: logprobs (CTC) + encoder_output

onnx/tag_classifier.onnx Standalone tag classifier (input: pooled encoder features)

onnx/tag_classifier.json Tag classifier metadata (labels, class counts)

onnx/config.json Preprocessor configuration (mel spectrogram parameters)

onnx/tokenizer.model SentencePiece BPE tokenizer (5000 tokens)

onnx/vocabulary.json Full vocabulary list with token mappings

Usage

NeMo Inference

import nemo.collections.asr as nemo_asr

# Standard NeMo transcription (CTC head only — tag classifier weights # are stored in the checkpoint but EncDecCTCModelBPE does not load them # by default). For full dual-head inference, use ONNX or PromptingNemo. asr_model = nemo_asr.models.ASRModel.from_pretrained( "WhissleAI/STT-meta-ZH-100m"

transcriptions = asr_model.transcribe(["audio.wav"]) print(transcriptions[0]) # Output includes inline tags: # "你好世界。 AGE_26_40 GENDER_MALE ENTITY_PERSON_NAME 张三 END"

PromptingNemo Inference (Full Dual-Head)

For full dual-head inference with the tag classifier, use the PromptingNemo training framework:

# Clone PromptingNemo # git clone https://github.com/WhissleAI/PromptingNemo.git

import torch from huggingface_hub import hf_hub_download

# Download the .nemo checkpoint nemo_path = hf_hub_download( repo_id="WhissleAI/STT-meta-ZH-100m", filename="zh-citrinet-meta-v11.nemo"

# Load with PromptingNemo's custom model class that includes the tag classifier # See: https://github.com/WhissleAI/PromptingNemo/blob/main/scripts/asr/meta-asr from scripts.asr.meta_asr.tag_classifier import ( TrailingTagClassifier, build_trailing_tag_maps, masked_mean_pool,

# The tag_classifier weights are stored inside the .nemo archive. # PromptingNemo's training script loads them automatically.

ONNX Inference (Production — Recommended)

Self-contained inference using only onnxruntime, numpy, soundfile, and sentencepiece:

import json import numpy as np import onnxruntime as ort import soundfile as sf import sentencepiece as spm from huggingface_hub import hf_hub_download

# Download model files repo = "WhissleAI/STT-meta-ZH-100m" model_path = hf_hub_download(repo, "onnx/model.onnx") cls_path = hf_hub_download(repo, "onnx/tag_classifier.onnx") cls_meta_path = hf_hub_download(repo, "onnx/tag_classifier.json") tok_path = hf_hub_download(repo, "onnx/tokenizer.model") vocab_path = hf_hub_download(repo, "onnx/vocabulary.json") config_path = hf_hub_download(repo, "onnx/config.json")

# Load config and vocabulary with open(config_path) as f: config = json.load(f) with open(vocab_path) as f: vocab_data = json.load(f) with open(cls_meta_path) as f: cls_meta = json.load(f)

vocabulary = vocab_data["vocabulary"] blank_id = vocab_data.get("blank_id", len(vocabulary))

# Load tokenizer sp = spm.SentencePieceProcessor() sp.Load(tok_path)

# Load ONNX sessions asr_session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) cls_session = ort.InferenceSession(cls_path, providers=["CPUExecutionProvider"])

# --- Preprocessing --- def preprocess_audio(audio_path, config): """Convert audio to log-mel spectrogram features.""" audio, sr = sf.read(audio_path, dtype="float32") if sr != 16000: raise ValueError(f"Expected 16kHz audio, got {sr}Hz") if audio.ndim > 1: audio = audio.mean(axis=1)

# Preemphasis preemph =...

Show HN: 150M Mandarin transcription model with real-time metadata detection

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi