Show HN: 150M Mandarin transcription model with real-time metadata detection

ksingla0251 pts0 comments

WhissleAI/STT-meta-ZH-150m · Hugging Face

Log In<br>Sign Up

STT-meta-ZH-100m

A dual-head Mandarin Chinese ASR model that simultaneously performs speech-to-text transcription and speaker attribute classification (age, gender, dialect) in a single forward pass.

Built on NVIDIA Citrinet-1024 with language-specific bottleneck adapters and a trailing tag classifier head, fine-tuned on 60 hours of meta-annotated Mandarin speech data using PromptingNemo.

Metric<br>Value

Parameters<br>157.7M

WER<br>19.22%

Tag Accuracy<br>94.2%

Language<br>Mandarin Chinese (zh)

Audio<br>16kHz mono

Architecture

Audio (16kHz) ──▶ Mel Spectrogram (80-dim) ──▶ Citrinet-1024 Encoder (23 blocks)<br>┌─────────┴─────────┐<br>▼ ▼<br>CTC Decoder Tag Classifier<br>(5001 vocab) (3 linear heads)<br>│ │<br>▼ ▼<br>Transcription + AGE / GENDER /<br>Entity Tags DIALECT labels

Parameter Breakdown

Component<br>Parameters<br>Description

Citrinet-1024 Encoder<br>140.4M<br>23 Jasper-style blocks with squeeze-excitation

Language Adapter<br>12.1M<br>Bottleneck adapters (dim=256) in each encoder block

CTC Decoder<br>5.1M<br>Conv1d projecting 1024 → 5001 (BPE vocab + blank)

Tag Classifier<br>12.3K<br>3 linear heads on mean-pooled encoder output

Total<br>157.7M

Tag Categories

Category<br>Classes<br>Labels

AGE<br>NONE, AGE_14_25, AGE_26_40, AGE_, AGE_>41

GENDER<br>NONE, GENDER_FEMALE, GENDER_MALE

DIALECT<br>NONE, DIALECT_NORTH, DIALECT_OTHERS, DIALECT_SOUTH

The CTC head also outputs inline entity tags (e.g., ENTITY_PERSON_NAME ... END, ENTITY_TEMPERATURE ... END) as part of the transcription vocabulary.

Files

File<br>Description

zh-citrinet-meta-v11.nemo<br>Full NeMo checkpoint (encoder + decoder + adapter + tag classifier)

onnx/model.onnx<br>ONNX model with dual outputs: logprobs (CTC) + encoder_output

onnx/tag_classifier.onnx<br>Standalone tag classifier (input: pooled encoder features)

onnx/tag_classifier.json<br>Tag classifier metadata (labels, class counts)

onnx/config.json<br>Preprocessor configuration (mel spectrogram parameters)

onnx/tokenizer.model<br>SentencePiece BPE tokenizer (5000 tokens)

onnx/vocabulary.json<br>Full vocabulary list with token mappings

Usage

NeMo Inference

import nemo.collections.asr as nemo_asr

# Standard NeMo transcription (CTC head only — tag classifier weights<br># are stored in the checkpoint but EncDecCTCModelBPE does not load them<br># by default). For full dual-head inference, use ONNX or PromptingNemo.<br>asr_model = nemo_asr.models.ASRModel.from_pretrained(<br>"WhissleAI/STT-meta-ZH-100m"

transcriptions = asr_model.transcribe(["audio.wav"])<br>print(transcriptions[0])<br># Output includes inline tags:<br># "你好世界。 AGE_26_40 GENDER_MALE ENTITY_PERSON_NAME 张三 END"

PromptingNemo Inference (Full Dual-Head)

For full dual-head inference with the tag classifier, use the PromptingNemo training framework:

# Clone PromptingNemo<br># git clone https://github.com/WhissleAI/PromptingNemo.git

import torch<br>from huggingface_hub import hf_hub_download

# Download the .nemo checkpoint<br>nemo_path = hf_hub_download(<br>repo_id="WhissleAI/STT-meta-ZH-100m",<br>filename="zh-citrinet-meta-v11.nemo"

# Load with PromptingNemo's custom model class that includes the tag classifier<br># See: https://github.com/WhissleAI/PromptingNemo/blob/main/scripts/asr/meta-asr<br>from scripts.asr.meta_asr.tag_classifier import (<br>TrailingTagClassifier,<br>build_trailing_tag_maps,<br>masked_mean_pool,

# The tag_classifier weights are stored inside the .nemo archive.<br># PromptingNemo's training script loads them automatically.

ONNX Inference (Production — Recommended)

Self-contained inference using only onnxruntime, numpy, soundfile, and sentencepiece:

import json<br>import numpy as np<br>import onnxruntime as ort<br>import soundfile as sf<br>import sentencepiece as spm<br>from huggingface_hub import hf_hub_download

# Download model files<br>repo = "WhissleAI/STT-meta-ZH-100m"<br>model_path = hf_hub_download(repo, "onnx/model.onnx")<br>cls_path = hf_hub_download(repo, "onnx/tag_classifier.onnx")<br>cls_meta_path = hf_hub_download(repo, "onnx/tag_classifier.json")<br>tok_path = hf_hub_download(repo, "onnx/tokenizer.model")<br>vocab_path = hf_hub_download(repo, "onnx/vocabulary.json")<br>config_path = hf_hub_download(repo, "onnx/config.json")

# Load config and vocabulary<br>with open(config_path) as f:<br>config = json.load(f)<br>with open(vocab_path) as f:<br>vocab_data = json.load(f)<br>with open(cls_meta_path) as f:<br>cls_meta = json.load(f)

vocabulary = vocab_data["vocabulary"]<br>blank_id = vocab_data.get("blank_id", len(vocabulary))

# Load tokenizer<br>sp = spm.SentencePieceProcessor()<br>sp.Load(tok_path)

# Load ONNX sessions<br>asr_session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])<br>cls_session = ort.InferenceSession(cls_path, providers=["CPUExecutionProvider"])

# --- Preprocessing ---<br>def preprocess_audio(audio_path, config):<br>"""Convert audio to log-mel spectrogram features."""<br>audio, sr = sf.read(audio_path, dtype="float32")<br>if sr != 16000:<br>raise ValueError(f"Expected 16kHz audio, got {sr}Hz")<br>if audio.ndim > 1:<br>audio = audio.mean(axis=1)

# Preemphasis<br>preemph =...

onnx json import model meta classifier

Related Articles