WhissleAI/STT-meta-ZH-150m · Hugging Face
Log In<br>Sign Up
STT-meta-ZH-100m
A dual-head Mandarin Chinese ASR model that simultaneously performs speech-to-text transcription and speaker attribute classification (age, gender, dialect) in a single forward pass.
Built on NVIDIA Citrinet-1024 with language-specific bottleneck adapters and a trailing tag classifier head, fine-tuned on 60 hours of meta-annotated Mandarin speech data using PromptingNemo.
Metric<br>Value
Parameters<br>157.7M
WER<br>19.22%
Tag Accuracy<br>94.2%
Language<br>Mandarin Chinese (zh)
Audio<br>16kHz mono
Architecture
Audio (16kHz) ──▶ Mel Spectrogram (80-dim) ──▶ Citrinet-1024 Encoder (23 blocks)<br>┌─────────┴─────────┐<br>▼ ▼<br>CTC Decoder Tag Classifier<br>(5001 vocab) (3 linear heads)<br>│ │<br>▼ ▼<br>Transcription + AGE / GENDER /<br>Entity Tags DIALECT labels
Parameter Breakdown
Component<br>Parameters<br>Description
Citrinet-1024 Encoder<br>140.4M<br>23 Jasper-style blocks with squeeze-excitation
Language Adapter<br>12.1M<br>Bottleneck adapters (dim=256) in each encoder block
CTC Decoder<br>5.1M<br>Conv1d projecting 1024 → 5001 (BPE vocab + blank)
Tag Classifier<br>12.3K<br>3 linear heads on mean-pooled encoder output
Total<br>157.7M
Tag Categories
Category<br>Classes<br>Labels
AGE<br>NONE, AGE_14_25, AGE_26_40, AGE_, AGE_>41
GENDER<br>NONE, GENDER_FEMALE, GENDER_MALE
DIALECT<br>NONE, DIALECT_NORTH, DIALECT_OTHERS, DIALECT_SOUTH
The CTC head also outputs inline entity tags (e.g., ENTITY_PERSON_NAME ... END, ENTITY_TEMPERATURE ... END) as part of the transcription vocabulary.
Files
File<br>Description
zh-citrinet-meta-v11.nemo<br>Full NeMo checkpoint (encoder + decoder + adapter + tag classifier)
onnx/model.onnx<br>ONNX model with dual outputs: logprobs (CTC) + encoder_output
onnx/tag_classifier.onnx<br>Standalone tag classifier (input: pooled encoder features)
onnx/tag_classifier.json<br>Tag classifier metadata (labels, class counts)
onnx/config.json<br>Preprocessor configuration (mel spectrogram parameters)
onnx/tokenizer.model<br>SentencePiece BPE tokenizer (5000 tokens)
onnx/vocabulary.json<br>Full vocabulary list with token mappings
Usage
NeMo Inference
import nemo.collections.asr as nemo_asr
# Standard NeMo transcription (CTC head only — tag classifier weights<br># are stored in the checkpoint but EncDecCTCModelBPE does not load them<br># by default). For full dual-head inference, use ONNX or PromptingNemo.<br>asr_model = nemo_asr.models.ASRModel.from_pretrained(<br>"WhissleAI/STT-meta-ZH-100m"
transcriptions = asr_model.transcribe(["audio.wav"])<br>print(transcriptions[0])<br># Output includes inline tags:<br># "你好世界。 AGE_26_40 GENDER_MALE ENTITY_PERSON_NAME 张三 END"
PromptingNemo Inference (Full Dual-Head)
For full dual-head inference with the tag classifier, use the PromptingNemo training framework:
# Clone PromptingNemo<br># git clone https://github.com/WhissleAI/PromptingNemo.git
import torch<br>from huggingface_hub import hf_hub_download
# Download the .nemo checkpoint<br>nemo_path = hf_hub_download(<br>repo_id="WhissleAI/STT-meta-ZH-100m",<br>filename="zh-citrinet-meta-v11.nemo"
# Load with PromptingNemo's custom model class that includes the tag classifier<br># See: https://github.com/WhissleAI/PromptingNemo/blob/main/scripts/asr/meta-asr<br>from scripts.asr.meta_asr.tag_classifier import (<br>TrailingTagClassifier,<br>build_trailing_tag_maps,<br>masked_mean_pool,
# The tag_classifier weights are stored inside the .nemo archive.<br># PromptingNemo's training script loads them automatically.
ONNX Inference (Production — Recommended)
Self-contained inference using only onnxruntime, numpy, soundfile, and sentencepiece:
import json<br>import numpy as np<br>import onnxruntime as ort<br>import soundfile as sf<br>import sentencepiece as spm<br>from huggingface_hub import hf_hub_download
# Download model files<br>repo = "WhissleAI/STT-meta-ZH-100m"<br>model_path = hf_hub_download(repo, "onnx/model.onnx")<br>cls_path = hf_hub_download(repo, "onnx/tag_classifier.onnx")<br>cls_meta_path = hf_hub_download(repo, "onnx/tag_classifier.json")<br>tok_path = hf_hub_download(repo, "onnx/tokenizer.model")<br>vocab_path = hf_hub_download(repo, "onnx/vocabulary.json")<br>config_path = hf_hub_download(repo, "onnx/config.json")
# Load config and vocabulary<br>with open(config_path) as f:<br>config = json.load(f)<br>with open(vocab_path) as f:<br>vocab_data = json.load(f)<br>with open(cls_meta_path) as f:<br>cls_meta = json.load(f)
vocabulary = vocab_data["vocabulary"]<br>blank_id = vocab_data.get("blank_id", len(vocabulary))
# Load tokenizer<br>sp = spm.SentencePieceProcessor()<br>sp.Load(tok_path)
# Load ONNX sessions<br>asr_session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])<br>cls_session = ort.InferenceSession(cls_path, providers=["CPUExecutionProvider"])
# --- Preprocessing ---<br>def preprocess_audio(audio_path, config):<br>"""Convert audio to log-mel spectrogram features."""<br>audio, sr = sf.read(audio_path, dtype="float32")<br>if sr != 16000:<br>raise ValueError(f"Expected 16kHz audio, got {sr}Hz")<br>if audio.ndim > 1:<br>audio = audio.mean(axis=1)
# Preemphasis<br>preemph =...