Cloning a voice at 48 kHz with VoxCPM2 in ElevenLabs API quality

ipotapov1 pts0 comments

Cloning a voice at 48 kHz with VoxCPM2 — SoniqoEN

Blog·Voice cloning<br>May 17, 2026<br>Cloning a voice at 48 kHz<br>with VoxCPM2.<br>A new TTS model just landed in Soniqo. It runs on your laptop, outputs studio-quality 48 kHz audio, and clones a voice from a single short clip. This post walks through what you can build with it, the three ways it lets you clone a voice, and a friendly look at how the model works inside.

What you can build<br>Four things that change when cloning runs locally.<br>Running cloning on the device unlocks four properties at once — privacy, offline use, no per-call cost, and full voice ownership. Each of these opens a class of product that's awkward to build any other way.

Personal audiobook narrators<br>Record 30 seconds of a parent reading. The audiobook app then narrates any chapter in their voice — same warmth, same accent, locally generated each session.

Multilingual creator content<br>YouTubers and podcasters keep one consistent voice across 30 languages. Record once in English, ship the same episode in Japanese, Spanish, and Hindi without a vocal cast.

Accessibility & voice banking<br>People facing voice loss can bank their voice in a short clip and keep speaking through assistive tech that sounds like them — not like a generic TTS engine.

Product voices on demand<br>Describe the voice you want — "young woman, gentle and warm" — and the model designs it without a reference recording. Useful for game NPCs, kiosk prompts, or A/B testing brand voices.

On-device vs hosted<br>How VoxCPM2 compares to ElevenLabs.<br>ElevenLabs is the obvious cloud-API alternative. The trade-off is what runs where — and who owns the voice afterwards.

For products that need privacy guarantees, offline operation, or zero per-call cost, on-device cloning is the only option — every ElevenLabs call uploads audio to their servers.<br>VoxCPM2 (Soniqo)ElevenLabsWhere it runsOn the user’s deviceHosted APIAudio leaves the deviceNoYes (uploaded to ElevenLabs)Offline useYesNo (requires internet)Per-call costNonePer-character billingModel licenceApache 2.0, open weightsProprietary, SaaS onlyMax output sample rate48 kHz native48 kHz (Pro tier and above)Languages3029 (Multilingual v2) · 70+ (Eleven v3)Reference clip required5–30 s1 min (Instant) · 30 min (Professional)Voice design from textYesYes<br>Both engines reach 48 kHz; both support a similar language spread for everyday cloning; both expose voice design from a text description. The genuine difference is whether the audio ever leaves the device.

Three cloning modes<br>One model, three ways in.<br>The model is the same in every call. What changes is which arguments you pass — that decides whether you're designing a voice from a description, copying a recorded one, or preserving an accent.

Voice design<br>When you don't have a reference recording.

Describe the voice in natural language. The model picks a matching voice and stays consistent across calls.<br>let audio = try await tts.generateVoxCPM2(<br>text: "Welcome to the show.",<br>instruct: "A young woman, gentle and warm voice."

Reference cloning<br>When you have a short clip of the target speaker.

Pass any 5–30 s of clean speech. The model copies the timbre and rhythm and synthesises new text in that voice.<br>let ref = try AudioFileLoader.load(<br>url: URL(fileURLWithPath: "speaker.wav"),<br>targetSampleRate: 16000<br>let audio = try await tts.generateVoxCPM2(<br>text: "This is a cloned voice.",<br>refAudio: ref

Ultimate cloning<br>When the speaker has a distinctive accent and you want it preserved.

Pass the clip AND its transcript. The model can now line up acoustic features with phonemes — accent and vowel choices carry through.<br>let audio = try await tts.generateVoxCPM2(<br>text: "Hello from the cloned voice.",<br>refAudio: ref,<br>promptText: "this is what the reference clip actually said",<br>promptAudio: ref

Three cloning modes, same modelEach mode arranges different pieces in the input sequence before the model. Voice design adds a written description, reference cloning adds an audio prefix, and ultimate cloning adds a paired audio-and-transcript example.Voice design(description)text to sayReference cloningreference audiotext to sayUltimate cloningreference audiotranscripttext to sayprompt audioaudio framestext conditiontext to synthesise<br>The same input slot, filled with different pieces. The model never sees a flag — it reads the sequence.

Under the hood<br>How VoxCPM2 produces audio.<br>Four cooperating modules. You don't need to know any of this to use the model, but if you're curious where the 48 kHz comes from — here it is.

VoxCPM2 architectureText and optional voice prompts feed an autoregressive language model and a residual refiner. A local diffusion transformer produces audio latents which the AudioVAE V2 decodes to a 48 kHz waveform.TextVoice prompt audioVoice instructionPrompt transcriptLocEncaudio + text fused into one streamTSLM · MiniCPM-4 backbone28-layer autoregressive LMdecides what audio patch comes nextRALMrefines each patch for prosodic detailLocDiT ·...

voice cloning model audio voxcpm2 text

Related Articles