Cloning a voice at 48 kHz with VoxCPM2 in ElevenLabs API quality

Cloning a voice at 48 kHz with VoxCPM2 — SoniqoEN

Blog·Voice cloning May 17, 2026 Cloning a voice at 48 kHz with VoxCPM2. A new TTS model just landed in Soniqo. It runs on your laptop, outputs studio-quality 48 kHz audio, and clones a voice from a single short clip. This post walks through what you can build with it, the three ways it lets you clone a voice, and a friendly look at how the model works inside.

What you can build Four things that change when cloning runs locally. Running cloning on the device unlocks four properties at once — privacy, offline use, no per-call cost, and full voice ownership. Each of these opens a class of product that's awkward to build any other way.

Personal audiobook narrators Record 30 seconds of a parent reading. The audiobook app then narrates any chapter in their voice — same warmth, same accent, locally generated each session.

Multilingual creator content YouTubers and podcasters keep one consistent voice across 30 languages. Record once in English, ship the same episode in Japanese, Spanish, and Hindi without a vocal cast.

Accessibility & voice banking People facing voice loss can bank their voice in a short clip and keep speaking through assistive tech that sounds like them — not like a generic TTS engine.

Product voices on demand Describe the voice you want — "young woman, gentle and warm" — and the model designs it without a reference recording. Useful for game NPCs, kiosk prompts, or A/B testing brand voices.

On-device vs hosted How VoxCPM2 compares to ElevenLabs. ElevenLabs is the obvious cloud-API alternative. The trade-off is what runs where — and who owns the voice afterwards.

For products that need privacy guarantees, offline operation, or zero per-call cost, on-device cloning is the only option — every ElevenLabs call uploads audio to their servers. VoxCPM2 (Soniqo)ElevenLabsWhere it runsOn the user’s deviceHosted APIAudio leaves the deviceNoYes (uploaded to ElevenLabs)Offline useYesNo (requires internet)Per-call costNonePer-character billingModel licenceApache 2.0, open weightsProprietary, SaaS onlyMax output sample rate48 kHz native48 kHz (Pro tier and above)Languages3029 (Multilingual v2) · 70+ (Eleven v3)Reference clip required5–30 s1 min (Instant) · 30 min (Professional)Voice design from textYesYes Both engines reach 48 kHz; both support a similar language spread for everyday cloning; both expose voice design from a text description. The genuine difference is whether the audio ever leaves the device.

Three cloning modes One model, three ways in. The model is the same in every call. What changes is which arguments you pass — that decides whether you're designing a voice from a description, copying a recorded one, or preserving an accent.

Voice design When you don't have a reference recording.

Describe the voice in natural language. The model picks a matching voice and stays consistent across calls. let audio = try await tts.generateVoxCPM2( text: "Welcome to the show.", instruct: "A young woman, gentle and warm voice."

Reference cloning When you have a short clip of the target speaker.

Pass any 5–30 s of clean speech. The model copies the timbre and rhythm and synthesises new text in that voice. let ref = try AudioFileLoader.load( url: URL(fileURLWithPath: "speaker.wav"), targetSampleRate: 16000 let audio = try await tts.generateVoxCPM2( text: "This is a cloned voice.", refAudio: ref

Ultimate cloning When the speaker has a distinctive accent and you want it preserved.

Pass the clip AND its transcript. The model can now line up acoustic features with phonemes — accent and vowel choices carry through. let audio = try await tts.generateVoxCPM2( text: "Hello from the cloned voice.", refAudio: ref, promptText: "this is what the reference clip actually said", promptAudio: ref

Three cloning modes, same modelEach mode arranges different pieces in the input sequence before the model. Voice design adds a written description, reference cloning adds an audio prefix, and ultimate cloning adds a paired audio-and-transcript example.Voice design(description)text to sayReference cloningreference audiotext to sayUltimate cloningreference audiotranscripttext to sayprompt audioaudio framestext conditiontext to synthesise The same input slot, filled with different pieces. The model never sees a flag — it reads the sequence.

Under the hood How VoxCPM2 produces audio. Four cooperating modules. You don't need to know any of this to use the model, but if you're curious where the 48 kHz comes from — here it is.

VoxCPM2 architectureText and optional voice prompts feed an autoregressive language model and a residual refiner. A local diffusion transformer produces audio latents which the AudioVAE V2 decodes to a 48 kHz waveform.TextVoice prompt audioVoice instructionPrompt transcriptLocEncaudio + text fused into one streamTSLM · MiniCPM-4 backbone28-layer autoregressive LMdecides what audio patch comes nextRALMrefines each patch for prosodic detailLocDiT ·...

Cloning a voice at 48 kHz with VoxCPM2 in ElevenLabs API quality

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play