Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

<- Back

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

ipotapov

Comments (50)

vessenes
This is cool. It makes me want an unsloth quant though! A 7b local model with tool calling would be genuinely useful, although I understand this is not that.UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.
ilaksh
Does anyone have working code for fine-tuning PersonaPlex for outgoing calls? I have tried to take the fine tuning LoRA stuff from Kyutai/moshi-finetune and apply it to the personaplex code. Or more accurately,various LLMs have worked on that.I have something that seems to work in a rough way but only if I turn the lora scaling factor up to 5 and that generally screws it up in other ways.And then of course when GPT-5.3 Codex looked at it, it said that speaker A and speaker B were switched in the LoRA code. So that is now completely changed and I am going to do another dataset generation and training run.If anyone is curious it's a bit of a mess but it's on my GitHub under runvnc moshi-finetune and personaplex. It even has a gradio app to generate data and train. But so far no usable results.
armcat
I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.
4dregress
This sounds quite dangerous https://www.theguardian.com/technology/2026/mar/04/gemini-ch...
dubeye
It doesn't feel like speech recognition has been improving at the same rate as other generative AI. It had a big jump up to about 6% WER a year or two ago, but it seems to have plateaued. Am I just using the wrong model? Or is human level error rate, some kind of limit, which I estimate to be about 5%.
scosman
I’m a big fan of whisperKit for this, and they just added TTS. Great because they support features like speaker diarization (“who spoke when”) and custom dictionaries.Here’s a load test where they run 4 models in realtime on same device:- Qwen3-TTS - text to speech- Parakeet v2 - Nvidia speech to text model- Canary v2 - multilingual / translation STT- Sortformer - speaker diarization (“who spoke when”)https://x.com/atiorh/status/2027135463371530695
sgt
My problem with TTS is that I've been struggling to find models that support less common use cases like mixed bilingual Spanish/English and also in non-ideal audio conditions. Still haven't found anything great, to be honest.
jwr
As a heavy user of MacWhisper (for dictation), I'm looking forward to better speech-to-text models. MacWhisper with Whisper Large v3 Turbo model works fine, but latency adds up quickly, especially if you use online LLMs for post-processing (and it really improves things a lot).
michelsedgh
its really cool, but for real life use cases i think it lacks the ability to have a silent text stream output for example for json and other stuff so as its talking it can run commands for you. right now it can only listen and talk back which limits what u can make with this a lot
WeaselsWin
This full duplex spoken thing, it's already for quite a long time being used by the big players when using the whatever "conversation mode" their apps offer, right? Those modes always seemed fast enough to for sure not be going through the STT->LLM->TTS pipeline?
nerdsniper
Do we have real-time (or close-enough) face-to-face models as well? I'd like to gracefully prove a point to my boss that some of our IAM procedures need to be updated.
Serenacula
This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.
Tepix
It's cool tech and I will give it a try. I will probably make a 8-bit-quant instead of the 4-bit which should be easy with the provided script.That said, I found the example telling:Input: “Can you guarantee that the replacement part will be shipped tomorrow?”:Reponse with prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”It's not surprising that people have little interest in talking to AI if they're being lied to.PS: Is it just me or are we seing AI generated copy everywhere? I just hope the general talking style will not drift towards this style. I don't like it one bit.
api
How close are we to the Star Trek universal translator?
pothamk
What’s interesting about full-duplex speech systems isn’t just the model itself, but the pipeline latency.Even if each component is fast individually, the chain of audio capture → feature extraction → inference → decoding → synthesis can quickly add noticeable delay.Getting that entire loop under ~200–300ms is usually what makes the interaction start to feel conversational instead of “assistant-like”.
nicktikhonov
From what I've seen, it's really easy to get PersonaPlex stuck in a death spiral - talking to itself, stuttering and descending deeper and deeper into total nonsense. Useless for any production use case. But I think this kind of end-to-end model is needed to correctly model conversations. STT/TTS compresses a lot of information - tone, timing, emotion out of the input data to the model, so it seems obvious that the results will always be somewhat robotic. Excited to see the next iteration of these models!
khalic
ugh, qwen, I wish they'd use an open data model for this kind of projects
octoclaw
[dead]