Microsoft VibeVoice: Open-Source Frontier Voice AI

<- Back

Microsoft VibeVoice: Open-Source Frontier Voice AI

tosh

Comments (139)

steinvakt2
This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
maxloh
I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.https://github.com/microsoft/VibeVoice/issues/102
isodev
I think in this category, Voxtral by Mistral is a lot better. It also happens to be small enough to run on webGPU https://huggingface.co/spaces/mistralai/Voxtral-Realtime-Web...
triage8004
Surprised it wasn't called Copilot Voice
pluc
Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243
aqme28
Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.
yayadarsh
Someone tell me if this is better or worse than Parakeet
embedding-shape
Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?
xnx
Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.
dragonfax
Shouldn't it be called something like "Copilot Voice"?
mberg
I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.
CubsFan1060
Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/
podgietaru
So we've really just settled on Vibe as the verb for AI then?
chaosprint
Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...
ryukoposting
Holy moly, a Microsoft AI product that isn't named Copilot!
Anonyneko
You have selected Microsoft Sam as the computer's default voice.
anon
undefined
nickandbro
This is a very good model, but can it be run on the web?
frangonf
I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
Mobius01
Microsoft has historically made poor choices in product naming, but this has to be a new low.
solomatov
It would have been better if they provided not just weights, but also some frontend where it is usable as is.
Void_
I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):- Cohere Transcribe (self hosted)- Grok Speech To Text (they provide an API, only $0.10/hr!)They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?
JumpCrisscross
What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?
BlastBash192
Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.
khimaros
looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested
mistic92
For me its giving me very poor results
Zopieux
English only?
walthamstow
Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck
ChrisArchitect
Previously:Sept 2025 https://news.ycombinator.com/item?id=45114245
starkeeper
Microsoft is famous for choosing terrible names but how could they be this terrible.
villgax
lol they rug-pulled the 7B for our own safety some months ago