文章称语音 AI 仍然很难打造,因为高品质录制语音资料集比文字资料集更稀缺,处理时间更高,而且在对话中即使只有几毫秒的回复延迟也会让人感到不舒服。该领域仍在投入语音作为下一个 AI 介面,文章把这股推动与收购和招募联系起来:Meta 在去年夏天收购了 Play AI,Google 最近聘用了 Hume 的创办人,Apple 收购了以色列新创 Q.ai。
文章强调风险会从不便升级到安全关键情境:在家中的语言切换很恼人,但在机器人手术或自动驾驶等系统中就令人警惕,文中举例为一辆时速 70 miles per hour 的汽车。以词错误率衡量,Whisper 列为 7.44%(其中 0% 为完美),较几个月前超过 8% 下降,而 Nvidia 的 Canary-Qwen-2.5B 以 5.63% 领先。OpenAI 表示威尔斯语问题应已在其最新模型更新中修复,并将问题归因于标注错误的资料。
For much of the past year, ChatGPT voice mode repeatedly transcribed the author’s English speech into Welsh, and OpenAI said Whisper sometimes got confused. FT reporting says OpenAI has known about the issue for over a year. The same unexpected language switching was also reported by users in Malay and Icelandic, while developers linked errors to difficult speech conditions such as background noise, accents, overlapping speech, and unusual requests.
The article says voice AI remains hard to build because high-quality recorded speech datasets are scarcer than text datasets, processing times are higher, and even a few milliseconds of reply delay can feel uncomfortable in conversation. The sector is still committing to voice as the next AI interface, and the article links this push to acquisitions and hires: Meta bought Play AI last summer, Google hired Hume’s founder recently, and Apple bought Israeli start-up Q.ai.
The article highlights risk escalation from inconvenience to safety-critical contexts: a language switch at home is annoying, but in systems like robotic surgery or autonomous driving it is alarming, with an example of a car at 70 miles per hour. Measured by word error rate, Whisper is listed at 7.44% (with 0% as perfect), down from over 8% a few months earlier, while Nvidia’s Canary-Qwen-2.5B leads at 5.63%. OpenAI says the Welsh issue should be fixed in its latest model update and attributes the problem to mislabelled data.