Microsoft has developed an advanced text-to-speech AI model known as VALL-E 2, which can closely mimic human speech with exceptional accuracy. The technology has reached a level of sophistication where it can reproduce the voice of the original speaker with high naturalness and precision. However, due to concerns about potential fraudulent use and impersonation, the company has opted not to make this AI model available to the public.
Named VALL-E 2, this AI model represents a significant achievement in text-to-speech synthesis, achieving human-like voice quality and performance. Microsoft’s internal benchmarks indicate that VALL-E 2 can replicate or even surpass human speech in certain cases. The company’s researchers conducted experiments on the LibriSpeech and VCTK datasets, demonstrating that VALL-E 2 outperforms previous zero-shot TTS systems in terms of performance, robustness, naturalness, and voice similarity. This is the first system to achieve human parity in these standards.
While Microsoft has emphasized that VALL-E 2 is solely a research project with no current plans for public release, the company has outlined potential applications in various industries such as education, journalism, self-authored content, accessibility features, voice response systems, translation, and chatbots.