“With releases like Operator, Deep Research, Computer-Using Agents, and the Responses API with built-in tools, we’ve invested in advancing the intelligence, capabilities, and usefulness of text-based agents—or systems that independently accomplish tasks on behalf of users—over the past few months.” But for agents to be truly useful, people must be able to have deeper, more intuitive interactions with agents beyond just text—using natural spoken language to communicate effectively.
New audio models with enhanced accuracy and dependability were released by OpenAI on Thursday through the application programming interface (API). Three new artificial intelligence (AI) models for text-to-speech (TTS) and speech-to-text transcription were launched by the San Francisco-based AI company. According to the business, developers will be able to create apps with agentic workflows thanks to these models. Additionally, it said that companies may use the API to automate tasks similar to customer service. Interestingly, the company’s GPT-4o and GPT-4o small AI models serve as the foundation for the new models.
OpenAI is introducing new speech-to-text and text-to-speech audio models in the API today, which will enable the development of more potent, adaptable, and intelligent voice agents that provide tangible benefits. Our most recent voice-to-text models surpass current solutions in accuracy and dependability, setting a new bar for the state of the art, particularly in difficult situations with accents, loud surroundings, and variable speech rates. The models are particularly well-suited for use cases like customer call centres, meeting note transcription, and more because of these enhancements, which also raise transcription reliability.
In a blog post, the AI firm outlined the new API-specific AI models. The business stated that throughout the years it has developed numerous AI agents such as Operator, Deep Research, Computer-Using Agents, and the Responses API with built-in tools. It did add, though, that agents’ full potential won’t be realized until they are able to communicate and function intuitively in contexts other than text.
Three new audio models are available. The speech-to-text models are GPT-4o-transcribe and GPT-4o-mini-transcribe, whereas the GPT-4o-mini-tts is a TTS model as the name implies. According to OpenAI, these models perform better than the company’s current Whisper models, which were introduced in 2022. The new models, however, are not open-source like the earlier ones.
The AI company claimed that the GPT-4o-transcribe exhibits enhanced “word error rate” (WER) performance on the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) benchmark, which evaluates AI models on multilingual speech in 100 different languages. According to OpenAI, the enhancements were brought about by focused training methods including reinforcement learning (RL) and in-depth mis-training using high-quality audio datasets.
Even in difficult situations including loud surroundings, strong accents, and different speaking rates, these speech-to-text algorithms are able to record audio.
Significant enhancements are also included in the GPT-4o-mini-tts model, which notably only provides artificial and preset voices. According to the AI company, the models can speak with customizable inflections, intonations, and emotional expressiveness, allowing developers to create applications that can be used for a variety of tasks, such as customer service and creative storytelling.
The GPT-4o-based audio model will cost $40 per million input tokens and $80 per million output tokens, according to OpenAI’s API pricing page. However, the GPT-4o mini-based audio models will cost $10 for every million input tokens and $20 for every million output tokens.
Developers may now access all of the audio models using the API. To assist users in creating speech agents, OpenAI is now making available an interface with its Agents software development kit (SDK).
OpenAI tells more about the technical innovations which is behind the models
- Utilizing Real Audio Datasets for Pretraining
In order to maximize model performance, our new audio models are heavily pretrained on specific audio-centric datasets, building on the GPT‑4o and GPT‑4o-mini architectures. This focused method allows for outstanding performance on a variety of audio-related activities and offers a greater understanding of speech subtleties.
- Sophisticated Techniques for Distillation
By improving our distillation methods, we are able to transfer knowledge from our biggest audio models to more manageable, smaller models. By utilizing sophisticated self-play techniques, our distillation datasets successfully replicate authentic user-assistant interactions by capturing realistic conversational dynamics. This enables our smaller models to provide outstanding responsiveness and conversational quality.
- The Concept of Reinforcement Learning
We’ve included a reinforcement learning (RL)-heavy paradigm for our speech-to-text models, achieving state-of-the-art transcription accuracy. Our voice-to-text solutions are incredibly competitive in difficult speech recognition settings because of this technology, which significantly increases precision and decreases hallucination.
These advancements mark a step forward in the field of audio modelling, fusing cutting-edge techniques with useful improvements to improve voice application performance.