Skip to content

OpenAI GPT-Realtime-2: Voice AI Gets GPT-5 Reasoning

6 min read

OpenAI GPT-Realtime-2: Voice AI Gets GPT-5 Reasoning
Photo by Los Muertos Crew on Pexels

What OpenAI Shipped on May 7

On May 7, 2026, OpenAI released three new voice models through its API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. This is not a single model update — it is a deliberate split into purpose-built tools for different jobs. Understanding which model does what matters before you reach for any of them.

The original GPT-Realtime (now effectively v1.5) handled conversational voice in a single model. OpenAI has unbundled the functionality: high-intelligence dialogue goes to GPT-Realtime-2, live translation gets its own model, and streaming transcription is separated into Whisper. Each trades off differently on cost, latency, and capability.

GPT-Realtime-2: The One That Can Actually Think

GPT-Realtime-2 brings GPT-5-class reasoning into a real-time audio pipeline for the first time. That sounds like marketing until you look at what changes operationally.

The context window expanded from 32,000 to 128,000 tokens. For a customer support agent handling a complex billing dispute across a 40-minute call, or a healthcare documentation workflow capturing a patient history, this is the difference between a model that forgets and one that holds the thread. The prior limit forced workarounds — summarization, context pruning — that introduced their own errors.

OpenAI introduced adjustable reasoning effort levels: normal, high, and xhigh. Developers can tune the latency-vs-intelligence tradeoff per use case. On OpenAI benchmarks, GPT-Realtime-2 at high effort scores 15.2% above GPT-Realtime-1.5 on Big Bench Audio, and 13.8% higher on Audio MultiChallenge for instruction following at xhigh effort.

The most striking external data point comes from Zillow, one of the launch partners. Zillow reports a 26-point lift in call-success rate on its hardest adversarial benchmark — from 69% on the prior model to 95% on GPT-Realtime-2. That is a material gain. Though the benchmark is internal and unpublished, a 26-point improvement in production-like test conditions is hard to dismiss.

GPT-Realtime-Translate and Whisper: Focused Tools

GPT-Realtime-Translate is a live translation model that keeps pace with the speaker. It handles 70+ input languages and outputs in 13 languages, in real time. Deutsche Telekom is testing it for multilingual customer support. Vimeo is experimenting with using it to translate product education videos as they play.

The 70-input / 13-output asymmetry matters. Many languages can be understood; only a subset can be generated fluently. OpenAI has not published the full list of which 13 output languages are supported — relevant if your use case involves Thai or Arabic output rather than French or Spanish.

GPT-Realtime-Whisper is the streaming transcription model — Whisper speech-to-text delivered token-by-token as the speaker talks, rather than in batches after they stop. The use case is medical documentation, meeting notes, and real-time captioning where latency matters more than reasoning. It is not a dialogue model.

The Cost and Latency Tradeoffs

GPT-Realtime-2 is priced at $32 per million audio-input tokens and $64 per million audio-output tokens. Cached input tokens drop to $0.40 per million. In practical terms, an uncached voice session runs $0.18–$0.46 per minute depending on reasoning effort. With prompt caching enabled and tool output trimmed, that falls to $0.05–$0.10 per minute.

That per-minute cost compounds quickly. A contact center running 100,000 calls per month at 8 minutes average handle time, even at the low-end $0.10/min with caching, adds up to $80,000 monthly in inference costs alone — before infrastructure, tooling, or human escalation. The Zillow lift numbers only justify the expense if they translate to fewer escalations or measurable conversion improvements.

The reasoning effort dial is where developers will spend time calibrating. Normal effort gives faster responses at lower cost; xhigh delivers maximum accuracy but adds latency that may break the conversational feel. OpenAI has not published exact latency numbers for each tier, which makes evaluation difficult without running your own benchmarks.

Who Is Actually Using It and What They Are Seeing

OpenAI named a specific set of launch partners: Zillow, Glean, Genspark, Bluejay, Intercom, Priceline, Foundation Health, BolnaAI, Vimeo, and Deutsche Telekom. The mix is telling — real estate, search, healthcare, travel, and telecoms — all industries where a reasoning-capable voice model changes what is possible at the interaction layer.

Intercom is worth watching. Its Fin AI agent is already deployed across thousands of enterprise support desks. Adding GPT-Realtime-2 into that stack suggests voice-first support routing — not just text-based chat — is arriving as a product feature, not a demo.

Foundation Health points at a different use case: clinical documentation. A model that holds 128K tokens of context and reasons across a complex patient intake conversation, then generates structured notes, addresses a real gap in EHR workflows that transcription-only tools cannot fill. The regulatory questions around AI-generated clinical documentation remain open, but the technical capability is now present.

This connects to a broader pattern: AI agents moving from pilots to production in 2026 increasingly requires voice-capable agents, particularly in customer-facing roles where text-only interfaces create friction.

What Changes for Voice-First Product Teams

GPT-Realtime-2 removes one of the primary objections to deploying conversational voice AI in production: the model was not smart enough to handle hard cases without escalation. The Zillow numbers suggest that bar has moved. The remaining objections are cost at scale, regulatory compliance in healthcare and finance, and latency on higher reasoning tiers.

The three-model split also signals an architectural shift. Instead of a single general-purpose voice model, OpenAI is offering a portfolio — which means developers need to route intelligently across models, not just invoke one. A pipeline that routes simple queries to Whisper-based transcription, sends standard dialogue to GPT-Realtime-2 at normal effort, and escalates to xhigh only for complex cases could meaningfully reduce cost while maintaining quality. That orchestration adds engineering complexity but is likely necessary for cost-effective deployment at scale.

For teams evaluating voice-first features, the question is no longer capability — it is whether the cost-per-interaction works given your deflection or conversion economics. That math is now easier to run, and for high-value interactions, it increasingly does.

Further Reading

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Enjoyed this? Get one AI insight per day.

Join engineers and decision-makers who start their morning with vortx.ch. No fluff, no hype — just what matters in AI.