Everything you need to know about GPT-4o, including pricing, features, and how to get access

Sam Altman, CEO of OpenAI, which recently announced the GPT-4o model, speaking at the firm's inaugural developer conference in San Francisco in 2023.

(Image credit: Getty Images)

published 14 May 2024

OpenAI has introduced GPT-4o, its latest flagship large language model, featuring improvements to how it deals with text, voice, and video – and a lot more charm.

The ‘o’ in GPT-4o stands for ‘omni’ and refers to its multimodal capabilities, meaning it has the ability to accept combinations of text, audio, and images as input, and then to generate text, audio, and images as outputs in response.

This, OpenAI said, is a step towards much more natural human-computer interaction.

OpenAI showcased the new model’s abilities in a series of videos that feature the AI assistant creating and then singing a song, cooing over a cute dog it is shown, joking and flirting with people, and even being sarcastic towards users as they chat.

OpenAI CEO Sam Altman described the new voice and video mode as “the best computer interface I’ve ever used”.

“It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change,” he said.

“The original ChatGPT showed a hint of what was possible with language interfaces; this new thing feels viscerally different.

“Talking to a computer has never felt really natural for me; now it does.”

What can GPT-4o actually do?

The new model can respond to spoken questions in an average of 320 milliseconds, similar to human response times. OpenAI said the version matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages.

GPT-4o is especially better at visual and audio understanding compared to existing models, it said.

“Today, GPT-4o is much better than any existing model at understanding and discussing the images you share. For example, you can now take a picture of a menu in a different language and talk to GPT-4o to translate it, learn about the food's history and significance, and get recommendations,” OpenAI said.

In the future, improvements will allow for more natural, real-time voice conversation and the ability to converse with ChatGPT via real-time video, the firm promised.

Before the arrival of GPT-4o, you could already use ‘Voice Mode’ to talk to ChatGPT, but it was a slow process with an average latency – waiting time - of 2.8 seconds (for GPT-3.5) and 5.4 seconds (for GPT-4).

That’s because Voice Mode strings together three separate models: one basic model transcribes audio to text, GPT-3.5 or GPT-4 does the actual work of creating the required work and then a third simple model converts that text back to audio.

But that also means in the process GPT-4 loses a lot of information - like tone, whether there are multiple speakers, or background noises - and it also can’t output laughter, singing, or express emotion, OpenAI explained.

“With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network,” the company explained.

“Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.”

How do I get access to GPT-4o?

OpenAI said it is beginning to roll out GPT-4o to ChatGPT Plus and Team users, with availability for enterprise users coming soon.

The firm is making GPT-4o available in the free tier, and to Plus users with 5x higher message limits. OpenAI said it will roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.

There will be a limit on the number of messages that free users can send with GPT-4o depending on usage and demand. When the limit is reached, ChatGPT will automatically switch to GPT-3.5 so users can continue their conversations.

Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is twice as fast and half the price, and has five-times higher rate limits compared to GPT-4 Turbo.

“We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks,” OpenAI said.

What does all of this mean?

OpenAI also showed the AI assistant being used in a customer service scenario.

One version of the AI stood in for a customer who had a broken phone, while another iteration was playing the customer service agent, helping to get the handset sent back.

While it was strange to listen to the two bots chatting away to each other to get the job done, a conversation between two pieces of software peppered with phrases like ‘Got it’ and ‘Cool’ and ‘Great, thanks’, it’s pretty easy to see how tools like this could be used to automate large parts of customer service at a rapid pace.

The race between OpenAI and other big players like Google is only going to become more fierce as the broader potential benefits of generative AI become clearer.

What else did OpenAI announce?

For free and paid users, OpenAI has also launched a new ChatGPT desktop app for macOS. This allows users to ask ChatGPT questions by using a simple keyboard shortcut (option plus space).

RELATED WEBINAR

A webinar, with host images, on Generative AI for mainframe application modernization

Solve application modernization challenges with generative AI

Users can also take and discuss screenshots directly in the app. You can currently have voice conversations with ChatGPT using Voice Mode; OpenAI said that GPT-4o’s new audio and video capabilities are coming in the future.

OpenAI also said it was introducing a new look and feel for ChatGPT with a new home screen and message layout that's designed to be friendlier and more conversational.

Steve Ranger is an award-winning reporter and editor who writes about technology and business. Previously he was the editorial director at ZDNET and the editor of silicon.com.