Open AI GPT-4o label on colored background.

GPT-4o – The New GPT on the Block

OpenAI released, GTP-4o, their latest GPT model yesterday. The “o” stands for omni, and hints at the models ability to process audio, image, video and text seamlessly. The demo was performed live, where the GPT responded using audio and with a rich personality. Amongst the showcased talents were voice acting, singing, real time translation and recognising hand written notes.

One of the focal points was when it solved a linear algebra problem in real time, as it was written down on paper. This contrasts with the Google Gemini demo where it recognised a hand drawn duck, but was later controversially revealed to have been staged.

AI Companion

The main strength of the model appears to lie in it’s holistic abilities of combining video, audio images and text. Where it can seamlessly switch between any of these as a source of input. This is similar to what the Humane AI Pin tried to achieve, which Marques Brownless famously roasted as “bad at almost everything”:

However, GPT-4o avoids many of the pitfalls of the Humane AI pin. It is packaged as a convenient mobile and desktop app, is accurate and has quick response times. Which could make it a viable real life assistant for people in their day to day activities.

Performance

The new model is capable of conversing with response times matching human levels. It matches GPT4 performance for English text and code, improves on it for non-English text, is faster and the API is 50% cheaper.

On the 0-shot COT MLU benchmark, for genereal knowledge text questions, GTP-4o comes ahead of the Claude (Anthropic AI), Gemini (Google) and Llama (Facebook) models. With a record breaking 88.7% accuracy.

It also paves the way with new high scores for audio translation, standardised exams answering and visual recognition. Although it lags behind Whisper-v3 for audio recognition, but with that being an OpenAI model too it may be integrated with the API.

Availability

The new video and voice capabilities of the model have been recognised as a potential security risk, so it will only be rolled out to select red-teams for vulnerability analysis for now.

Text and image capabilities are being made available as part of the free and paid tiers, with plans to make voice available to ChatGPT Plus subscribers in the coming weeks.

For developers, text and vision are currently available via the API, boasting a 5x higher rate limit, 2x the speed and 1/2 the price of GPT4. The video and audio abilities will be rolled out more gradually, starting with a small group of trusted users.