Learn To Build Real-World AI Solutions

✓ Join 1,000+ developers building AI apps that actually ship
✓ Get full source code, templates, and commercial-use rights
✓ Based on apps and tools built for paying clients

👉 Start Building Now — Free for 7 Days

“ A very nice and precise lesson plan. The money spent is well invested.

★★★★★

Achim Dehnert , Professor at Neu-Ulm University

Unveiling GPT-4o: The Next Generation in AI

May 15, 2024

Yesterday marked a significant milestone in the world of artificial intelligence with the release of GPT-4o, the latest iteration of the powerful GPT-4 model. This enhanced version is 2x faster and 50% cheaper in the API compared to GPT-4 Turbo and is offering significant advancements in AI capabilities.

In this blog post, we’ll dive into the new features and improvements that GPT-4o brings to the table. We’ll also provide some practical code snippets to help you get started with integrating GPT-4o into your projects.

So, let’s explore the advancements of GPT-4o and see how you can leverage its capabilities to create innovative and impactful applications.

Some Hard Facts: GPT-4o

GPT-4o is offering some significant advancements compared to GPT-4.

Before GPT-4o, using Voice Mode to talk to ChatGPT involved latencies of 2.8 seconds with GPT-3.5 and 5.4 seconds with GPT-4. This was achieved through a pipeline of three separate models: one transcribing audio to text, GPT-3.5 or GPT-4 processing the text, and another converting the text back to audio.

This setup caused the main model, GPT-4, to miss out on important contextual information like tone, multiple speakers, background noises, and it couldn't generate outputs with emotion, such as laughter or singing.

With GPT-4o, a single model processes text, vision, and audio end-to-end, allowing all inputs and outputs to be handled by the same neural network. As this is the first model to combine these modalities, we're just beginning to explore its full potential and limitations.

Here are some highlights:

Capabilities: GPT-4o accepts text, audio, image, and video inputs, and can generate text, audio, and image outputs.
Speed: Responds to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, comparable to human response times in conversations.
Improved Understanding: Matches GPT-4 Turbo's performance in text (English and code), and significantly improves on text in non-English languages. It also excels in vision and audio understanding.
Pricing: GPT-4o is 2x faster and 50% cheaper in the API compared to GPT-4 Turbo.

Improvements Over Previous Models

Multimodal Abilities: Unlike previous models, GPT-4o can process and generate outputs across multiple modalities including text, audio, image, and video.
Enhanced Speed and Cost Efficiency: It is significantly faster and more cost-effective than GPT-4 Turbo, making it accessible for a wider range of applications.
Advanced Multilingual Support: GPT-4o sets new benchmarks for text performance in non-English languages, surpassing previous models.
Integrated Model: GPT-4o combines all modalities into a single model, enhancing its ability to understand and generate complex, multimodal content.

How to use the GPT-4o Chat Completions API

Let's start with the easiest use case: text completion. Chat models take a list of messages as input and return a response. This format is great for both ongoing conversations and single replies.

Here's a simple example of a Chat Completions API call:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
   {"role": "system", "content": "You are a helpful assistant."},
   {"role": "user", "content": "What is the sense of life?"},
 ]
)

print(response.choices[0].message.content)

This results in: "The sense of life is often considered to be about finding happiness, purpose, and meaning in your experiences and relationships."

GPT-4o Vision - Understanding Images

Images can be provided to the model in two main ways: by passing a link to the image or by including the base64 encoded image directly in the request. Images can be included in user, system, and assistant messages.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
   {
     "role": "user",
     "content": [
     {"type": "text", "text": "What’s in this image?"},
     {
       "type": "image_url",
       "image_url": {
          "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
       },
     },
   ],
 }
 ],
 max_tokens=300,
)

print(response.choices[0])

This results in: "A wooden boardwalk extends through a lush green meadow under a bright blue sky with scattered clouds."

The model excels at answering general questions about what is present in images. While it understands the relationship between objects, it is not yet optimized for answering detailed questions about the exact location of specific objects.

Text to Speech - Turn Text into Spoken Audio

The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription. We currently support multiple input and output file formats.

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
 model="tts-1",
 voice="alloy",
 input="Today is a wonderful day to build something people love!"
)

response.stream_to_file(speech_file_path)

How to Create Transcriptions

The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. It supports multiple input and output file formats.

from openai import OpenAI
client = OpenAI()

audio_file= open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
  model="whisper-1", 
  file=audio_file
)
print(transcription.text)

This code imports the OpenAI library, initializes a client, and then opens an audio file to send it to OpenAI's transcription model for converting the audio into text. Finally, it prints out the transcribed text.

Conclusion

And there you have it! We've covered some exciting new features of GPT-4o, from handling text completions to understanding images and converting text to speech. This latest release opens up a world of possibilities for developers, researchers, and AI enthusiasts looking to create innovative applications.

Whether you're building chatbots that can hold natural conversations, generating audio content on the fly, or exploring new ways to understand visual content, GPT-4o has you covered. We hope these examples inspire you to dive in and start experimenting with this powerful new tool.

Got any questions or cool projects you’re working on with GPT-4o? Share them in the comments or reach out on social media. Let’s learn and build together!

Ready to turn your AI ideas into real apps?

Access step-by-step courses & launch-ready tools

👉 Start Building Now - Free For 7 Days