How to Calculate Audio Token Costs in Multimodal APIs

With the release of models like GPT-4o, native audio processing has fundamentally changed. Previously, developers had to use a separate Speech-to-Text API (like Whisper) and send the resulting text to an LLM. Now, audio is processed natively.

Audio Tokenization

Unlike text, which maps to words, or video, which maps to frames, native audio tokenization is typically based on duration.

OpenAI (GPT-4o): Audio input is significantly more expensive than text. While exact ratios fluctuate, native audio input often costs roughly equivalent to processing a low-res image every few seconds.
Google (Gemini 1.5): Google simplifies this by billing audio directly based on duration, often measured in seconds or hours, translating that duration into a fixed token footprint.

Cost Saving Strategy: Whisper vs. Native

If your application only needs a transcription (e.g., summarizing a podcast), it is almost always cheaper to run a local, open-source Whisper model (or a cheap Whisper API endpoint) to generate text, and then send that text to the LLM.

Native audio should be reserved for use cases where the tone, emotion, or timing of the voice is critical to the prompt.

Compare the costs of text versus rich media using our Pricing Tool.