How Video FPS Affects AI API Token Costs

Processing video through LLMs like Gemini 1.5 or GPT-4o is fundamentally different from processing text. AI models don't "watch" video; they analyze extracted image frames alongside the audio track.

The 1 FPS Standard

By default, most multimodal APIs sample video at 1 Frame Per Second (1 FPS). If you submit a 60-second video at 30 FPS, the AI doesn't process 1,800 frames. It extracts 60 frames.

In models like Gemini 1.5 Pro, video is billed at approximately 263 tokens per second of video.

Strategies for Cost Reduction

Pre-processing: Never send 4K 60FPS video directly to an API. Downscale the video to 720p and reduce the framerate to 1 FPS before making the API call.
Keyframe Extraction: Instead of using native video endpoints, use a local script (like FFmpeg) to extract only the scene changes (keyframes) and send them as a batch of images.
Audio Separation: If the visual context isn't changing, transcribe the audio locally using Whisper, and only send the text prompt to the expensive LLM.

Analyze your local video files without uploading them using our Multimodal Pricing Calculator.