Create transcription
Create Transcription
Transcribes audio into text using a speech-to-text model. The audio file is sent as multipart/form-data.
POST
Create transcription
Transcribe audio into text using OpenAI’s speech-to-text models through Requesty’s routing.
Example response:
Date pinned snapshots (for example
Default JSON (
When Token usage (
Duration usage (
Use the Verbose JSON with Timestamps (
Use
Example response:
Base URL
Authentication
Include your Requesty API key in the request headers:Example Request
The endpoint acceptsmultipart/form-data. Send the audio as the file field and the model identifier as the model field.
OpenAI SDK
The endpoint is fully compatible with the OpenAI SDK. Just point the client at Requesty’s base URL:Supported Models
Browse the full catalog on the Transcription model library. Today the available transcription models are all from OpenAI:| Model | Best for | Billing |
|---|---|---|
openai/gpt-4o-transcribe | Highest accuracy, multilingual | Token based |
openai/gpt-4o-mini-transcribe | Fast and cost efficient | Token based |
openai/whisper-1 | Drop in replacement for legacy Whisper | Duration based (per second of audio) |
openai/gpt-4o-mini-transcribe-2025-12-15) are also available when you need a stable model version.
Supported Audio Formats
Thefile field accepts the following formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
The maximum upload size per request is 32 MB. For longer recordings, split the audio into chunks and concatenate the resulting transcripts on your side.
Language Hint
Setlanguage to the ISO 639-1 code of the spoken language to improve accuracy and latency. When omitted, the model auto detects the language.
Response Format
Setresponse_format to control the output shape. Supported values:
| Value | Description |
|---|---|
json | (default) JSON object with text and usage |
verbose_json | JSON with full timestamp data (words, segments, language, duration) |
Default JSON (json)
When response_format is omitted or set to json, the response is a JSON object with text and a usage block:
Token usage (gpt-4o-transcribe, gpt-4o-mini-transcribe)
Duration usage (whisper-1)
type discriminator to decide how to render or aggregate usage on your side.
Verbose JSON with Timestamps (verbose_json)
Use response_format=verbose_json together with timestamp_granularities[] to get word-level and/or segment-level timestamps. This is especially useful for subtitle generation, audio alignment, and video localization.
timestamp_granularities[] accepts word, segment, or both (OpenAI whisper-1 supports both simultaneously).
mistral/voxtral-mini-latest only supports a single timestamp_granularities value per request. If you need both word and segment timestamps with Mistral, make two separate requests.The
verbose_json response format with timestamp_granularities is supported for openai/whisper-1 and mistral/voxtral-mini-latest. The newer gpt-4o-transcribe models return timestamps through their own response schema.Pricing
Transcription models are priced either per token of input audio (forgpt-4o-transcribe and gpt-4o-mini-transcribe) or per second of input audio (for whisper-1). The exact rate per model is on the Transcription model library. Charges appear in your usage dashboard immediately after the request completes.
Error Handling
The API returns standard HTTP status codes:200Success400Bad Request (missingfileormodel, unsupported audio format)401Unauthorized (invalid API key)404Model not found or not approved for your organization413Payload Too Large (audio file exceeds 32 MB)429Rate limited500Internal Server Error
This endpoint is fully compatible with the OpenAI Audio Transcriptions API. You can use the OpenAI SDK’s
client.audio.transcriptions.create() method directly.Authorizations
API key for authentication
Body
multipart/form-data
The audio file to transcribe. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. Maximum upload size is 32 MB.
The speech-to-text model to use, prefixed with the provider slug. Currently only OpenAI models are supported.
Example:
"openai/gpt-4o-transcribe"
The language of the input audio in ISO 639-1 format (for example, en, fr, ja). Supplying the language improves accuracy and latency. Auto-detected when omitted.
Last modified on June 8, 2026