> ## Documentation Index
> Fetch the complete documentation index at: https://docs.requesty.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create Transcription

> Transcribes audio into text using a speech-to-text model. The audio file is sent as `multipart/form-data`.

Transcribe audio into text using OpenAI's speech-to-text models through Requesty's routing.

## Base URL

```
https://router.requesty.ai/v1/audio/transcriptions
```

## Authentication

Include your Requesty API key in the request headers:

```bash theme={"dark"}
Authorization: Bearer YOUR_REQUESTY_API_KEY
```

## Example Request

The endpoint accepts `multipart/form-data`. Send the audio as the `file` field and the model identifier as the `model` field.

```bash theme={"dark"}
curl https://router.requesty.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_REQUESTY_API_KEY" \
  -F "model=openai/gpt-4o-transcribe" \
  -F "file=@./meeting.mp3"
```

Example response:

```json theme={"dark"}
{
  "text": "Hello, this is a transcription of the audio.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "output_tokens": 11,
    "total_tokens": 25,
    "input_token_details": {
      "audio_tokens": 14,
      "text_tokens": 0
    }
  }
}
```

### OpenAI SDK

The endpoint is fully compatible with the OpenAI SDK. Just point the client at Requesty's base URL:

```python theme={"dark"}
from openai import OpenAI

client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="YOUR_REQUESTY_API_KEY",
)

with open("meeting.mp3", "rb") as audio:
    transcript = client.audio.transcriptions.create(
        model="openai/gpt-4o-transcribe",
        file=audio,
    )

print(transcript.text)
```

```typescript theme={"dark"}
import OpenAI from "openai";
import fs from "node:fs";

const client = new OpenAI({
  baseURL: "https://router.requesty.ai/v1",
  apiKey: process.env.REQUESTY_API_KEY,
});

const transcript = await client.audio.transcriptions.create({
  model: "openai/gpt-4o-transcribe",
  file: fs.createReadStream("meeting.mp3"),
});

console.log(transcript.text);
```

## Supported Models

Browse the full catalog on the [Transcription model library](https://app.requesty.ai/model-library/transcriptions). Today the available transcription models are all from OpenAI:

| Model                           | Best for                               | Billing                              |
| ------------------------------- | -------------------------------------- | ------------------------------------ |
| `openai/gpt-4o-transcribe`      | Highest accuracy, multilingual         | Token based                          |
| `openai/gpt-4o-mini-transcribe` | Fast and cost efficient                | Token based                          |
| `openai/whisper-1`              | Drop in replacement for legacy Whisper | Duration based (per second of audio) |

Date pinned snapshots (for example `openai/gpt-4o-mini-transcribe-2025-12-15`) are also available when you need a stable model version.

## Supported Audio Formats

The `file` field accepts the following formats: `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, `webm`.

The maximum upload size per request is **32 MB**. For longer recordings, split the audio into chunks and concatenate the resulting transcripts on your side.

## Language Hint

Set `language` to the ISO 639-1 code of the spoken language to improve accuracy and latency. When omitted, the model auto detects the language.

```bash theme={"dark"}
curl https://router.requesty.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_REQUESTY_API_KEY" \
  -F "model=openai/gpt-4o-transcribe" \
  -F "language=fr" \
  -F "file=@./conference.m4a"
```

## Response Format

Set `response_format` to control the output shape. Supported values:

| Value          | Description                                                                 |
| -------------- | --------------------------------------------------------------------------- |
| `json`         | **(default)** JSON object with `text` and `usage`                           |
| `verbose_json` | JSON with full timestamp data (`words`, `segments`, `language`, `duration`) |

### Default JSON (`json`)

When `response_format` is omitted or set to `json`, the response is a JSON object with `text` and a `usage` block:

#### Token usage (`gpt-4o-transcribe`, `gpt-4o-mini-transcribe`)

```json theme={"dark"}
{
  "text": "Hello, world.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "output_tokens": 11,
    "total_tokens": 25,
    "input_token_details": {
      "audio_tokens": 14,
      "text_tokens": 0
    }
  }
}
```

#### Duration usage (`whisper-1`)

```json theme={"dark"}
{
  "text": "Hello, world.",
  "usage": {
    "type": "duration",
    "seconds": 4.2
  }
}
```

Use the `type` discriminator to decide how to render or aggregate usage on your side.

### Verbose JSON with Timestamps (`verbose_json`)

Use `response_format=verbose_json` together with `timestamp_granularities[]` to get word-level and/or segment-level timestamps. This is especially useful for subtitle generation, audio alignment, and video localization.

`timestamp_granularities[]` accepts `word`, `segment`, or both (OpenAI whisper-1 supports both simultaneously).

<Note>
  `mistral/voxtral-mini-latest` only supports a single `timestamp_granularities` value per request. If you need both word and segment timestamps with Mistral, make two separate requests.
</Note>

```bash theme={"dark"}
curl https://router.requesty.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_REQUESTY_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word" \
  -F "timestamp_granularities[]=segment" \
  -F "file=@./meeting.mp3"
```

```python theme={"dark"}
transcript = client.audio.transcriptions.create(
    model="openai/whisper-1",
    file=audio,
    response_format="verbose_json",
    timestamp_granularities=["word", "segment"],
)

for word in transcript.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
```

Example response:

```json theme={"dark"}
{
  "task": "transcribe",
  "language": "english",
  "duration": 4.2,
  "text": "Hello, world.",
  "words": [
    { "word": "Hello,", "start": 0.0, "end": 0.52 },
    { "word": "world.", "start": 0.52, "end": 1.04 }
  ],
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 4.2,
      "text": " Hello, world.",
      "tokens": [50364, 2425, 11, 1002, 13, 50574],
      "temperature": 0.0,
      "avg_logprob": -0.28,
      "compression_ratio": 0.7,
      "no_speech_prob": 0.1
    }
  ]
}
```

<Note>
  The `verbose_json` response format with `timestamp_granularities` is supported for `openai/whisper-1` and `mistral/voxtral-mini-latest`. The newer `gpt-4o-transcribe` models return timestamps through their own response schema.
</Note>

## Pricing

Transcription models are priced either per token of input audio (for `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`) or per second of input audio (for `whisper-1`). The exact rate per model is on the [Transcription model library](https://app.requesty.ai/model-library/transcriptions). Charges appear in your [usage dashboard](https://app.requesty.ai/analytics) immediately after the request completes.

## Error Handling

The API returns standard HTTP status codes:

* `200` Success
* `400` Bad Request (missing `file` or `model`, unsupported audio format)
* `401` Unauthorized (invalid API key)
* `404` Model not found or not approved for your organization
* `413` Payload Too Large (audio file exceeds 32 MB)
* `429` Rate limited
* `500` Internal Server Error

<Info>
  This endpoint is fully compatible with the OpenAI Audio Transcriptions API. You can use the OpenAI SDK's `client.audio.transcriptions.create()` method directly.
</Info>

<Tip>
  To go the other direction and turn text into audio, use the [Create Speech endpoint](/api-reference/endpoint/audio-speech-create).
</Tip>


## OpenAPI

````yaml POST /v1/audio/transcriptions
openapi: 3.0.3
info:
  title: Requesty API
  description: Requesty API for AI model routing and key management
  version: 1.0.0
servers:
  - url: https://api-v2.requesty.ai
    description: Management API endpoint
  - url: https://router.requesty.ai
    description: Inference router endpoint
security:
  - BearerAuth: []
paths:
  /v1/audio/transcriptions:
    servers:
      - url: https://router.requesty.ai
        description: Inference router endpoint
    post:
      summary: Create transcription
      description: >-
        Transcribes audio into text using a speech-to-text model. The audio file
        is sent as `multipart/form-data`.
      operationId: createTranscription
      requestBody:
        required: true
        content:
          multipart/form-data:
            schema:
              $ref: '#/components/schemas/TranscriptionMultipartRequest'
      responses:
        '200':
          description: Transcription result
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/TranscriptionResponse'
        '400':
          description: Bad request - malformed payload or invalid parameters.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '401':
          description: Unauthorized - missing or empty Authorization header.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '402':
          description: Payment required - organization balance exhausted.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '403':
          description: Forbidden - invalid token or model not in access list.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '404':
          description: Not found - provider/model not supported.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '413':
          description: Payload too large
        '429':
          description: Rate limit exceeded. Retry after the Retry-After header value.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '500':
          description: Internal server error.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
        '502':
          description: Bad gateway - upstream provider returned an invalid response.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ErrorResponse'
components:
  schemas:
    TranscriptionMultipartRequest:
      type: object
      required:
        - file
        - model
      properties:
        file:
          type: string
          format: binary
          description: >-
            The audio file to transcribe. Supported formats are `flac`, `mp3`,
            `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`. Maximum
            upload size is 32 MB.
        model:
          type: string
          description: >-
            The speech-to-text model to use, prefixed with the provider slug.
            Currently only OpenAI models are supported.
          example: openai/gpt-4o-transcribe
        language:
          type: string
          description: >-
            The language of the input audio in ISO 639-1 format (for example,
            `en`, `fr`, `ja`). Supplying the language improves accuracy and
            latency. Auto-detected when omitted.
    TranscriptionResponse:
      type: object
      required:
        - text
        - usage
      properties:
        text:
          type: string
          description: The transcribed text.
          example: Hello, world.
        usage:
          $ref: '#/components/schemas/TranscriptionUsage'
    ErrorResponse:
      type: object
      required:
        - error
      properties:
        error:
          type: object
          required:
            - origin
            - message
          properties:
            origin:
              type: string
              enum:
                - router
                - provider
              description: >-
                Whether the error originated from Requesty's router or an
                upstream provider.
            message:
              type: string
              description: Human-readable error description.
    TranscriptionUsage:
      type: object
      description: >-
        Usage stats for the transcription. The shape depends on how the model is
        billed: token-based (`gpt-4o-transcribe`, `gpt-4o-mini-transcribe`) or
        duration-based (`whisper-1`).
      oneOf:
        - $ref: '#/components/schemas/TranscriptionTokenUsage'
        - $ref: '#/components/schemas/TranscriptionDurationUsage'
    TranscriptionTokenUsage:
      type: object
      required:
        - type
        - input_tokens
        - output_tokens
        - total_tokens
      properties:
        type:
          type: string
          enum:
            - tokens
          description: Discriminator. Always `tokens` for this variant.
          example: tokens
        input_tokens:
          type: integer
          description: Number of input tokens billed for this request.
        output_tokens:
          type: integer
          description: Number of output tokens generated.
        total_tokens:
          type: integer
          description: Total tokens used (input + output).
        input_token_details:
          type: object
          properties:
            audio_tokens:
              type: integer
              description: Number of audio tokens in the input.
            text_tokens:
              type: integer
              description: Number of text tokens in the input.
    TranscriptionDurationUsage:
      type: object
      required:
        - type
        - seconds
      properties:
        type:
          type: string
          enum:
            - duration
          description: Discriminator. Always `duration` for this variant.
          example: duration
        seconds:
          type: number
          format: float
          description: Duration of the input audio in seconds.
  securitySchemes:
    BearerAuth:
      type: http
      scheme: bearer
      description: API key for authentication

````