Speech to Text
Convert audio to text with industry-leading accuracy in 100+ languages.
Overview
The Speech to Text API provides accurate transcription for audio files and real-time streams. It supports speaker diarization, word-level timestamps, and custom vocabularies to handle domain-specific terminology.
Key Capabilities
- Real-time streaming
- Speaker diarization
- 100+ languages
- Word-level timestamps
- Custom vocabulary
- Punctuation & formatting
Quickstart
Transcribe an audio file in just a few lines of code.
Endpoints
/v1/stt/transcribeTranscribe a complete audio file. Supports MP3, WAV, FLAC, OGG, and WebM formats up to 500 MB. Returns the full transcript with optional segments, speaker labels, and word-level timestamps.
| Parameter | Type | Description |
|---|---|---|
| fileREQUIRED | file | The audio file to transcribe. Max 500 MB. |
| language | string | BCP-47 language code (e.g. "en", "ar", "de"). Auto-detected if omitted. |
| speaker_diarization | boolean | Enable speaker diarization to identify individual speakers. Default: false. |
| timestamps | boolean | Include word-level timestamps in the response. Default: false. |
| punctuation | boolean | Apply automatic punctuation and formatting. Default: true. |
| vocabulary | string[] | Custom vocabulary list for domain-specific terms (e.g. product names, acronyms). |
/v1/stt/streamOpen a WebSocket connection for real-time speech-to-text transcription. Send audio chunks and receive partial and final transcriptions as they are produced. Ideal for live captioning, voice assistants, and real-time note-taking.
| Parameter | Type | Description |
|---|---|---|
| language | string | BCP-47 language code. Auto-detected if omitted. |
| sample_rate | integer | Audio sample rate in Hz. Default: 16000. |
| encoding | string | Audio encoding: "pcm16", "opus", or "flac". Default: "pcm16". |
| interim_results | boolean | Return partial results as speech is recognized. Default: true. |
Response Objects
The transcription response follows this schema. All times are in seconds.
Best Practices
Use custom vocabulary for domain-specific terms
Add product names, acronyms, and technical terms to the vocabulary parameter to improve recognition accuracy for specialized content.
Choose the right encoding for streaming
Use Opus for low-bandwidth connections and PCM16 for highest quality. Match the sample rate to your audio source to avoid resampling artifacts.
Enable diarization only when needed
Speaker diarization adds processing time. Only enable it when you need to distinguish between multiple speakers, such as in meetings or interviews.
Handle interim results for responsive UIs
Display partial transcripts immediately and replace them when final results arrive. This provides a much more responsive user experience for live captioning.