Text to Speech
Generate natural-sounding speech from text in 100+ languages.
Overview
The Text to Speech API transforms written text into lifelike audio using premium neural voices. Fine-tune speed, pitch, and emotional expression to create the perfect voice experience for your application.
Key Capabilities
- 50+ premium voices
- SSML support
- Speed & pitch control
- Streaming output
- Multiple output formats
- Emotional expression
Quickstart
Generate speech from text in just a few lines of code.
Endpoints
/v1/tts/generateGenerate a complete audio file from text. Supports plain text and SSML input. Returns a URL to the generated audio file along with metadata. Ideal for batch processing, audiobook generation, and pre-rendered content.
| Parameter | Type | Description |
|---|---|---|
| textREQUIRED | string | The text to synthesize. Supports plain text and SSML. Max 10,000 characters. |
| voice_idREQUIRED | string | The voice to use. See the Voices page for available options. |
| language | string | BCP-47 language code. Auto-detected from voice if omitted. |
| output_format | string | Output format: "mp3", "wav", "ogg", or "flac". Default: "mp3". |
| speed | number | Playback speed multiplier from 0.25 to 4.0. Default: 1.0. |
| pitch | number | Pitch adjustment in semitones from -12 to 12. Default: 0. |
/v1/tts/streamStream generated audio in real time as chunks. The first audio chunk is delivered within 200ms, enabling low-latency playback for conversational interfaces, voice assistants, and interactive applications.
| Parameter | Type | Description |
|---|---|---|
| textREQUIRED | string | The text to synthesize and stream. Max 10,000 characters. |
| voice_idREQUIRED | string | The voice to use for generation. |
| output_format | string | Output format: "mp3" or "pcm16". Default: "mp3". |
| chunk_size | integer | Size of each audio chunk in bytes. Default: 4096. |
Response Objects
The generation response follows this schema. Streaming responses deliver raw audio bytes with metadata in response headers.
Best Practices
Use streaming for conversational interfaces
The stream endpoint delivers the first audio chunk within 200ms. Use it for chatbots and voice assistants where latency matters more than having a complete file.
Choose the right output format
Use MP3 for web delivery and general use. Use WAV for post-processing workflows. Use OGG for bandwidth-constrained environments with good quality.
Leverage SSML for fine-grained control
Use SSML tags to add pauses, emphasis, and pronunciation hints. This is especially useful for reading numbers, dates, acronyms, and domain-specific terms correctly.
Cache generated audio when possible
If you generate the same text repeatedly with the same voice and settings, cache the audio URL or file locally to reduce API calls and lower latency.