Text to Speech

Generate natural-sounding speech from text in 100+ languages.

Overview

The Text to Speech API transforms written text into lifelike audio using premium neural voices. Fine-tune speed, pitch, and emotional expression to create the perfect voice experience for your application.

Key Capabilities

50+ premium voices
SSML support
Speed & pitch control
Streaming output
Multiple output formats
Emotional expression

Quickstart

Generate speech from text in just a few lines of code.

curl -X POST https://api.nur.ai/v1/tts/generate \
  -H "Authorization: Bearer nur_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello! Welcome to Nur.",
    "voice_id": "rachel_v2"
  }' \
  --output output.mp3

Endpoints

POST/v1/tts/generate

Generate a complete audio file from text. Supports plain text and SSML input. Returns a URL to the generated audio file along with metadata. Ideal for batch processing, audiobook generation, and pre-rendered content.

Parameter	Type	Description
textREQUIRED	string	The text to synthesize. Supports plain text and SSML. Max 10,000 characters.
voice_idREQUIRED	string	The voice to use. See the Voices page for available options.
language	string	BCP-47 language code. Auto-detected from voice if omitted.
output_format	string	Output format: "mp3", "wav", "ogg", or "flac". Default: "mp3".
speed	number	Playback speed multiplier from 0.25 to 4.0. Default: 1.0.
pitch	number	Pitch adjustment in semitones from -12 to 12. Default: 0.

curl -X POST https://api.nur.ai/v1/tts/generate \
  -H "Authorization: Bearer nur_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The quarterly results exceeded expectations, with revenue growing by 23 percent year over year.",
    "voice_id": "marcus_v2",
    "language": "en",
    "output_format": "wav",
    "speed": 0.95,
    "pitch": -1
  }' \
  --output report.wav

{
  "audio_url": "https://cdn.nur.ai/audio/gen_abc123def456.wav",
  "duration": 4.82,
  "format": "wav",
  "voice_id": "marcus_v2",
  "characters_used": 94
}

POST/v1/tts/stream

Stream generated audio in real time as chunks. The first audio chunk is delivered within 200ms, enabling low-latency playback for conversational interfaces, voice assistants, and interactive applications.

Parameter	Type	Description
textREQUIRED	string	The text to synthesize and stream. Max 10,000 characters.
voice_idREQUIRED	string	The voice to use for generation.
output_format	string	Output format: "mp3" or "pcm16". Default: "mp3".
chunk_size	integer	Size of each audio chunk in bytes. Default: 4096.

from nur import NurClient
client = NurClient()
stream = client.tts.stream(
    text="Welcome to Nur. This audio is being streamed in real time.",
    voice_id="rachel_v2",
    output_format="mp3",
    chunk_size=4096
)
# Write streamed chunks to file
with open("streamed_output.mp3", "wb") as f:
    for chunk in stream:
        f.write(chunk)
# Or play audio directly with a callback
def on_chunk(chunk):
    audio_player.write(chunk)
client.tts.stream(
    text="Playing audio in real time.",
    voice_id="rachel_v2",
    on_chunk=on_chunk
)

Content-Type: audio/mpeg
Transfer-Encoding: chunked
X-Request-Id: req_tts_abc123
X-Voice-Id: rachel_v2
X-Characters-Used: 58

Response Objects

The generation response follows this schema. Streaming responses deliver raw audio bytes with metadata in response headers.

{
  "audio_url": "string (URL to generated audio file)",
  "duration": "number (seconds)",
  "format": "string (mp3 | wav | ogg | flac)",
  "voice_id": "string",
  "characters_used": "number"
}

Best Practices

Use streaming for conversational interfaces

The stream endpoint delivers the first audio chunk within 200ms. Use it for chatbots and voice assistants where latency matters more than having a complete file.

Choose the right output format

Use MP3 for web delivery and general use. Use WAV for post-processing workflows. Use OGG for bandwidth-constrained environments with good quality.

Leverage SSML for fine-grained control

Use SSML tags to add pauses, emphasis, and pronunciation hints. This is especially useful for reading numbers, dates, acronyms, and domain-specific terms correctly.

Cache generated audio when possible

If you generate the same text repeatedly with the same voice and settings, cache the audio URL or file locally to reduce API calls and lower latency.