Skip to content

Text to Speech

Generate natural-sounding speech from text in 100+ languages.

Overview

The Text to Speech API transforms written text into lifelike audio using premium neural voices. Fine-tune speed, pitch, and emotional expression to create the perfect voice experience for your application.

Key Capabilities

  • 50+ premium voices
  • SSML support
  • Speed & pitch control
  • Streaming output
  • Multiple output formats
  • Emotional expression

Quickstart

Generate speech from text in just a few lines of code.

1curl -X POST https://api.nur.ai/v1/tts/generate \
2 -H "Authorization: Bearer nur_your_api_key" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "text": "Hello! Welcome to Nur.",
6 "voice_id": "rachel_v2"
7 }' \
8 --output output.mp3

Endpoints

POST/v1/tts/generate

Generate a complete audio file from text. Supports plain text and SSML input. Returns a URL to the generated audio file along with metadata. Ideal for batch processing, audiobook generation, and pre-rendered content.

ParameterTypeDescription
textREQUIREDstringThe text to synthesize. Supports plain text and SSML. Max 10,000 characters.
voice_idREQUIREDstringThe voice to use. See the Voices page for available options.
languagestringBCP-47 language code. Auto-detected from voice if omitted.
output_formatstringOutput format: "mp3", "wav", "ogg", or "flac". Default: "mp3".
speednumberPlayback speed multiplier from 0.25 to 4.0. Default: 1.0.
pitchnumberPitch adjustment in semitones from -12 to 12. Default: 0.
1curl -X POST https://api.nur.ai/v1/tts/generate \
2 -H "Authorization: Bearer nur_your_api_key" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "text": "The quarterly results exceeded expectations, with revenue growing by 23 percent year over year.",
6 "voice_id": "marcus_v2",
7 "language": "en",
8 "output_format": "wav",
9 "speed": 0.95,
10 "pitch": -1
11 }' \
12 --output report.wav
1{
2 "audio_url": "https://cdn.nur.ai/audio/gen_abc123def456.wav",
3 "duration": 4.82,
4 "format": "wav",
5 "voice_id": "marcus_v2",
6 "characters_used": 94
7}
POST/v1/tts/stream

Stream generated audio in real time as chunks. The first audio chunk is delivered within 200ms, enabling low-latency playback for conversational interfaces, voice assistants, and interactive applications.

ParameterTypeDescription
textREQUIREDstringThe text to synthesize and stream. Max 10,000 characters.
voice_idREQUIREDstringThe voice to use for generation.
output_formatstringOutput format: "mp3" or "pcm16". Default: "mp3".
chunk_sizeintegerSize of each audio chunk in bytes. Default: 4096.
1from nur import NurClient
2
3client = NurClient()
4
5stream = client.tts.stream(
6 text="Welcome to Nur. This audio is being streamed in real time.",
7 voice_id="rachel_v2",
8 output_format="mp3",
9 chunk_size=4096
10)
11
12# Write streamed chunks to file
13with open("streamed_output.mp3", "wb") as f:
14 for chunk in stream:
15 f.write(chunk)
16
17# Or play audio directly with a callback
18def on_chunk(chunk):
19 audio_player.write(chunk)
20
21client.tts.stream(
22 text="Playing audio in real time.",
23 voice_id="rachel_v2",
24 on_chunk=on_chunk
25)
1Content-Type: audio/mpeg
2Transfer-Encoding: chunked
3X-Request-Id: req_tts_abc123
4X-Voice-Id: rachel_v2
5X-Characters-Used: 58

Response Objects

The generation response follows this schema. Streaming responses deliver raw audio bytes with metadata in response headers.

1{
2 "audio_url": "string (URL to generated audio file)",
3 "duration": "number (seconds)",
4 "format": "string (mp3 | wav | ogg | flac)",
5 "voice_id": "string",
6 "characters_used": "number"
7}

Best Practices

Use streaming for conversational interfaces

The stream endpoint delivers the first audio chunk within 200ms. Use it for chatbots and voice assistants where latency matters more than having a complete file.

Choose the right output format

Use MP3 for web delivery and general use. Use WAV for post-processing workflows. Use OGG for bandwidth-constrained environments with good quality.

Leverage SSML for fine-grained control

Use SSML tags to add pauses, emphasis, and pronunciation hints. This is especially useful for reading numbers, dates, acronyms, and domain-specific terms correctly.

Cache generated audio when possible

If you generate the same text repeatedly with the same voice and settings, cache the audio URL or file locally to reduce API calls and lower latency.