Skip to content

Speech to Text

Convert audio to text with industry-leading accuracy in 100+ languages.

Overview

The Speech to Text API provides accurate transcription for audio files and real-time streams. It supports speaker diarization, word-level timestamps, and custom vocabularies to handle domain-specific terminology.

Key Capabilities

  • Real-time streaming
  • Speaker diarization
  • 100+ languages
  • Word-level timestamps
  • Custom vocabulary
  • Punctuation & formatting

Quickstart

Transcribe an audio file in just a few lines of code.

1curl -X POST https://api.nur.ai/v1/stt/transcribe \
2 -H "Authorization: Bearer nur_your_api_key" \
3 -F "file=@recording.mp3" \
4 -F "language=en" \
5 -F "punctuation=true"

Endpoints

POST/v1/stt/transcribe

Transcribe a complete audio file. Supports MP3, WAV, FLAC, OGG, and WebM formats up to 500 MB. Returns the full transcript with optional segments, speaker labels, and word-level timestamps.

ParameterTypeDescription
fileREQUIREDfileThe audio file to transcribe. Max 500 MB.
languagestringBCP-47 language code (e.g. "en", "ar", "de"). Auto-detected if omitted.
speaker_diarizationbooleanEnable speaker diarization to identify individual speakers. Default: false.
timestampsbooleanInclude word-level timestamps in the response. Default: false.
punctuationbooleanApply automatic punctuation and formatting. Default: true.
vocabularystring[]Custom vocabulary list for domain-specific terms (e.g. product names, acronyms).
1curl -X POST https://api.nur.ai/v1/stt/transcribe \
2 -H "Authorization: Bearer nur_your_api_key" \
3 -F "file=@interview.mp3" \
4 -F "language=en" \
5 -F "speaker_diarization=true" \
6 -F "timestamps=true" \
7 -F "punctuation=true" \
8 -F "vocabulary=Nur,API,SDK"
1{
2 "text": "Welcome to the Nur API demo. Today we will walk through the speech to text features.",
3 "segments": [
4 {
5 "speaker": "speaker_0",
6 "text": "Welcome to the Nur API demo.",
7 "start": 0.0,
8 "end": 2.34,
9 "confidence": 0.97
10 },
11 {
12 "speaker": "speaker_0",
13 "text": "Today we will walk through the speech to text features.",
14 "start": 2.56,
15 "end": 5.81,
16 "confidence": 0.95
17 }
18 ],
19 "language": "en",
20 "duration": 5.81
21}
POST/v1/stt/stream

Open a WebSocket connection for real-time speech-to-text transcription. Send audio chunks and receive partial and final transcriptions as they are produced. Ideal for live captioning, voice assistants, and real-time note-taking.

ParameterTypeDescription
languagestringBCP-47 language code. Auto-detected if omitted.
sample_rateintegerAudio sample rate in Hz. Default: 16000.
encodingstringAudio encoding: "pcm16", "opus", or "flac". Default: "pcm16".
interim_resultsbooleanReturn partial results as speech is recognized. Default: true.
1from nur import NurClient
2import asyncio
3
4client = NurClient()
5
6async def live_transcribe():
7 async with client.stt.stream(
8 language="en",
9 sample_rate=16000,
10 encoding="pcm16",
11 interim_results=True
12 ) as stream:
13 # Send audio chunks from microphone
14 async for result in stream:
15 if result.is_final:
16 print(f"Final: {result.text}")
17 else:
18 print(f"Partial: {result.text}")
19
20asyncio.run(live_transcribe())
1{
2 "type": "transcript",
3 "is_final": true,
4 "text": "Hello, how can I help you today?",
5 "confidence": 0.96,
6 "start": 0.0,
7 "end": 1.82
8}

Response Objects

The transcription response follows this schema. All times are in seconds.

1{
2 "text": "string",
3 "segments": [
4 {
5 "speaker": "string",
6 "text": "string",
7 "start": "number",
8 "end": "number",
9 "confidence": "number (0-1)"
10 }
11 ],
12 "language": "string (BCP-47 code)",
13 "duration": "number (seconds)"
14}

Best Practices

Use custom vocabulary for domain-specific terms

Add product names, acronyms, and technical terms to the vocabulary parameter to improve recognition accuracy for specialized content.

Choose the right encoding for streaming

Use Opus for low-bandwidth connections and PCM16 for highest quality. Match the sample rate to your audio source to avoid resampling artifacts.

Enable diarization only when needed

Speaker diarization adds processing time. Only enable it when you need to distinguish between multiple speakers, such as in meetings or interviews.

Handle interim results for responsive UIs

Display partial transcripts immediately and replace them when final results arrive. This provides a much more responsive user experience for live captioning.