Speech to Text

Convert audio to text with industry-leading accuracy in 100+ languages.

Overview

The Speech to Text API provides accurate transcription for audio files and real-time streams. It supports speaker diarization, word-level timestamps, and custom vocabularies to handle domain-specific terminology.

Key Capabilities

Real-time streaming
Speaker diarization
100+ languages
Word-level timestamps
Custom vocabulary
Punctuation & formatting

Quickstart

Transcribe an audio file in just a few lines of code.

curl -X POST https://api.nur.ai/v1/stt/transcribe \
  -H "Authorization: Bearer nur_your_api_key" \
  -F "file=@recording.mp3" \
  -F "language=en" \
  -F "punctuation=true"

Endpoints

POST/v1/stt/transcribe

Transcribe a complete audio file. Supports MP3, WAV, FLAC, OGG, and WebM formats up to 500 MB. Returns the full transcript with optional segments, speaker labels, and word-level timestamps.

Parameter	Type	Description
fileREQUIRED	file	The audio file to transcribe. Max 500 MB.
language	string	BCP-47 language code (e.g. "en", "ar", "de"). Auto-detected if omitted.
speaker_diarization	boolean	Enable speaker diarization to identify individual speakers. Default: false.
timestamps	boolean	Include word-level timestamps in the response. Default: false.
punctuation	boolean	Apply automatic punctuation and formatting. Default: true.
vocabulary	string[]	Custom vocabulary list for domain-specific terms (e.g. product names, acronyms).

curl -X POST https://api.nur.ai/v1/stt/transcribe \
  -H "Authorization: Bearer nur_your_api_key" \
  -F "file=@interview.mp3" \
  -F "language=en" \
  -F "speaker_diarization=true" \
  -F "timestamps=true" \
  -F "punctuation=true" \
  -F "vocabulary=Nur,API,SDK"

{
  "text": "Welcome to the Nur API demo. Today we will walk through the speech to text features.",
  "segments": [
    {
      "speaker": "speaker_0",
      "text": "Welcome to the Nur API demo.",
      "start": 0.0,
      "end": 2.34,
      "confidence": 0.97
    },
    {
      "speaker": "speaker_0",
      "text": "Today we will walk through the speech to text features.",
      "start": 2.56,
      "end": 5.81,
      "confidence": 0.95
    }
  ],
  "language": "en",
  "duration": 5.81
}

POST/v1/stt/stream

Open a WebSocket connection for real-time speech-to-text transcription. Send audio chunks and receive partial and final transcriptions as they are produced. Ideal for live captioning, voice assistants, and real-time note-taking.

Parameter	Type	Description
language	string	BCP-47 language code. Auto-detected if omitted.
sample_rate	integer	Audio sample rate in Hz. Default: 16000.
encoding	string	Audio encoding: "pcm16", "opus", or "flac". Default: "pcm16".
interim_results	boolean	Return partial results as speech is recognized. Default: true.

from nur import NurClient
import asyncio
client = NurClient()
async def live_transcribe():
    async with client.stt.stream(
        language="en",
        sample_rate=16000,
        encoding="pcm16",
        interim_results=True
    ) as stream:
        # Send audio chunks from microphone
        async for result in stream:
            if result.is_final:
                print(f"Final: {result.text}")
            else:
                print(f"Partial: {result.text}")
asyncio.run(live_transcribe())

{
  "type": "transcript",
  "is_final": true,
  "text": "Hello, how can I help you today?",
  "confidence": 0.96,
  "start": 0.0,
  "end": 1.82
}

Response Objects

The transcription response follows this schema. All times are in seconds.

{
  "text": "string",
  "segments": [
    {
      "speaker": "string",
      "text": "string",
      "start": "number",
      "end": "number",
      "confidence": "number (0-1)"
    }
  ],
  "language": "string (BCP-47 code)",
  "duration": "number (seconds)"
}

Best Practices

Use custom vocabulary for domain-specific terms

Add product names, acronyms, and technical terms to the vocabulary parameter to improve recognition accuracy for specialized content.

Choose the right encoding for streaming

Use Opus for low-bandwidth connections and PCM16 for highest quality. Match the sample rate to your audio source to avoid resampling artifacts.

Enable diarization only when needed

Speaker diarization adds processing time. Only enable it when you need to distinguish between multiple speakers, such as in meetings or interviews.

Handle interim results for responsive UIs

Display partial transcripts immediately and replace them when final results arrive. This provides a much more responsive user experience for live captioning.