Conversational AI
Build real-time voice conversations with AI — speak naturally and get instant spoken responses.
Overview
The Conversational AI API enables real-time, spoken dialogue between users and AI. Create a conversation session, open a WebSocket audio stream, and begin talking. The AI listens, understands context, and responds in natural speech with minimal latency. Conversations maintain full context across turns so the AI remembers what was said earlier. Use it to power voice assistants, customer service agents, interactive tutors, companion apps, and any experience where users need to speak with AI as naturally as they would with another person.
Key Capabilities
- Real-time dialogue
- Context memory
- Interruption handling
- Multilingual support
- Tool / function calling
- Custom personas
Quickstart
Create a conversation session and start talking to AI in just a few lines of code.
Endpoints
/v1/conversations/createCreate a new conversation session. A session holds the AI persona, voice configuration, conversation history, and any registered tools. Once created, connect to the session via WebSocket to begin real-time voice dialogue.
| Parameter | Type | Description |
|---|---|---|
| voice_idREQUIRED | string | The voice the AI will use when speaking. See the Voices page for available options. |
| system_promptREQUIRED | string | Instructions that define the AI persona, tone, and behavior for the conversation. |
| language | string | Primary language for the conversation (ISO 639-1 code). Default: "en". |
| tools | array | Tool/function definitions the AI can invoke mid-conversation (e.g., booking, lookups). |
| context_window | integer | Number of previous turns to keep in context. Default: 50. |
| max_turns | integer | Maximum number of dialogue turns before the session auto-closes. Default: unlimited. |
/v1/conversations/{session_id}/streamOpen a persistent WebSocket connection for real-time audio streaming on an active conversation session. Send raw audio frames from the user's microphone and receive AI-generated speech frames back. The connection supports voice activity detection, mid-utterance interruptions, and bidirectional audio flow for natural turn-taking.
| Parameter | Type | Description |
|---|---|---|
| sample_rate | integer | Audio sample rate in Hz for both input and output. Default: 16000. |
| encoding | string | Audio encoding format: "pcm16", "opus", or "mulaw". Default: "pcm16". |
| vad_threshold | number | Voice activity detection sensitivity (0.0 to 1.0). Higher values require louder speech to trigger. Default: 0.5. |
/v1/conversations/{session_id}Retrieve details and full conversation history for a session. Returns the session configuration, current status, and an ordered list of all dialogue turns including transcripts and metadata.
/v1/conversations/{session_id}End a conversation session and release all associated resources. Any active WebSocket connections on the session will be closed immediately. The conversation history remains available for retrieval for 30 days after deletion.
Response Objects
The API uses the following core object schemas across conversation endpoints and WebSocket events.
Best Practices
Tune VAD sensitivity for your environment
In noisy environments (call centers, outdoors), raise vad_threshold toward 0.7-0.8 to avoid false activations. In quiet environments, lower it to 0.3-0.4 for more responsive detection. Test with real users to find the right balance between responsiveness and false triggers.
Craft detailed system prompts for natural dialogue
A well-written system prompt dramatically improves conversation quality. Specify the persona, tone, response length preferences, and how the AI should handle ambiguity. For example: "Keep responses under two sentences unless the user asks for detail. Always confirm before taking actions."
Design tool calls for conversational flow
When registering tools, write clear descriptions so the AI knows when to invoke them. Keep tool execution fast (under 2 seconds) to avoid awkward pauses. If a tool takes longer, configure the AI to say a brief hold message like "Let me look that up for you" while waiting.
Handle errors and disconnections gracefully
WebSocket connections can drop due to network issues. Implement automatic reconnection with exponential backoff and resume the session using the same session_id. Store the last turn_id locally so you can detect and skip duplicate events after reconnecting.