From 75289ac86ec7908526567e93c9d6ec7934a43d0b Mon Sep 17 00:00:00 2001 From: Rishabh Bhargava Date: Fri, 17 Apr 2026 15:58:19 -0700 Subject: [PATCH] docs: add VAD/turn_detection params to realtime transcription endpoint Document the Voice Activity Detection configuration for the /realtime WebSocket endpoint: - Add transcription_session.updated client event with turn_detection schema - Document all 5 client-settable VAD parameters with production defaults - Document how to disable VAD (turn_detection: null or query param none) - Document query parameter configuration at connection time - Document VAD on/off behavior (auto completed events vs manual commit) - Add transcription_session.updated server confirmation event MLE-5017 Co-Authored-By: Claude Opus 4.6 (1M context) --- openapi.yaml | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/openapi.yaml b/openapi.yaml index 3c2f527..8449b9d 100644 --- a/openapi.yaml +++ b/openapi.yaml @@ -6748,12 +6748,60 @@ paths: "audio": "" } ``` - - `input_audio_buffer.commit`: Signal end of audio stream + - `input_audio_buffer.commit`: Signal end of audio stream. When VAD is enabled, the server automatically detects speech boundaries and emits `completed` events. When VAD is disabled, you must send `commit` to trigger transcription of the buffered audio. ```json { "type": "input_audio_buffer.commit" } ``` + - `transcription_session.updated`: Update session configuration, including Voice Activity Detection (VAD) parameters. Send this after receiving `session.created`. Can also be sent at any time during the session to change VAD settings. + ```json + { + "type": "transcription_session.updated", + "session": { + "turn_detection": { + "type": "server_vad", + "threshold": 0.3, + "min_silence_duration_ms": 500, + "min_speech_duration_ms": 250, + "max_speech_duration_s": 5.0, + "speech_pad_ms": 250 + } + } + } + ``` + To disable VAD entirely (manual commit mode), set `turn_detection` to `null`: + ```json + { + "type": "transcription_session.updated", + "session": { + "turn_detection": null + } + } + ``` + + **Voice Activity Detection (VAD)** + + VAD controls how the server automatically detects speech segments in the audio stream. When enabled (the default), the server uses Silero VAD to identify speech regions and emits transcription events as each segment completes. When disabled, you must manually call `input_audio_buffer.commit` to trigger transcription. + + VAD can be configured in two ways: + 1. **Query parameters** at connection time: `turn_detection=server_vad&threshold=0.3&min_silence_duration_ms=500` + 2. **Session message** after connection: Send `transcription_session.updated` with a `turn_detection` object (see above) + + To disable VAD at connection time, use `turn_detection=none` as a query parameter. + + **VAD Parameters:** + + All parameters are optional — omitted fields use their defaults. + + | Parameter | Type | Default | Description | + |-----------|------|---------|-------------| + | `type` | string | `server_vad` | VAD mode. Use `server_vad` to enable, or set `turn_detection` to `null` to disable. | + | `threshold` | float | `0.3` | Speech probability threshold (0.0–1.0). Audio frames with probability above this value are classified as speech. Lower values detect more speech but may increase false positives. For low-SNR audio (e.g., 8kHz phone calls), values of 0.01–0.2 may work better. | + | `min_silence_duration_ms` | int | `500` | Minimum silence duration in milliseconds before ending a speech segment. Higher values merge nearby speech bursts into single segments. For phone calls with mid-sentence pauses, 2000–5000ms prevents over-segmentation. | + | `min_speech_duration_ms` | int | `250` | Minimum speech segment duration in milliseconds. Segments shorter than this are discarded. Filters out brief noise bursts or clicks. | + | `max_speech_duration_s` | float | `5.0` | Maximum speech segment duration in seconds. Segments longer than this are force-split at the longest internal silence gap. Useful for continuous speech without natural pauses. | + | `speech_pad_ms` | int | `250` | Padding in milliseconds added to the start and end of each detected segment. Prevents clipping speech edges. When padding would cause adjacent segments to overlap, the gap is split at the midpoint instead. | **Server Events:** - `session.created`: Initial session confirmation (sent first) @@ -6768,6 +6816,22 @@ paths: } } ``` + - `transcription_session.updated`: Confirms session configuration was applied. Sent in response to a client `transcription_session.updated` message. + ```json + { + "type": "transcription_session.updated", + "session": { + "turn_detection": { + "type": "server_vad", + "threshold": 0.3, + "min_silence_duration_ms": 500, + "min_speech_duration_ms": 250, + "max_speech_duration_s": 5.0, + "speech_pad_ms": 250 + } + } + } + ``` - `conversation.item.input_audio_transcription.delta`: Partial transcription results ```json {