Overview
This guide provides a comprehensive overview of how to use Exotel’s Stream Applet (Unidirectional) and Voicebot Applet (Bidirectional) for media streaming applications. These applets enable real-time audio streaming only between Exotel and external bot platforms using WebSocket-based integration. Note: Exotel does not provide inbuilt STT (Speech-to-Text) or TTS (Text-to-Speech) capabilities. You must use your own bot platform to handle media decoding and AI processing.
This article is an extension of: Working with the Stream and Voicebot Applet – Basic Setup
For stream metadata, logging, and observability, refer to the companion article: Here
Why Choose WebSocket-Based Integration?
WebSockets are ideal for low-latency, bi-directional audio streaming. Compared to SIP or polling mechanisms, WebSocket integration offers:
Real-time bidirectional media transfer
Persistent low-latency connection
Lightweight and scalable protocol for AI integrations
Simplified firewall/NAT traversal compared to SIP RTP
This makes it the preferred method for modern bot platforms such as Dialogflow, Azure Bot Framework, Gupshup, Yellow.ai, Haptik, Google CCAI, Amazon Lex, and in-house NLU/LLM-based bots that support real-time WebSocket-based audio streaming.
Key Technical Concepts from the Core Stream/Voicebot Applet Article
To ensure compatibility with the base setup, this section re-emphasises the following from the core guide:
WebSocket Endpoint URL must be publicly reachable and must support ws:// or wss:// protocol with base64-encoded audio frames.
Maximum Connection Time: Streaming can last up to 60 minutes per session (check limits based on plan).
Timeout Handling: If the bot server does not respond within 10 seconds, the session will fail.
Connection Retry: Exotel will attempt one automatic retry if the WebSocket handshake fails. Ensure redundancy in bot endpoints.
Security: For wss:// endpoints, ensure valid TLS certificates are installed. Self-signed certs may be rejected.
Payload Format: Use base64-decoded Linear PCM payloads (raw/slin 16 bit, 8kHz mono PCM) to feed your STT engines.
These constraints and protocols apply to both Stream and Voicebot applets and should be adhered to during bot deployment.
Applet Differences
Note: Exotel only handles the media relay. The bot must provide transcription, NLU, TTS, etc.
Events Sent Over WebSocket
Each applet emits events during the call session. The communication over the WebSocket is bi-directional, with distinct roles for Exotel and the bot:
Exotel to Bot: Transmits session events (Connected, Start, Media, DTMF, Stop, Clear) and the audio stream in base64-encoded chunks.
Bot to Exotel: Returns media (for Voicebot Applet only) and may trigger control markers (e.g., Mark). The bot must gracefully handle session events and terminate the WebSocket connection when the bot conversation ends. Ending the WSS connection triggers Exotel to move to the next applet. There is no explicit Stop event sent from the bot to Exotel.
Each event type Exotel emits over the WebSocket serves a distinct purpose in orchestrating the media stream and enabling seamless bot interaction. Understanding these events allows for optimised bot design, real-time feedback loops, and intelligent call flow transitions.
Best Practices & Use Cases per Event:
Connected: Confirms WebSocket handshake. Use this to initialise your bot pipeline (e.g., STT/TTS service initialization, session state allocation).
- Best Practice: Log and correlate with call_sid for session tracing.
- Use Case: Trigger bot intro prompt preparation.
Start: Indicates that audio streaming is beginning.
Best Practice: Start buffering/streaming audio to the STT engine.
Use Case: Sync with a user-facing prompt like "How can I help you today?"
Media: Continuous chunks of base64-encoded PCM audio.
Best Practice: Ensure your STT engine handles 100ms PCM blocks efficiently.
Use Case: Real-time speech transcription or voice intent detection.
DTMF: Keypress detection on the caller's side.
Best Practice: Use to branch hybrid IVR+Bot logic without STT latency.
Use Case: Press 1 to speak to agent → Exotel hears DTMF → triggers escalation.
Mark: Bot-sent event to indicate a logical milestone.
Best Practice: Use to sync analytics or inject debugging hooks.
Use Case: Mark when the bot completes a sub-flow (e.g., "address collected").
Stop: Triggered by Exotel when the customer's leg is disconnected.
Use Case: To identify when the customer's leg has disconnected.
Clear: Resets session context mid-call. Sent by Voicebot. Voicebot sends this event to indicate that the current conversational context should be flushed and re-initialized. This is useful in scenarios where the bot needs to start a fresh session mid-call, such as when the caller says “start over” or the previous context is corrupted.
- Best Practice: Ensure your bot wipes session memory and re-establishes a clean state when a Clear event is received.
- Use Case: Caller says "start again" → Clear received → bot resets all entities, replays welcome prompt.
- Supported Chunk Size: Ensure the bot handles minimum 100ms audio chunks (approx. 3.2 KB base64 payload per frame) for seamless reset and session realignment during mid-call context switches.
- Best Practice: Use to reinitialise bot memory (e.g., context drops).
- Use Case: Caller says "start over" → bot requests Clear → reset the conversation.
These events should be monitored in real time and mapped to your backend decision logic and conversation orchestration flow.
Connected: Indicates the WebSocket connection is successfully established. (Initialize bot session, allocate STT/TTS/LLM resources)
Start: Voice media stream is starting. (Start decoding audio for transcription or detection)
Media: Base64-encoded PCM audio payload. (Feed to STT or analytics pipeline)
DTMF: DTMF tones detected. (Capture digit input in IVR scenarios)
Mark: Developer-defined markers. (Sync checkpoints in bot logic)
Stop: Media streaming session ends. (Escalate, save transcript, or trigger routing)
Clear: Reset session context. (Re-initiate bot logic mid-call)
Streaming Audio Format
Codec: 16-bit Linear PCM (s16le)
Sample Rate: 8000 Hz
Channels: Mono
Encoding: base64 in WebSocket frames
Dynamic URL and Custom Parameters
You can configure a dynamic WebSocket URL using placeholders and custom parameters.
Custom Parameter Rules:
Max 3 parameters
Total param length (after ?) must not exceed 256 characters
Sample:
ws://127.0.0.1:5001/media?param1=value1¶m2=value2¶m3=value3
Dynamic resolution from HTTP(s) applet must also return a valid ws(s) URL in this format.
Recording (Optional)
Recordings, if enabled, are returned via Passthru as a RecordingUrl.
Use Cases:
Train STT/LLM models
Compliance and QA reviews
Escalation audits
Voice sentiment labeling datasets
Routing to Agent or Contact Center After Voicebot Applet
When the WebSocket connection is closed—either due to the bot disconnecting after completing its interaction or a network-level termination—Exotel automatically moves to the next applet in the flow. There is no explicit Stop event sent by the bot to Exotel. Instead, the bot must close the WebSocket session once the conversation ends.
Upon disconnection, Exotel internally emits a Stop event and transitions to the next applet, typically a Passthru Applet. This Passthru makes a GET request to your endpoint, and based on your response (e.g., escalate=true), Exotel decides how to route the call.
Scenarios:
Connect Applet → Route to Exotel agent
SIP Connect via vSIP Trunk → Route to enterprise contact center
Hangup Applet → Gracefully end the call
Example:
User: "Talk to human" → Bot finishes → WSS disconnects → Exotel emits Stop → Passthru GET call → Response: escalate=true → SwitchCase → vSIP Trunk over Connect applet
Passthru Integration
Always place a Passthru Applet immediately after Voicebot/Stream Applet to:
Fetch session metadata
Log streaming stats (SID, duration, Recording URL)
Detect disconnects
Read escalation flags from response
Best Practices
Place Passthru immediately after streaming applet
Use Clear/Mark events for context handling and observability
Use Active Streams API for concurrency limits
Design Passthru logic to interpret escalation/disconnect
Follow WebSocket timeout, reconnect, and handshake guidelines strictly
Keep custom params concise and secure
Ensure bot sends Stop to gracefully close stream
Advanced Use Cases
STT pipeline using OpenAI Whisper, Google STT
Bot+LLM conversations via voice
IVR replacement with NLP flows
DTMF + Speech combo journeys
Hybrid fallback to live agents via SIP trunk
Summary
Exotel’s Voicebot and Stream Applets power modern voice automation by offering developer-first WebSocket-based audio streaming infrastructure. These applets act as programmable media bridges, streaming audio in real time between Exotel’s telephony platform and any compliant bot platform capable of understanding linear PCM data.
By adopting this architecture, enterprises gain:
Low-latency streaming that supports responsive AI-driven conversations
Vendor-neutral integration with STT, TTS, LLMs, and custom NLP systems
Dynamic routing and escalation via Passthru Applets and intelligent fallback logic
Secure, observable, and resilient media sessions with active stream monitoring
Enterprise-grade call flow design that integrates Exotel CC, SIP trunks, and external agents
This document is a production-ready extension to the Stream/Voicebot Applet Basic Guide and complements the Passthru Streaming Metadata Guide.
For deployment, ensure compliance with session lifecycle best practices, chunk size handling (100ms), parameter limits, recording strategies, and escalation routing logic through WebSocket termination events.