Overview

This guide provides a comprehensive overview of how to use Exotel’s Stream Applet (Unidirectional) and Voicebot Applet (Bidirectional) for media streaming applications. These applets enable real-time audio streaming only between Exotel and external bot platforms using WebSocket-based integration. Note: Exotel does not provide inbuilt STT (Speech-to-Text) or TTS (Text-to-Speech) capabilities. You must use your own bot platform to handle media decoding and AI processing.

This article is an extension of: Working with the Stream and Voicebot Applet – Basic Setup

For stream metadata, logging, and observability, refer to the companion article:  Here


Why Choose WebSocket-Based Integration?

WebSockets are ideal for low-latency, bi-directional audio streaming. Compared to SIP or polling mechanisms, WebSocket integration offers:

  • Real-time bidirectional media transfer

  • Persistent low-latency connection

  • Lightweight and scalable protocol for AI integrations

  • Simplified firewall/NAT traversal compared to SIP RTP

This makes it the preferred method for modern bot platforms such as Dialogflow, Azure Bot Framework, Gupshup, Yellow.ai, Haptik, Google CCAI, Amazon Lex, and in-house NLU/LLM-based bots that support real-time WebSocket-based audio streaming.

Key Technical Concepts from the Core Stream/Voicebot Applet Article

To ensure compatibility with the base setup, this section re-emphasises the following from the core guide:

  • WebSocket Endpoint URL must be publicly reachable and must support ws:// or wss:// protocol with base64-encoded audio frames.

  • Maximum Connection Time: Streaming can last up to 60 minutes per session (check limits based on plan).

  • Timeout Handling: If the bot server does not respond within 10 seconds, the session will fail.

  • Connection Retry: Exotel will attempt one automatic retry if the WebSocket handshake fails. Ensure redundancy in bot endpoints.

  • Security: For wss:// endpoints, ensure valid TLS certificates are installed. Self-signed certs may be rejected.

  • Payload Format: Use base64-decoded Linear PCM payloads (raw/slin 16 bit, 8kHz mono PCM) to feed your STT engines.

These constraints and protocols apply to both Stream and Voicebot applets and should be adhered to during bot deployment.

Applet Differences

Applet

Streaming Direction

Primary Use Case

Stream Applet

Unidirectional

Transcribe user audio, e.g., STT

Voicebot Applet

Bidirectional

Full voice AI interaction (bot speaks + listens)


Note: Exotel only handles the media relay. The bot must provide transcription, NLU, TTS, etc.

Events Sent Over WebSocket

Each applet emits events during the call session. The communication over the WebSocket is bi-directional, with distinct roles for Exotel and the bot:

  • Exotel to Bot: Transmits session events (ConnectedStartMediaDTMFStopClear) and the audio stream in base64-encoded chunks.

  • Bot to Exotel: Returns media (for Voicebot Applet only) and may trigger control markers (e.g., Mark). The bot must gracefully handle session events and terminate the WebSocket connection when the bot conversation ends. Ending the WSS connection triggers Exotel to move to the next applet. There is no explicit Stop event sent from the bot to Exotel.

Each event type Exotel emits over the WebSocket serves a distinct purpose in orchestrating the media stream and enabling seamless bot interaction. Understanding these events allows for optimised bot design, real-time feedback loops, and intelligent call flow transitions.

Best Practices & Use Cases per Event:

  • Connected: Confirms WebSocket handshake. Use this to initialise your bot pipeline (e.g., STT/TTS service initialization, session state allocation).

    • Best Practice: Log and correlate with call_sid for session tracing.
    • Use Case: Trigger bot intro prompt preparation.
  • Start: Indicates that audio streaming is beginning.

    • Best Practice: Start buffering/streaming audio to the STT engine.

    • Use Case: Sync with a user-facing prompt like "How can I help you today?"

  • Media: Continuous chunks of base64-encoded PCM audio.

    • Best Practice: Ensure your STT engine handles 100ms PCM blocks efficiently.

    • Use Case: Real-time speech transcription or voice intent detection.

  • DTMF: Keypress detection on the caller's side.

    • Best Practice: Use to branch hybrid IVR+Bot logic without STT latency.

    • Use Case: Press 1 to speak to agent → Exotel hears DTMF → triggers escalation.

  • Mark: Bot-sent event to indicate a logical milestone.

    • Best Practice: Use to sync analytics or inject debugging hooks.

    • Use Case: Mark when the bot completes a sub-flow (e.g., "address collected").

  • Stop: Triggered by Exotel when the customer's leg is disconnected.

    • Use Case: To identify when the customer's leg has disconnected. 

  • ClearResets session context mid-call. Sent by Voicebot. Voicebot sends this event to indicate that the current conversational context should be flushed and re-initialized. This is useful in scenarios where the bot needs to start a fresh session mid-call, such as when the caller says “start over” or the previous context is corrupted.

    • Best Practice: Ensure your bot wipes session memory and re-establishes a clean state when a Clear event is received.
    • Use Case: Caller says "start again" → Clear received → bot resets all entities, replays welcome prompt.
    • Supported Chunk Size: Ensure the bot handles minimum 100ms audio chunks (approx. 3.2 KB base64 payload per frame) for seamless reset and session realignment during mid-call context switches.
    • Best Practice: Use to reinitialise bot memory (e.g., context drops).
    • Use Case: Caller says "start over" → bot requests Clear → reset the conversation.


These events should be monitored in real time and mapped to your backend decision logic and conversation orchestration flow.

  1. Connected: Indicates the WebSocket connection is successfully established. (Initialize bot session, allocate STT/TTS/LLM resources)

  2. Start: Voice media stream is starting. (Start decoding audio for transcription or detection)

  3. Media: Base64-encoded PCM audio payload. (Feed to STT or analytics pipeline)

  4. DTMF: DTMF tones detected. (Capture digit input in IVR scenarios)

  5. Mark: Developer-defined markers. (Sync checkpoints in bot logic)

  6. Stop: Media streaming session ends. (Escalate, save transcript, or trigger routing)

  7. Clear: Reset session context. (Re-initiate bot logic mid-call)


Streaming Audio Format

  • Codec: 16-bit Linear PCM (s16le)

  • Sample Rate: 8000 Hz

  • Channels: Mono

  • Encoding: base64 in WebSocket frames


Dynamic URL and Custom Parameters


You can configure a dynamic WebSocket URL using placeholders and custom parameters.


Custom Parameter Rules:

  • Max 3 parameters

  • Total param length (after ?) must not exceed 256 characters

  • Sample:

ws://127.0.0.1:5001/media?param1=value1&param2=value2&param3=value3

  • Dynamic resolution from HTTP(s) applet must also return a valid ws(s) URL in this format.

Recording (Optional)

Recordings, if enabled, are returned via Passthru as a RecordingUrl.

Use Cases:

  • Train STT/LLM models

  • Compliance and QA reviews

  • Escalation audits

  • Voice sentiment labeling datasets

Routing to Agent or Contact Center After Voicebot Applet

When the WebSocket connection is closed—either due to the bot disconnecting after completing its interaction or a network-level termination—Exotel automatically moves to the next applet in the flow. There is no explicit Stop event sent by the bot to Exotel. Instead, the bot must close the WebSocket session once the conversation ends.


Upon disconnection, Exotel internally emits a Stop event and transitions to the next applet, typically a Passthru Applet. This Passthru makes a GET request to your endpoint, and based on your response (e.g., escalate=true), Exotel decides how to route the call.

Scenarios:

  • Connect Applet → Route to Exotel agent

  • SIP Connect via vSIP Trunk → Route to enterprise contact center

  • Hangup Applet → Gracefully end the call

Example:

User: "Talk to human" → Bot finishes → WSS disconnects → Exotel emits Stop → Passthru GET call → Response: escalate=true → SwitchCase → vSIP Trunk over Connect applet

Passthru Integration

Always place a Passthru Applet immediately after Voicebot/Stream Applet to:

  • Fetch session metadata

  • Log streaming stats (SID, duration, Recording URL)

  • Detect disconnects

  • Read escalation flags from response


Best Practices

  • Place Passthru immediately after streaming applet

  • Use Clear/Mark events for context handling and observability

  • Use Active Streams API for concurrency limits

  • Design Passthru logic to interpret escalation/disconnect

  • Follow WebSocket timeout, reconnect, and handshake guidelines strictly

  • Keep custom params concise and secure

  • Ensure bot sends Stop to gracefully close stream

Advanced Use Cases

  • STT pipeline using OpenAI Whisper, Google STT

  • Bot+LLM conversations via voice

  • IVR replacement with NLP flows

  • DTMF + Speech combo journeys

  • Hybrid fallback to live agents via SIP trunk


Summary

Exotel’s Voicebot and Stream Applets power modern voice automation by offering developer-first WebSocket-based audio streaming infrastructure. These applets act as programmable media bridges, streaming audio in real time between Exotel’s telephony platform and any compliant bot platform capable of understanding linear PCM data.

By adopting this architecture, enterprises gain:

  • Low-latency streaming that supports responsive AI-driven conversations

  • Vendor-neutral integration with STT, TTS, LLMs, and custom NLP systems

  • Dynamic routing and escalation via Passthru Applets and intelligent fallback logic

  • Secure, observable, and resilient media sessions with active stream monitoring

  • Enterprise-grade call flow design that integrates Exotel CC, SIP trunks, and external agents

This document is a production-ready extension to the Stream/Voicebot Applet Basic Guide and complements the Passthru Streaming Metadata Guide.

For deployment, ensure compliance with session lifecycle best practices, chunk size handling (100ms), parameter limits, recording strategies, and escalation routing logic through WebSocket termination events.