How to use text-to-speech with websocket streaming in Python or Node.js

Websocket streaming is a method of sending and receiving data over a single, long-lived connection. This method is useful for real-time applications where you need to stream audio data as it becomes available. If you want to quickly test out the latency (time to first byte) of a websocket connection to the ElevenLabs text-to-speech API, you can install elevenlabs-latency via npm and follow the instructions here.

Requirements

An ElevenLabs account with an API key (here’s how to find your API key).
Python or Node.js/Typescript installed on your machine

Setup

Install dotenv package to manage your environmental variables:

pip install python-dotenv
pip install websockets

Next, create a .env file in your project directory and fill it with your credentials like so:

.env

ELEVENLABS_API_KEY=your_elevenlabs_api_key_here

Last, create a new file to write the code in. You can name it text-to-speech-websocket.py for Python or text-to-speech-websocket.ts for Typescript.

Initiate the websocket connection

Pick a voice from the voice library and a text-to-speech model; Then initiate a websocket connection to the text-to-speech API.

import os
from dotenv import load_dotenv
import websockets

# Load the API key from the .env file
load_dotenv()
ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY")

voice_id = 'kmSVBPu7loj4ayNinwWM'
model_id = 'eleven_turbo_v2'

async def text_to_speech_ws_streaming(voice_id, model_id):
    uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model_id}"

    async with websockets.connect(uri) as websocket:
       ...

For TypeScript, create a write stream ahead for saving the audio into mp3 which can be passed to the websocket listener.

text-to-speech-websocket.ts (Typescript)

import * as fs from "node:fs";

const outputDir = "./output";
try {
  fs.accessSync(outputDir, fs.constants.R_OK | fs.constants.W_OK);
} catch (err) {
  fs.mkdirSync(outputDir);
}
const writeStream = fs.createWriteStream(outputDir + "/test.mp3", {
  flags: "a",
});

Send the input text

Once the websocket connection is open, set up voice settings first. Next, send the text message to the API.

async def text_to_speech_ws_streaming(voice_id, model_id):
    async with websockets.connect(uri) as websocket:
        await websocket.send(json.dumps({
            "text": " ",
            "voice_settings": {"stability": 0.5, "similarity_boost": 0.8, "use_speaker_boost": False},
            "generation_config": {
                "chunk_length_schedule": [120, 160, 250, 290]
            },
            "xi_api_key": ELEVENLABS_API_KEY,
        }))

        text = "The twilight sun cast its warm golden hues upon the vast rolling fields, saturating the landscape with an ethereal glow. Silently, the meandering brook continued its ceaseless journey, whispering secrets only the trees seemed privy to."
        await websocket.send(json.dumps({"text": text}))

        // Send empty string to indicate the end of the text sequence which will close the websocket connection
        await websocket.send(json.dumps({"text": ""}))

Save the audio to file

Read the incoming message from the websocket connection and write the audio chunks to a local file.

import asyncio

async def write_to_local(audio_stream):
    """Write the audio encoded in base64 string to a local mp3 file."""

    with open(f'./output/test.mp3', "wb") as f:
        async for chunk in audio_stream:
            if chunk:
                f.write(chunk)

async def listen(websocket):
    """Listen to the websocket for audio data and stream it."""

    while True:
        try:
            message = await websocket.recv()
            data = json.loads(message)
            if data.get("audio"):
                yield base64.b64decode(data["audio"])
            elif data.get('isFinal'):
                break

        except websockets.exceptions.ConnectionClosed:
            print("Connection closed")
            break

async def text_to_speech_ws_streaming(voice_id, model_id):
    async with websockets.connect(uri) as websocket:
          ...
          # Add listen task to submit the audio chunks to the write_to_local function
          listen_task = asyncio.create_task(write_to_local(listen(websocket)))

          await listen_task

asyncio.run(text_to_speech_ws_streaming(voice_id, model_id))

Run the script

You can run the script by executing the following command in your terminal. An mp3 audio file will be saved in the output directory.

python Python python text-to-speech-websocket.py  typescript     Typescript tsx text-to-speech-websocket.ts

Understanding buffering

A key concept to understand when using websockets is buffering. The API only runs model generations when a certain amount of text above a threshold has been sent. This is to optimize the quality of the generated audio by maximising the amount of context available to the model while balancing latency. For example, if the threshold is set to 120 characters and you send ‘Hello, how are you?’, the audio won’t be generated immediately. This is because the sent message has only 19 characters which is below the threshold. However, if you keep sending text, the API will generate audio once the total text sent since the last generation has at least 120 characters. In the case that you want force the immediate return of the audio, you can use flush=true to clear out the buffer and force generate any buffered text. This can be useful, for example, when you have reached the end of a document and want to generate audio for the final section. In addition, closing the websocket will automatically force generate any buffered text.

Best practice

We suggest using the default setting for chunk_length_schedule in generation_config. Avoid using try_trigger_generation as it is deprecated.
When developing a real-time conversational AI application, we advise using flush=true along with the text at the end of conversation turn to ensure timely audio generation.
If the default setting doesn’t provide optimal latency for your use case, you can modify the chunk_length_schedule. However, be mindful that reducing latency through this adjustment may come at the expense of quality.

Tips

The API maintains a internal buffer so that it only runs model generations when a certain amount of text above a threshold has been sent. For short texts with a character length smaller than the value set in chunk_length_schedule, you can use flush=true to clear out the buffer and force generate any buffered text.
The websocket connection will automatically close after 20 seconds of inactivity. To keep the connection open, you can send a single space character " ". Please note that this string must include a space, as sending a fully empty string, "", will close the websocket.
Send an empty string to close the websocket connection after sending the last text message.
You can use alignment to get the word-level timestamps for each word in the text. This can be useful for aligning the audio with the text in a video or for other applications that require precise timing.

Get Started

Text to Speech

Text to Sound Effects

Dubbing

How to use text-to-speech with websocket streaming in Python or Node.js

Requirements

Setup

Initiate the websocket connection

Send the input text

Save the audio to file

Run the script

Understanding buffering

Best practice

Tips

Get Started

Text to Speech

Text to Sound Effects

Dubbing

​Requirements

​Setup

​Initiate the websocket connection

​Send the input text

​Save the audio to file

​Run the script

​Understanding buffering

​Best practice

​Tips

Requirements

Setup

Initiate the websocket connection

Send the input text

Save the audio to file

Run the script

Understanding buffering

Best practice

Tips