Frequently Asked Questions

FAQs
What is the WellSaid API?

Our RESTful API is a Text-to-Speech (TTS) API service that allows developers to convert text into spoken audio using high-quality AI voices. Our API can be used in a wide-range of applications — voice assistants, long-form narration of books/articles, interactive voice response (IVR) systems, podcasts and more. We’ve collaborated with professional voice actors to develop the highest quality AI voice models that boast over 200 combinations of voices and styles, ensuring that the data used to create our natural-sounding AI voices is secure, high quality and developed ethically.

What capabilities does the API offer?

The API has a basic endpoint as well as a streaming TTS (text-to-speech) endpoint. The API will accept some SSML tags, such as and we offer an avatar endpoint to quickly pull a list of all available voices and their metadata. The API also offers options to create and store pronunciation changes in libraries using replacements. Additional endpoins allow you to manage and combine clips so you can produce the best possible final product.

What is the rate limit?

The default WellSaid API key is currently limited to 3 requests per/ 1 second and 1,000 characters per request, along with your unique total monthly quota of characters used per month. There are no monthly limits or quotas hard-coded in, as we charge for monthly character usage and overages and as not to disrupt any services you are offering with our API. Additional plans are available with higher rate limiting, contact us to learn more.

What is the render speed of the API?

The render speed is considered to be the total amount of time it takes to play back audio after the entire text has been rendered. WellSaid API has a fastest render speed of approximately 500ms per 30 characters, ensuring seamless voice integration for your application. The total render speed can vary based on factors like text length, voice selection, and integration method with your technology stack.

What is the Time to First Byte (TTFB)?

Time to First Byte (TTFB) measures the time it takes for the server to send the first byte of data in response to the client's request. Our streaming endpoint delivers audio in real-time as it's generated, significantly reducing TTFB. For applications needing low latency, the streaming endpoint is recommended.

Which audio formats are supported?

Responses are returned an mp3 file.

What voices are available in the API?

We offer over 200 high-quality voices and styles with a diverse range of regional and international accents. You can find a full list of voices on our Available Voice Avatars page

What Languages are available in the API?

We have global languages available including Chinese (Cantonese and Mandarin), French, German, Italian, Japanese, Korean, Portuguese, and Spanish. Available Global Languages Voices page

Can I enable a custom voice in the API?

Yes! Contact us to help deploy a custom voice to your API.