When using the WellSaid API for text-to-speech generation, you have the option to select which voice model you would like to use. The model determines the voice performance engine powering the audio, and some models support enhanced features like AI Director.
Available Models
Tag (Optional) | Description |
---|---|
legacy | The original WellSaid voice model. If no model is specified in your API request, this model will be used automatically. It delivers natural-sounding speech and supports most core voice features. |
caruso | The next-generation WellSaid voice model, built to support advanced performance cues via AI Director. It enables dynamic control over pitch, tempo, loudness, and respelling, giving you more expressive and natural results. |
How to Specify the Model in the API
To choose a model, include the model
field in the JSON payload sent to any/tts
endpoint. Here's an example usingcaruso\
:
{
"speaker_id": 26,
"model": "caruso",
"text": "<pitch value=\"-250\">This sentence feels deeper and more serious.</pitch>"
}
If the model
field is omitted, the API will default to the legacy
model:
{
"speaker_id": 26,
"text": "This uses the legacy model because no model is specified."
}
Why Choose the Caruso Model?
The caruso model unlocks AI Director capabilities, allowing you to fine-tune the emotional and stylistic delivery of voiceovers using inline markup tags. With caruso, you can modify:
Pitch
(e.g. deeper or higher tone)
Tempo
(slower or faster pace)
Loudness
(quiet to commanding projection)
These advanced controls give you studio-level flexibility directly through the API.
Learn more: Using AI Director with the API