Using AI Director with the API

The AI Director feature allows you to fine-tune speech delivery by adjusting tempo, pausing, loudness, pitch, and respell words by embedding XML cues directly into your text input. These cues more expressive, human-like voiceovers so you can get the right performance with each take.

Basic Usage

AI Director currently makes use of 4 different tags, to implement these in your request you wrap them around the specific text you want to imbue with that effect. Each tag has a specified range that should be used to make sure you get the best results from the model, going outside of these ranges will lead to unfavorable results.

The following XML tags can be added within the speak_tag to provide additional cues to the AI.

Tag (Optional)	Description
`tempo`	Function: Cue a change in pace for the enclosed text Attributes*: Value (float): [Required] slow to neutral to fast [0.5, 1, 2.5] Example**: This speech is slower.
`loudness`	Function: Cue a change in loudness for the enclosed text Attributes*: Value (float): [Required] quiet to neutral to loud [-20, 0, 10] Example**: This speech is a little louder.
`pitch`	Function: Cue a change in loudness for the enclosed text Attributes*: Value (float): [Required] quiet to neutral to loud [-250, 0, 500] Example**: This speech is a little louder.
`respell`	Function: Inline respelling for the enclosed text Attributes*: Value (string): [Required] respelling of the enclosed text Example**: Pizza

Examples

Each of the following examples shows how AI Director tags modify the voice actor’s delivery, not just the audio file’s properties. These changes affect cadence, emotion, intensity, and presence, producing a more natural and expressive result.

Loudness:

The loudness tag controls the intensity of the delivery. Unlike simply increasing the volume of the audio file, this tag prompts the voice actor to speak louder or softer, changing their energy, projection, and vocal presence.

Low Value (-20) → subdued, intimate

Neutral (0) → Natural, balanced

High Value (10) → Assertive, commanding, emphatic

Suggested values: [-20, -12, -8, -4, -2, 0, 2, 4, 6, 8, 10]

{
  "speaker_id": 20,
  "text": "<loudness value=\"6\">This part is louder than providing more emphasis.</loudness>",
  "model": "caruso"
}

Tempo:

The tempo tag adjusts the pacing of the voice actor. This influences how quickly or slowly the words are spoken, changing the urgency, clarity, and emotional weight of the delivery.

Low Value (0.5) → Deliberate, thoughtful, calm

Neutral (1) → Conversational, balanced

High Value (2.5) → Rapid, energetic, urgent

Suggested values: [0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.3, 1.6, 1.9, 2.3, 2.5]

{
  "speaker_id": 19,
  "text": "<tempo value=\"0.5\">This part is spoken slowly, with intention and gravity.</tempo>",
  "model": "caruso"
}

Pitch:

The pitch tag shifts the tonal height of the speaker’s voice. Lower pitches sound more grounded, serious, and mysterious, while higher pitches can come across as more enthusiastic, light, or youthful.

Low Value (-250) → Deep, resonant, solemn

Neutral (0) → Natural, centered

High Value (500) → High-pitched, animated, playful

Suggested values: [-250, -200, -150, -100, -50, 0, 100, 200, 300, 400, 500]

{
  "speaker_id": 19,
  "text": "<pitch value=\"-250\">This sentence feels deeper and more serious.</pitch>",
  "model": "caruso"
}

Combining Multiple Tags:

These tags can also be nested. To draw attention to a call to action, for instance, you can nest a lower pitch, a slower tempo, and a decrease loudness to achieve a more serious an brooding delivery:

Example with a cURL

curl --location 'https://api.wellsaidlabs.com/v1/tts/stream' \
--header 'Content-Type: application/json' \
--header 'Accept: audio/mpeg' \
--header 'X-Api-Key: YOUR_API_KEY' \
--data '{
    "speaker_id": 26,
    "model": "caruso",
    "text": "<pitch value=\"-200\"><tempo value=\"0.5\"><loudness value=\"-10\">This sentence is quieter, slower, and pitched lower — making the delivery convey a serious tone.</loudness></tempo></pitch>"
}' \
--output ai_director_demo.mp3

Adding Pauses with the Tempo tag:

The tempo tag can be used to adjust the length of pauses by applying it to any punctuation mark that denotes a pause in your script. To make sure the pause does not include a breath, it's important to also apply a loudness tag set below the suggested values listed above, -40 for example.

The following example will add a longer pause to the comma making the delivery feel more impactful:

{
  "speaker_id": 19,
  "text": "You have to stay focused<tempo value=\"0.5\"><loudness value=\"-40\">,</loudness></tempo> even when it's hard.",
  "model": "caruso"
}