The AI Director feature allows you to fine-tune speech delivery by embedding XML cues directly into your text input. These cues include tempo, loudness, pitch, and respelling, enabling more expressive, human-like voiceovers so you can get the right feeling with each take.


Basic Usage

AI Director currently makes use of 4 different tags, to implement these in your request you wrap them around the specific text you want to imbue with that effect. Each tag has a specified range that should be used to make sure you get the best results from the model, going outside of these ranges will lead to unfavorable results.

The following XML tags can be added within the speak_tag to provide additional cues to the AI.

Tag (Optional)Description
tempoFunction: Cue a change in pace for the enclosed text
Attributes: Value (float): [Required] slow to neutral to fast [0.5, 1, 2.5]
Example: This speech is slower.
loudnessFunction: Cue a change in loudness for the enclosed text
Attributes: Value (float): [Required] quiet to neutral to loud [-20, 0, 10]
Example: This speech is a little louder.
pitchFunction: Cue a change in loudness for the enclosed text
Attributes: Value (float): [Required] quiet to neutral to loud [-250, 0, 500]
Example: This speech is a little louder.
respellFunction: Inline respelling for the enclosed text
Attributes: Value (string): [Required] respelling of the enclosed text
Example: Pizza

Examples

Each of the following examples shows how AI Director tags modify the voice actor’s delivery, not just the audio file’s properties. These changes affect cadence, emotion, intensity, and presence, producing a more natural and expressive result.

Loudness:

The loudness tag controls the intensity of the delivery. Unlike simply increasing the volume of the audio file, this tag prompts the voice actor to speak louder or softer, changing their energy, projection, and vocal presence.

Low Value (-20) → subdued, intimate

Neutral (0) → Natural, balanced

High Value (10) → Assertive, commanding, emphatic

Suggested values: [-20, -12, -8, -4, -2, 0, 2, 4, 6, 8, 10]

{
  "speaker_id": 20,
  "text": "<loudness value=\"6\">This part is louder than providing more emphasis.</loudness>",
  "model": "caruso"
}

Tempo:

The tempo tag adjusts the pacing of the voice actor. This influences how quickly or slowly the words are spoken, changing the urgency, clarity, and emotional weight of the delivery.

Low Value (0.5) → Deliberate, thoughtful, calm

Neutral (1) → Conversational, balanced

High Value (2.5) → Rapid, energetic, urgent

Suggested values: [0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.3, 1.6, 1.9, 2.3, 2.5]

{
  "speaker_id": 19,
  "text": "<tempo value=\"0.5\">This part is spoken slowly, with intention and gravity.</tempo>",
  "model": "caruso"
}

Pitch:

The pitch tag shifts the tonal height of the speaker’s voice. Lower pitches sound more grounded, serious, and mysterious, while higher pitches can come across as more enthusiastic, light, or youthful.

Low Value (-250) → Deep, resonant, solemn

Neutral (0) → Natural, centered

High Value (500) → High-pitched, animated, playful

Suggested values: [-250, -200, -150, -100, -50, 0, 100, 200, 300, 400, 500]

{
  "speaker_id": 19,
  "text": "<pitch value=\"-250\">This sentence feels deeper and more serious.</pitch>",
  "model": "caruso"
}

Combining Multiple Tags:

These tags can also be nested. To draw attention to a call to action, for instance, you can nest a lower pitch, a slower tempo, and a decrease loudness to achieve a more serious an brooding delivery:

Example with a cURL

curl --location 'https://api.wellsaidlabs.com/v1/tts/stream' \
--header 'Content-Type: application/json' \
--header 'Accept: audio/mpeg' \
--header 'X-Api-Key: YOUR_API_KEY' \
--data '{
    "speaker_id": 26,
    "model": "caruso",
    "text": "<pitch value=\"-200\"><tempo value=\"0.5\"><loudness value=\"-10\">This sentence is quieter, slower, and pitched lower — making the delivery convey a serious tone.</loudness></tempo></pitch>"
}' \
--output ai_director_demo.mp3