Using Word Timing with the WellSaid API
The Word Timing endpoint allows you to generate timing information for individual words in a speech synthesis request. This can be useful for synchronizing text highlighting, captions, or animations with audio playback.
When using this API, the response includes a ZIP file containing:
- Audio file (
audio.mp3
) - JSON file (
word-timing.json
) with word timing data - SRT file (
srt.srt
) for subtitles - VTT file (
vtt.vtt
) for web captions
How do you use the endpoint?
Generate Word Timing
This API call processes your text input and returns both word timing data and the generated audio.
To generate word timing, make a POST request to the Word Timing endpoint, including your API key in the header:
https://api.wellsaidlabs.com/v1/tts/word-timing
Example curl
Command
curl
Commandcurl --request POST \
--url https://api.wellsaidlabs.com/v1/tts/word-timing \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--header 'X-Enable-SSML: false' \
--header 'accept: */*' \
--header 'content-type: application/json' \
--data '
{
"speaker_id": 3,
"text": "this is the text"
}
' --output word-timing.zip
Response
This request returns a ZIP file (word-timing.zip
), which contains:
audio.mp3
→ The synthesized speech audio.word-timing.json
→ JSON file with word timing details.srt.srt
→ SubRip subtitle file.vtt.vtt
→ Web Video Text Track (VTT) subtitle file.
Understanding the Word Timing Response
The word-timing.json file includes word-by-word timing information:
{
"results": [
{
"alternatives": [
{
"transcript": "this is the text",
"confidence": 0.98413205,
"words": [
{
"endOffset": "0.200s",
"word": "this"
},
{
"startOffset": "0.200s",
"endOffset": "0.500s",
"word": "is"
},
{
"startOffset": "0.500s",
"endOffset": "0.700s",
"word": "the"
},
{
"startOffset": "0.700s",
"endOffset": "1.300s",
"word": "text"
}
]
}
],
"resultEndOffset": "1.380s",
"languageCode": "en-us"
}
]
}
Key Response Fields
Field | Type | Description |
---|---|---|
transcript | String | The full spoken sentence |
confidence | Float | The AI confidence score for the transcription |
words | Array | List of individual words with their respective timing |
startOffset | String | Start time of the word in seconds |
endOffset | String | End time of the word in seconds |
resultEndOffset | String | Total duration of the speech |
languageCode | String | The detected language |
How to Use the Word Timing Data
- For Subtitles: Use the
srt.srt
orvtt.vtt
files to display timed captions. - For Word Highlighting: Use the
word-timing.json
data to highlight words in sync with audio. - For Animations: Sync text animations to the
startOffset
andendOffset
timestamps.
Example: Displaying Word-by-Word Highlighting
If you want to highlight words as they are spoken, extract the startOffset
and endOffset
timestamps from word-timing.json
and match them with your audio playback position.
Example Code (JavaScript)
function highlightText(wordData, audioElement) {
audioElement.addEventListener('timeupdate', () => {
const currentTime = audioElement.currentTime;
wordData.words.forEach(word => {
if (currentTime >= parseFloat(word.startOffset) && currentTime <= parseFloat(word.endOffset)) {
document.getElementById(word.word).classList.add("highlight");
} else {
document.getElementById(word.word).classList.remove("highlight");
}
});
});
}
Conclusion
The Word Timing API enables accurate synchronization of audio with visual elements like captions and text highlighting. By leveraging the JSON timing data, you can create an engaging and accessible user experience.
Now, you can:
- Generate word timing
- Retrieve word timings
- Sync subtitles and highlights with speech
Start integrating the WellSaid Word Timing API today.