Using Word Timing with the WellSaid API

The Word Timing endpoint allows you to generate timing information for individual words in a speech synthesis request. This can be useful for synchronizing text highlighting, captions, or animations with audio playback.

When using this API, the response includes a ZIP file containing:

Audio file (audio.mp3)
JSON file (word-timing.json) with word timing data
SRT file (srt.srt) for subtitles
VTT file (vtt.vtt) for web captions

How do you use the endpoint?

Generate Word Timing

This API call processes your text input and returns both word timing data and the generated audio.

To generate word timing, make a POST request to the Word Timing endpoint, including your API key in the header:

https://api.wellsaidlabs.com/v1/tts/word-timing

Example `curl` Command

curl --request POST \
     --url https://api.wellsaidlabs.com/v1/tts/word-timing \
     --header 'X-API-KEY: <YOUR_API_KEY>' \
     --header 'X-Enable-SSML: false' \
     --header 'accept: */*' \
     --header 'content-type: application/json' \
     --data '
{
  "speaker_id": 3,
  "text": "this is the text"
}
' --output word-timing.zip

Response

This request returns a ZIP file (word-timing.zip), which contains:

audio.mp3 → The synthesized speech audio.
word-timing.json → JSON file with word timing details.
srt.srt → SubRip subtitle file.
vtt.vtt → Web Video Text Track (VTT) subtitle file.

Understanding the Word Timing Response

The word-timing.json file includes word-by-word timing information:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "this is the text",
          "confidence": 0.98413205,
          "words": [
            {
              "endOffset": "0.200s",
              "word": "this"
            },
            {
              "startOffset": "0.200s",
              "endOffset": "0.500s",
              "word": "is"
            },
            {
              "startOffset": "0.500s",
              "endOffset": "0.700s",
              "word": "the"
            },
            {
              "startOffset": "0.700s",
              "endOffset": "1.300s",
              "word": "text"
            }
          ]
        }
      ],
      "resultEndOffset": "1.380s",
      "languageCode": "en-us"
    }
  ]
}

Key Response Fields

Field	Type	Description
`transcript`	String	The full spoken sentence
`confidence`	Float	The AI confidence score for the transcription
`words`	Array	List of individual words with their respective timing
`startOffset`	String	Start time of the word in seconds
`endOffset`	String	End time of the word in seconds
`resultEndOffset`	String	Total duration of the speech
`languageCode`	String	The detected language

How to Use the Word Timing Data

For Subtitles: Use the srt.srt or vtt.vtt files to display timed captions.
For Word Highlighting: Use the word-timing.json data to highlight words in sync with audio.
For Animations: Sync text animations to the startOffset and endOffset timestamps.

Example: Displaying Word-by-Word Highlighting

If you want to highlight words as they are spoken, extract the startOffset and endOffset timestamps from word-timing.json and match them with your audio playback position.

Example Code (JavaScript)

function highlightText(wordData, audioElement) {
  audioElement.addEventListener('timeupdate', () => {
    const currentTime = audioElement.currentTime;
    wordData.words.forEach(word => {
      if (currentTime >= parseFloat(word.startOffset) && currentTime <= parseFloat(word.endOffset)) {
        document.getElementById(word.word).classList.add("highlight");
      } else {
        document.getElementById(word.word).classList.remove("highlight");
      }
    });
  });
}

Conclusion

The Word Timing API enables accurate synchronization of audio with visual elements like captions and text highlighting. By leveraging the JSON timing data, you can create an engaging and accessible user experience.

Now, you can:

Generate word timing
Retrieve word timings
Sync subtitles and highlights with speech

Start integrating the WellSaid Word Timing API today.