Using Word Timing with the WellSaid API

The Word Timing endpoint allows you to generate timing information for individual words in a speech synthesis request. This can be useful for synchronizing text highlighting, captions, or animations with audio playback.

When using this API, the response includes a ZIP file containing:

  • Audio file (audio.mp3)
  • JSON file (word-timing.json) with word timing data
  • SRT file (srt.srt) for subtitles
  • VTT file (vtt.vtt) for web captions

How do you use the endpoint?

Generate Word Timing

This API call processes your text input and returns both word timing data and the generated audio.

To generate word timing, make a POST request to the Word Timing endpoint, including your API key in the header:

https://api.wellsaidlabs.com/v1/tts/word-timing

Example curl Command

curl --request POST \
     --url https://api.wellsaidlabs.com/v1/tts/word-timing \
     --header 'X-API-KEY: <YOUR_API_KEY>' \
     --header 'X-Enable-SSML: false' \
     --header 'accept: */*' \
     --header 'content-type: application/json' \
     --data '
{
  "speaker_id": 3,
  "text": "this is the text"
}
' --output word-timing.zip

Response

This request returns a ZIP file (word-timing.zip), which contains:

  • audio.mp3 → The synthesized speech audio.
  • word-timing.json → JSON file with word timing details.
  • srt.srt → SubRip subtitle file.
  • vtt.vtt → Web Video Text Track (VTT) subtitle file.

Understanding the Word Timing Response

The word-timing.json file includes word-by-word timing information:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "this is the text",
          "confidence": 0.98413205,
          "words": [
            {
              "endOffset": "0.200s",
              "word": "this"
            },
            {
              "startOffset": "0.200s",
              "endOffset": "0.500s",
              "word": "is"
            },
            {
              "startOffset": "0.500s",
              "endOffset": "0.700s",
              "word": "the"
            },
            {
              "startOffset": "0.700s",
              "endOffset": "1.300s",
              "word": "text"
            }
          ]
        }
      ],
      "resultEndOffset": "1.380s",
      "languageCode": "en-us"
    }
  ]
} 

Key Response Fields

FieldTypeDescription
transcriptStringThe full spoken sentence
confidenceFloatThe AI confidence score for the transcription
wordsArrayList of individual words with their respective timing
startOffsetStringStart time of the word in seconds
endOffsetStringEnd time of the word in seconds
resultEndOffsetStringTotal duration of the speech
languageCodeStringThe detected language

How to Use the Word Timing Data

  • For Subtitles: Use the srt.srt or vtt.vtt files to display timed captions.
  • For Word Highlighting: Use the word-timing.json data to highlight words in sync with audio.
  • For Animations: Sync text animations to the startOffset and endOffset timestamps.

Example: Displaying Word-by-Word Highlighting

If you want to highlight words as they are spoken, extract the startOffset and endOffset timestamps from word-timing.json and match them with your audio playback position.

Example Code (JavaScript)

function highlightText(wordData, audioElement) {
  audioElement.addEventListener('timeupdate', () => {
    const currentTime = audioElement.currentTime;
    wordData.words.forEach(word => {
      if (currentTime >= parseFloat(word.startOffset) && currentTime <= parseFloat(word.endOffset)) {
        document.getElementById(word.word).classList.add("highlight");
      } else {
        document.getElementById(word.word).classList.remove("highlight");
      }
    });
  });
}

Conclusion

The Word Timing API enables accurate synchronization of audio with visual elements like captions and text highlighting. By leveraging the JSON timing data, you can create an engaging and accessible user experience.

Now, you can:

  • Generate word timing
  • Retrieve word timings
  • Sync subtitles and highlights with speech

Start integrating the WellSaid Word Timing API today.