Introduction to SSML

SSML, or Speech Synthesis Markup Language, is a standard markup language that allows developers to control various aspects of speech synthesis such as pronunciation. It is used to enhance text-to-speech (TTS) capabilities by providing a way to add more natural and expressive elements to synthesized speech.

The WSL API allows for the manipulation of the voice output through the use of the tag specifically. The tag is used to specify how specific portions of text should be pronounced or interpreted by our text-to-speech engine such as a date, time, telephone number, or currency amount.

For example, the following SSML code uses the tag to specify that the text "2024-01-01" should be spoken as a date:


<speak><say-as interpret-as="date" format="ymd">2024-01-01</say-as></speak>

This would be spoken as "January first, twenty twenty-four" by a text-to-speech engine that supports SSML, not “2024 dash 01 dash 01.”

Overall, SSML enhances text-to-speech capabilities by providing developers with the tools needed to create high-quality, customizable speech experiences within an application.


Benefits of SSML

Overall, the tag enhances the naturalness, accuracy, and comprehensibility of synthesized speech in our API in cases where our model has not been trained specifically. Key benefits to your business include:

  1. Improved Pronunciation: It ensures that specifically formatted text, such as dates, times, and numbers, are pronounced correctly by our text-to-speech engine.
  2. Enhanced Understanding: By indicating the intended interpretation of the text, it helps users better understand the content being spoken, especially when dealing with complex or ambiguous data.
  3. Consistent Formatting: It helps maintain consistency in speech output, ensuring that dates, times, numbers, and other formatted text are spoken uniformly across different contexts and platforms.
  4. Customization: Developers can customize how specific text is spoken, such as specifying the format of dates or how numbers are pronounced, to tailor the speech output to the needs of their application or audience.

How to Use SSML

The say-as element in SSML lets you specify the type of text inside it, helping to control how the text is read aloud. The element has further attributes like interpret-as and format which provide additional hints about the text's formatting.

🚧

The most important attribute is interpret-as, which tells the processor the type of content in the element. interpret-as is the only required attribute.

To utilize the <say-as> tag in SSML, simply enclose the text you wish to format within the <say-as> opening and closing tags. Here's the basic syntax


<speak>
    Your prescription will be ready by <say-as interpret-as="date" format="ymd">2024-01-01</say-as>.
</speak>

In this example, the text "2024-01-01" is wrapped in the <say-as> tag with the attribute interpret-as set to "date" to indicate that the text should be spoken as a date. The format attribute specifies the format of the date ("ymd" stands for year-month-day).

The result for the example above would be "Your prescription will be ready by January first, twenty twenty-four."

Usage of the say-as element's attributes are described in the following table.

AttributeDescriptionRequired or optional
interpret-asIndicates the content type of an element's text. For a list of types, see the following table.Required
formatProvides additional information about the precise formatting of the element's text for content types that might have ambiguous formats. SSML defines formats for content types that use them. See the following table.Optional

Supported tags

The following content types are supported for the interpret-as and format attributes. Include the format attribute where indicated in this table.

interpret-asformatInterpretation
charactersspell-out, verbatimThe text is spoken as individual letters (spelled out).

The speech synthesis engine pronounces:<say-as interpret-as="characters">api</say-as> As "A P I."
cardinalnoneThe text is spoken as a cardinal number.

The speech synthesis engine pronounces: There are <say-as interpret-as="cardinal">12</say-as> people in the queue As "There are twelve people in the queue."
ordinal noneThe text is spoken as an ordinal number.

The speech synthesis engine pronounces: Select the <say-as interpret-as="ordinal">4th</say-as> option As "Select the fourth option."
datedmy, mdy, ymd, ydm, ym, my, md, dm, d, m, yThe text is spoken as a date. The format attribute specifies the date's format (d=day, m=month, and y=year).
The speech synthesis engine pronounces: Today is <say-as interpret-as="date">10-17-2024</say-as> As "Today is October seventeeth two thousand twenty-four."

Change the day and month for formats outside of the US. The speech synthesis engine pronounces: Today is <say-as interpret-as="date" format="dmy">10-12-2024</say-as> As "Today is December tenth two thousand twenty-four."
timehms12, hms24The text is spoken as a time. The format attribute specifies whether the time is specified by using a 12-hour clock (hms12) or a 24-hour clock (hms24).

Use a colon to separate numbers representing hours, minutes, and seconds. Here are some valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45.

The speech synthesis engine pronounces: The train departs at <say-as interpret-as="time" format="hms12">09:00am</say-as> As "The train departs at nine A M."

NOTE:
The "hour" component is required and may optionally include a leading zero (i.e. 02 to represent 2 o'clock.)

The "minute" component is optional and must be 2 digits.

The "seconds" component is also optional. If the seconds component is included, the minute component must be included as well.

Please note this does not work for durations (2:30 as 2 hours and 30 minutes)
telephoneThe format parameter can specify the country code to use as the default country when reading the phone number.

<say-as interpret-as="telephone" format="44">020-7499-9000</say-as> will read as a local number dialed from within the UK, whereas <say-as interpret-as=”telephone” format="1">+4420-7499-9000</say-as> will read as an international number dialed from the US

if not provided, the default is US
The text is spoken as a telephone number.

The speech synthesis engine pronounces: The number is <say-as interpret-as="telephone">(888) 555-1212</say-as> As "The number is area code eight eight eight five five five one two one two."
currencynoneThe text is spoken as a currency.

The speech synthesis engine pronounces:<say-as interpret-as="currency">$79.9 USD</say-as> As "seventy-nine US dollars and ninety cents."
addressnoneThe text is spoken as an address.

The speech synthesis engine pronounces: I'm at <say-as interpret-as="address">123 15th ST NE, Redmond, WA</say-as> As "I'm at one twenty three 15th Street Northeast Redmond Washington."

Road suffixes (St, Blvd, etc), Directionals (NE, SW,…), Unit designators (ste, apt), State codes (TX, CA,…) will be expanded.

Zip codes will be read by digit (nine oh two one oh instead of ninety thousand two hundred and ten).

Best results will be obtained following USPS recommended address formats, such as only abbreviating street as st, and spelling out Saint in Saint Paul, MN.

Code samples

  1. Address:

    Road suffixes (St, Blvd, etc), Directionals (NE, SW,…), Unit designators (ste, apt), State codes (TX, CA,…) will be expanded. Zip codes will be read by digit (nine oh two one oh instead of ninety thousand two hundred and ten). Best results will be obtained following USPS recommended address formats, such as only abbreviating street as st, and spelling out Saint in Saint Paul, MN.

    
    <say-as interpret-as="address">1600 Amphitheatre Pkwy, Mountain View, CA</say-as>
    
    will be read as
    
    sixteen hundred Amphitheatre Parkway, Mountain View, California
    
    <say-as interpret-as="address">Miss Jane Doe, 1928 Hollywood Blvd, Beverly Hills, CA 90210</say-as>
    will be read as
    nineteen twenty eight Hollywood Boulevard, Beverly Hills, California nine oh two one oh
    
  2. Cardinal Number:

    <say-as interpret-as="cardinal">12345</say-as>
    
    will read as 
    twelve thousand three hundred and forty five
    
  3. Currency:

    
    <say-as interpret-as="currency" >$50.25</say-as>
    Fifty dollars and twenty five cents
    
    
  4. Date:

    <say-as interpret-as="date" format="ymd">2022-01-01</say-as>
    
    January first, twenty twenty two
    
  5. Ordinal Number:

    <say-as interpret-as="ordinal">1st</say-as>
    
    first
    
  6. Telephone Number:

    <say-as interpret-as="telephone">1-800-555-1212</say-as>
    
    one eight hundred five five five one two one two
    
  7. Time:

    <say-as interpret-as="time" format="hms12">3:30pm</say-as>
    three thirty PM