Introduction to SSML
SSML, or Speech Synthesis Markup Language, is a standard markup language that allows developers to control various aspects of speech synthesis such as pronunciation. It is used to enhance text-to-speech (TTS) capabilities by providing a way to add more natural and expressive elements to synthesized speech.
The WSL API allows for the manipulation of the voice output through the use of the tag specifically. The tag is used to specify how specific portions of text should be pronounced or interpreted by our text-to-speech engine such as a date, time, telephone number, or currency amount.
For example, the following SSML code uses the tag to specify that the text "2024-01-01" should be spoken as a date:
<speak><say-as interpret-as="date" format="ymd">2024-01-01</say-as></speak>
This would be spoken as "January first, twenty twenty-four" by a text-to-speech engine that supports SSML, not “2024 dash 01 dash 01.”
Overall, SSML enhances text-to-speech capabilities by providing developers with the tools needed to create high-quality, customizable speech experiences within an application.
Benefits of SSML
Overall, the tag enhances the naturalness, accuracy, and comprehensibility of synthesized speech in our API in cases where our model has not been trained specifically. Key benefits to your business include:
- Improved Pronunciation: It ensures that specifically formatted text, such as dates, times, and numbers, are pronounced correctly by our text-to-speech engine.
- Enhanced Understanding: By indicating the intended interpretation of the text, it helps users better understand the content being spoken, especially when dealing with complex or ambiguous data.
- Consistent Formatting: It helps maintain consistency in speech output, ensuring that dates, times, numbers, and other formatted text are spoken uniformly across different contexts and platforms.
- Customization: Developers can customize how specific text is spoken, such as specifying the format of dates or how numbers are pronounced, to tailor the speech output to the needs of their application or audience.
How to Use SSML
The say-as element in SSML lets you specify the type of text inside it, helping to control how the text is read aloud. The element has further attributes like interpret-as and format which provide additional hints about the text's formatting.
The most important attribute is
interpret-as
, which tells the processor the type of content in the element.interpret-as
is the only required attribute.
To utilize the <say-as>
tag in SSML, simply enclose the text you wish to format within the <say-as>
opening and closing tags. Here's the basic syntax
<speak>
Your prescription will be ready by <say-as interpret-as="date" format="ymd">2024-01-01</say-as>.
</speak>
In this example, the text "2024-01-01" is wrapped in the <say-as>
tag with the attribute interpret-as
set to "date" to indicate that the text should be spoken as a date. The format
attribute specifies the format of the date ("ymd" stands for year-month-day).
The result for the example above would be "Your prescription will be ready by January first, twenty twenty-four."
Usage of the say-as
element's attributes are described in the following table.
Attribute | Description | Required or optional |
---|---|---|
interpret-as | Indicates the content type of an element's text. For a list of types, see the following table. | Required |
format | Provides additional information about the precise formatting of the element's text for content types that might have ambiguous formats. SSML defines formats for content types that use them. See the following table. | Optional |
Supported tags
The following content types are supported for the interpret-as
and format
attributes. Include the format
attribute where indicated in this table.
interpret-as | format | Interpretation |
---|---|---|
characters , spell-out , verbatim | The text is spoken as individual letters (spelled out). The speech synthesis engine pronounces: <say-as interpret-as="characters">api</say-as> As "A P I." | |
cardinal | none | The text is spoken as a cardinal number. The speech synthesis engine pronounces: There are <say-as interpret-as="cardinal">12</say-as> people in the queue As "There are twelve people in the queue." |
ordinal | none | The text is spoken as an ordinal number. The speech synthesis engine pronounces: Select the <say-as interpret-as="ordinal">4th</say-as> option As "Select the fourth option." |
date | dmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y | The text is spoken as a date. The format attribute specifies the date's format (d=day, m=month, and y=year). The speech synthesis engine pronounces: Today is <say-as interpret-as="date">10-17-2024</say-as> As "Today is October seventeeth two thousand twenty-four."Change the day and month for formats outside of the US. The speech synthesis engine pronounces: Today is <say-as interpret-as="date" format="dmy">10-12-2024</say-as> As "Today is December tenth two thousand twenty-four." |
time | hms12, hms24 | The text is spoken as a time. The format attribute specifies whether the time is specified by using a 12-hour clock (hms12) or a 24-hour clock (hms24). Use a colon to separate numbers representing hours, minutes, and seconds. Here are some valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45. The speech synthesis engine pronounces: The train departs at <say-as interpret-as="time" format="hms12">09:00am</say-as> As "The train departs at nine A M."NOTE: The "hour" component is required and may optionally include a leading zero (i.e. 02 to represent 2 o'clock.) The "minute" component is optional and must be 2 digits. The "seconds" component is also optional. If the seconds component is included, the minute component must be included as well. Please note this does not work for durations (2:30 as 2 hours and 30 minutes) |
telephone | The format parameter can specify the country code to use as the default country when reading the phone number.<say-as interpret-as="telephone" format="44">020-7499-9000</say-as> will read as a local number dialed from within the UK, whereas <say-as interpret-as=”telephone” format="1">+4420-7499-9000</say-as> will read as an international number dialed from the USif not provided, the default is US | The text is spoken as a telephone number. The speech synthesis engine pronounces: The number is <say-as interpret-as="telephone">(888) 555-1212</say-as> As "The number is area code eight eight eight five five five one two one two." |
currency | none | The text is spoken as a currency. The speech synthesis engine pronounces: <say-as interpret-as="currency">$79.9 USD</say-as> As "seventy-nine US dollars and ninety cents." |
address | none | The text is spoken as an address. The speech synthesis engine pronounces: I'm at <say-as interpret-as="address">123 15th ST NE, Redmond, WA</say-as> As "I'm at one twenty three 15th Street Northeast Redmond Washington."Road suffixes (St, Blvd, etc), Directionals (NE, SW,…), Unit designators (ste, apt), State codes (TX, CA,…) will be expanded. Zip codes will be read by digit (nine oh two one oh instead of ninety thousand two hundred and ten). Best results will be obtained following USPS recommended address formats, such as only abbreviating street as st, and spelling out Saint in Saint Paul, MN. |
Code samples
-
Address:
Road suffixes (St, Blvd, etc), Directionals (NE, SW,…), Unit designators (ste, apt), State codes (TX, CA,…) will be expanded. Zip codes will be read by digit (nine oh two one oh instead of ninety thousand two hundred and ten). Best results will be obtained following USPS recommended address formats, such as only abbreviating street as st, and spelling out Saint in Saint Paul, MN.
<say-as interpret-as="address">1600 Amphitheatre Pkwy, Mountain View, CA</say-as> will be read as sixteen hundred Amphitheatre Parkway, Mountain View, California <say-as interpret-as="address">Miss Jane Doe, 1928 Hollywood Blvd, Beverly Hills, CA 90210</say-as> will be read as nineteen twenty eight Hollywood Boulevard, Beverly Hills, California nine oh two one oh
-
Cardinal Number:
<say-as interpret-as="cardinal">12345</say-as> will read as twelve thousand three hundred and forty five
-
Currency:
<say-as interpret-as="currency" >$50.25</say-as> Fifty dollars and twenty five cents
-
Date:
<say-as interpret-as="date" format="ymd">2022-01-01</say-as> January first, twenty twenty two
-
Ordinal Number:
<say-as interpret-as="ordinal">1st</say-as> first
-
Telephone Number:
<say-as interpret-as="telephone">1-800-555-1212</say-as> one eight hundred five five five one two one two
-
Time:
<say-as interpret-as="time" format="hms12">3:30pm</say-as> three thirty PM