The Actions on Google platform supports a number of SSML Beta features in addition to the Actions on Google standard SSML elements.
Summary of supported Beta SSML features:
<phoneme>
: Customize the pronunciation of specific words.<say-as interpret-as="duration">
: Specify durations.<voice>
: Switch between voices in the same request.<lang>
: Use multiple languages in the same request.- Timepoints: Use the
<mark>
tag to return the timepoint of a specified point in your transcript.
<phoneme>
You can use the <phoneme>
tag to produce custom pronunciations of words
inline. Actions on Google accepts the
IPA and
X-SAMPA phonetic alphabets. See the
phonemes page for a list of supported
languages and phonemes.
Each application of the <phoneme>
tag directs the pronunciation of a single
word:
<phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme> <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>
Stress markers
There are up to three levels of stress that can be placed in a transcription:
- Primary stress: Denoted with
ˈ
in IPA and"
in X-SAMPA. - Secondary stress: Denoted with
ˌ
in IPA and%
in X-SAMPA. - Unstressed: Not denoted with a symbol (in either notation).
Some languages might have fewer than three levels or not denote stress placement at all. See the phonemes page to see the stress levels available for your language. Stress markers are placed at the start of each stressed syllable. For example, in US English:
Example word | IPA | X-SAMPA |
---|---|---|
water | ˈwɑːtɚ |
"wA:t@` |
underwater | ˌʌndɚˈwɑːtɚ |
%Vnd@"wA:t@ |
Broad vs narrow transcriptions
As a general rule, keep your transcriptions more broad and phonemic in nature.
For example, in US English, transcribe intervocalic t
(instead of using a
tap):
Example word | IPA | X-SAMPA |
---|---|---|
butter | ˈbʌtɚ instead of ˈbʌɾɚ |
"bVt@` instead of "bV4@` |
There are some instances where using the phonemic representation makes your TTS results sound unnatural (for example, if the sequence of phonemes is anatomically difficult to pronounce).
One example of this is voicing assimilation for s
in English. In this case the
assimilation should be reflected in the transcription:
Example word | IPA | X-SAMPA |
---|---|---|
cats | ˈkæts |
"k{ts |
dogs | ˈdɑːgz instead of ˈdɑːgs |
"dA:gz instead of "dA:gs |
Reduction
Every syllable must contain one (and only one) vowel. This means that you should avoid syllabic consonants and instead transcribe them with a reduced vowel. For example:
Example word | IPA | X-SAMPA |
---|---|---|
kitten | ˈkɪtən instead of ˈkɪtn |
"kIt@n instead of "kitn |
kettle | ˈkɛtəl instead of ˈkɛtl |
"kEt@l instead of "kEtl |
Syllabification
You can optionally specify syllable boundaries by using .
. Each syllable must
contain one (and only one) vowel. For example:
Example word | IPA | X-SAMPA |
---|---|---|
readability | ˌɹiː.də.ˈbɪ.lə.tiː |
%r\i:.d@."bI.l@.ti: |
Durations
The Actions on Google platform supports <say-as interpret-as="duration">
to correctly
read durations. For example, the following example would be verbalized as "five
hours and thirty minutes":
<say-as interpret-as="duration" format="h:m">5:30</say-as>
The format string supports the following values:
Abbreviation | Value |
---|---|
h | hours |
m | minutes |
s | seconds |
ms | milliseconds |
<voice>
The <voice>
tag allows you to use more than one voice in a single SSML
request. In the following example, the default voice is an English male voice.
All words will be synthesized in this voice except for "qu'est-ce qui t'amène
ici", which will be verbalized in French using a female voice instead of the
default language (English) and gender (male).
<speak>And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui t'amène ici</voice><break time="250ms"/> in her sweet and gentle voice.</speak>
Alternatively, you can use a <voice>
tag to specify an individual voice (the
voice name on the
supported voices and languages page)
rather than specifying a language
and/or gender
:
<speak>The dog is friendly<voice name="fr-CA-Wavenet-B">mais la chat est mignon</voice><break time="250ms"/> said a pet shop owner</speak>
When you use the <voice>
tag, Actions on Google expects to receive either
a name
(the
name of the voice you
want to use) or a combination of the following attributes. All three
attributes are optional but you must provide at least one if you don't provide a
name
.
gender
: One ofmale
,female
orneutral
.variant
: Used as a tiebreaker in cases where there are multiple possibilities of which voice to use based on your configuration.language
: Your desired language. Only one language can be specified in a given<voice>
tag. Specify your language in BCP-47 format. You can find the BCP-47 code for your language in the language code column on the supported voices and languages page.
You can also control the relative priority of each of the gender
, variant
,
and language
attributes using two additional tags: required
and ordering
.
required
: If an attribute is designated asrequired
and not configured properly, the request fails.ordering
: Any attributes listed after anordering
tag are considered as preferred attributes rather than required. The SSML considers preferred attributes on a best effort basis in the order they are listed after theordering
tag. If any preferred attributes are configured incorrectly, Actions on Google might still return a valid voice but with the incorrect configuration dropped.
Examples of configurations using the required
and ordering
tags:
<speak>And there it was <voice language="en-GB" gender="male" required="gender" ordering="gender language">a flying bird </voice>roaring in the skies for the first time.</speak>
<speak>Today is supposed to be <voice language="en-GB" gender="female" ordering="language gender">Sunday Funday.</voice></speak>
<lang>
You can use <lang>
to include text in multiple languages within the same SSML
request. All languages will be synthesized in the same voice unlesss you use the
<voice>
tag to explicitly change the voice. The xml:lang
string must contain
the target language in BCP-47 format (this value is listed as "language code" in
the supported voices
table). In the following example "chat" will be verbalized in French instead of
the default language (English):
<speak>The french word for cat is <lang xml:lang="fr-FR">chat</lang></speak>
Actions on Google platform supports the <lang>
tag on a best effort basis. Not all
language combinations produce the same quality results if specified in the same
SSML request. In some cases, a language combination might produce an effect that
is detectible but subtle or perceived as negative. Known issues:
- Japanese with Kanji characters is not supported by the
<lang>
tag. The input is transliterated and read as Chinese characters. - Semitic languages such as Arabic, Hebrew, and Persian are not supported by
the
<lang>
tag and will result in silence. If you want to use any of these languages we recommend using the<voice>
tag to switch to a voice that speaks your desired language (if available).