After you have installed your voice, start the test program called ModelTalker2.exe by double clicking its desktop icon. At this point, if you type something in the open text area of ModelTalker2 and click the Speak button, it should render the text in your synthetic ModelTalker voice.
Using your voice in other programs
In addition to using our ModelTalker2.exe program to hear your voice, it should be selectable as the Windows default Synthetic Speech voice in the Speech control panel of any recent version of Windows. Most other programs that are “speech aware” and conform to the SAPI 5.0 standard should be able to use this voice as well.
Adjusting your voice
The SAPI speech controls provide limited control of your voice, however, by using the ModelTalker2.exe program there are many additional voice settings you can control under the Settings > Synthesis menu. You are welcome to try changing these settings to see if they produce improvements in your synthetic voice. When you close ModelTalker2, your changes to the settings will be saved as the new default settings.
In the Settings dialog, there is a button to Restore Defaults if you find that you do not like changes you have introduced, so don’t be afraid to experiment.
One important thing to keep in mind is that you should not attempt to change any of these settings while your ModelTalker voice is selected as the default Windows voice in the control panel, or being used by any other speech-aware software. Always make sure nothing else is trying to use your voice when changing settings, or the changes will not take effect.
There are three tabs in the Synthesis Settings dialog box. In the following, we briefly describe the meaning of the settings that may be of most use in adjusting the sound of your voice.
Target Costs Tab
ModelTalker works by choosing very brief segments of your recorded speech to concatenate into a full word or sentence. Usually, there are multiple examples of each segment to choose from, but some are likely to be better than others, depending on exactly what word or sentence is being produced. Target costs determine how ModelTalker chooses which segments might work best. It will typically choose quite a few examples and rank them by “cost”, with lower costs representing what are probably better candidates. Costs are spread over a number of features such as F0, Duration, Phonetic Context, etc., with each feature having a value between 0 and 100. The larger the value assigned to a particular feature, the more important that feature will be in determining how individual segments are ranked.
- Stress Target Cost: Stress in this context is related to the pattern of strong and weak syllables in words. For example, in the word “about”, the first syllable is unstressed and the second one is stressed, but in the word “apple” the reverse is true (the first syllable is stressed and the second is unstressed). ModelTalker should try to find segments that have the same syllable stress as the utterance to be synthesized. We normally set this parameter to an intermediate value
- Boundary Target Cost: Speech sounds are made somewhat differently depending on how close they are to the beginning or end of a sentence or phrase, that is, how close we are to a phrase or sentence boundary. For instance, speech tends to be louder and more rapid at the beginning of a sentence, and a bit drawn-out and more quiet at the end of a sentence. ModelTalker should try to find segments that come from the same part of a sentence as the part they are being used to synthesize. We normally set this parameter to an intermediate value, somewhat higher than the value for stress.
- Phonetic Target Cost: The phonetic context (i.e., the identity of the speech sounds immediately preceding and following each segment) is very important, and ModelTalker should always try to find segments that were originally recorded in the same phonetic context as the context in which they will be used for synthesis. We normally set this parameter to 100 and it is unlikely that smaller values will improve the synthesis quality.
- Accent Target Cost: Pitch accents are another way of describing the intonational features of utterances. They are a symbolic description of the different types of pitch variations talkers use to signal important parts of an utterance. As with stress and boundary features, ModelTalker should attempt to select segments that are associated with the most appropriate accent type. We normally assign a small, non-zero value to the Accent target cost.
Join Costs Tab
After ranking each segment based on target costs, ModelTalker then constructs a synthetic utterance by concatenating a specific sequence of segments. Choosing which segments to concatenate is partly based on the target costs of the segments, but also on how well they will join with one another to form a smooth and natural sounding utterance. Join costs determine which features are most important in this final stage of the process. Although target costs tend to be mostly associated with linguistic features, join costs tend to be more related to acoustic properties, with the goal of reducing acoustic discontinuities in the output speech.
- F0 Join Cost: Determines the importance of avoiding discontinuities in the intonation contour by favoring segments that are the same pitch at their adjoining edges.
- RMS Join Cost: Determines the importance of avoiding sudden changes in amplitude or loudness by favoring segments that are of the same amplitude at their adjoining edges.
- Spectral Join Cost: Determines the importance of avoiding discontinuities in spectrum shape by favoring segments that are very similar in spectrum shape at their adjoining edges.
- Splice Cost: Determines the importance of trying to find stretches of segments that were originally recorded as continuous speech.
- Playback Rate: This is a linear scaling factor applied to the output speech. A value of -10 will reduce the speaking rate by a factor of 2, and a value of +10 will increase the speaking rate by a factor of 2; however, the scale is not bounded by [-10,+10] and values outside of that range can be used (though they are of doubtful utility).
- Speaking Rate (WPM): This sets a target speaking rate in words per minute. It is used in two ways. First, based on the segmental timing generated by the prosody model, it can bias the selection of candidates from the database. If Duration Target Cost is non-zero, segments closest to the target duration generated by the prosody model have a better chance of being selected. Second, if Duration Control is checked (see below) MT2 will modify the duration of segments to match the desired segment target duration.
- Max Candidates: The maximum number of candidates that will be considered for each unit. Increasing this value increases the chances of finding the ideal sequence of candidates for a particular utterance; however, it also increases the amount of memory required to conduct the search and the CPU demands for conducting the search. On desktops and modern laptops, we typically use values in the range of 1000, but for mobile devices, 250 or less may be a better balance between quality and resources.
- Duration Control: Adjusts the length of chunks of speech to be played back based on a model trained on your speech. Some voices benefit from this, others do not. Enabling Duration Control may improve the perceived timing of the speech at the expense of some loss of natural voice quality and the addition of some signal processing distortion.
- Pitch Control: As with Duration Control, enabling Pitch Control allows ModelTalker to force a smooth intonation contour on every utterance. Sometimes that can make the utterance sound more fluent, but it can also introduce distortion.
One of the most common problems people encounter with any TTS system is that it fails to correctly pronounce names (people, streets, towns, etc.). If you find that ModelTalker fails to pronounce some important names for people or places in your life, let us know. We can probably fix those problems very easily. Sometimes TTS systems make errors that are not simple pronunciation errors, but sound like the word is missing some sounds or contains extraneous sounds. These problems are much harder to fix, but if you find some that are really horrible, do let us know about them too.