The ModelTalker Version 2 unit selection engine uses a function we refer to as hmmcat to perform the actual unit selection and waveform concatenation process. The behavior of this module is controlled by settings in the <hmmcat> stanza of the voice-specific XML configuration file. An example of the stanza is given in table I.
The voice configuration file has various names and locations depending on the operating system. On recent versions of Windows, the file is:
On Windows systems, nearly all of the configuration parameters of interest can be controlled from settings in the ModelTalker2.exe program and we recommend using it to adjust the parameters (see ModelTalker2 User Guide). However, other systems provide no other access to these settings and so to change them it is necessary to edit the configuration file by hand using a good text editor.
On Mac OSX systems, the configuration file is:
On Linux computers, the configuration file is:
Table I. hmmcat section of the voice configuration.
<hmmcat> <speakingRate>0.000000</speakingRate> <speakingRateWPM>150.000000</speakingRateWPM> <speakingVolume>100.000000</speakingVolume> <speakingPitch>0.000000</speakingPitch> <spliceType>middle</spliceType> <f0_tc>0.000000</f0_tc> <dur_tc>0.000000</dur_tc> <str_tc>1.000000</str_tc> <bnd_tc>2.000000</bnd_tc> <pc_tc>4.000000</pc_tc> <acc_tc>0.000000</acc_tc> <int_tc>0.000000</int_tc> <df_tc>0.000000</df_tc> <rob_tc>0.000000</rob_tc> <llk_tc>0.000000</llk_tc> <f0_jc>2.000000</f0_jc> <sp_jc>1.000000</sp_jc> <esp_jc>0.000000</esp_jc> <tp_jc>0.000000</tp_jc> <rms_jc>0.000000</rms_jc> <path_jc>0.000000</path_jc> <dur_jc>0.000000</dur_jc> <dc>1.000000</dc> <maxcand>250</maxcand> <dopitch>0</dopitch> <dotime>1</dotime> <dospec>0</dospec> <smoothF0>0</smoothF0> <smoothRMS>0</smoothRMS> <smoothSpec>0</smoothSpec> </hmmcat>
The following is a brief description of each parameter that might be of interest. Note that some parameters are there for internal experimental purposes. This document does not attempt to cover those features, just the ones that are likely to have significant impact on performance, especially on mobile platforms where memory and/or CPU cycles may be limited.
speakingRate – This is a linear scaling factor applied to the output speech. It is applied whether the voice is set of modify segment timing or not (see ‘dotime’ below). A value of -10 will reduce the speaking rate by a factor of 2, and a value of +10 will increase the speaking rate by a factor of 2, however the scale is not bounded by [-10,+10] and values outside that range can be use (though they are of doubtful utility).
speakingRateWPM – This sets a target speaking rate in words per minute. It is used in two ways. First, based on the segmental timing generated by the prosody model, it can bias the selection of candidates from the database. If dur_tc is non-zero, segments closest to the target duration generated by the prosody model have a better chance of being selected. Second, if timing control is active (see ‘dotime’ below) MT2 will modify the duration of segments to make the match the segment target duration.
speakingVolume – A linear scaler in the range 0 – 100. This value, divided by 100 is a multiplier on the output sample values. Normally this should be left at 100 (a scaling factor of 1.0) and amplitude should be controlled using the system audio controls.
speakingPitch – A value in Hertz for the voice average F0. If 0.0, the database average F0 is used instead.
All target costs have the form key_tc. These influence the selection of candidate units from the database. Setting a target cost to 0.0 removes it’s influence from the candidate selection process. Setting it to a large value increases it’s importance in determining which candidate units are more favored for synthesis. The actual values of these costs are only meaningfull in a relative sense. For example setting one cost to 6 and another to 3 (all other costs being 0) will have exactly the same effect as setting the first to 10 and the second to 5, that is, the one feature will be twice as important as the other in ranking the costs of candidates.
The keys and their meaning are:
- f0 – unit average f0
- dur – unit duration
- str – the lexical stress associated with the syllable in which the unit is found.
- bnd – the boundary (AKA break index) for the syllable in which the unit is found.
- pc – the phonetic context in which the unit was uttered. I.e., how closely a unit’s context matches the target context.
- acc – the pitch accent associated with the syllable containing the candidate unit.
The remaining keys are for internal experimental purposes and should not be modified.
After selecting candidate phonetic segments and assigning a rating to each based on its target cost, MT2 considers how smoothly the candidates will blend together to make a natural sounding synthetic utterance. This is based on a set of join costs that weight the various types of discontinuities that one might encounter when concatenating one unit with another. Each type of join cost is specified as key_tc. The important keys are listed below:
- f0 – The difference in fundamental frequency between the edges of two units. Using a large f0 join cost penalizes concatenation where there might be large jumps in voice pitch.
- esp – Epoch spectral differences. This join cost penalizes large difference in the speech spectrum at a concatenation boundary.
- rms – RMS Amplitudes. This join cost penalizes large increases or decreases in overall amplitude across a concatenation boundary.
Other join costs are experimental in nature and should not be modified at this time.
There are 4 additional parameters that can substantially affect the way the synthetic output sounds. These are:
- dc – Discontinuity cost. Every concatenation point is assumed to have some amount of discontinuity unless the two units were adjacent in a natural utterance. The dc is a cost that is added to each concatenation join unless it results in reconstructing a sequence that was naturally recorded. This cost actually grows with the square of the value specified.
- maxcand – The maximum number of candidates that will considered for each unit. Increasing this value increases the chances of finding the ideal sequence of candidates for a particular utterance, however, it also increases the amount of memory required to conduct the search, and the CPU demands for conducting the search. On desktops and modern laptops, we typically use values in the range of 1000, but for mobile devices, 250 or less may be a better balance between quality and resources.
- dopitch – MT2 can control utterance pitch to follow an intonation contour that is determined by our prosody model. This can give sentences a more natural/expected sort of intonation, but it also adds distortion ot the voice due to the additional signal processing that is involved. That in turn, reduces the extent to which the synthesizer sounds like the target talker. Most users seem to prefer having dopitch turned off by setting it to 0. Any non-zero value turns it on.
- dotime – Like with pitch, MT2 can modify segment durations so that they are a better match to the segment duration that is determined by our prosody model. Setting dotime to any non-zero value enables this control. We tend to favor voices generated with dotime turned on.
The remaining features (dospec, smoothF0, smoothRMS, smoothSpec) are experimental and should always be set to 0 for user voices.