While it is best to record all of the standard inventory sentences, that sometimes turns out to be too difficult. We will try to build a voice from as many recordings as you are able to complete. Our sentence material is ordered so that the most important material is recorded earliest. In studies we’ve run with these sentences, we have found the following to be a rough guideline to the tradeoff between the number of sentences recorded and the intelligibility of the resulting TTS voice.
- 200 sentences: Using only the first 200 sentences, it is possible to get a voice that will work some of the time, but it will not generally be usable for communication, particularly with strangers.
- 400 sentences: Voices made with the first 400 sentences can be usable, but there will still be many words that are mangled and hard/impossible to understand. The prosody (speech timing and intonation) will be quite robotic. This is the smallest number of sentences we recommend attempting to use as a real TTS voice.
- 800 sentences: With 800 sentences recorded, the synthetic voice will be approaching its maximum intelligibility. That is, recording more sentences will probably only slightly improve the intelligibility of the voice. However, speech prosody will still be awkward and frequently sound incorrect. For example, questions are more likely to sound like statements, or statements to sound like questions because the intonation is not appropriate.
- 1600 sentences: As you go from 800 to 1600 sentences, the majority of the changes in voice quality will be changes in the naturalness of the speech. Sentences will more frequently sound like they have the correct rhythm and intonation. Effects like the way we indicate phrase and sentence boundaries will more often be correct.
- 3155 sentences: After the first 1600 sentences, nearly all of the changes in voice quality will be changes in the naturalness of the speech. Sentences will more frequently sound like they have the correct rhythm and intonation. Effects like the way we indicate phrase and sentence boundaries will more often be correct.
Note that studies we’ve conducted to determine these guidelines were run with voices created from speech recorded under studio conditions by American English speaking voice talent. For speakers of other English dialects, speech recorded under less than ideal audio conditions, and speech recorded by talkers who are dysarthric or less able to produce exactly the correct sentences with consistent speaking rate and style, these breakpoints are likely to be optimistic. Your experience may differ considerably.