Almost Human (But Not Quite): Evaluating Text-to-Speech for eLearning Narration

Inour previous article we discussed the use of audio narration for ouronline courses. One potential voice source is text-to-speech, orTTS. We want to report what we learned about its viability from ourview as internal eLearning developers.

Theuse of live narrators for eLearning can be a lengthy andresource-intensive process, both for initial production and forsubsequent revisions. TTS, if viable, could make our production moreefficient. Of course, the trade-off is voice quality. Can TTSreally sound human enough to be practical? After spot-checking theTTS market for over a year we recently took a more in-depth look.

A variety of considerations andopinions

Onone hand, we learned that some learners, our own employees included,can accommodate TTS as long as they don’t have to strain tounderstand it. One source found that after several minutes, learnersviewed it as listening to someone with an accent. On the other hand,there are the elements of cost, suitability of voices, and ease ofuse.

Postson an ASTD eLearning discussion group were unanimously against usingTTS. Other articles were generally in favor of it under certaincircumstances. We think some of the disparity stems from the widevariance of quality not only between TTS engine manufacturers, buteven between different voices that use the same TTS engine.

TTS engines we reviewed

Weevaluated TTS engines and voices from the following TTS enginemanufacturers:

  • Acapela

  • Cepstral

  • Ivona

  • Loquendo

  • NeoSpeech

  • NextUp

In addition to these companies, all of whom specialize in TTSServices, we evaluated the voices that come with Adobe Captivate.

Voice quality

Typical optionsinclude male and female personalities along with accents such asAmerican, British, and Australian. All voices were judged using thesame passage from a script in one of our eLearning courses. Voicequality ranged from highly robotic to amazingly human-like. Besidesone voice’s diction sounding quite different from another, we alsofound that a voice could vary within itself depending on the passage.

Price

Wefound that the same package was priced quite differently whether itwas being licensed for individual use, internal distribution on anintranet, or commercially.

TTSmanufacturers seem to use one of two general business models. One isa hosted model. Text is entered on the host website and read by theselected voice. The user adjusts pronunciation and inflection untilthe sound is satisfactory. (See note under Ease of Use.) The userthen downloads the finished product as an audio file. Mostmanufacturers who use this model base their fee on number of finishedminutes of audio. In our sample, fees for this kind of service ranbetween $7.50 and $11.00 per finished minute.

Theother model is based on licensed downloads of the engine and voices. Fees for this kind of service varied from $2,500 per year for theengine and three voices to a one-time fee of $1,100 for the engineand two voices. Either way, additional voices are available for anadditional fee.

Ease of use

Inour small eLearning shop, no one specializes in a particular skill ortool. Thus it is essential that if TTS is going to work, it must bevery intuitive to tweak a voice’s inflection and punctuation. Wefound that several TTS products do not have a graphical userinterface. Rather, some of them use a SDK (Software Developer Kit)and are intended for use by developers only. Note: We were able toadjust some pronunciation by changing spelling and punctuation in atrial-and-error fashion.

Technical support

Finallya critical factor in anyone’s use of TTS is technical support. Based on the responsiveness to our inquiries, technical support couldrange widely. Thus, we urge anyone considering TTS to check thiscarefully.

Conclusion

We believe the quality, price, and ease of use are reaching a pointwhere text-to-speech is becoming a viable alternative to recordinghuman voices for certain narration. After evaluating a variety ofsources and voices, we feel the ones that ship with Adobe Captivateare acceptable for short passages. Some others are getting close tohuman-sounding. In our sample, which is not comprehensive, we foundthe following products to be viable based on quality of voices,price, and ease of use:

  • Virtual Speaker and Acapela Box by the Acapela Group

  • Studio Two by Ivona

  • Cepstral

However, because we found there can be noticeable variation betweenvoices using the same engine, and even within the same voice from onepassage to the next, we urge anyone considering TTS to evaluate theproduct thoroughly, across a wide sample of phrases.

Reference

“Text-to-Speech vs Human Narration for eLearning.” eLearningTechnology, Tony Karrer, September 14, 2010, downloaded fromhttps://elearningtech.blogspot.com/2010/09/text-to-speech-vs-human-narration-for.html


Share:


Contributors

Topics: