The Speech Synthesis Technology Computer Science Essay

Stephen Hawking is one of the most celebrated people utilizing speech synthesis to pass on. Speech synthesis is the unreal production of human address. A computing machine system used for this intent is called a address synthesist, and can be implemented in package or hardware. A text-to-speech ( TTS ) system converts normal linguistic communication text into address ; other systems render symbolic lingual representations like phonic written texts into address. [ 1 ]

Synthesized address can be created by concatenating pieces of recorded address that are stored in a database. Systems differ in the size of the stored address units ; a system that shops phones or diphones provides the largest end product scope, but may miss lucidity. For specific use spheres, the storage of full words or sentences allows for high-quality end product. Alternatively, a synthesist can integrate a theoretical account of the vocal piece of land and other human voice features to make a wholly “ man-made ” voice end product. [ 2 ]

The quality of a address synthesist is judged by its similarity to the human voice and by its ability to be understood. An apprehensible text-to-speech plan allows people with ocular damages or reading disablements to listen to written plants on a place computing machine. Many computing machine runing systems have included address synthesists since the early 1980s.


overview of a typical TTS system

A text-to-speech system ( or “ engine ” ) is composed of two parts [ 3 ] : a front-end and a back-end. The front-end has two major undertakings. First, it converts natural text incorporating symbols like Numberss and abbreviations into the equivalent of written-out words. This procedure is frequently called text standardization, pre-processing, or tokenization. The front-end so assigns phonic written texts to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The procedure of delegating phonic written texts to words is called text-to-phoneme or grapheme-to-phoneme transition. Phonetic written texts and inflection information together do up the symbolic lingual representation that is end product by the front-end. The back-end-often referred to as the synthesizer-then converts the symbolic lingual representation into sound. In certain systems, this portion includes the calculation of the mark inflection ( flip contour, phoneme continuances [ 4 ] ) , which is so imposed on the end product address.


Long earlier electronic signal processing was invented, there were those who tried to construct machines to make human address. Some early fables of the being of “ talking caputs ” involved Gerbert of Aurillac ( d. 1003 AD ) , Albertus Magnus ( 1198-1280 ) , and Roger Bacon ( 1214-1294 ) .

In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of Sciences, built theoretical accounts of the human vocal piece of land that could bring forth the five long vowel sounds ( in International Phonetic Alphabet notation, they are [ aE? ] , [ eE? ] , [ iE? ] , [ oE? ] and [ uE? ] ) . [ 5 ] This was followed by the bellows-operated “ acoustic-mechanical address machine ” by Wolfgang von Kempelen of Vienna, Austria, described in a 1791 paper. [ 6 ] This machine added theoretical accounts of the lingua and lips, enabling it to bring forth consonants every bit good as vowels. In 1837, Charles Wheatstone produced a “ speech production machine ” based on von Kempelen ‘s design, and in 1857, M. Faber built the “ Euphonia ” . Wheatstone ‘s design was resurrected in 1923 by Paget. [ 7 ]

In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic address analyser and synthesist that was said to be clearly apprehensible. Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World ‘s Fair.

The Pattern playback was built by Dr. Franklin S. Cooper and his co-workers at Haskins Laboratories in the late fortiess and completed in 1950. There were several different versions of this hardware device but merely one presently survives. The machine converts images of the acoustic forms of address in the signifier of a spectrograph back into sound. Using this device, Alvin Liberman and co-workers were able to detect acoustic cues for the perceptual experience of phonic sections ( consonants and vowels ) .

Dominant systems in the 1980s and 1990s were the MITalk system, based mostly on the work of Dennis Klatt at MIT, and the Bell Labs system ; [ 8 ] the latter was one of the first multilingual language-independent systems, doing extended usage of Natural Language Processing methods.

Early on electronic address synthesists sounded robotic and were frequently hardly apprehensible. The quality of synthesized address has steadily improved, but end product from modern-day address synthesis systems is still clearly distinguishable from existent human address.

As the cost-performance ratio causes speech synthesists to go cheaper and more accessible to the people, more people will profit from the usage of text-to-speech plans. [ 9 ]

[ edit ] Electronic devices

The first computer-based address synthesis systems were created in the late fiftiess, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr and co-worker Louis Gerstman [ 10 ] used an IBM 704 computing machine to synthesise address, an event among the most outstanding in the history of Bell Labs. Kelly ‘s voice recording equipment synthesist ( vocoder ) recreated the vocal “ Daisy Bell ” , with musical concomitant from Max Mathews. Coincidentally, Arthur C. Clarke was sing his friend and co-worker John Pierce at the Bell Labs Murray Hill installation. Clarke was so impressed by the presentation that he used it in the climactic scene of his screenplay for his fresh 2001: A Space Odyssey, [ 11 ] where the HAL 9000 computing machine sings the same vocal as it is being put to kip by spaceman Dave Bowman. [ 12 ] Despite the success of strictly electronic address synthesis, research is still being conducted into mechanical address synthesists. [ 13 ]

Handheld electronics having speech synthesis began emerging in the 1970s. One of the first was the Telesensory Systems Inc. ( TSI ) Speech+ portable reckoner for the blind in 1976. [ 14 ] [ 15 ] Other devices were produced chiefly for educational intents, such as Speak & A ; Spell, produced by Texas Instruments [ 16 ] in 1978. The first multi-player game utilizing voice synthesis was Milton from Milton Bradley Company, which produced the device in 1980.

Synthesizer engineerings

The most of import qualities of a address synthesis system are naturalness and intelligibility. Naturalness describes how closely the end product sounds like human address, while intelligibility is the easiness with which the end product is understood. The ideal address synthesist is both natural and apprehensible. Speech synthesis systems normally try to maximise both features.

The two primary engineerings for bring forthing man-made address wave forms are concatenative synthesis and formant synthesis. Each engineering has strengths and failings, and the intended utilizations of a synthesis system will typically find which attack is used.

Concatenative synthesis

Concatenative synthesis is based on the concatenation ( or threading together ) of sections of recorded address. Generally, concatenative synthesis produces the most natural-sounding synthesized address. However, differences between natural fluctuations in address and the nature of the machine-controlled techniques for sectioning the wave forms sometimes result in hearable bugs in the end product. There are three chief sub-types of concatenative synthesis.

Unit choice synthesis

Unit choice synthesis uses big databases of recorded address. During database creative activity, each recorded vocalization is segmented into some or all of the followers: single phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into sections is done utilizing a specially modified speech recognizer set to a “ forced alliance ” manner with some manual rectification subsequently, utilizing ocular representations such as the wave form and spectrograph. [ 17 ] An index of the units in the address database is so created based on the cleavage and acoustic parametric quantities like the cardinal frequence ( pitch ) , continuance, place in the syllable, and neighbouring phones. At runtime, the coveted mark vocalization is created by finding the best concatenation of candidate units from the database ( unit choice ) . This procedure is typically achieved utilizing a specially leaden determination tree.

Unit choice provides the greatest naturalness, because it applies merely a little sum of digital signal processing ( DSP ) to the recorded address. DSP frequently makes recorded speech sound less natural, although some systems use a little sum of signal processing at the point of concatenation to smooth the wave form. The end product from the best unit-selection systems is frequently identical from existent human voices, particularly in contexts for which the TTS system has been tuned. However, maximal naturalness typically require unit-selection address databases to be really big, in some systems runing into the Gs of recorded informations, stand foring tonss of hours of address. [ 18 ] Besides, unit choice algorithms have been known to choose sections from a topographic point that consequences in less than ideal synthesis ( e.g. minor words become ill-defined ) even when a better pick exists in the database. [ 19 ]

Diphone synthesis

Diphone synthesis uses a minimum address database incorporating all the diphones ( sound-to-sound passages ) happening in a linguistic communication. The figure of diphones depends on the phonotactics of the linguistic communication: for illustration, Spanish has approximately 800 diphones, and German about 2500. In diphone synthesis, merely one illustration of each diphone is contained in the address database. At runtime, the mark inflection of a sentence is superimposed on these minimum units by agencies of digital signal processing techniques such as additive prognostic cryptography, PSOLA [ 20 ] or MBROLA. [ 21 ] The quality of the ensuing address is by and large worse than that of unit-selection systems, but more natural-sounding than the end product of formant synthesists. Diphone synthesis suffers from the sonic bugs of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either attack other than little size. As such, its usage in commercial applications is worsening, although it continues to be used in research because there are a figure of freely available package executions.

Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to make complete vocalizations. It is used in applications where the assortment of texts the system will end product is limited to a peculiar sphere, like theodolite agenda proclamations or weather studies. [ 22 ] The engineering is really simple to implement, and has been in commercial usage for a long clip, in devices like speaking redstem storksbills and reckoners. The degree of naturalness of these systems can be really high because the assortment of sentence types is limited, and they closely match the inflection and modulation of the original recordings. [ commendation needed ]

Because these systems are limited by the words and phrases in their databases, they are non all-purpose and can merely synthesise the combinations of words and phrases with which they have been preprogrammed. The blending of words within of course spoken linguistic communication nevertheless can still do jobs unless the many fluctuations are taken into history. For illustration, in non-rhotic idioms of English the “ R ” in words like “ clear ” /E?kliE?E™/ is normally merely pronounced when the followers word has a vowel as its first missive ( e.g. “ clear out ” is realized as /E?kliE?E™E?E?E‘ESt/ ) . Likewise in French, many concluding consonants become no longer soundless if followed by a word that begins with a vowel, an consequence called affair. This alternation can non be reproduced by a simple word-concatenation system, which would necessitate extra complexness to be context-sensitive.

Formant synthesis

Formant synthesis does non utilize human address samples at runtime. Alternatively, the synthesized address end product is created utilizing linear synthesis and an acoustic theoretical account ( physical modeling synthesis ) . [ 23 ] Parameters such as cardinal frequence, voicing, and noise degrees are varied over clip to make a wave form of unreal address. This method is sometimes called rules-based synthesis ; nevertheless, many concatenative systems besides have rules-based constituents. Many systems based on formant synthesis engineering generate unreal, robotic-sounding address that would ne’er be mistaken for human address. However, maximal naturalness is non ever the end of a address synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized address can be faithfully apprehensible, even at really high velocities, avoiding the acoustic bugs that normally plague concatenative systems. High-speed synthesized address is used by the visually impaired to rapidly voyage computing machines utilizing a screen reader. Formant synthesists are normally smaller plans than concatenative systems because they do non hold a database of address samples. They can hence be used in embedded systems, where memory and microprocessor power are particularly limited. Because formant-based systems have complete control of all facets of the end product address, a broad assortment of inflections and modulations can be end product, conveying non merely inquiries and statements, but a assortment of emotions and tones of voice.

Examples of non-real-time but extremely accurate modulation control in formant synthesis include the work done in the late seventiess for the Texas Instruments plaything Speak & A ; Spell, and in the early 1980s Sega arcade machines. [ 24 ] and in many Atari, Inc. arcade games [ 25 ] utilizing the TMS5220 LPC Chips. Making proper modulation for these undertakings was painstaking, and the consequences have yet to be matched by real-time text-to-speech interfaces. [ 26 ]

Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesising address based on theoretical accounts of the human vocal piece of land and the articulation processes happening at that place. The first articulative synthesist on a regular basis used for research lab experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesist, known as ASY, was based on vocal piece of land theoretical accounts developed at Bell Laboratories in the sixtiess and 1970s by Paul Mermelstein, Cecil Coker, and co-workers.

Until late, articulative synthesis theoretical accounts have non been incorporated into commercial address synthesis systems. A noteworthy exclusion is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. Following the death of the assorted embodiments of NeXT ( started by Steve Jobs in the late eightiess and merged with Apple Computer in 1997 ) , the Trillium package was published under the GNU General Public License, with work go oning as gnuspeech. The system, foremost marketed in 1994, provides full articulatory-based text-to-speech transition utilizing a wave guide or transmission-line parallel of the human unwritten and rhinal piece of lands controlled by Carre ‘s “ typical part theoretical account ” .

HMM-based synthesis

HMM-based synthesis is a synthesis method based on concealed Markov theoretical accounts, besides called Statistical Parametric Synthesis. In this system, the frequence spectrum ( vocal piece of land ) , cardinal frequence ( vocal beginning ) , and continuance ( inflection ) of address are modeled at the same time by HMMs. Address wave forms are generated from HMMs themselves based on the maximal likeliness standard. [ 27 ]

Sinewave synthesis

Sinewave synthesis is a technique for synthesising address by replacing the formants ( chief sets of energy ) with pure tone whistlings. [ 28 ]

[ edit ] Challenges

[ edit ] Text standardization challenges

The procedure of normalising text is seldom straightforward. Text are full of heteronyms, Numberss, and abbreviations that all require enlargement into a phonic representation. There are many spellings in English which are pronounced otherwise based on context. For illustration, “ My latest undertaking is to larn how to better project my voice ” contains two pronunciations of “ undertaking ” .

Most text-to-speech ( TTS ) systems do non bring forth semantic representations of their input texts, as procedures for making so are non dependable, good understood, or computationally effectual. As a consequence, assorted heuristic techniques are used to think the proper manner to disambiguate homographs, like analyzing adjacent words and utilizing statistics about frequence of happening.

Recently TTS systems have begun to utilize HMMs ( discussed supra ) to bring forth “ parts of address ” to help in disambiguating homographs. This technique is rather successful for many instances such as whether “ read ” should be pronounced as “ ruddy ” connoting past tense, or as “ reed ” connoting present tense. Typical mistake rates when utilizing HMMs in this manner are normally below five per centum. These techniques besides work good for most European linguistic communications, although entree to necessitate preparation principal is often hard in these linguistic communications.

Deciding how to change over Numberss is another job that TTS systems have to turn to. It is a simple scheduling challenge to change over a figure into words ( at least in English ) , like “ 1325 ” going “ one 1000 three hundred 25. ” However, Numberss occur in many different contexts ; “ 1325 ” may besides be read as “ one three two five ” , “ 13 25 ” or “ 13 hundred and twenty five ” . A TTS system can frequently deduce how to spread out a figure based on environing words, Numberss, and punctuation, and sometimes the system provides a manner to stipulate the context if it is equivocal. [ 29 ] Roman numbers can besides be read otherwise depending on context. For illustration “ Henry VIII ” reads as “ Henry the Eighth ” , while “ Chapter VIII ” reads as “ Chapter Eight ” .

Similarly, abbreviations can be equivocal. For illustration, the abbreviation “ in ” for “ inches ” must be differentiated from the word “ in ” , and the reference “ 12 St John St. ” uses the same abbreviation for both “ Saint ” and “ Street ” . TTS systems with intelligent front terminals can do educated conjectures about equivocal abbreviations, while others provide the same consequence in all instances, ensuing in absurd ( and sometimes amusing ) end products.

Text-to-phoneme challenges

Speech synthesis systems use two basic attacks to find the pronunciation of a word based on its spelling, a procedure which is frequently called text-to-phoneme or grapheme-to-phoneme transition ( phoneme is the term used by linguists to depict typical sounds in a linguistic communication ) . The simplest attack to text-to-phoneme transition is the dictionary-based attack, where a big dictionary incorporating all the words of a linguistic communication and their right pronunciations is stored by the plan. Determining the right pronunciation of each word is a affair of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the lexicon. The other attack is rule-based, in which pronunciation regulations are applied to words to find their pronunciations based on their spellings. This is similar to the “ sounding out ” , or man-made phonics, attack to larning reading.

Each attack has advantages and drawbacks. The dictionary-based attack is speedy and accurate, but wholly fails if it is given a word which is non in its lexicon. [ commendation needed ] As dictionary size grows, so excessively does the memory infinite demands of the synthesis system. On the other manus, the rule-based attack plants on any input, but the complexness of the regulations grows well as the system takes into history irregular spellings or pronunciations. ( See that the word “ of ” is really common in English, yet is the lone word in which the missive “ degree Fahrenheit ” is pronounced [ V ] . ) As a consequence, about all speech synthesis systems use a combination of these attacks.

Languages with a phonemic writing system have a really regular authorship system, and the anticipation of the pronunciation of words based on their spellings is rather successful. Speech synthesis systems for such linguistic communications frequently use the rule-based method extensively, fall backing to lexicons merely for those few words, like foreign names and adoptions, whose pronunciations are non obvious from their spellings. On the other manus, speech synthesis systems for linguistic communications like English, which have highly irregular spelling systems, are more likely to trust on lexicons, and to utilize rule-based methods merely for unusual words, or words that are n’t in their lexicons.

Evaluation challenges

The consistent rating of speech synthesis systems may be hard because of a deficiency of universally agreed nonsubjective rating standards. Different organisations frequently use different address informations. The quality of speech synthesis systems besides depends to a big grade on the quality of the production technique ( which may affect parallel or digital recording ) and on the installations used to play back the address. Measuring speech synthesis systems has hence frequently been compromised by differences between production techniques and rematch installations.

Recently, nevertheless, some research workers have started to measure speech synthesis systems utilizing a common address dataset. [ 30 ]

Prosodics and emotional content

A recent survey reported in the diary “ Speech Communication ” by Amy Drahota and co-workers at the University of Portsmouth, UK, reported that hearers to voice recordings could find, at better than opportunity degrees, whether or non the talker was smiling. [ 31 ] It was suggested that designation of the vocal characteristics which signal emotional content may be used to assist do synthesized address sound more natural.

The first address system integrated into an operating system that shipped in measure was Apple Computer ‘s MacInTalk in 1984. Since the 1980s Macintosh Computers offered text to speech capablenesss through The MacinTalk package. In the early 1990s Apple expanded its capablenesss offering system broad text-to-speech support. With the debut of faster PowerPC-based computing machines they included higher quality voice trying. Apple besides introduced speech acknowledgment into its systems which provided a unstable bid set. More late, Apple has added sample-based voices. Get downing as a wonder, the address system of Apple Macintosh has evolved into a fully-supported plan, PlainTalk, for people with vision jobs. VoiceOver was for the first clip featured in Mac OS X Tiger ( 10.4 ) . During 10.4 ( Tiger ) & A ; first releases of 10.5 ( Leopard ) there was merely one standard voice transporting with Mac OS X. Get downing with 10.6 ( Snow Leopard ) , the user can take out of a broad scope list of multiple voices. VoiceOver voices feature the pickings of realistic-sounding breaths between sentences, every bit good as improved lucidity at high read rates over PlainTalk. Mac OS X besides includes say, a command-line based application that converts text to hearable address. The AppleScript Standard Additions includes a say verb that allows a book to utilize any of the installed voices and to command the pitch, talking rate and transition of the spoken text.


The 2nd operating system with advanced address synthesis capablenesss was AmigaOS, introduced in 1985. The voice synthesis was licensed by Commodore International from a third-party package house ( Do n’t Ask Software, now Softvoice, Inc. ) and it featured a complete system of voice emulation, with both male and female voices and “ emphasis ” index markers, made possible by advanced characteristics of the Amiga hardware sound chipset. [ 33 ] It was divided into a storyteller device and a transcriber library. Amiga Speak Handler featured a text-to-speech transcriber. AmigaOS considered speech synthesis a practical hardware device, so the user could even airt console end product to it. Some Amiga plans, such as word processors, made extended usage of the address system.

Microsoft Windows

See besides: Microsoft Agent

Modern Windows systems use SAPI4- and SAPI5-based address systems that include a address acknowledgment engine ( SRE ) . SAPI 4.0 was available on Microsoft-based runing systems as a third-party addition for systems like Windows 95 and Windows 98. Windows 2000 added a speech synthesis plan called Narrator, straight available to users. All Windows-compatible plans could do usage of address synthesis characteristics, available through bill of fare one time installed on the system. Microsoft Speech Server is a complete bundle for voice synthesis and acknowledgment, for commercial applications such as call centres.

Text-to-Speech ( TTS

) capablenesss for a computing machine refers to the ability to play back text in a spoken voice. TTS is the ability of the operating system to play back printed text as spoken words. [ 34 ]

An internal ( installed with the operating system ) driver ( called a TTS engine ) : recognizes the text and utilizing a synthesized voice ( chosen from several pre-generated voices ) speaks the written text. Extra engines ( frequently use a certain slang or vocabulary ) are besides available through third-party makers. [ 34 ]


Version 1.6 of Android added support for address synthesis ( TTS ) . [ 35 ]


The most recent TTS development in the web browser, is the JavaScript Text to Speech work of Yury Delendik, which ports the Flite C engine to pure JavaScript. This allows web pages to change over text to audio utilizing HTML5 engineering. The ability to utilize Yury ‘s TTS port presently requires a usage browser physique that uses Mozilla ‘s Audio-Data-API. However, much work is being done in the context of the W3C to travel this engineering into the mainstream browser market through the W3C Audio Incubator Group with the engagement of The BBC and Google Inc.

Presently, there are a figure of applications, plugins and appliances that can read messages straight from an e-mail client and web pages from a web browser or Google Toolbar such as Text-to-voice which is an add-on to Firefox. Some specialised package can narrate RSS-feeds. On one manus, on-line RSS-narrators simplify information bringing by leting users to listen to their favorite intelligence beginnings and to change over them to podcasts. On the other manus, online RSS-readers are available on about any Personal computer connected to the Internet. Users can download generated audio files to portable devices, e.g. with a aid of podcast receiving system, and listen to them while walking, ramble oning or transposing to work.

A turning field in cyberspace based TTS is web-based assistive engineering, e.g. ‘Browsealoud ‘ from a UK company and Readspeaker. It can present TTS functionality to anyone ( for grounds of handiness, convenience, amusement or information ) with entree to a web browser. The non-profit undertaking Pediaphon was created in 2006 to supply a similar web-based TTS interface to the Wikipedia. [ 36 ] Additionally SPEAK.TO.ME from Oxford Information Laboratories is capable of presenting text to speech through any browser without the demand to download any particular applications, and includes smart bringing engineering to guarantee merely what is seen is spoken and the content is logically pathed.


Some theoretical accounts of Texas Instruments place computing machines produced in 1979 and 1981 ( Texas Instruments TI-99/4 and TI-99/4A ) were capable of text-to-phoneme synthesis or declaiming complete words and phrases ( text-to-dictionary ) , utilizing a really popular Speech Synthesizer peripheral. TI used a proprietary codec to implant complete spoken phrases into applications, chiefly video games. [ 37 ]

IBM ‘s OS/2 Warp 4 included VoiceType, a precursor to IBM ViaVoice.

Systems that operate on free and unfastened beginning package systems including Linux are assorted, and include open-source plans such as the Festival Speech Synthesis System which uses diphone-based synthesis ( and can utilize a limited figure of MBROLA voices ) , and gnuspeech which uses articulative synthesis [ 38 ] from the Free Software Foundation.

Companies which developed address synthesis systems but which are no longer in this concern include BeST Speech ( bought by L & A ; H ) , Eloquent Technology ( bought by SpeechWorks ) , Lernout & A ; Hauspie ( bought by Nuance ) , SpeechWorks ( bought by Nuance ) , Rhetorical Systems ( bought by Nuance ) .

[ edit ] Speech synthesis markup linguistic communications

A figure of markup linguistic communications have been established for the rendering of text as address in an XML-compliant format. The most recent is Speech Synthesis Markup Language ( SSML ) , which became a W3C recommendation in 2004. Older address synthesis markup linguistic communications include Java Speech Markup Language ( JSML ) and SABLE. Although each of these was proposed as a criterion, none of them has been widely adopted.

Speech synthesis markup linguistic communications are distinguished from duologue markup linguistic communications. VoiceXML, for illustration, includes tickets related to speech acknowledgment, duologue direction and touchtone dialing, in add-on to text-to-speech markup.

[ edit ] Applications

Speech synthesis has long been a critical assistive engineering tool and its application in this country is important and widespread. It allows environmental barriers to be removed for people with a broad scope of disablements. The longest application has been in the usage of screen readers for people with ocular damage, but text-to-speech systems are now normally used by people with dyslexia and other reading troubles every bit good as by pre-literate kids. They are besides often employed to help those with terrible address damage normally through a dedicated voice end product communicating assistance.

Sites such as Ananova and YAKiToMe! have used address synthesis to change over written intelligence to audio content, which can be used for nomadic applications.

Speech synthesis techniques are used every bit good in the amusement productions such as games, Zanzibar copal and similar. In 2007, Animo Limited announced the development of a package application bundle based on its address synthesis package FineSpeech, explicitly geared towards clients in the amusement industries, able to bring forth narrative and lines of duologue harmonizing to user specifications. [ 39 ] The application reached adulthood in 2008, when NEC Biglobe announced a web service that allows users to make phrases from the voices of Code Geass: Lelouch of the Rebellion R2 characters. [ 40 ]

TTS applications such as YAKiToMe! and Speakonia are frequently used to add man-made voices to YouTube pictures for comedic consequence, as in Barney Bunch videos. YAKiToMe! is besides used to change over full books for personal podcasting intents, RSS provenders and web pages for intelligence narratives, and educational texts for enhanced acquisition.

Software such as Vocaloid can bring forth cantabile voices via wordss and tune. This is besides the purpose of the Singing Computer undertaking ( which uses GNU LilyPond and Festival ) to assist blind people look into their lyric input. [ 41 ]

Following to these applications is the usage of text to speech package besides popular in Synergistic Voice Response systems, frequently in combination with speech acknowledgment. Examples of such voices can be found at or Nextup.

New PhD thesis presented on HMM-based address synthesis

Posteado por gabrielf

Last Friday July 16th a new PhD thesis was presented within the research group on media engineerings, the writer of the thesis is Xavier Gonzalvo and the rubric is “ HMM-based address synthesis applied to Spanish and English, its applications and a intercrossed attack ” . The advisers of this work are Dr. Joan Claudi Socoro and Dr. Ignasi Iriondo. We congratulate all for this first-class work!


This work presents a Text-to-Speech ( TTS ) system based in a statistical model utilizing Hidden Markov Models ( HMMs ) that will cover with the chief subjects uder survey in recent old ages such as voice manner version, trainable TTS systems and low print databases. Furthermore, a film editing border intercrossed attack uniting concatenative and statistical synthesis is besides presented. Ideas and consequences in this work demo a measure frontward in the HMM-based TTS system field.