Text to Lips – Put a Face on Speech Synthesis

Text to Speech is becoming not only more popular, but easier to implement. Android devices have a built in TTS library and tools like Emscripten are making it easy to port lower-level language utilities to JavaScript, such as TTS libraries.

The downside is, many of the voices used for speech are very "robotic". You can pay a premium for more human voices, but not all of us can make that kind of investment. My solution was to use a little JavaScript and Photoshop manipulation to create a moving mouth on a given image. I found that this somehow made the robotic voice just a little bit more human.

The first thing that might come to your mind is an animated GIF file. When the speech begins, simply show the animated GIF. As the speech finishes, hide the GIF, or replace it with an non-animated equivalent. This might work, but would not afford you any flexibility on the mouth movement. I instead used 3 individual images files with varying mouth positions as seen below. Yes, it's an iguana!

The idea here is that the first image is shown when the iguana is not speaking. As soon as speech begins, we immediately switch to the second image, then setup a JavaScript interval to flip between the second and third image. Because we essentially only have two "frames" to flip between, the difference in mouth position needs to be very minimal.

Once we have the three individual images, it's just a matter of using CSS to absolute them and JavaScript to manage the rotation. In my case, this was for a mobile app, so I wanted to let the viewport size determine the size of the image. Given that, I started with a higher res image, and placed them all as background images inside <div> elements. This will allow the background-size css property to do it's magic, adjusting them appropriately.

The HTML

<div id="iguana" class="iguana"></div> <div id="iguana2" class="iguana"></div> <div id="iguana3" class="iguana"></div>

The CSS

html, body { height: 100%; } #iguana { width: 100%; height: 100%; background: url(img/iguana.png) center center no-repeat; } #iguana.open { background-image: url(img/iguana1.png); } #iguana.open2 { background-image: url(img/iguana2.png); }

@media only screen and (max-width: 1023px) { /* smaller tablet portrait */ #iguana { background-size: 512px, 384px; } }


	@media only screen and (max-width: 480px) {

	    /* mobile */

	    #iguana { background-size: 256px, 192px; }    

	}

The JavaScript

speak = function(text) { moveMouthStart(); startSpeech(text, function() { /* called when speech is complete */ moveMouthStop(); });


	}
	moveMouthStart = function() {

	    

	    $('#iguana').addClass('open');

	    mouthInterval = setInterval(function() {

	        

	        setTimeout(function() { $('#iguana').addClass('open2'); }, 150);

	        setTimeout(function() { $('#iguana').removeClass().addClass('open'); }, 300);

	        

	    }, 500);

	    

	}

moveMouthStop = function() { clearTimeout(mouthInterval); setTimeout(function() { $('#iguana').removeClass(); }, 300); }

The startSpeech function is essentially pseudo-code, to be replaced with your application text-to-speech module function. The rest is just basic JavaScript code, leveraging a callback and the interval/timeout functions built into JavaScript.

I put a little video together to demonstrate the final result: http://www.youtube.com/watch?v=zZ81Imt4IM4&feature=youtu.be

Tags: css backgrounds fun photoshop

The HTML

The CSS

The JavaScript

Leave a Reply Cancel reply