In my preceeding article, "Read My Lips: Facial Animation Techniques" I left off with a nice short list of the visemes I would need to represent speech realistically. However, now I am left with the not insignificant problem of determining exactly how to display these visemes in a real-time application.
It may seem as if this is purely an art problem, better left to your art staff Or, if you are a one-person development team, at least left to the creative side of your brain. However, your analytical side needs to inject itself in here a bit. This is one of those early production decisions you read about so much in the Postmortem column that can make or break your schedule and budget. Choose wisely and everything will work out great. Choose poorly and your art staff or even your own brain will throttle you.
For the final result, I want a 3D real-time character that can deliver various pieces of dialog in the most convincing manner possible. Thanks to the information learned last month, I know I can severely limit the amount of work I need to do. I know that with 13 visemes, or visual phoneme positions, I can reasonably represent most sounds I expect to encounter. I even have a nice mapping from American English to my set of visemes. Most other languages could probably be represented by these visemes as well, but could require a different mapping table.
From this information I can expect that if I can reasonably represent these 13 visemes with my character mesh, then continuous lip-synch should be possible. So the problem really comes down to how I construct and manipulate those meshes.
Certainly, the obvious method for creating these 13 visemes is to generate 13 versions of my character head mesh, one to represent each viseme. I can then use the morphing techniques I discussed in my column “Mighty Morphing Mesh Machine,” in the December 1998 issue of Game Developer to interpolate smoothly between different sounds.
Figure 1. The “l” viseme
as seen at the start
of the word “life.”
Modeling the face to match the visemes is pretty easy. Once the artist has the base mesh created, each viseme can be generated by deforming the mesh any way necessary to get the right target frame. As long as no vertices are added or deleted and the triangle topology remains the same, everything should work out great. Figure 1 shows an image of a character displaying the “L” viseme, as in the word “life.” The tongue is behind the top teeth, slightly cupped, leaving gaps at the side of the mouth, and the teeth are slightly parted.
Sounds pretty good so far. Just create 13 morph targets for the visemes in addition to the base frame and you’re done. Life’s great, back to physics, right? Well, not quite yet.
Suppose in addition to simply lip-synching dialog, your characters must express some emotion. You want them to be able to say things sadly, or speak cheerfully. We need to add an emotional component to the system.
Adding Some Heart to the Story
At first glance, it may seem that you can simply add some additional morph targets for the base emotions. Most people describe six basic emotions. Here they are with some of their traits. (See Goldfinger under “For Further Info” for photo examples of the six emotions.)
1. Happiness: Mouth smiles open or closed, cheeks puff, eyes narrow.
2. Sadness: Mouth cornsers pull down, brows incline, upper eyelids droop.
3. Surprise: Brows raise up and arch, upper eyelids raise, jaw drops.
4. Fear: Brows raise and draw together, upper eyelids raise, lower eyelids tense upwards, jaw drops, mouth corners go out and down.
5. Anger: Inner brows pull together and down, upper eyelids raise, nostrils may flare, lips are closed tightly or open exposing teeth.
6. Disgust: Middle portion of upper lip pulls up exposing teeth, inner brows pull together and down, nose wrinkles.
There are variations of these emotions, such as contempt, pain, distress, excitement, but you get the idea. Very distinct versions of these six will get the message across.
The key thing to notice about this list is that many of these emotions directly affect the same regions of the model as the visemes. If you simply layer these emotions on top of the existing viseme morph targets, you can get an additive effect. This can lead to ugly results.
Figure 2. A very surprised
For example, let me start with the “L” sound from before and blend in a surprised emotion at 100 percent. The “L” sound moves the tongue up to the top set of teeth and parts the mouth slightly. However, the surprise target drops the jaw even farther but leaves the tongue alone. This combination blends into the odd-looking character you see in Figure 2.
This problem really becomes apparent when the two meshes are actually fighting each other. For example, the “oo” viseme drives the lips into a tight, pursed shape while the surprise emotion drives the lips apart. Nothing pretty or realistic will come out of that combination.
When I ran into this issue a couple of years ago, the solution was tied to the weighting. By assigning a weight or priority to each morph target, I can compensate for these problems. I give the “oo” viseme priority over the surprise frame. This will suppress the effect that the surprise emotion has over shared vertices.