Gama
Network Presents:

Flex
Your Facial Muscles
By
Jeff
Lander
Gamasutra
April
14, 2000
URL: http://www.gamasutra.com/features/20000414/lander_01.htm
In my preceeding
article, "Read
My Lips: Facial Animation Techniques", I left off with a nice short
list of the visemes I would need to represent speech realistically. However,
now I am left with the not insignificant problem of determining exactly how
to display these visemes in a real-time application.
It
may seem as if this is purely an art problem, better left to your art staff
Or, if you are a one-person development team, at least left to the creative
side of your brain. However, your analytical side needs to inject itself in
here a bit. This is one of those early production decisions you read about so
much in the Postmortem column that can make or break your schedule and budget.
Choose wisely and everything will work out great. Choose poorly and your art
staff or even your own brain will throttle you.
Decisions,
Decisions
For
the final result, I want a 3D real-time character that can deliver various pieces
of dialog in the most convincing manner possible. Thanks to the information
learned last month, I know I can severely limit the amount of work I need to
do. I know that with 13 visemes, or visual phoneme positions, I can reasonably
represent most sounds I expect to encounter. I even have a nice mapping from
American English to my set of visemes. Most other languages could probably be
represented by these visemes as well, but could require a different mapping
table.
From
this information I can expect that if I can reasonably represent these 13 visemes
with my character mesh, then continuous lip-synch should be possible. So the
problem really comes down to how I construct and manipulate those meshes.
Viseme-Based
Methods
Certainly,
the obvious method for creating these 13 visemes is to generate 13 versions
of my character head mesh, one to represent each viseme. I can then use the
morphing techniques I discussed in my column “Mighty Morphing Mesh Machine,”
in the December 1998 issue of Game Developer to interpolate smoothly
between different sounds.
 |
|
Figure
1. The “l” viseme
as seen at the start
of the word “life.”
|
Modeling
the face to match the visemes is pretty easy. Once the artist has the base mesh
created, each viseme can be generated by deforming the mesh any way necessary
to get the right target frame. As long as no vertices are added or deleted and
the triangle topology remains the same, everything should work out great. Figure
1 shows an image of a character displaying the “L” viseme, as in the word “life.”
The tongue is behind the top teeth, slightly cupped, leaving gaps at the side
of the mouth, and the teeth are slightly parted.
Sounds
pretty good so far. Just create 13 morph targets for the visemes in addition
to the base frame and you’re done. Life’s great, back to physics, right? Well,
not quite yet.
Suppose
in addition to simply lip-synching dialog, your characters must express some
emotion. You want them to be able to say things sadly, or speak cheerfully.
We need to add an emotional component to the system.
Adding
Some Heart to the Story
At
first glance, it may seem that you can simply add some additional morph targets
for the base emotions. Most people describe six basic emotions. Here they are
with some of their traits. (See Goldfinger under “For Further Info” for photo
examples of the six emotions.)
1.
Happiness: Mouth smiles open or closed, cheeks puff, eyes narrow.
2.
Sadness: Mouth cornsers pull down, brows incline, upper eyelids droop.
3.
Surprise: Brows raise up and arch, upper eyelids raise, jaw drops.
4.
Fear: Brows raise and draw together, upper eyelids raise, lower eyelids
tense upwards, jaw drops, mouth corners go out and down.
5.
Anger: Inner brows pull together and down, upper eyelids raise, nostrils
may flare, lips are closed tightly or open exposing teeth.
6.
Disgust: Middle portion of upper lip pulls up exposing teeth, inner brows
pull together and down, nose wrinkles.
There
are variations of these emotions, such as contempt, pain, distress, excitement,
but you get the idea. Very distinct versions of these six will get the message
across.
The
key thing to notice about this list is that many of these emotions directly
affect the same regions of the model as the visemes. If you simply layer these
emotions on top of the existing viseme morph targets, you can get an additive
effect. This can lead to ugly results.
 |
|
Figure
2. A very surprised
“l” viseme.
|
For
example, let me start with the “L” sound from before and blend in a surprised
emotion at 100 percent. The “L” sound moves the tongue up to the top set of
teeth and parts the mouth slightly. However, the surprise target drops the jaw
even farther but leaves the tongue alone. This combination blends into the odd-looking
character you see in Figure 2.
This
problem really becomes apparent when the two meshes are actually fighting each
other. For example, the “oo” viseme drives the lips into a tight, pursed shape
while the surprise emotion drives the lips apart. Nothing pretty or realistic
will come out of that combination.
When
I ran into this issue a couple of years ago, the solution was tied to the weighting.
By assigning a weight or priority to each morph target, I can compensate for
these problems. I give the “oo” viseme priority over the surprise frame. This
will suppress the effect that the surprise emotion has over shared vertices.
Welcome
to Muscle Beach
Most
of the academic research on facial animation has not approached the problem
from a viseme basis. This is due to a fundamental drawback to the viseme frame
based approach. In the viseme-based system, every source frame of animation
is completely specified. While I can specify the amount each frame contributes
to the final model, I cannot create new source models dynamically. Say, for
example, I want to allow the character to raise one eyebrow. With the frames
I have described so far, this would not be possible. In order to accomplish
this goal, I would need to create individual morph targets with each eyebrow
raised individually. Since a viseme can incorporate a combination of many facial
actions, isolating these actions can lead to an explosive need for source meshes.
You may find yourself breaking these targets into isolated regions of the face.
 |
|
Figure
3. The zygomaticus major muscle will put a
smile on your face.
|
For
this reason, researchers such as Frederic Parke and Keith Waters began examining
how the face actually works biologically. By examining the muscle structure
underneath the skin, a parametric representation of the face became possible.
In fact, psychologists Paul Ekman and Wallice Friesden developed a system to
determine emotional state based on the measurement of individual muscle groups
as “action units.” Their system, called Facial Action Coding System (FACS),
describes 50 of these action units that can create thousands of facial expressions.
By creating a facial model that is controlled via these action units, Waters
was able to simulate the effect that changes in the action units reveal on the
skin.
While
I’m not sure if artists are ready to start creating parametric models controlled
by virtual muscles, there are definitely some lessons to be learned here. With
this system, it’s possible to describe any facial expression using these 50
parameters. It also completely avoids the additive morph problem I ran into
with the viseme system. Once a muscle is completely contracted, it cannot contract
any further. This limits the expression to ones that are at least physically
possible.
Artist-Driven
Muscle-Based Facial Animation
Animation
tools are not really developed to a point where artists can place virtual muscles
and attach them to a model. This would require a serious custom application
that the artists may be reluctant even to use. However, that doesn’t mean that
these methods are not available for game production. It just requires a different
way of thinking about modeling.
For
instance, let me take a look at creating a simple smile. Biologically, I smile
by contracting the zygomaticus major muscle on each side of my face. This muscle
connects the outside of the zygomatic bone to the corner of the mouth as shown
in Figure 3. Contract one muscle and half a smile is born.
 |
|
Figure
4. Pucker up:
Incisivus labii at work.
|
O.K.
Mr. Science, what does that have to do with modeling? Well, this muscle contracts
in a linear fashion. Take a neutral mouth and deform it as you would when the
left zygomaticus major is contracted. This mesh can be used to create a delta
table for all vertices that change. Repeat this process for all the muscles
you wish to simulate and you have all the data you need to start making faces.
You will find that you probably don’t need all 50 muscle groups described in
the FACS system. Particularly if your model has a low polygon count, this will
be overkill. The point is to create the muscle frames necessary to create all
the visemes and emotions you will need, plus any additional flexibility you
want. You will probably want to add some eye blinks, perhaps some eye shifts,
and tongue movement to make the simulation more realistic.
The
FACS system is a scientifically-based general modeling system. It does not consider
the individual features of a particular model. By allowing the modeler to deform
the mesh for the muscles instead of using this algorithmic system, I am giving
up general flexibility over a variety of meshes. However, I gain creative control
by allowing for exaggeration as well as artistic judgement.
The
downside is that it is now much harder to describe to the artists what it is
you need. You need to purchase some sort of anatomy book (see my suggestions
at the end of the column) and figure out exactly what you want to achieve. Your
artists are going to resist. You had this nice list of 13 visemes and now you
are creating more work. They don’t know what an incisivius labii is and don’t
want to. You can explain that it is what makes Lara pucker up and they won’t
care. You will have to win the staff over by showing the creative possibilities
for character expression that are now available. They probably still won’t care,
so get the producer to force them to do it. I have created a sample muscle set
in Chart 1. This will give you some groups from which to pick.
 |
|
Chart
1. The basic muscle groups involved in facial animation.
|
Now
I need to relate these individual muscle meshes to the viseme and emotional
states. This is accomplished with “muscle macros” that blend the percentages
of the basic muscles to form complex expressions. This flexibility permits speech
and emotion in any language without the need for special meshes.
I
still need to handle the case where several muscles interact with the same vertices.
However, now there is a biological foundation to what you are doing.
Certain
muscles counteract the actions of other muscles. For example, the muscles needed
to create the “oo” viseme (incisivius labii) will counter the effect of the
jaw dropping (digastric for those of you playing along at home). One real-time
animation package I have been working with called Geppetto, from Quantumworks,
calls this Muscle Relations Channels. You can create a simple mathematical expression
between the two to enforce this relationship. You can see this effect in Figure
5.
 |
|
Figure
5. W.C. Fields’s jaw is open and then blended
with the “oo” viseme. Image courtesy of Virtual
Celebrities Productions and Quantumworks.
|
Now
for the Animation
I
finally have my system set up and my models created. It is time to create some
real-time animation. The time-tested animation production method is to take
a track of audio dialog and go through it, matching the visemes in your model
set to the dialog. Then, in second pass, go through it and add any emotional
elements you want. This, as you can imagine, is pretty time consuming. Complicating
the matter is that there are not many off-the-shelf solutions to help you out.
The job requires handling data in a very special way and most commercial animation
packages are not up to the task without help.
Detecting
the individual phonemes within an audio track is part of the puzzle that you
can get help with. There is an excellent animation utility called Magpie Pro
from Third Wish Software that simplifies this task. It can take an audio track
and analyze it for phoneme patterns you provide automatically. While not entirely
accurate, it will at least get you started. From there you can manually match
up the visemes to the waveform until it looks right. The software also allows
you to create additional channels for things such as emotions and eye movements.
All this information can be exported as a text file containing the transition
information. This in turn can be converted directly to a game-ready stream of
data. You can see Magpie Pro in action in Figure 6.
 |
|
Figure
6. Magpie Pro simplifies the task of isolating
phoneme patterns in your audio track.
|
Wire
Me Up, Baby
With
all the high-tech toys available these days, it may seem like a waste to spend
all this time hand-synching dialog. What about this performance capture everyone
has been talking about? There are many facial capture devices on the market.
Some determine facial movements by looking at dots placed on the subject’s face.
Others use a video analysis method for determining facial position. For more
detailed information on this aspect, have a look at Jake Rodgers’s article “Animating
Facial Expressions” in the November 1998 issue of Game Developer . The
end result is a series of vectors that describe how certain points on the face
move during a capture session. The number of points that can be captured varies
based on the system used. However, typically you get from about eight to hundreds
of sensor positions in either 2D or 3D. The data is commonly brought into an
animation system like Softimage or Maya and the data points drive the deformation
of a model. Filmbox by Kaydara is designed specifically to aid in the process
of capturing, cleaning up, and applying this form of data. Filmbox can also
apply suppressive expressions, inverse kinematic constraints, and perform audio
analysis similar to Magpie Pro.
This
form of motion capture clearly can speed up the process of generating animation
information. However, it’s geared much more toward traditional animation and
high-end performance animation. In this respect it doesn’t really suit the real-time
game developer’s needs. It’s possible to drive a real-time character by using
the raw motion capture data to drive a facial deformation model. However, for
a real-time game application, I do not believe this is currently feasible.
In
order to convert this stream of positional data into my limited real-time animation
system, I would need to analyze the data and determine what visemes and emotions
the performer is trying to convey. You need a filtering method that will take
the multiple sample points and select the viseme or muscle action that is occurring.
This is really the key to making motion capture data usable for real-time character
animation. This area of research, termed gesture recognition, is pretty active
right now. There is a lot of information out there for study. However, Quantumworks’
Geppetto provides gesture recognition from motion capture data to drive “muscle
macros” as both a standalone and a plug-in for Filmbox.
Where
Do We Go from Here?
Between
viseme-based and muscle-based facial animation, you can see that there are a
lot of possible approaches and creative areas to explore. In fact, the whole
field has really opened up to game development in terms of opportunities for
game productions as well as tool developers. Games are going to need content
to start filling up those new DVD drives and I think facial animation is a great
way to take our productions to the next level.
For
Further Information:
•
Ekman, P. and W. Friesen. Manual for the Facial Action Coding System.
Palo Alto, Calif.: Consulting Psychologist Press, 1977.
•
Faigin, Gary. The Artist’s Complete Guide to Facial Expression. New York:
Watson-Guptill Publications, 1990.
•
Goldfinger, Eliot. Human Anatomy for Artists. New York: Oxford University
Press, 1991.
•
Landreth, C. “Faces with Personality: Modeling Faces That Exude Personality
When Animated.” Computer Graphics World (February 1996): p. 58(3).
•
Waters, Keith. “A Muscle Model for Animating Three-Dimensional Facial Expression,”
SIGGRAPH Vol. 21, N. 4 (July 1987): pp. 17-24.
Facial
Animation
http://mambo.ucsc.edu/psl/fan.html
Gesture
Recognition
http://www.cs.cmu.edu/~face
Performance
Animation Society
http://www.pasociety.org
Magpie
Pro
http://thirdwish.simplenet.com
Filmbox
http://www.kaydara.com
Geppetto
http://www.quantumworks.com
Acknowledgements
Thanks to Steve Tice of Quantumworks Corporation for the skull model and the
use of Geppetto as well as insight into muscle-based animation systems. The
W. C. Fields image is courtesy of Virtual Celebrity Productions LLC (http://www.virtualceleb.com)
created using Geppetto. The female kiss image is courtesy of Tom Knight of Imagination
Works .
When not massaging
the faces of digital beauties or doing stunt falls in a mo-cap rig, Jeff can
be found flapping his own lips at Darwin 3D. Send him some snappier dialogue
at jeffl@darwin3d.com.
Copyright
© 2003 CMP Media Inc. All rights reserved.