Gama
Network Presents:

Read
My Lips: Facial Animation Techniques
By
Jeff
Lander
Gamasutra
April
06, 2000
URL: http://www.gamasutra.com/features/20000406/lander_01.htm
The
"Graphic Content" column in Game Developer follows the erratic
path of a professional computer graphics developer, namely me. Anyone who has
ever been in a professional production situation realizes that real-world coding
these days requires a broad area of expertise. When this expertise is lacking,
developers need to be humble enough to look things up and turn to people around
them who are more experienced in that particular area.
As
I continue to explore areas of graphics technology, I have attempted to document
the research and resources I have used in creating projects for my company.
My research demands change from month to month depending on what is needed at
the time. This month, I have the need to develop some facial animation techniques,
particularly lip sync. This means I need to shelve my physics research for a
bit and get some other work done. I hope to get back to moments of inertia,
and such, real soon.
And
Now for Something Completely Different
My
problem right now is facial animation. In particular, I need to know enough
in order to create a production pathway and technology to display real-time
lip sync. My first step when trying to develop new technology is to take a historic
look at the problem and examine previous solutions. The first people I could
think of who had explored facial animation in depth were the animators who created
cartoons and feature animation in the early days of Disney and Max Fleischer.
Facial
animation in games has built up on this tradition. Chiefly, this has been achieved
through cut-scene movies animated using many of the same methods. Games like
Full Throttle and The Curse of Monkey Island used facial animation
for their 2D cartoon characters in the same way that the Disney animators would
have. More recently, games have begun to include some facial animation in real-time
3D projects. Tomb Raider has had scenes in which the 3D characters pantomime
the dialog, but the face is not actually animated. Grim Fandango uses
texture animation and mesh animation for a basic level of facial animation.
Even console titles like Banjo Kazooie are experimenting with real-time
“lip-flap” without even having a dialog track. How do I leverage this tradition
into my own project?
Phonemes
and Visemes
No
discussion of facial animation is possible without discussing phonemes. Jake
Rodgers’s article “Animating Facial Expressions” (Game Developer, November
1998) defined a phoneme as an abstract unit of the phonetic system of a language
that corresponds to a set of similar speech sounds. More simply, phonemes are
the individual sounds that make up speech. A naive facial animation system may
attempt to create a separate facial position for each phoneme. However, in English
(at least where I speak it) there are about 35 phonemes. Other regional dialects
may add more.
Now,
that’s a lot of facial positions to create and keep organized. Luckily, the
Disney animators realized a long time ago that using all phonemes was overkill.
When creating animation, an artist is not concerned with individual sounds,
just how the mouth looks while making them. Fewer facial positions are necessary
to visually represent speech since several sounds can be made with the same
mouth position. These visual references to groups of phonemes are called visemes.
How do I know which phonemes to combine into one viseme? Disney animators relied
on a chart of 12 archetypal mouth positions to represent speech as you can see
in Figure 1.
 |
|
Figure
1. The 12 classic Disney mouth positions.
|
Each
mouth position or viseme represented one or more phonemes. This reference chart
became a standard method of creating animation. As a game developer, however,
I am concerned with the number of positions I need to support. What if my game
only has room for eight visemes? What if I could support 15 visemes? Would it
look better?
Throughout
my career, I have seen many facial animation guidelines with different numbers
of visemes and different organizations of phonemes. They all seem to be similar
to the Disney 12, but also seem like they involved animators talking to a mirror
and doing some guessing.
I
wanted to establish a method that would be optimal for whatever number of visemes
I wanted to support. Along with the animator’s eye for mouth positions, there
are the more scientific models that reduce sounds into visual components. For
the deaf community, which does not hear phonemes, spoken language recognition
relies entirely on lip reading. Lip-reading samples base speech recognition
on 18 speech postures. Some of these mouth postures show very subtle differences
that a hearing individual may not see.
So,
the Disney 12 and the lip reading 18 are a good place to start. However, making
sense of the organization of these lists requires a look at what is physically
going on when we speak. I am fortunate to have a linguist right in the office.
It’s times like this when it helps to know people in all sorts of fields, no
matter how obscure.
Science
Break
The
field of linguistics, specifically phonetics, compares phonemes according to
their actual physical attributes. The grouping does not really concentrate on
the visual aspects, as sounds rely on things going on in the throat and in the
mouth, as well as on the lips. But, perhaps this can help me organize the phonemes
a bit.
Sounds
can be categorized according to voicing, manner of articulation (airflow), and
the places of articulation. There are more, but these will get the job done.
As speakers of English, we automatically create sounds correctly without thinking
about what is going on inside the mouth. Yet, when we see a bad animation, we
know it doesn’t look quite right although we may not know why. With the information
below, you will be equipped to know why things look wrong. Now for some group
participation. This is an interactive article. Go on, no one is looking. The
categories we want to examine are:
Voiced
vs. Voiceless. Put your hand on your throat and say something. You can feel
an intermittent vibration. Now say, “p-at, b-at, p-at, b-at,” (emphasizing the
initial consonant). Looking at the face, there is no visual difference between
voiced and voiceless sounds. In some sounds the vocal cords are vibrating together
(b-voiced) and in some the vocal cords are apart (p- voiceless). This is an
automatic no-brainer as far as reducing sounds into one viseme. Any pair of
sounds that is only different because of voicing can be reduced to the same
viseme. In English, that eliminates eight phonemes.
Nasal
vs. oral. Put your fingers on your nose. Slowly say “momentary.” You can
feel your nose vibrating when you are saying the “m.” Some sounds are said through
the nasal cavity, but most are said through the oral cavity. These are also
not visibly different. So again, we have an automatic reduction in phonemes.
All three nasal sounds in English can be included in the oral viseme counterpart.
Manners
of Speech. Sounds can also be differentiated by the amount of opening through
the oral tract. These also do not offer a visible clue, but are very important
for categorizing phonemes. Sounds that have complete closure of the airstream
are called stops. Sounds that have a partially obstructed closure and turbulent
airflow are called fricatives. A sound that combines a stop/fricative is called
an affricate. Sounds that have a narrowing of the vocal tract, but no turbulent
airflow, are called approximates. And then there are sounds that have relatively
no obstruction of the airflow; these are the vowels.
 |
|
Figure
2. Side cut-out view of places of articulation.
|
Places
of Articulation. This involves where the sound is being made in the mouth.
This is where the visible differences occur. There are several places of articulation
(see Figure 2) involving the lips, teeth, tongue, and stuff in the back of the
mouth (the palate, velum, and glottis) for the consonants. Vowel placement is
based on the relative height of the tongue and whether the tongue is more front
or back in the mouth. A differentiating factor not listed in Chart 1 is lip
rounding. This is not associated with any particular place of articulation and
will be addressed below. Whew.
As
I said, there are 35 phonemes in my dialect of American English. You may have
more. Chart 1 is a summary of these phonemes. Read the chart from the front
of the mouth to the back of the mouth. Try saying each of the words that illustrate
the phoneme that is in bold. Have a look in the mirror and see what is going
on as well as feel what is going on inside the head. By using the distinction
of voicing and oral/nasal, we have already eliminated 11 phonemes. Let’s continue
the reduction of phonemes into the usable visemes.
Take
It to the Limit
According
to the chart, there are three bilabials, which are sounds made with both lips.
They are [b], [p], and [m]. According to the Figures 3a, 3b, and 3c they have
different attributes inside the mouth. B and P only differ in that the B makes
use of the vocal cords and P does not. The M sound is nasal and voiced so it
is similar to the B sound, but it is a nasal sound. The cool thing about these
sounds is that while there are differences inside the mouth, visually there
is no difference. If you look in a mirror and say “buy,” “pie,” and “my” they
all look identical. We have reduced three phonemes into one viseme.
 |
|
Chart
1. American English phoneme summary chart.
|
While
you’re working, remember that you are thinking with respect to sounds (phonemes),
not letters. In many cases a phoneme is made up of multiple letters. So, if
we go through Chart 1, we can continue to reduce the 35 phonemes into 13 visemes.
For the most part, the visemes are categorized along the lines of the Places
of Articulation (with the exception of [r]).
Take
a look at the following listing of visemes. It describes the look of each phoneme
in American English. The only phoneme not listed is [h]. “In English, ‘h’ acts
like a consonant, but from an articulatory point of view it is simply the voiceless
counterpart of the following vowel.” (Ladefoged, 1982:33-4). In other words,
treat [h] like the vowel that comes after it.
|
Visemes
1.
[p, b, m] - Closed lips.
2.
[w] & [boot] - Pursed lips.
3.
[r*] & [book] - Rounded open lips with corner of lips slightly puckered.
If you look at Chart 1, [r] is made in the same place in the mouth as
the sounds of #7 below. One of the attributes not denoted in the chart
is lip rounding. If [r] is at the beginning of a word, then it fits here.
Try saying “right” vs. “car.”
4.
[v] & [f ] - Lower lip drawn up to upper teeth.
5.
[thy] & [thigh] - Tongue between teeth, no gaps on sides.
6.
[l] - Tip of tongue behind open teeth, gaps on sides.
7.
[d,t,z,s,r*,n] - Relaxed mouth with mostly closed teeth with pinkness
of tongue behind teeth (tip of tongue on ridge behind upper teeth).
8.
[vision, shy, jive, chime] Slightly open mouth with mostly closed teeth
and corners of lips slightly tightened.
9.
[y, g, k, hang, uh-oh] - Slightly open mouth with mostly closed teeth.
10.
[beat, bit] - Wide, slightly open mouth.
11.
[bait, bet, but] - Neutral mouth with slightly parted teeth and slightly
dropped jaw.
12.
[boat] - very round lips, slight dropped jaw.
13.
[bat, bought] - open mouth with very dropped jaw.
|
To
see how helpful this information can be when animating a face take a word like
“hack.” It has four letters, three phonemes, and only two visemes (13 and 9
in the listing).
Say
that you don’t have enough space to include 13 visemes and whatever emotions
you want expressed. Well, by using Chart 1 and the list of visemes in the listing,
you can make logical decisions of where to cut. For example, if you only have
room for 12 visemes, you can combine viseme 5 and 6 or 6 and 7 below. For 11
visemes, continue combining visemes by incorporating viseme 7 and 9 below. For
10, combine visemes 2 and 3. For 9, combine 8 with the new viseme 7/9. For 8,
combine 11 and 13.
If
I were really pressed for space, I could keep combining and drop this list down
further. Most drastic would be three frames (Open, Closed, and Pursed as in
boot) or even a simple two frames of lip flap open and closed. In this case
you would just alternate between opened and closed once in a while. But that
isn’t very fun or realistic, is it?
Art
Issues
 |
|
Side
view of the sound
[b], as in “buy.”
|
These
viseme descriptions are enough to realistically represent speech. However, the
use of individual visemes is more an artistic judgement then a hard rule. When
speaking, people tend to slur phonemes together. They do not clearly articulate
each phoneme all the time. Also, the look of a viseme can change depending on
the visemes that surround it. For example, the Disney guidelines describe the
use of a slightly different viseme for B, P, and M if they precede the ea sound
as in beat.
This
dependency on surrounding sounds is called co-articulation and makes viseme
choice more complicated. This is one reason that the automatic phoneme recognition
software in some packages doesn’t always provide realistic results. Smooth interpolation
between viseme keyframes can help, but this alone may not be good enough. In
many cases, it requires an artistic judgement for which viseme really looks
best. In computer animation, realistic looks are all that matter. So, when you
work, put in the viseme that looks best.
 |
|
Side
view of the sound
[p], as in “pie.”
|
Emphasis
and exaggeration are also very important in animation. You may wish to punch
up a sound by the use of a viseme to punctuate the animation. This emphasis
along with the addition of secondary animation to express emotion is key to
a believable sequence.
In
addition to these viseme frames, you will want to have a neutral frame that
you can use for pauses. In fast speech, you may not want to add the neutral
frame between all words, but in general it gives good visual cues to sentence
boundaries.
So
What Do I Do with This Stuff?
 |
|
Side
view of the sound
[m], as in “my.”
|
So
far, I have been discussing issues that only seem important to the artists working
on the facial animation. If the only use of facial animation in your project
is for pre-rendered cut scenes, this may be true. However, I believe facial
animation will become an important aspect in real-time 3D rendering as we take
character simulation to the next level. This requires a close relationship between
the art assets and engine features. As a technical lead on a cutting-edge 3D
project, you will be required to create the production pathway that the artists
will use to create assets. You will be responsible for deciding how many visemes
the engine can support and the manner in which the meshes must be created. Having
a clear understanding of what goes into the creation of the assets will allow
you to interface more effectively with those creating the assets.
However,
even with the viseme count I am still not ready to set the artists loose creating
my viseme frames. There are several basic engine decisions that I must make
before modeling begins. Unfortunately, I will have to wait until the next column
to dig into that. Until then, think back on my 3D morphing column (“Mighty Morphing
Mesh Machine,” December 1998) as well as last year’s skeletal deformation column
(“Skin Them Bones,” Graphic Content, May 1998) and see if you can get a jump
on the rest of the class.
Acknowledgements
Special thanks go to my partner in crime, Margaret Pomeroy. She was able to
explain to me what was really going on when I made all those funny faces in
the mirror. When she was studying ancient languages in school I am sure she
never imagined working on lip-synching character dialog.
For
Further Info:
•
Culhane, Shamus. Animation from Script to Screen. New York: St. Martin’s
Press, 1988.
•
Ladefoged, Peter. A Course in Phonetics. San Diego: Harcourt Brace Jovanovich,
1982.
•
Maestri, George. [digital] Character Animation. Indianapolis: New Riders
Publishing, 1996.
•
Parke, Frederic I. and Keith Waters. Computer Facial Animation. Wellesley:
A. K. Peters, 1996.
Jeff Lander often sounds like he knows what he’s talking about. Actually, he’s
just lip-synched to someone who really know what’s going on. Let him know you
are on to the scam at jeffl@darwin3d.com.
Copyright
© 2003 CMP Media Inc. All rights reserved.