Read
My Lips: Facial Animation Techniques
Anyone
who has ever been in a professional production situation realizes that
real-world coding these days requires a broad area of expertise. When
this expertise is lacking, developers need to be humble enough to look
things up and turn to people around them who are more experienced in
that particular area.
As
I continue to explore areas of graphics technology, I have attempted
to document the research and resources I have used in creating projects
for my company. My research demands change from month to month depending
on what is needed at the time. This month, I have the need to develop
some facial animation techniques, particularly lip sync. This means
I need to shelve my physics research for a bit and get some other work
done. I hope to get back to moments of inertia, and such, real soon.
And
Now for Something Completely Different
My
problem right now is facial animation. In particular, I need to know
enough in order to create a production pathway and technology to display
real-time lip sync. My first step when trying to develop new technology
is to take a historic look at the problem and examine previous solutions.
The first people I could think of who had explored facial animation
in depth were the animators who created cartoons and feature animation
in the early days of Disney and Max Fleischer.
Facial
animation in games has built up on this tradition. Chiefly, this has
been achieved through cut-scene movies animated using many of the same
methods. Games like Full Throttle and The Curse of Monkey
Island used facial animation for their 2D cartoon characters in
the same way that the Disney animators would have. More recently, games
have begun to include some facial animation in real-time 3D projects.
Tomb Raider has had scenes in which the 3D characters pantomime
the dialog, but the face is not actually animated. Grim Fandango
uses texture animation and mesh animation for a basic level of facial
animation. Even console titles like Banjo Kazooie are experimenting
with real-time “lip-flap” without even having a dialog track. How do
I leverage this tradition into my own project?
Phonemes
and Visemes
No
discussion of facial animation is possible without discussing phonemes.
Jake Rodgers’s article “Animating Facial Expressions” (Game Developer,
November 1998) defined a phoneme as an abstract unit of the phonetic
system of a language that corresponds to a set of similar speech sounds.
More simply, phonemes are the individual sounds that make up speech.
A naive facial animation system may attempt to create a separate facial
position for each phoneme. However, in English (at least where I speak
it) there are about 35 phonemes. Other regional dialects may add more.
Now,
that’s a lot of facial positions to create and keep organized. Luckily,
the Disney animators realized a long time ago that using all phonemes
was overkill. When creating animation, an artist is not concerned with
individual sounds, just how the mouth looks while making them. Fewer
facial positions are necessary to visually represent speech since several
sounds can be made with the same mouth position. These visual references
to groups of phonemes are called visemes. How do I know which phonemes
to combine into one viseme? Disney animators relied on a chart of 12
archetypal mouth positions to represent speech as you can see in Figure
1.
 |
|
Figure
1. The 12 classic Disney mouth positions.
|
Each
mouth position or viseme represented one or more phonemes. This reference
chart became a standard method of creating animation. As a game developer,
however, I am concerned with the number of positions I need to support.
What if my game only has room for eight visemes? What if I could support
15 visemes? Would it look better?
Throughout
my career, I have seen many facial animation guidelines with different
numbers of visemes and different organizations of phonemes. They all
seem to be similar to the Disney 12, but also seem like they involved
animators talking to a mirror and doing some guessing.
I
wanted to establish a method that would be optimal for whatever number
of visemes I wanted to support. Along with the animator’s eye for mouth
positions, there are the more scientific models that reduce sounds into
visual components. For the deaf community, which does not hear phonemes,
spoken language recognition relies entirely on lip reading. Lip-reading
samples base speech recognition on 18 speech postures. Some of these
mouth postures show very subtle differences that a hearing individual
may not see.
So,
the Disney 12 and the lip reading 18 are a good place to start. However,
making sense of the organization of these lists requires a look at what
is physically going on when we speak. I am fortunate to have a linguist
right in the office. It’s times like this when it helps to know people
in all sorts of fields, no matter how obscure.
_______________________________________________________________
Science
Break