who has ever been in a professional production situation realizes that
real-world coding these days requires a broad area of expertise. When
this expertise is lacking, developers need to be humble enough to look
things up and turn to people around them who are more experienced in
that particular area.
As I continue to explore areas of graphics technology, I have attempted to document the research and resources I have used in creating projects for my company. My research demands change from month to month depending on what is needed at the time. This month, I have the need to develop some facial animation techniques, particularly lip sync. This means I need to shelve my physics research for a bit and get some other work done. I hope to get back to moments of inertia, and such, real soon.
And Now for Something Completely Different
My problem right now is facial animation. In particular, I need to know enough in order to create a production pathway and technology to display real-time lip sync. My first step when trying to develop new technology is to take a historic look at the problem and examine previous solutions. The first people I could think of who had explored facial animation in depth were the animators who created cartoons and feature animation in the early days of Disney and Max Fleischer.
Facial animation in games has built up on this tradition. Chiefly, this has been achieved through cut-scene movies animated using many of the same methods. Games like Full Throttle and The Curse of Monkey Island used facial animation for their 2D cartoon characters in the same way that the Disney animators would have. More recently, games have begun to include some facial animation in real-time 3D projects. Tomb Raider has had scenes in which the 3D characters pantomime the dialog, but the face is not actually animated. Grim Fandango uses texture animation and mesh animation for a basic level of facial animation. Even console titles like Banjo Kazooie are experimenting with real-time “lip-flap” without even having a dialog track. How do I leverage this tradition into my own project?
Phonemes and Visemes
No discussion of facial animation is possible without discussing phonemes. Jake Rodgers’s article “Animating Facial Expressions” (Game Developer, November 1998) defined a phoneme as an abstract unit of the phonetic system of a language that corresponds to a set of similar speech sounds. More simply, phonemes are the individual sounds that make up speech. A naive facial animation system may attempt to create a separate facial position for each phoneme. However, in English (at least where I speak it) there are about 35 phonemes. Other regional dialects may add more.
Now, that’s a lot of facial positions to create and keep organized. Luckily, the Disney animators realized a long time ago that using all phonemes was overkill. When creating animation, an artist is not concerned with individual sounds, just how the mouth looks while making them. Fewer facial positions are necessary to visually represent speech since several sounds can be made with the same mouth position. These visual references to groups of phonemes are called visemes. How do I know which phonemes to combine into one viseme? Disney animators relied on a chart of 12 archetypal mouth positions to represent speech as you can see in Figure 1.
Figure 1. The 12 classic Disney mouth positions.
Each mouth position or viseme represented one or more phonemes. This reference chart became a standard method of creating animation. As a game developer, however, I am concerned with the number of positions I need to support. What if my game only has room for eight visemes? What if I could support 15 visemes? Would it look better?
Throughout my career, I have seen many facial animation guidelines with different numbers of visemes and different organizations of phonemes. They all seem to be similar to the Disney 12, but also seem like they involved animators talking to a mirror and doing some guessing.
I wanted to establish a method that would be optimal for whatever number of visemes I wanted to support. Along with the animator’s eye for mouth positions, there are the more scientific models that reduce sounds into visual components. For the deaf community, which does not hear phonemes, spoken language recognition relies entirely on lip reading. Lip-reading samples base speech recognition on 18 speech postures. Some of these mouth postures show very subtle differences that a hearing individual may not see.
So, the Disney 12 and the lip reading 18 are a good place to start. However, making sense of the organization of these lists requires a look at what is physically going on when we speak. I am fortunate to have a linguist right in the office. It’s times like this when it helps to know people in all sorts of fields, no matter how obscure.