“We didn't need dialogue. We had faces!”
claims Norma Desmond (Gloria Swanson) in Billy Wilder’s Sunset Blvd.
(1950). Except that now we do, and we have a lot in video games.
You
snipe a guard in a murky alley, there goes the AI triggered cue; you’ve
just started a brand new adventure game, there goes the pre-rendered
exposition scene; you’ve defeated a level boss, there goes the scripted
event hinting at what to do next. In terms of audio, a big game in the
past couple of years was over 10,000 recorded lines. With next-gen
platforms and publishers banking on massive contents, those figures are
likely to grow by 3 if not by 5 or 10 in the medium or long run (and we
know the long run isn’t very long in this industry) depending on how
effectively AI will manage audiobases1 and cues’ variations.
Localizing
in-game and video dialogue is a hot topic: it is complicated and
expensive and it usually takes place at a time when you have a million
other fish to fry (the first of which being this game you need to
finish). This difficulty originates from our games’ high level of
sophistication, but also from worldlier principles: the intricacy of
level design, programming, animation, script writing (and many other
yet to be described tasks) result in what one could describe as a real
house of cards. Add pressure from publishers
for last minute changes and your production calendar is offset by four
weeks while a cold wind of anxiety blows over your already
sleep-deprived teams.
Finally,
localizing dialogue means dealing with external resources that will
carry out translation, casting, recordings and linguistic testing, all
tasks that also need to fit into your schedule while your submission
date isn’t likely to change.
Localizing audio and video is high maintenance. To paraphrase Harry Burns (Billy Crystal) in “When Harry Met Sally”2,
it’s the worst kind: it’s high maintenance that people misconstrue as
low maintenance. It has become even more so in the age of “sim ship”3:
more and more games, at least AAA’s that need international markets to
break even and start generating revenue faster, are localized while the
original game is still in its final (or not so final) stage; well, that
is exactly like trying to build a house without the final blueprint.
The
purpose of this article is to shed some light on original and localized
version entanglement and hopefully offer some advice to prevent things
from escalating to DEFCON 1.
1. Scheduling & Budgeting
1.1 Let’s first take a look at a (very) simplified audio production process:
Figure 1: audio process
Your
design is locked, your character brief is ready and you have a “final”
recording script for the “original” version (we’ll assume it’s American
English). Simultaneously animation is being created, based sometimes on
temporary voices. A voice director is hired and briefed on the game
mechanics and “feel”.
U.S. casting begins, goes through in-house and/or licensor / publisher approval (if applicable).
U.S.
recordings begin (you can do this in-house if you have the facilities
or contract a studio if you don’t). Cues are delivered in raw sessions
and need to be cut, cleaned (from clicks, scratches and mouth noises)
and named4. Keepers5
need to be selected. Lines that will be integrated in cinematic scenes
are delivered to the animation department for editing. That’s it: you
have a US audiobase that is your reference for localization recording
and here’s why you should definitely wait for US recordings to be over:
U.S. lines provide time references (see 1.3). Animation cannot adapt to five languages6.
Even if it could, your disc space is limited. If you record “blind”,
each foreign file will have a different length. You need the actual
U.S. recorded lines so that their localized equivalents match.
Each
language “sound databank” will also be the same size – which of course
will facilitate data and memory optimization. Before local actors
render their lines, they listen to the U.S.’s, then give theirs while
checking the wave’s physical form on screen so they can adapt.
Figure 2: audio waves
You don’t always have the time to select keepers (a lengthy process). This will allow countries to “mirror” all alternative takes in all languages so that later you can pick the right ones (which frequently happens when animation is not final).
Original recordings provide
additional artistic direction and help reduce recording time. U.S.
recording times are by and large longer than localizations.
You get back mirrored localized
audiobases that are consistent: same folder architecture, identical
file naming, matching number of files. This comes in very handy when
it’s time for post production and integration.
If U.S. recordings are done, chances are retakes7
(if deemed necessary) won’t be too numerous. Remember: all retake costs
(studio, engineer, actors and voice director) are multiplied by the
number of languages.
Localization begins: casting then script translation. Once localization agencies8 are all set (translated script and U.S. audiobase) they start recording.
Localized
audiobases are delivered back to the developer where they are checked
for integrity and quality (everything is there and meets the
standards). Files are post produced the same way U.S. were. You can
then move on with integration.
QA
and debug can start. Note: if post production is significant and
requires a lot of special effects on voices, you can integrate the dry
files9
and proceed with linguistic testing to save some time. The dry files
will be eventually swapped with the final ones (voices with effects,
final videos) which of course you can’t afford not to check in actual
gaming conditions.
1. Audiobase: ensemble of all lines recorded for the game.
2. He was of course referring to women.
3. Simultaneous shipping: original and localized versions ship at the same time.
4. Raw session is cut into single files that need naming, ex: CHAR1_Intro_001
5. When you have alternative takes, you then need to select the keeper (the one that meets your demands).
6. FIGS (French,
Italian, German, Spanish) + US. Those are generally the 5 languages you
find on PAL discs. Of course the number of languages can go up.
7. Additional recording session (lines were overlooked, quality was poor, lines were added in between etc.)
8. Local service providers that will translate cast and record the game in each language.
Excellent technical article and detail. Without this becoming a shameless plug. I'd like to elaborate in detail about some automation process currently used in many games that can save time and fits this framework.
Annosoft licenses the "Text based Lipsync SDK". This software automatically aligns VO with the text script, generating word and phonetical markers. It's very accurate and can streamline the animation process.
The localized scripts include animation markers/tags which define when an animation should occur in relation to the script. The automatic process will produce timing for these without having to re-animate or short-change localized versions. it's a clean way to handle mouth and general animations, and it actually works!
Anyway, sorry if this comes off as a plug, but i can really save some effort while keeping quality good.
Annosoft licenses the "Text based Lipsync SDK". This software automatically aligns VO with the text script, generating word and phonetical markers. It's very accurate and can streamline the animation process.
The localized scripts include animation markers/tags which define when an animation should occur in relation to the script. The automatic process will produce timing for these without having to re-animate or short-change localized versions. it's a clean way to handle mouth and general animations, and it actually works!
Anyway, sorry if this comes off as a plug, but i can really save some effort while keeping quality good.
Mark Zartler
Annosoft
Really Automatic Lipsync