Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Asset Recovery: What to do When the Data is Gone!
View All     RSS
January 16, 2021
arrowPress Releases
January 16, 2021
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Asset Recovery: What to do When the Data is Gone!


August 3, 2001 Article Start Previous Page 4 of 5 Next
 

Finding Graphics in a Hex Dump
The key to identifying raw graphics data in a hex dump, is the ability to identify repeating data. In palette images eight bpp or less), you have runs of the same color on any given scan line. This is less obvious with dithered, digitized, or rendered data, especially if it is 16, 24, and 32 bit data, but if your image uses transparent colors, you might be able to identify runs of the transparent color.

Hex dump excerpt of a 4 bpp 256x256 image.

In the example above, notice that we have long strings of 0x99 and 0xCC. We can tell that this is four bpp data because the repetition (in this case 9s and Cs) is by nibble. Furthermore, by noting that these are eight lines 16 bytes apart, we can tell that the image is 128 bytes wide, or 256 pixels wide, assuming that these are both part of the same line. We can also discern that the image uses reverse order nibbles because the end of each run always has an out of order value: notice the first block starts with 91 99 and ends with 99 89, which seems very unlikely otherwise.

For an eight bpp bitmap, we can see repeated values for every byte instead of every nibble when there is a horizontal line. In addition, we can see the second scan line emerge 256 bytes into the data for a 256 pixel wide bitmap.

There are similar patterns in 16 bpp and 24 bpp data, but they might not be so apparent. Look for repeating sections of two or three byte codes and try to scan a larger area to determine the length of scan lines. This is where it is better to use the raw mode of your paint program.




Data from an 8 bpp 256x256 bitmap.

Tile maps
The older video game platforms like the Sega Genesis and Super Nintendo made great use of tile maps to efficiently display large sections of scrolling graphics. Although in today's 3D game worlds, you don't see tile maps as frequently, however, they are still used for Gameboy and Gameboy Advance games. Additionally, some games store their terrain as a uniform grid or height map, and the texture selection data is sometimes stored as a tile map.

Often, tile maps are created by automatically ripping out tile-sized sections of a full-sized bitmap, from left-to-right, top-to-bottom. When a tile that matches an already existing tile is found, the code of the existing tile is added, saving a tile definition. Since the process is automatic, one generally ends up with runs of increasing numbers, starting with tile 0. A hex dump of an automatically created tile map might start out like this:

The data is always increasing by one, except when using a previously defined code.

Notice that the data is always increasing by one, except when using a previously defined code. This technique only works with automatically generated tile map data. Of course, if there are bits for H-Flip, V-Flip, color palette, etc. they should be ignored.

The high 5 bits are used for priority, palette, H-flip, and V-flip.

In this example, the high 5 bits are used for priority, palette, H-flip, and V-flip. However, the patterns of tile allocation remain consistent if that extra data is ignored. Otherwise, you can treat a tile map just about the same as a bitmap, using the same techniques described above.

Palette Data
Palette data is usually 16, 24 or 32 bit RGB data. For 32 bit data, the extra byte is usually for alpha channel or other ID bits. In 16 bit data, the high bit is frequently unused, or may be used for transparency information, leaving five bits each for RGB data. Twenty-four bit data is generally all RGB data.

16 bit palette data

This example is 16 bit palette data. The high bit is transparency data. The palette data probably starts at 0x14 because the first palette entry is often all zeros for black transparent data. We know that our platform uses Intel format data, so the high byte is first. The subsequent set of data is for semi-transparent pixels, so we can see that the high bit is set for almost all of the following values. This seems to be a collection of multiple 16 color palettes, because the value 00 00 repeats every 32 bytes (2 bytes per color X 16 colors), meaning that the transparent zero color is repeated.

You may also try calculating some common colors in the correct bit format and doing a binary search. For 16 bit (1.5.5.5) data, pure white is 7FFF (or FF7F in Intel ordering), but that might occur automatically. The next lower shade of gray is 7BDE (DE7B), and the one below that is 739C (9C73), so if you see those numbers in your suspected palette block, you are probably in the right place.

Text Data
It seems like it is easy to find text data, especially if your data is in ASCII. However, for a variety of reasons, an alternative encoding is used. Certainly, some minimal encoding is required if you desire to protect the text from the prying eyes of hackers. On the other hand, if you have an Asian language with tens of thousands of characters, chances are that a custom character ordering is used.

If it's raw ASCII, it's like taking candy from a baby…

For the case of encoded ASCII data, chances are that the encoding is incredibly simple, or the data is compressed and you have little chance of getting it out without providing source for the decompressor. Let's use this simple code as an example. Ninety percent of the time the data is in ASCII order with an offset; the letter 'A', or space could be one. The key to recognizing the data is the frequency of the symbols. You know space will appear regularly every few characters, and that sentences will start with capital letters and end with punctuation. Furthermore, capital letters will be grouped together, lowercase letters will be in a different group, numbers will be sequential, etc.

Here, 1F has been subtracted from each letter.

Another common method for encoding the text is to XOR the letters by a given value. XORing is reversible by XORing again by the same value. Therefore, the same rules of symbol frequency apply, although they may be more difficult to pick out in a hex editor.

Sometimes, especially in the case of games with Asian encoding, the letters will be assigned using a sparse encoding. In this case, they are allocated using the same rules as ripping a tilemap—the symbols are assigned in the order that they appear. The best bet for determining the encoding in that situation is to extract the font data, which should be in the same order as the character encoding.

For Asian languages, it is more likely that the ROM font will actually be used if it is available because there are so many symbols that take up memory. In this case, chances are that one of the standard two-byte encoding's like Shift-JIS, EUC, Big-5, etc. is being used. To check for that, you need a program that displays the proper encoding, such as Hongbo Ni's NJStar, or JWP. Also, if you have Microsoft Office or Internet Explorer installed you can install the far-east encoding option with the appropriate fonts and then open the files in your web browser, trying various encoding. You may need to crop a byte off the start of the file or reverse endian on the text block in order for the double-byte data to be displayed correctly.

I've noticed most console programs from Japan tend to use Shift-JIS-I never see EUC, JIS, or Unicode. Windows games use Shift-JIS for normal text and Unicode for menus and resource text. One useful attribute of either encoding is that the characters that overlap ASCII are easy to decipher. First, in Shift-JIS, normal ASCII characters are allowed—shifted characters begin with a high byte of 0x81 or higher. There are also doublewide ASCII letters available, starting at 0x853F, the second byte being an ASCII value with 0x1F added to it. There is more to it than just that, but you should be able to see something that almost looks like ASCII text where roman letters appear. For Unicode, it is even easier—for ASCII symbols, every code is an ASCII code with 0x00 in the high byte.


Article Start Previous Page 4 of 5 Next

Related Jobs

Sucker Punch Productions
Sucker Punch Productions — Bellevue, Washington, United States
[01.15.21]

Producer
Square Enix Co., Ltd.
Square Enix Co., Ltd. — Tokyo, Japan
[01.12.21]

Experienced Game Developer





Loading Comments

loader image