A big part of making a successful Gameboy Advance game is managing system resources. In this paper I present algorithms and code for managing one of the most important resources on AGB: the various kinds of memory. These types include: OAM memory, OBJ sprite memory, tile management for tile engines (like sidescrollers), and general purpose memory management. Each of the various kinds of memory on AGB has special restrictions that require special attention to how memory is allocated. Furthermore, because of some of the inherent restrictions on how memory can be allocated we can tailor the memory allocation routines for each memory type for added efficiency.
Code listings for all the systems described in this paper are available on: http://oroboro.com/rafael/gdctalk.html
AGB Memory Architecture
The AGB has 9 different memory areas: System ROM, CPU External Work RAM, CPU Internal Work RAM, Register memory, Palette RAM, OBJ VRAM, BG VRAM, OAM and CART ROM. Some of these areas are either not dynamically managed (System ROM, Register memory, and CART ROM) or are trivial to manage (Palette RAM).
Palette RAM Management
The Palette RAM memory area is 1k, just enough memory to store two 256-color palettes with 16 bits of color resolution per palette entry (actually only 15 bits are used, but the storage is 16 bits). For these you simply block copy the palettes to the two palette blocks and you are done. It is possible to implement a palette stack where palettes can be pushed onto the stack and the palette manager simply keeps the top of the palette stack on the palette area and copies palettes that are lower down in the stack to a buffer in work RAM.
OAM Memory Management
The OAM memory area is 1k, and is used to store 128 8-byte entries. Each entry represents one sprite being displayed. The entry has room to store the screen coordinates of the displayed sprite, a reference to the sprite data in OBJ VRAM, a bunch of flags and other display parameters. Moving a sprite on the screen, or changing its display parameters (flipping, transparency, Z-depth, rotation) is a matter of tweaking the OAM entry associated with that sprite. Managing this memory would be trivial except for the fact that the Z-depth of the sprites is determined by where the sprite appears on the list. OAM entries near the top of this memory area (smaller memory addresses) are displayed on top of sprites lower down in memory. There is a rough scale four-layer priority scheme in OAM memory—that can be done using the priority flags in each OAM entry—but for finer-scale Z-ordering the OAM entries must be stored in Z-depth order.
Keeping the OAM entries sorted can be a big pain. Ideally, when you allocate an OAM entry, you want to store the entry number so that you can change its display flags to move the sprite around on the screen. This requires that either the entry number never changes, or whenever it changes (due to Z-depth changes) you have to track down all references to this OAM entry number and change them.
For our game engine we decided that we want the finer control of Z-depth and that a call back scheme or reference-counted OAM entries was too much trouble. So we implemented a system whereby we keep two versions of the OAM data. One version resides in the hardware OAM memory area, and another version, the "Shadow OAM" is allocated in CPU External Work RAM. The work RAM version includes some extra data entries to help manage Z-depth without changing the memory order of the OAM entries. The Shadow OAM has a copy of each OAM entry, plus the Z-depth of each OAM entry and a linked list pointer, an 8-bit reference to the OAM entry with the next highest Z-depth.
Every game frame (which happens every hardware V-blank) we copy the Shadow OAM to the hardware OAM in Z-sorted order. This way the hardware OAM can change its order as necessary, while the order of the OAM entries in the Shadow OAM stays fixed. All other game systems that need to change the display parameters of a sprite can make changes to the flags in the Shadow OAM with full confidence that during the lifetime of any given sprite its location in the shadow OAM will never change.
The OAM manager also sets the OAM entry's BG Priority flags appropriately. It is enough to just define the Z-depth of each BG layer, and the BG priority flags will be set appropriately.
For most modifications to the sprite's display parameters, game systems can simply write new flags into the Shadow OAM. To change the Z-depth is a little more involved, so we provide a function to change the Z-depth that will update the linked list. Given that we provide only a singly linked list to encode the Z-depth, changing the Z-depth involves searching from the top of the linked list until you find the appropriate spot. There are usually far fewer elements on the screen than the maximum ( 128 ), so this has not been a problem for us. There are various ways to eliminate even this cost. One could implement a doubly linked list by allocating just one more 8-bit list pointer ( total cost: 128 bytes ). One could also subdivide the Z-depth range into segments each of which has its own linked list. For example one could implement a Z-range for game UI elements, another for a particle system and another for characters. In this case it is likely that only the characters would have many Z-depth values, and OBJ memory limitations are such that you are unlikely to have more than six of these on the screen at a time.
The cost of refreshing the hardware OAM is linear with the number of sprites being displayed, and the majority of the copying can be done using DMA. The total memory cost for this system is almost neglible, 1k. The code itself is stored in ROM, so its cost is even less of an issue. If you don't need fine scale Z-depth for your project it might not be worth the small amount of time it takes to maintain the Z-order linked list pointers and the time it takes to refresh the hardware OAM every frame.
OBJ Sprite Memory Management
The sprite image map itself is stored in OBJ VRAM. In BG mode this region is 32k. This is a very small amount of memory to store images. For example a single sprite image can be as much as 4k. It is possible to manage this memory area "manually". You decide how big each type of game element is going to be and you pre-allocate slots for each type. So if each of your main enemies is no bigger than 2k each then you statically pre-allocate a fixed number of slots for AI's and a slot for your main character, some slots for the particles and a small area for the UI. The problem with this is that it quickly becomes unmanageable when the needs of the game change. You run the danger of keeping certain slots always free in case you need to load a font, an extra UI meter, etc. Furthermore, you may need to have a different memory map for each screen type. These two problems interact badly. You may find that you have to write several versions of the same code that will, for example, display a similar in-game UI in different ways depending on the state of the game. This type of management also becomes a code maintenance problem. As the project progresses, as the art needs become clearer you may find yourself having to refigure the memory maps many times. On the other hand this kind of memory management makes you think about exactly what is on the screen at all times, and you can be sure you will never run out of OBJ memory.
Implementing a general-purpose memory manager for OBJ VRAM solves many of the problems above. The code that displays various game elements receives pointers from the allocator and never has to worry about exactly where they are.
A memory manager for OBJ VRAM needs to have several properties to be useful. First it needs to be extremely efficient with memory. Given the very small memory area, we need to make sure that we have as much of it available as possible at all times. It also needs to be very careful about memory fragmentation. If for example a small UI element is allocated in an unfortunate location early on, it may keep a larger section of memory off limits for all but the smallest sprites for the duration of a game level. This could mean the difference between being able to reliably display six characters on the screen versus just three. To deal with fragmentation the manager needs to coalesce small blocks of free'd memory into larger blocks. Finally the memory manager needs to be somewhat fast. Absolute top speed is not critical because loading a new sprite involves copying a new image to OBJ VRAM, which takes some time anyway.
Screen from Rayman for the Game Boy Advance.
The blocks allocated into OBJ memory have certain properties that you can rely on to simplify what the allocator has to do. The AGB sprite unit supports 12 different sprite size configurations (8x8, 16x16, 32x32, 64x64, 16x8, 32x16, 64x32, 8x16, 16x32 and 32x64). These configurations translate into a smaller number of absolute memory sizes (32, 64, 128, 256, 512, and 1024 bytes). All of these are powers of two, and all are multiples of 32 bytes. Given that the OBJ VRAM is 32k, we can manage this memory as 1024 32-byte blocks. With such a small number of blocks, we can keep track of which blocks are used and which are free with a single "allocation map" array of 1024 bytes, and we can refer to any allocated memory area with a 16 bit block ID. We fill the elements of the allocation map with one of three values, UNUSED, USED, and CONTINUE. Unused blocks are simply labeled UNUSED. When several contiguous blocks are allocated we label the first allocated block with USED and all of the following allocated blocks with CONTINUE. This way when that memory area is free'd, we will receive the block id of the first allocated block and we know to free all the following blocks until we reach one that is not labeled CONTINUE. We free them by simply labeling all the free'd blocks UNUSED again. Contiguous free'd memory areas will become joined automatically.
We allocate blocks only on regions that are even multiple of the desired block size. This way when blocks are allocated they are less likely to create regions that are unusable by blocks of the allocated size or smaller. And when blocks are free'd they are more likely to create regions usable by larger block sizes. For example, if we allocate a single 32-byte region (a sprite that is 8x8 pixels, by 4 bits) and then a 64 byte region, when we free the original 32 byte region we open a space big enough to allocate another 64 byte region.
Tile Management in Tile-Based Engines
It is straightforward to create tile based rendering engines on AGB. The graphics hardware supports multiple layers of 8x8 pixel tile maps. Typically, you store a set of tiles in ROM. This can be an array of all the tiles in a level, and a set of tile maps. The tile maps are 2D buffers with indexes into the tile array. During rendering, for each layer, you allocate a smaller 2D tile map in BG VRAM, and you copy all the visible tiles into BG VRAM version of the tile set. The RAM tile maps have indexes into the tile set in RAM. You pass the addresses of the tile set and tile maps to the AGB hardware, and the AGB graphics chip does the rest.
The only tricky part is keeping track of the mapping between the RAM and ROM versions of the tile set. As you scroll the smaller RAM based tile map around you need to copy new tiles into VRAM as needed and free tiles that are no longer needed out of the RAM tile set.
So for example when you copy in a new row into the RAM tile map, you need to read a row from the ROM tile map. This has indexes into the ROM tile set. You have to quickly determine if the referenced tile has already been copied into RAM. If so, you simply use the RAM version of that tile index in the new row of your RAM tile set. If not, you need to allocate a new tile in RAM, copy the appropriate tile from ROM. Whenever a row is removed from the RAM tile map, you need to quickly determine if the tiles in that row are still being used in another part of the map. Only if they are no longer being used is it safe to free that tile from the RAM tile set.
To do this we implement three arrays:
To initialize the arrays, you set the RAM-to-ROM and ROM-to-RAM arrays with a magic value that indicates an unused mapping. You need to find a value that is bigger than the largest valid ROM tile index. (As a practical matter, use 0xFFFF. Your options on AGB are to use 8-bit, 16-bit or 32-bit indexes. 8-bit indexes are too small, allowing only 255 tiles. 16-bit indexes allow up to 64k tiles, which is way more than the AGB can support, so there is no need to use 32-bit indexes.) The ref count array is initialized with each entry pointing to the next entry on the list. In other words: initialize entry "0" with the value "1", entry "1" with the value "2", etc.
When you need to load a new tile into RAM, you start with the ROM index of the tile you need that was read from the ROM tile map. You check the ROM-to-RAM array. If the value is not the magic unused value, then the tile you want has already been allocated, and the ROM-to-RAM array gives you the RAM index you need. Simply increase the reference count in the ref count array, and write the RAM tile index into the RAM tile map. If the tile has not been allocated then you unlink first element off the free list, and set its reference count to "1", update the ROM-to-RAM mapping, and the RAM-to-ROM mapping arrays. Finally copy the actual tile data into the RAM tile set, at the address specified by the RAM index.
When you need to free a tile, you start with the RAM tile index of the tile you need to free. Use this RAM index to check the reference count in reference count array. If the reference count is greater than one, decrement the reference count and you are done. If the reference count is only one, then you need to link this tile back into the free list. Now you need to clear its entry in the ROM-to-RAM array. Unfortunately you only have that tile's RAM index, so you look up its ROM index in the RAM-to-ROM array, and use the ROM index to clear the proper entry in the ROM-to-RAM array.
There is one tile that you will want to handle specially, which is the all-transparent tile. In a system with multiple layers it is likely that a large number of tiles in each layer will be copies of the tile that is all transparent (on AGB this is all pixels set to color "0"). You could dynamically manage this tile like the others, but given that this tile is likely to always be on the screen you can special case it. Make this tile always be tile zero in both the ROM and RAM maps. At initialization time, copy the all transparent tile to the first tile in the RAM tile set, and remove it from the free list. Then, whenever you read a zero in the ROM tile map, just copy a zero to the RAM tile map. Don't bother with allocating, reference counting and freeing this tile. There is no need to bother. You will typically save having to do any tile management for about half the tiles being displayed at any time.
The nice thing about the allocation scheme described above is that allocation and deallocation are fast constant time operations. You don't have to worry about coalescing allocated blocks because all the blocks you allocate are the same size.
In AGB it is possible to have simultaneously 4-bit and 8-bit layers. In the 8-bit layers each 8x8 pixel tile is 64 bytes. In the 4-bit layers each tile is 32 bytes. Each layer can only use one kind of tile, but you can simultaneously display multiple layers that have different bit depths. One way to deal with this is to simply duplicate the system described above. Reserve one region of memory for the 4-bit tiles, and another region for the 8-bit tiles, and make two versions of the tile management data structures. One downside to doing this is that as the number of 4-bit and 8-bit tiles change, the 8-bit tile region will not be able to use free space in the 4-bit tile region and visa-versa.
We can extend our allocation scheme to manage tiles of two sizes using the same data area to store both kinds of tiles, but its kind of tricky. The RAM tile maps that are passed in to the AGB rendering hardware use 16 bits to encode each map entry. Only 10 bits of those 16 are used to encode the tile index. The remaining six bits are used to store various flags. This means that the entire tile set for any one map can have no more than 1024 tiles. For 8-bit tiles this is a memory area of 64k, and for 4-bit tiles this is a memory area of 32k. If you want to allocate both 4- and 8-bit tiles from the same shared-memory area, you must be careful to allocate the 4-bit tiles from a single 32k sub region of the 64k data area. For example, if you were allocating from a 64k buffer, you could allocate shared 8- and 4-bit tiles from the first 32k and then allocate only 8-bit tiles from the second 32k. For a 64k tile set region you need to keep track of 2048 32-byte blocks.
To implement this scheme we need three linked lists interwoven in the reference count array. One linked list, "list-a" links all the even numbered entries for the first 1024 entries, and the second linked list, "list-b", links all the even numbered entries for the second 1024 blocks. This allows you to allocate only pairs of blocks on even block boundaries. When you allocate an 8-bit tile you use two consecutive blocks to store the tile. When you allocate a 4-bit tile you use a just the first 32-byte block. The second block in the pair that is left free is placed on a third free list, "list-c", which keeps track of left over single blocks that are good only for allocating single 4-bit tiles.
To allocate a new 8-bit tile, you check first to see if you can find a free pair of blocks from "list-b" where you can only allocate 8-bit blocks. If this list has no free blocks, then you try allocating from "list-a".
To allocate a new 4-bit tile first you see if there are free 4-bit blocks in "list-c". If so you use one of the blocks from that list, if not you use a pair of blocks from "list-a". The first block you use for the 4-bit tile, and the second block you move into "list-c".
To free an 8-bit block you determine if the free block belongs in "list-a" or "list-b". 8-bit blocks with a tile index of smaller than 1024 go into "list-a", bigger blocks into "list-b".
To free a 4-bit block you need to check if the block that is part of its two block pair is already free. For even numbered blocks this is the following block, for odd numbered blocks you check for the previous block. If this pair block is free then you remove the pair block from "list-c" and add the pair back into "list-a". If the pair block is not free you just add the free'd 4-bit block into "list-c".
On the AGB the BG VRAM area is 64k in total, so you cannot use this entire area for tiles because you must save some memory for the actual tile maps. For our sidescroller we need four small tile maps that take up 2k each, leaving 56k for the RAM tile set. So, our shared 4- and 8-bit tile region is 32k and the remaining 24k is used for the 8-bit only tile region.
This allocation scheme is extremely fast. Allocating and deallocating tiles in all cases is a constant time operation. This is critical because in a 4-layer sidescroller for example, if you allow each layer to scroll a maximum of 8 pixels per frame in any direction you will need to reallocate a maximum of 500 tiles per frame. The majority of those tiles are likely to already be loaded, so you may not have to actually copy all 500 tiles into memory each frame, but each one of those 500 tiles will need to be checked through the tile allocation and deallocation routines.
On the down side this scheme does take significant memory resources. We store the three linked lists we need interwoven in the same array that we use to store reference counts, so this does not take much memory, but the various RAM-to-ROM and ROM-to-RAM mappings can be large. In the configuration we are using for our game engine we are using about 48k of RAM to store all the data. Given that the AGB main system memory is only 256k, this is quite significant. This memory requirement can be significantly reduced by reducing the size of the largest allowable ROM tile set.
General Purpose Memory Management
Given the memory constraints of the AGB, and the lack of a swap device for memory, you need to be extremely careful about how you allocate main system memory. For most data structures in a game it makes sense to statically allocate all the structures you will need at compile time. This way you don't have to worry memory allocations failing during run-time. For some game systems however it really helps to have small pools of dynamically allocated memory. These pools can be quite small. For example, we use a tiny buffer of 4k as scratch memory for the AI routines of enemies that are on the screen.
For these small general-purpose memory pools you cannot make too many assumptions about how the memory will be used. You want an allocator that will quickly give you memory chunks of any desired size. Because the pools are so small it is important that the memory overhead of dynamically managing these pools be very small as well. We present here a general-purpose memory manager that uses just three percent of the managed buffer for memory management overhead. This allocator must of course also coalesce adjacent free'd buffers and must choose the areas it allocates from carefully, to minimize dead space between allocated blocks (otherwise what is the point of getting the management overhead down to three percent?) It can manage buffers as small as 32 bytes and as large as 512k, twice the size of the AGB's main memory.
In the version we present here the memory manager manages the memory using 8-byte memory blocks. Of course, you can use memory blocks of any size you want. Raising the block size will decrease the percentage of memory used for overhead, but will increase the amount of dead memory that cannot be allocated when the user asks for memory regions that are not multiples of the block size. Decreasing the block size does the reverse. Therefore, for a block size of 16, the overhead of the memory management data structures falls to 1.5%. The amount of "dead memory" will depend on the user's allocation pattern. When implementing a system like this you will want to consider the block size carefully.
The trick to making a memory manager with low memory overhead is to store as much of the heap management data structures as possible in the unallocated memory blocks. In our memory manager we use two basic data structures: a memory map that indicates which blocks are allocated, and an ordered binary tree that keeps track of all the free blocks. The memory map is stored in the beginning of the allocated buffer and is not available for allocation. The binary tree, however, is stored in the free blocks themselves, and so costs us nothing.
For each memory block in the buffer we need to store two pieces of information in the memory map:
To store these two pieces of information in the memory map we use two bits per block. So we can manage up to 32 bytes of data per byte of memory map. (This is where the three percent memory overhead comes from.)
For free blocks we don't need to store any information in the memory map. Once we know its free we know that we can find out all we want to know about it from the binary tree node that is stored inside the free block itself.
The memory manager also has a small 8-byte header that contains the total number of blocks being managed, an index into the root of the binary tree of free blocks, and a pointer to the first managed node.
The memory map follows the header, which is of the right size to store two bits of information per node. Finally each free block may contain a node for the binary tree. The tree nodes hold a pointer to the parent node, a pointer to all that node's smaller children, a pointer to all of the node's larger children and the size of the node. Each free node structure is 8-bytes-just fitting into the block.
The manager initializes all the bits in the memory map to free, with the binary tree holding a single node whose size is as big as all of the allocatable memory in the buffer.
To allocate a block you round the memory requested to the next block size, and then you walk the binary tree searching for the smallest available block that is large enough to allocate the required memory. Once an appropriate tree node is found, we shrink the node by the requested size. If no memory is left over the node is removed from the tree, if there is memory left over the node is resized and re-linked to the appropriate place in the tree. Finally, you need to update the memory map to flag all of the allocated blocks as used, and you need to set the last block bit for the last block in the allocated region.
To free a block you find its area in the memory map, which you can do arithmetically from the pointer passed in, and start clearing allocated bits until you find a map entry that is labeled as the last block in the region. Then you check in the memory map to see if the areas adjacent to this block are free. If they are, you merge the free'd block with its neighbors.
For a reasonably well-balanced tree the allocation operation will average m + log(n) time where n is the number of free nodes, and m is the size of the block. The free operation is linear with the size of the block. Our current implementation uses just a sorted binary tree. We do not do any automatic rebalancing, so it is possible, though unlikely, for the tree to become extremely unbalanced and slow the allocator down significantly when there are large numbers of free nodes. If this becomes a problem it is straightforward to implement a rebalancing routine like red-black. We have not found it to be necessary.
Random Number Generation with the Mersenne Twister
And now for something completely different...
The AGB's main CPU does not implement division and mod in hardware. They are implemented in software and are too slow to use for run time generation of random numbers. Most of the high quality random number generation routines require these operations. There is a relatively new random number generation routine called the Mersenne Twister developed by Makoto Matsumoto and Takuji Nishimura in 1997, which does not require mod or divide. This is an industrial strength generator with a period of 2^19937 and is 623-dimensionally equi-distributed. This is far more powerful than you are likely to need in a game—but it is simple to implement and fast to run, so why not use it? To implement the full generator as described in their paper you need to allocate an array of 2496 bytes to store the current state of the generator. It is possible to make this buffer smaller but this will reduce the period of the generator. But then again, this generator already has a period far greater than the number of stars in the universe, so cutting the period down a little bit won't hurt much if you need to get an extra kilobyte out of the generator's state buffer. Generating a new random number out of the state buffer is a matter of doing 4 XORs, 2 ANDs and 4 register shifts, so its fast enough for any purpose. I won't include any information on the theory of how the Mersenne Twister works. You can read about it in Matsumoto and Nishimura's paper (available on their web site http://www.math.keio.ac.jp/~matumoto/emt.html)
Return to the full version of this article
Copyright © UBM Tech, All rights reserved