It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.    

Search articles, jobs, buyers guide, and more.

by Pete Hallenberg
Gamasutra
[Author's Bio]
September 16, 2002

Do You Need SQL?

Database Design

How Big is Big?

Printer Friendly Version

[Back To] Online Games Resource Guide

Sponsored by:

This feature originally appeared in the Proceedings of the 2002 Game Develoers Conference.

 

 


Resource Guide

Of Internet Servers and SQL Databases:
Designing the Backend for Power and Performance

How Big is Big?
A question that plagued me for a long time when I first started working with databases is just what constitutes a “big” table? You'll often hear people say that this or that operation doesn't work well on “big” tables, but exactly how many rows are we talking about here? On ITNet, we have just under 200 tables, ranging in size from 1 to 60 million rows. I've summarized my experiences working with these table to give you a better sense of how size impacts the utility of a table:

Indexes are Your Friend
Although indexes are well covered by most database books, there are a couple of important concepts that bear clarification. First and foremost, be aware that you will not be able to do anything quickly on a Medium size or larger table unless you have at least one index on it. The only exception to this is inserting new data rows. Indexes are what makes it possible to do anything useful with a really big table.

All tables can be given a special index called a primary key. The primary key has many uses, but it's main purpose is to define the column (or columns) that will principally be used as the “key” when retrieving records from the table. As such, the primary key must be unique for each record in the table. A good example of a primary key would be a unique player ID number that you might assign to all players of your game. The reason I bring up primary keys is that there is a very common trick used by database programmers to assign a unique ID number to all records in a table. This trick involves generating a random key value for each record in the table at insertion time. This random – but unique – value is then used to identify the record in all other tables that reference it. While this technique is great for tables that don't have a naturally occurring unique value as part of their data, I don't recommend using it as a one-size-fits-all solution. You will find that some database programmers have strongly held “religious” beliefs that only synthetic data should be used in a primary key. The argument is that real data is subject to change at some later date, so only made up data (which is under the complete control of the database programmer) should only ever be part of the key.

This argument is, to put it bluntly, baloney. Using a unique value from a table's data as the primary key has some important advantages – notably that the information actually means something when you see it stored as a reference in another table. It can also prevent unnecessary database work in some situations in which the unique data is readily available, but must first be converted to the synthetic primary key value via a table lookup before it can be used. As long as the unique data you have earmarked for the primary key is fully under your control (as in the case of player ID numbers) or unlikely to ever change (as in the case of, say, social security numbers) there is much to be gained from using it in this way.

Indexes are often used as a way of guaranteeing uniqueness in a table. In ITNet's GUPE table, for example, we have a multi-column unique index that covers several columns which, taken as a group, are always different from one game to the next. We use this index not to look up records, but as a way of preventing duplicate game records from being inserted into the table. This is an excellent (albeit somewhat expensive) trick.

Although indexes are great, you should keep in mind that every index you add to a table increases the amount of time it takes to insert new data into the table. This is because when you insert a row, every index in a table must be updated to include the data in the new row sorted in its proper place. Also, indexes can take up a lot of space. Indexes on really big tables, in particular, are really big. Remember that adding indexes incurs a resource cost, so you need to carefully weigh the value of every index before you add it to a table.

The Pros and Cons of Normalized Data
Another topic that can stir up strong opinions among database professionals is data normalization. Put simply, normalizing data is about representing data in it's most compact form, and ensuring that every value in the database exists in one and only one place. The best way to explain normalization is through an example. Imagine that you're saving player data in a database, and you want to allow each players to associate up to two email addresses with their data. The simplest way to do this would be to add two text columns to your standard player data table that already contains their name, address, phone number, etc. If a player entered only one email address, you could just put a NULL value in the second email address column. While this sounds like a perfectly acceptable solution, there are many database programmers who will strongly object to the scheme. The problem is that the data is not normalized properly. A normalized solution would create a separate two-column table to hold the email data: the first column would be the primary key value of the player record that the data belongs to (to allow you to associate the record with the proper player), and the second column would hold a single email address. Now if a player enters a single email address they will have only a single record in the email table. A player that enters two email addresses will have two records in the table. Database programmers often prefer the normalized approach because it more closely matches the table structure to the data and prevents wasted space (in the above example, the often empty second email address column in the player data table).

The problem with normalization is that it isn't always the best solution. Although it clearly models the data much more accurately, it also incurs a processing penalty: in the above example, whenever you want to retrieve a player's information you will be forced to join the player table with the email table to gain access to all the data. In the non-normalized approach, all that's needed is a single lookup in the player table and you're done. A normalized scheme can also cause table contention bottlenecks when multiple processes need access to the same data at the same time. Although redundancy is potentially bad in terms of data integrity, it can be a critical technique for eliminating resource contention bottlenecks on busy servers.

Although there are many such trades offs in when working with databases, this particular issue can be a real bone of contention with database programmers. From what I've seen, there appear to be a large number of database professional who look at normalization as a philosophical rather than a practical issue. To many, normalization is essential to the “purity” and “elegance” of the overall database design. What's important to take away from this, is that the cultures of database and C programmers are different and can easily clash. It took me a long time to feel confident enough to stand up to a database programmer when he or she insisted that a solution was the “only right way” to solve a problem, but it turned out to be an essential part of making ITNet work well.

Writing Stored Procedures for Fun and Profit
Stored procedures are truly wonderful things, and are essential for making your servers run quickly. As any good database book will tell you, a stored procedure (or SP) is basically just a script that runs on the database. You can issue any of the SQL queries that you might normally run interactively from within a stored procedure. You can also use variables, temporary tables, looping, branching, and other sophisticated constructs to coax complex behavior from a stored procedure.

Stored procedures provide a speed boost in two ways. First, they allow you to embed logic on the database that can keep you from having to send data back and forth over the network every time you need to make a decision. If you want to update a value in a table based on some value in a second table, for example, you can examine the second value from within the stored procedure and issue the correct update query on the spot. The only way to perform this kind of thing without stored procedures is to first request the value from the second table. Wait for it to arrive back at the server. Examine the value. And finally, issue a second query to update the first table. Stored procedures save you all that up and back.

The second way stored procedures help you is by pre-processing all the SQL queries you plan to use before the SP is invoked. In the normal case, when a SQL query string is sent to a database for execution, the database must first parse the string, check its validity, and hand it off to a component called the query optimizer to figure out the best way to execute it. The optimizer generates something called a query plan which describes the steps the database engine must take to satisfy the query. Obviously, performing all these extra steps can add significantly to a query's execution time. The amount of time it takes to run the optimizer is unnoticeable to a human operator, however it can add up for a query that gets executed over and over.

I advise you to make liberal use of stored procedures when designing your server. In addition to the performance benefits, stored procedures give you a terrific way to change your server's behavior without having to recompile and deploy new server code. ITNet makes heavy use of stored procedures for both of these reasons.

As good as stored procedures are, there are a few things you need to take into account when using them. Perhaps the most important is that there's no easy way to debug a stored procedure. We C/C++ programmers have gotten spoiled on sophisticated debugging tools that let us single step, watch variables, etc. Stored procedures will welcome you back to the bad old days of debugging by “printf” statement. The combination of a powerful scripting language and non-existent debugging tools means you can get yourself into a lot of trouble if you try to make your stored procedures too long or tricky. Your best bet is to stick with stored procedures that are reasonably short and simple. Most of ITNet's stored procedures are less than 500 lines in length. The few times we've risked longer and more complicated SPs we've been disappointed on several fronts. Not only were the SP's hard to write and debug, they seemed to run less satisfactorily against the database than their shorter brethren. If I had to guess, I'd say modern database engines are optimized for running short to medium length stored procedures. Since that's what the tools make easiest to write, my advise is to just go with the flow.

While on the topic of stored procedures, its worth it to pause for a moment and talk about the query optimizer. While this piece of code does an absolutely remarkable job most of the time, be aware that it will sometimes make disastrously bad choices about the best way to execute a query. Under Sybase ASE, there is a command you can issue in an interactive SQL window to output a text version of the query plan. By examining this output, you can get an idea of what decisions the optimizer has made. Not surprisingly, the more complicated the query (particularly when multiple table joins are involved) the less reliable the optimizer. Over time, as you get better and better at designing queries, the process of query optimizing will begin to reduce to figuring out how to trick the query optimizer into not making bone headed decisions.

There are two techniques that I've come to rely on heavily to work around the optimizer. The first involves modifying the query to contain a hint to the optimizer about what index should be used to resolve a specific query. I'm not sure if the syntax for this is specific to Sybase databases, so I'll refer you to your database manuals for the instructions on how to do this. You should also read the cautionary information about overriding the query optimizer's index selections that you will find there. Unless you know what you're doing, you're as likely to bring a query to its knees as speed it up when using index hints.

The other technique is much simpler and safer: break a query down into several smaller queries that are issued one after another instead of all at once. In theory, issuing a bunch of small queries can involve more work than would be needed to resolve a single big query. In practice, however, spoon feeding the optimizer is often the only thing that lets a query run in a reasonable amount of time at all.

Batch Processing, Shadow Tables, Caching, and Other Tricks of the Trade
When we initially created the specs for ITNet, we had great plans for how we would improve over the old ITS system. One of the things we were especially keen to add were tournament leader boards that could update the instant a new game arrived that affected the standings. Because we wanted leader board data to be available on the web site, our plan was to keep this data in the database where it could be accessed by any external program that cared about it.

Over the course of the next year as ITNet got busier and busier, we went through three major overhauls of the leader board code. Our first attempt used a heavily optimized stored procedure to sort and display messages straight out of our massive game data table GUPE. When this failed due to load, we created a set of tables that attempted to speed up the process by working on a smaller subset of data. Six months later, we were right back where we started from due to additional load from all the new games that had come online. Our final attempt was a heavily optimized set of tables that kept most of the data needed to calculate the leader boards on the database, but relied on the ITNet servers to cache certain pieces of information on the server for faster processing. Every hour, the server would write the updated data to the leader board tables to keep the database in sync. In the end, the system provided instantaneous updating of leader boards for games connecting to ITNet, but allowed the leader board data in the database to lag as much as an hour behind the most up-to-date version. The system worked, and is still in use to this day.

The lesson we learned while creating our leader board system is an important truth about databases: instant gratification is expensive. Designing a database involves a seemingly endless series of tradeoffs and compromises. What the leader board experience taught us was that you have to trade away a lot in terms of performance and development time if you want your database to give you instantly updating views of complex data.

The good thing that came out of the experience, though was that it forced us to learn three of the most powerful techniques for improving a database's performance: batch processing, shadow tables, and server-side caching.

Batch processing is a tried and true method for reducing the processing load on a database. The basic idea is that instead of trying to keep some view of the data instantly updated, you run a task at regular intervals that updates the view to the current state. Note that by “view” here I'm not talking about the database object of the same name. Instead, I'm talking about a snapshot of the data, which is usually kept in a special table for quick access. This external table is what, in ITNet parlance, we call a “shadow” table: a non-authoritative copy of the data that is maintained only for performance reasons. As you may recall from the discussion of normalization, duplicate copies of key data are often kept in separate tables to speed up access to the data. Shadow tables are where this duplicate data lives. Batch processing and shadow tables are extremely useful concepts that are used throughout the ITNet system.

Server-side data caching is another important concept that can really improve your server's performance. The benefits of such a scheme are immediately obvious: instead of having to wait for a database roundtrip to get some important bit of data, the server can retrieve it instantly from its own RAM. The one thing you need to be careful of, though, is that you don't start using the server side cache to make changes to the data that aren't represented somewhere in the database. The risk here is that the server could crash and the changed data would be lost forever. With a little care, though, this problem can be easily avoided. If you think about it, a server-side cache as just a server-local version of a shadow table, so as long as you treat this data as non-authoritative and disposable you won't run into any trouble.

________________________________________________________

[Back To] Do I Need SQL?


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2002 CMP Media LLC. All rights reserved.
privacy policy | terms of service