|
Resource
Guide

Of Internet Servers and SQL Databases:
Designing the Backend for Power and Performance
How Big is Big?
A question that plagued me for a long time when I first started working
with databases is just what constitutes a “big” table? You'll
often hear people say that this or that operation doesn't work well on
“big” tables, but exactly how many rows are we talking about
here? On ITNet, we have just under 200 tables, ranging in size from 1
to 60 million rows. I've summarized my experiences working with these
table to give you a better sense of how size impacts the utility of a
table:

Indexes are Your Friend
Although indexes are well covered by most database books, there are
a couple of important concepts that bear clarification. First and foremost,
be aware that you will not be able to do anything quickly on a Medium
size or larger table unless you have at least one index on it. The only
exception to this is inserting new data rows. Indexes are what makes it
possible to do anything useful with a really big table.
All tables can be given a special index called a primary key. The primary
key has many uses, but it's main purpose is to define the column (or columns)
that will principally be used as the “key” when retrieving
records from the table. As such, the primary key must be unique for each
record in the table. A good example of a primary key would be a unique
player ID number that you might assign to all players of your game. The
reason I bring up primary keys is that there is a very common trick used
by database programmers to assign a unique ID number to all records in
a table. This trick involves generating a random key value for each record
in the table at insertion time. This random – but unique –
value is then used to identify the record in all other tables that reference
it. While this technique is great for tables that don't have a naturally
occurring unique value as part of their data, I don't recommend using
it as a one-size-fits-all solution. You will find that some database programmers
have strongly held “religious” beliefs that only synthetic
data should be used in a primary key. The argument is that real data is
subject to change at some later date, so only made up data (which is under
the complete control of the database programmer) should only ever be part
of the key.
This argument is, to put it bluntly, baloney. Using a unique value from
a table's data as the primary key has some important advantages –
notably that the information actually means something when you see it
stored as a reference in another table. It can also prevent unnecessary
database work in some situations in which the unique data is readily available,
but must first be converted to the synthetic primary key value via a table
lookup before it can be used. As long as the unique data you have earmarked
for the primary key is fully under your control (as in the case of player
ID numbers) or unlikely to ever change (as in the case of, say, social
security numbers) there is much to be gained from using it in this way.
Indexes are often used as a way of guaranteeing uniqueness in a table.
In ITNet's GUPE table, for example, we have a multi-column unique index
that covers several columns which, taken as a group, are always different
from one game to the next. We use this index not to look up records, but
as a way of preventing duplicate game records from being inserted into
the table. This is an excellent (albeit somewhat expensive) trick.
Although indexes are great, you should keep in mind that every index
you add to a table increases the amount of time it takes to insert new
data into the table. This is because when you insert a row, every index
in a table must be updated to include the data in the new row sorted in
its proper place. Also, indexes can take up a lot of space. Indexes on
really big tables, in particular, are really big. Remember that adding
indexes incurs a resource cost, so you need to carefully weigh the value
of every index before you add it to a table.
The Pros and Cons of Normalized Data
Another topic that can stir up strong opinions among database professionals
is data normalization. Put simply, normalizing data is about representing
data in it's most compact form, and ensuring that every value in the database
exists in one and only one place. The best way to explain normalization
is through an example. Imagine that you're saving player data in a database,
and you want to allow each players to associate up to two email addresses
with their data. The simplest way to do this would be to add two text
columns to your standard player data table that already contains their
name, address, phone number, etc. If a player entered only one email address,
you could just put a NULL value in the second email address column. While
this sounds like a perfectly acceptable solution, there are many database
programmers who will strongly object to the scheme. The problem is that
the data is not normalized properly. A normalized solution would create
a separate two-column table to hold the email data: the first column would
be the primary key value of the player record that the data belongs to
(to allow you to associate the record with the proper player), and the
second column would hold a single email address. Now if a player enters
a single email address they will have only a single record in the email
table. A player that enters two email addresses will have two records
in the table. Database programmers often prefer the normalized approach
because it more closely matches the table structure to the data and prevents
wasted space (in the above example, the often empty second email address
column in the player data table).
The problem with normalization is that it isn't always the best solution.
Although it clearly models the data much more accurately, it also incurs
a processing penalty: in the above example, whenever you want to retrieve
a player's information you will be forced to join the player table with
the email table to gain access to all the data. In the non-normalized
approach, all that's needed is a single lookup in the player table and
you're done. A normalized scheme can also cause table contention bottlenecks
when multiple processes need access to the same data at the same time.
Although redundancy is potentially bad in terms of data integrity, it
can be a critical technique for eliminating resource contention bottlenecks
on busy servers.
Although there are many such trades offs in when working with databases,
this particular issue can be a real bone of contention with database programmers.
From what I've seen, there appear to be a large number of database professional
who look at normalization as a philosophical rather than a practical issue.
To many, normalization is essential to the “purity” and “elegance”
of the overall database design. What's important to take away from this,
is that the cultures of database and C programmers are different and can
easily clash. It took me a long time to feel confident enough to stand
up to a database programmer when he or she insisted that a solution was
the “only right way” to solve a problem, but it turned out
to be an essential part of making ITNet work well.
Writing Stored Procedures for Fun and Profit
Stored procedures are truly wonderful things, and are essential for
making your servers run quickly. As any good database book will tell you,
a stored procedure (or SP) is basically just a script that runs on the
database. You can issue any of the SQL queries that you might normally
run interactively from within a stored procedure. You can also use variables,
temporary tables, looping, branching, and other sophisticated constructs
to coax complex behavior from a stored procedure.
Stored procedures provide a speed boost in two ways. First, they allow
you to embed logic on the database that can keep you from having to send
data back and forth over the network every time you need to make a decision.
If you want to update a value in a table based on some value in a second
table, for example, you can examine the second value from within the stored
procedure and issue the correct update query on the spot. The only way
to perform this kind of thing without stored procedures is to first request
the value from the second table. Wait for it to arrive back at the server.
Examine the value. And finally, issue a second query to update the first
table. Stored procedures save you all that up and back.
The second way stored procedures help you is by pre-processing all the
SQL queries you plan to use before the SP is invoked. In the normal case,
when a SQL query string is sent to a database for execution, the database
must first parse the string, check its validity, and hand it off to a
component called the query optimizer to figure out the best way to execute
it. The optimizer generates something called a query plan which describes
the steps the database engine must take to satisfy the query. Obviously,
performing all these extra steps can add significantly to a query's execution
time. The amount of time it takes to run the optimizer is unnoticeable
to a human operator, however it can add up for a query that gets executed
over and over.
I advise you to make liberal use of stored procedures when designing
your server. In addition to the performance benefits, stored procedures
give you a terrific way to change your server's behavior without having
to recompile and deploy new server code. ITNet makes heavy use of stored
procedures for both of these reasons.
As good as stored procedures are, there are a few things you need to
take into account when using them. Perhaps the most important is that
there's no easy way to debug a stored procedure. We C/C++ programmers
have gotten spoiled on sophisticated debugging tools that let us single
step, watch variables, etc. Stored procedures will welcome you back to
the bad old days of debugging by “printf” statement. The combination
of a powerful scripting language and non-existent debugging tools means
you can get yourself into a lot of trouble if you try to make your stored
procedures too long or tricky. Your best bet is to stick with stored procedures
that are reasonably short and simple. Most of ITNet's stored procedures
are less than 500 lines in length. The few times we've risked longer and
more complicated SPs we've been disappointed on several fronts. Not only
were the SP's hard to write and debug, they seemed to run less satisfactorily
against the database than their shorter brethren. If I had to guess, I'd
say modern database engines are optimized for running short to medium
length stored procedures. Since that's what the tools make easiest to
write, my advise is to just go with the flow.
While on the topic of stored procedures, its worth it to pause for a
moment and talk about the query optimizer. While this piece of code does
an absolutely remarkable job most of the time, be aware that it will sometimes
make disastrously bad choices about the best way to execute a query. Under
Sybase ASE, there is a command you can issue in an interactive SQL window
to output a text version of the query plan. By examining this output,
you can get an idea of what decisions the optimizer has made. Not surprisingly,
the more complicated the query (particularly when multiple table joins
are involved) the less reliable the optimizer. Over time, as you get better
and better at designing queries, the process of query optimizing will
begin to reduce to figuring out how to trick the query optimizer into
not making bone headed decisions.
There are two techniques that I've come to rely on heavily to work around
the optimizer. The first involves modifying the query to contain a hint
to the optimizer about what index should be used to resolve a specific
query. I'm not sure if the syntax for this is specific to Sybase databases,
so I'll refer you to your database manuals for the instructions on how
to do this. You should also read the cautionary information about overriding
the query optimizer's index selections that you will find there. Unless
you know what you're doing, you're as likely to bring a query to its knees
as speed it up when using index hints.
The other technique is much simpler and safer: break a query down into
several smaller queries that are issued one after another instead of all
at once. In theory, issuing a bunch of small queries can involve more
work than would be needed to resolve a single big query. In practice,
however, spoon feeding the optimizer is often the only thing that lets
a query run in a reasonable amount of time at all.
Batch Processing, Shadow Tables, Caching, and Other Tricks of the
Trade
When we initially created the specs for ITNet, we had great plans
for how we would improve over the old ITS system. One of the things we
were especially keen to add were tournament leader boards that could update
the instant a new game arrived that affected the standings. Because we
wanted leader board data to be available on the web site, our plan was
to keep this data in the database where it could be accessed by any external
program that cared about it.
Over the course of the next year as ITNet got busier and busier, we went
through three major overhauls of the leader board code. Our first attempt
used a heavily optimized stored procedure to sort and display messages
straight out of our massive game data table GUPE. When this failed due
to load, we created a set of tables that attempted to speed up the process
by working on a smaller subset of data. Six months later, we were right
back where we started from due to additional load from all the new games
that had come online. Our final attempt was a heavily optimized set of
tables that kept most of the data needed to calculate the leader boards
on the database, but relied on the ITNet servers to cache certain pieces
of information on the server for faster processing. Every hour, the server
would write the updated data to the leader board tables to keep the database
in sync. In the end, the system provided instantaneous updating of leader
boards for games connecting to ITNet, but allowed the leader board data
in the database to lag as much as an hour behind the most up-to-date version.
The system worked, and is still in use to this day.
The lesson we learned while creating our leader board system is an important
truth about databases: instant gratification is expensive. Designing a
database involves a seemingly endless series of tradeoffs and compromises.
What the leader board experience taught us was that you have to trade
away a lot in terms of performance and development time if you want your
database to give you instantly updating views of complex data.
The good thing that came out of the experience, though was that it forced
us to learn three of the most powerful techniques for improving a database's
performance: batch processing, shadow tables, and server-side caching.
Batch processing is a tried and true method for reducing the processing
load on a database. The basic idea is that instead of trying to keep some
view of the data instantly updated, you run a task at regular intervals
that updates the view to the current state. Note that by “view”
here I'm not talking about the database object of the same name. Instead,
I'm talking about a snapshot of the data, which is usually kept in a special
table for quick access. This external table is what, in ITNet parlance,
we call a “shadow” table: a non-authoritative copy of the
data that is maintained only for performance reasons. As you may recall
from the discussion of normalization, duplicate copies of key data are
often kept in separate tables to speed up access to the data. Shadow tables
are where this duplicate data lives. Batch processing and shadow tables
are extremely useful concepts that are used throughout the ITNet system.
Server-side data caching is another important concept that can really
improve your server's performance. The benefits of such a scheme are immediately
obvious: instead of having to wait for a database roundtrip to get some
important bit of data, the server can retrieve it instantly from its own
RAM. The one thing you need to be careful of, though, is that you don't
start using the server side cache to make changes to the data that aren't
represented somewhere in the database. The risk here is that the server
could crash and the changed data would be lost forever. With a little
care, though, this problem can be easily avoided. If you think about it,
a server-side cache as just a server-local version of a shadow table,
so as long as you treat this data as non-authoritative and disposable
you won't run into any trouble.
________________________________________________________
[Back
To] Do I Need SQL?
|