|
[In this Intel-sponsored feature, part of the Gamasutra Visual Computing microsite, Goo! developer Tommy Refenes of PillowFort matches wits with multi-threading in four gripping acts -- and emerges victorious]
I'll admit it.
I'm a speed freak. (I tried to get help, but
it turns out
those clinics by the train station are for
something totally
different than my addiction.) Simply
put, I like what
I write to run fast, getting a certain
amount of
personal joy when something I've written
runs at over 100
FPS.
I've always strived to make my
code as efficient
as possible, but I didn't realize how little
I knew until my
previous job. Back then I was an Xbox
360 engine
programmer at a studio working crazy hours
to get a game up
on XBLA. I was thrown into the world
of memory
management, cache optimizations, rendering
optimizations,
and briefly touched on the land of threading.
In May 2006 I
left that company to start working on
Goo!, a game in
which you control big globs of liquid and
use those globs
to envelop enemy goos. Players have
total control
over their goo, which means a ton of
collision
calculations and even more physics calculations.
My new job, once
again, was an engine programmer,
and I knew that
to have real-time interactive liquid
rendering on the
screen at a speed that wouldn't
give me FPS
withdrawal, I would need to thread the
collision and
physics to the extreme.
(A
small disclaimer: Goo! is my first attempt
at multi-threaded programming. I've learned everything I know about threading
reading MSDN help, a few random Intel documents, and various other
resources on threads. I didn't have the fancy pants books or money
to buy said books. So in this article I may miss a few official terms and
concepts and I may be totally wrong about some things,
but like all programming, it's all about results.)
Act 1:
Power Overwhelming
Goo!
began life on a Intel Pentium 4
processor
3.0 GHz with Hyper-Threading
Technology.
Goos are rendered by
combining
thousands of small blob
particles
on the screen to make up
a
height map, which is then passed
through
filters to make the goos look
like
liquid or mercury. That's how they
are
rendered now, but in the beginning
they
were hundreds of smiley faces.
Back
then, Goo! rendered and calculated
around
256 blobs on the screen. Blobs'
collision
calculations were culled based
on
velocity and position, but it was all
pushed
through the Intel Pentium 4
processor
and needless to say the
results
weren't great.
With
the game rendering at around
20
FPS I knew it was time to thread the
collision
detection. I broke down and
purchased
an Intel Pentium D 965
Extreme
Edition processor. The dual
3.73
GHz cores would be more than
enough
to start building a new
multi-threaded
collision detection
system.
Like a five-year old back in
1986
with a brand new Optimus Prime,
I
was excited and anxious to get started
on
the new collision engine.
So
began my first attempt at threading.
I
adopted a philosophy inspired by Ron
Popeil
(of Ronco fame): "Set it and forget
it."
Logically data that isn't needed
immediately
is a perfect candidate for
a
threaded operation. Collision detection
fell
into this thinking very nicely because
most
of the time collision results will be
the
same for several frames before
changing.
Therefore, I reasoned, collision
could
run on a separate thread and be
checked
every frame to see if the routine
had
finished, copying over the results
when
finished, and using them in the
current
frame's physics calculations.
At
the
time, this seemed like a great idea.
Not
waiting for collision detections to
finish
every frame allowed the engine
to
move on and continue crunching
physics
and gameplay calculations
with
old data, while still maintaining a
pretty
high frame rate. At this point,
Goo!
was pushing through around 625
blobs
rendering at about 60 FPS.
The
threading model was simple:
A
collision thread was created when
the
game loaded, sleeping until it had
something
to work on. The main thread
copied
position and size data for every
blob
into a static array the thread could
access
and then continued on to physics
calculations
and rendering. The collision
thread
woke up, saw it had work to do,
and
proceeded with a very brute force
method
of figuring out who's touching
who.
It then saved this data into an
array
of indices that corresponded to
every
blob in the game.
Once finished, it
posted
a flag to the main thread. When
the
main thread saw that the collision
thread
had completed its calculations,
it
copied the newly calculated collision
data
over the old data, copied new
position
and size data, and once again
sent
it to the collision thread to calculate.
At
this time, each collision calculation
took
around 20 ms, which meant that
collision
data was updated just about
every
1.5 frames.
With
625 blobs on the screen rendering
at
around 60 FPS, gameplay could now
start
development. This is where the
first
problems began to hit. The goos
themselves
didn't really feel like goos.
When
625 blobs represent one goo it
looks
fine, but Goo! is a game where
you
battle against other goos so having
more
than one on the screen made the
goos
look a little chunky.
Chunky goos
(unless
you are talking about GooGoo
Clusters)
are no good. The solution:
more
blobs! I expanded the threading
model
by adding another collision thread
and
splitting up the work over the two
threads.
The main thread now broke the
collision
work up into two pieces, telling
collision
thread 1 to calculate the first
half
of the data, and collision thread 2 to
calculate
the second half. Once the two
threads
finished, they sent two flags
and
waited for the main thread to see
that
data, copy it over, and send more
work.
This method yielded some better
results,
allowing around 900 blobs to be
rendered
to the screen at 60 FPS.
With
900 blobs on the screen goos
started
looking like goos, and it was time
to
put the collision to rest for a while
and
focus on rendering. But as gameplay
development
progressed, various little
unexplained
bugs started to pop up:
Memory
would appear as if it were
being
accessed after being freed, blobs
wouldn't
change ownership properly,
and
the simulation would occasionally
explode.
Although these were gamebreaking
bugs,
they were so infrequent
that
I didn't bother with them until
about
seven months later after some
core
gameplay was flushed out.
|
Being smarter always beats the brute force approach ...
You can see the video of it here: http://vrlab.epfl.ch/multimedia/BA_behavioural_animation.html#Riot
Can you tell us more about the scaling from 1 to 4 cores ? and 2 to 4 cores ?
Do you think your threading architecture can handle a 16 or 32 core processor ?
Vincent
I also find it strange that frequently creating and destroying threads would be more efficient than waiting on a system object. How do you decide when you 'need' to create a new thread, and how do you allocate work to those threads? Are they dedicated to specific tasks or do they have some sort of generic queue system just running through any kind of task that may be available?
The next article is going explain the new engine which has a threading model that self profiles itself on multiple core machines.
Matt / Erik:
The new threading model is similar because I do create more threads to do work when I need them, but when they are not working they are set to a wait state. These threads are assigned to specific functions instead of a giant thread pool that I pick out of when needed. I found this to be better due to the nature and frequency of the work needed to be ran on these threads. There are some threads that are created and then allowed to expire, but the ones that do collision detection / physics and object updates are activated by the main CPU, determine through their self profiling how many threads they can split the work over, create threads if needed (i.e. the profiling determines that instead of running over 3 threads, it will benefit from 4) and then does that work while the main CPU does something then they are put into a wait state and then activated by the main CPU once again when they are needed. This model is by far the most efficient. The purpose of this article was to explain the first attemps and put forth the obvious problems with that old threading model.
I hope you the best in the contest. Personally I think using multiprocessor multithreaded power in games is the next step.
I have used threads in Embedded systems and they are really challenging but as soon as you understand them they help a lot optimizing your application.
I will be waiting for you next article
Thank you very much for the complement! http://softwarecommunity.intel.com/articles/eng/3978.htm
"First Place "Goo" by Tommy Refenes, PillowFort Games."
Vincent, as an appetizer to your next article, just tell us if your threading model may work on a 32 core processor :p