Press Releases
April 19, 2019
Games Press

If you enjoy reading this site, you might also want to check out these UBM Tech sites:

# How porting to the PSVita improved performance everywhere else.

by Lars Doucet on 05/30/18 04:00:00 am

The following blog post, unless otherwise noted, was written by a member of Gamasutra’s community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.

Defender's Quest: Valley of the Forgotten DX has always had lingering issues with performance, and I've finally fixed them. Our main impetus to massively improve performance was our PlayStation Vita port. The game had been out on PC already, and ran alright on Xbox One and PS4, if not perfectly. But unless we seriously upped our game, there was no way it would run on the Vita.

When a game is slow, internet commenters tend to blame the programming language or engine. It's true that languages like C# and Java incur more overhead than C and C++, and that tools like Unity have longstanding issues with e.g. garbage collection, but the real reason people reach for these explanations is they're the software's most visible properties. But the real performance killers are just as likely to be stupid little details that have nothing to do with architecture.

# 0. Profiling Tools

There's only one real way to make a game faster -- profile it. Figure out where the computer is spending too much time, and make it spend less time to do the same thing, or even better -- make it spend no time by doing nothing at all.

The simplest profiling tool is just the Windows performance monitor:

This is actually a pretty versatile tool, and it's dead simple to run. Just Ctrl+Alt+Delete, open task manager, and hit the "Performance" tab. Just make sure you're not running a bunch of other programs. You should easily be able to detect CPU spikes and even memory leaks, if you're watching carefully. It's low-tech, but this is the first step for finding slow spots, besides just playing the game and noticing when stuff is sluggish.

Defender's Quest is written in Haxe, a high-level language that compiles to other languages, my chief target being C++. This means that any tool that can profile C++ can profile my Haxe-generated C++ code. So when I want to dig into what's causing problems, I boot up Visual Studio's Performance explorer:

The various consoles also have their own performance profiling tools, which are pretty awesome, but due to NDA's I'm not allowed to tell you anything about them. But if you have access to them, definitely use them!

Rather than write a crappy tutorial on how to use profiling tools like Performance Explorer, I'll just link to the official docs and get to the main event -- surprising things that yielded huge speedups, and how I found them!

# 1. Identifying the Problem

Performance is as much about perception as it is actual speed. Defender's Quest is a tower defense game that renders at 60 FPS, but with a variable game speed anywhere from 1/4x to 16x. Regardless of game speed, the simulation uses a fixed timestep of 60 updates per 1 second of 1x simulation time. So if you run the game at 16x, you're actually running the update logic at 960 FPS. That's honestly a lot to ask of a game! But I'm the one who made this mode possible, and if it's slow, players aren't wrong to notice.

And then there's this level:

That's the final bonus battle, "Endless 2", AKA the bane of my existence. This screenshot is from the New Game+ mode, where enemies are not only a lot stronger, but also gain powers like regenerating health. A favorite player strategy here is to spec dragons with maximum roar (an AOE attack that stuns enemies), and follow it up with a row of knights specced for maximum knockback to push anything that gets past the dragons back into range. The cumulative effect is that you can keep an enormous group of monsters frozen in place indefinitely, well past the point you could survive if you had to actually kill them. Since you only have to reach waves to earn rewards and achievements, not kill them, this is a perfectly valid and frankly brilliant strategy, exactly the kind of thing I want to encourage.

Unfortunately, it's also a pathological case for performance, especially when players try to run it at 16x or 8x speed. Sure, only the hardest of the hardcore are going to try for the Endless 2 New Game+ Wave 100 achievement, but they also tend to be the ones who talk loudest about the game, so I'd like for them to be happy.

"I mean, it's just a 2D game with a bunch of sprites, what's so hard about that?"

Indeed. Let's find out.

# 2. Collision decision

Look at this screenshot:

See that donut shape around the ranger? That's her range -- note also the dead zone where she can't target things. Every class has its own range shape, and each defender can have a differently sized range, depending on boost level and individual stats. And every defender can in theory target any enemy on the board if it's in range, and vice versa, for certain enemy types. You can have up to 36 defenders on the map at any time (not including the Azra, the main character), and there's no upper limit to the number of enemies. Every defender and enemy maintains a list of eligible targets based on range check calls every update step (minus sensible culling for those who can't possibly attack right now, etc).

Nowadays GPU's are really fast -- unless you're pushing things to the screaming bleeding edge, they can handle just about as many polygons as you want to throw at them. But CPU's, even fast ones, are extremely easy to bottleneck on simple subroutines, particularly ones that grow exponentially. This is why you can have a 2D game that is slower than a much prettier 3D one -- not because the programmer sucks (well, maybe that too, at least in my case), but principally because logic can sometimes be more expensive than drawing! It's not really a factor of "how many things you've got on screen", so much as it is what those things are doing.

So let's dive in and speed up collision detection. For perspective, before optimization, collision detection took up ~50% of the main battle loop's CPU time. Afterwards it was less than 5%.

The basic fix for slow collision detection is Space Partioning - and we'd already been using a decent QuadTree since day one. These basically divide up space efficiently so you can skip a bunch of unneccessary collision checks.

Every frame, you update the entire QuadTree to track where everyone is, and whenever an enemy or defender wants to target anything, it asks the QuadTree for a list of who's nearby. But the profiler said both of these operations were far slower than they had to be.

What was wrong?

A lot of things, as it turned out.

## Stringly Typed

Since I store both enemies and defenders in the same quadtree, you need to specify what you're asking for, and I was doing it like this:

var things:Array<XY> = _qtree.queryRange(zone.bounds, "e"); //"e" is for "enemy"

This is called Stringly Typed code, and among other reasons, it's bad because string comparison is always slower than integer comparison.

I rigged up some quick integer constants and changed it to this instead:

var things:Array<XY> = _qtree.queryRange(zone.bounds, QuadTree.ENEMY);

(Yeah, I probably should have used an Enum Abstract for maximum type safety but I was in a hurry and it got the job done.)

This one change made a big difference because this function gets called all the time, recursively, every time anybody needs a new list of targets.

## Array vs Vector

See this?
var things:Array<XY>

Haxe Arrays are quite similar to ActionScript and JS arrays in that they're a resizable collection of objects, but in Haxe they're strongly typed.

There is, however, another data structure that is more performant on static targets like cpp, which is haxe.ds.Vector. Haxe Vectors are basically the same as Arrays, except that you pick a fixed size when you create them.

Since my QuadTrees had a fixed capacity already, I swapped Arrays for Vectors for a noticeable performance increase.

## Ask only for what you need

Previously my queryRange function was returning a list of objects, XY instances. These held the x/y location of the referenced game object and its unique integer identifier (a lookup index in a master array). The requesting game object would take those XY's, extract the integer id to get its target, and then forget the rest.

So why was I passing all these XY object references around for each QuadTree node, recursively, up to 960 times per frame? All I needed to return was a list of integer ids.

PROTIP: Integers are way faster to pass around than basically anything else!

Compared to the other fixes this one was rather unsophisticated, but the performance gains were still noticeable because it was in such a hot inner loop.

## Tail-call optimization

There's this fancy thing called Tail-call optimization which is kind of hard to explain, so I'll just show you.

Before:

nw.queryRange(Range, -1, result);
ne.queryRange(Range, -1, result);
sw.queryRange(Range, -1, result);
se.queryRange(Range, -1, result);
return result;  

After:

return se.queryRange(Range, filter,
sw.queryRange(Range, filter,
ne.queryRange(Range, filter,
nw.queryRange(Range, filter, result))));

These return the same logical results, but according to the profiler, the second one is faster, at least on the cpp target. Both of these are doing the exact same logic - make some changes to the "result" data structure and pass that into the next function until we return. When you do this recursively, you're able to avoid the compiler generating temporary references as it can just return the result from the previous function immediately rather than having to hang on to it for an extra step. Or something like that. I'm not 100% clear on how it works, read the link above.

(To my knowledge, the current version of the Haxe compiler does not have a tail-call optimization feature, so this was probably the C++ compiler at work -- so you shouldn't be surprised if this trick doesn't work on non-cpp targets when using Haxe.)

## Object Pooling

If I want accurate results, I have to tear down my QuadTree and build it back up again for every update call. Creating new QuadTree instances was fairly tame, but the vast amount of new AABB and XY objects those QuadTrees depended on was causing some real memory thrashing. Since these are such simple objects, it makes sense to allocate a bunch ahead of time and just keep reusing them over and over. This is called Object Pooling.

Previously I was doing stuff like this:

nw = new QuadTree( new AABB( cx - hs2x, cy - hs2y, hs2x, hs2y) );
ne = new QuadTree( new AABB( cx + hs2x, cy - hs2y, hs2x, hs2y) );
sw = new QuadTree( new AABB( cx - hs2x, cy + hs2y, hs2x, hs2y) );
se = new QuadTree( new AABB( cx + hs2x, cy + hs2y, hs2x, hs2y) ); 

I replaced it with this:

nw = new QuadTree( AABB.get( cx - hs2x, cy - hs2y, hs2x, hs2y) );
ne = new QuadTree( AABB.get( cx + hs2x, cy - hs2y, hs2x, hs2y) );
sw = new QuadTree( AABB.get( cx - hs2x, cy + hs2y, hs2x, hs2y) );
se = new QuadTree( AABB.get( cx + hs2x, cy + hs2y, hs2x, hs2y) );  

We use the open source framework HaxeFlixel, so we implemented this using HaxeFlixel's FlxPool class. For narrow-case optimization like this I often found myself replacing some core Flixel elements like collision detection with my own implementation (as I did with QuadTrees), but FlxPool is better than anything I could have written myself and did exactly what I needed.

## Specialize as necessary

The XY object is a simple class that has an x, y, and int_id property. Since this was used in a particularly hot inner loop, I could save many allocations and operations by moving all this data into a special data structure that provides the same functionality as Vector<XY>. I called this new class XYVector and you can see the result here. It's a very narrow use case and not flexible at all, but it gave us some perf gains.

## Inline functions

So, after we do the broad-phase collision detection, we have to do a lot of tests to check which things are actually colliding. Whenever possible I would try to compare a point vs a shape rather than a shape vs. a shape, but sometimes you have to do the latter. In any case, these all require their own special tests:

private static function _collide_circleCircle(a:Zone, b:Zone):Bool
{
var dx:Float = a.centerX - b.centerX;
var dy:Float = a.centerY - b.centerY;
var d2:Float = (dx * dx) + (dy * dy);
return d2 < r2;
}

Which can be improved with the single keyword inline:

private static inline function _collide_circleCircle(a:Zone, b:Zone):Bool
{
var dx:Float = a.centerX - b.centerX;
var dy:Float = a.centerY - b.centerY;
var d2:Float = (dx * dx) + (dy * dy);
return d2 < r2;
}

When you 'inline' a function what you're doing is telling the compiler to basically copy-and-paste that code and fill in the variables wherever it's used, rather than make an external call to a separate function, which incurs overhead. Inlining isn't appropriate everywhere (it will bloat your code size, for one), but it's perfect for situations like this with small functions that get called over and over.

## Collision wrap-up

The real lesson here is that real-world optimization is not all of one type. These fixes are a mix of advanced techniques, cheap hacks, common-sense best practices, and cleaning up stupid mistakes. All of them add up to better performance.

But still -- measure twice, cut once!

Spending two hours meticulously optimizing a function that gets called once every six frames and takes 0.001 ms is not worth the effort, no matter how ugly or stupid the code is.

# 3. Sort yourself out

This was actually one of the very last performance tweaks I made, but it was such a big win it gets its own heading. It was also the simplest, and gave tremendous bang for its buck. There was one subroutine flagged by the profiler that I'd never been able to improve -- the main draw() loop was simply taking too long. And the culprit was the function that sorted all on screen elements prior to drawing them -- that's right, sorting all the sprites took way longer than actually drawing them!

If you look back at the screenshots from the game, you can see that all enemies and defenders are sorted first by y, then by x, so that elements stack back to front, left to right, as you move from the upper left to the lower right of the screen.

One way to cheat this would be to simply skip draw sorting every other frame. This is a useful trick for some expensive functions, but it immediately resulted in extremely noticeable visual bugs, so that was a no-go.

The solution finally came from Jens Fischer, one of HaxeFlixel's maintainers, who casually asked me, "Did you also make sure to use a sorting algorithm that's fast on almost-sorted arrays?"

No! No I wasn't. I was using the default Haxe standard library array sort, which I think is a Merge Sort implementation -- a good choice for general use. But I had a very specific use case here. Only a few sprites are going to change sorting position each frame, even when there's a lot of them. So I swapped out our old sort call for an Insertion Sort, and bam -- instant speed up.

# 4. Other technical issues

Collision detection and sorting were the big wins in our update() and draw() logic, but there were also tons of miscellaneous performance gotchas lurking in hot inner loops.

## Std.is() and cast

In various hot inner loops, I had code like this:

if(Std.is(something,Type))
{
var typed:Type = cast(something,Type);
}

In Haxe, Std.is() tells you if an object belongs to a certain Type or Class, and cast attempts to cast it to the given type at runtime.

Now, there's both safe and unsafe versions of cast -- safe casts incur a performance penalty, but unsafe casts do not.

Safe: cast(something, Type);
Unsafe: var typed:Type = cast something;

When an unsafe cast fails you wind up with a null value, whereas when a safe cast fails it throws an exception. But if you don't bother to catch the exception, what was the point of doing a safe cast? Without the catch, the operation still fails, and now it's slower.

It's also pointless to precede a safe cast with an Std.is() check. The only reason to use a safe cast is for the guaranteed exception, but if you check the type before casting, you've already guaranteed the cast can't fail!

I could speed this up a bit by using an unsafe cast after my Std.is() check. But why not rewrite things so I don't have to check the class type at all?

Let's say I have a CreatureSprite that could be an instance of either the DefenderSprite or EnemySprite subclass. Rather than calling Std.is(this,DefenderSprite) why not have an integer field in CreatureSprite with values like CreatureType.DEFENDER or CreatureType.ENEMY that is even faster to check?

Again, this is only worth fixing in areas where the profiler has concretely measured a significant slowdown.

Incidentally, you can read more about safe and unsafe casting in the Haxe Manual.

## Serializing/deserializing the universe

Finding stuff like this was embarrassing:

function copy():SomeClass
{
return SomeClass.fromXML(this.toXML());
}

Yeah. So in order to copy an object we are going to serialize it into XML, then we're going to parse all that XML, then immediately throw the XML away and return the new object. This is just about the slowest way I can think to copy an object, and as a bonus, it thrashes memory. I originally wrote the XML calls for saving and loading to disk, and I suppose I had just been too lazy to write proper copy routines.

It probably would have been okay if this was a rarely used function, but I found these calls cropping up in bad places in the middle of gameplay. So I sat down and just did the work of writing and testing a proper deep copy function.

## Say No to Null

Checking if something is equal to null is pretty common, but on Haxe's cpp target, a nullable object incurs overhead that you don't get if the compiler can assume the object will never be null. This is especially true for basic types like Int -- Haxe implements nullability for these on static targets by "boxing" them, and this occurs not only for variables you explicitly declare nullable (var myVar:Null<Int>), but also for things like optional parameters (?myParam:Int). Null checks themselves also incur some overhead.

I was able to fix several of these by just looking at my code and thinking of alternatives -- could I do a simpler test that would always be true if some object was null? Could I catch the null much earlier in the function call chain and send a simple integer or boolean flag down to the cascading child calls? Could I structure things so that this value would be guaranteed never to be null? And so on. You can't eliminate null checks and nullable values entirely, but keeping them out of hot functions helped a lot.

On the PSVita in particular, we had some real issues with load times for certain scenes. Profiling revealed the culprits as mostly being down to text rasterization, unnecessary software rendering, wasteful button rendering, and a few other things.

## Text

HaxeFlixel is based on OpenFL, which has some really excellent and robust TextFields. But the way I was using FlxText objects was not ideal -- FlxText objects have an internal OpenFL text field, which it then rasterizes. However, it turns out I didn't need most of the fancy text features, and because of some of the dumb ways I'd set up my UI system, the text fields had to be rendered before other objects could be placed. This lead to a small but noticeable hitch whenever e.g. a popup screen was loaded.

There were three fixes here -- the first was to replace as much text as possible with Bitmap Fonts. Flixel has built-in support for various bitmap font formats, including AngelCode's BMFont which allows you to easily handle unicode, style, and kerning, but the API for bitmap text is a bit different than for regular text, so I had to write a small wrapper class to smooth over the transition. (Fittingly, I called it FlxUITextHack).

This improved things a bit -- bitmap fonts render really quickly -- but incurred a bit of complexity overhead as I had to specially prepare all the specific character sets I wanted and throw in some logic to switch between them depending on the locale, rather than just set up a proper text field that can handle just about anything.

The second fix was to create a new UI object which was a simple placeholder for text, but had all the same public properties as text. I called this a "text region" and created a new class in my UI library for it, so that my UI system could use these text regions the same as it would actual text fields, but not have to render anything before it could calculate size and position for everything else. Then, after my scene was set up, I'd run a routine to replace the text regions with actual text fields (or bitmap font text fields).

The third fix was perceptual. Even if there's only half a second of wait time, it still feels sluggish if there's an input and no response. So I tried to find every scene where there was any input lag at all until the next transition and added either a half-transparent black overlay with the word "Loading...", or just the overlay with no text. This simple fix drastically improved the perception of responsiveness, because something happens as soon as you touch the input, even if it takes the same time for the menu to appear.

## Software Rendering

Most of the menus used to rely on a combination of software scaling and 9-slice compositing. This was because the PC version has a resolution-independent UI that could work in both 4:3 and 16:9 aspect ratios, and scaled accordingly. But on the PSVita we already know the resolution, we don't need all these fancy high resolution assets and algorithms to pop out a scaled version at runtime. We can just pre-render the assets at exactly the resolution they're going to be and plop them straight on the screen.

First, I put some conditionals in my UI markup for the vita that would swap to using a parallel set of assets. Next, I needed to create these ready-to-use single-resolution assets. The HaxeFlixel debugger came in super handy here -- I added a custom script to it that would simply dump the bitmap cache to disk. Then, I created a special build configuration on windows that mimicked the resolution of the Vita, visited all the menus in the game, opened the debugger, and ran my command to export the scaled versions of those assets as finished PNG's. Then I simply applied a naming convention to them and used those as my vita assets.

## Button rendering

My UI system had a real problem with buttons -- buttons would render a default set of assets on construction, only to be resized (and re-rendered) by the UI loading code an instant later, and possibly even a third time before the entire UI was done loading. I fixed this by adding some parameters to defer button rendering until the final step.

## Unnecessary text scanning

The journal in particular had some bad load times in it. Originally I thought this was down to the text fields I was using, but nope. Text in the journal can include links to other pages, which is indicated by some special characters embedded in the raw text itself. These characters are then stripped out and used to calculate the position of the link.

Turns out I was scanning every single text field in order to find and replace these characters with properly formatted links, without checking to see if the text field even had a special character in it first! Even worse, by design, links only appear on the table of contents, but I was checking every text field on every page.

I was able to short circuit these checks considerably with an if statement that just said, "does this text field even using links", which was usually "no." Finally, the page that took the longest time to load was the index page. Since it never changes for the lifetime of the journal menu, why not cache it?

# 6. Memory Profiling

Performance isn't just about CPU -- memory can be an issue too, especially on limited targets like the Vita. Even when you've killed your last memory leak, you still might be dealing with memory sawtoothing in a Garbage-collected environment.

What's "memory sawtoothing?" Well, the way Garbage-collection works is that data and objects you're not using will build up over time until they're periodically cleaned up. But you don't have precise control over when that happens, so your memory graph looks like a sawtooth:

## Take out the trash

Because cleanup isn't instant, the total RAM you're using is usually more than what you strictly need. But if you go over your total system RAM one of two things will happen -- on a PC, it will probably just use the Page File, which means temporarily converting some hard drive space to virtual RAM. The alternative on a constrained memory environment like a console is a hard crash -- even if it's just a few measly bytes over the limit. And even if you weren't using those bytes and they were just about to be garbage collected!

One of the nice things about Haxe is that it's totally open source, so you aren't locked into a black box you can't fix like you are with Unity. And the hxcpp backend exposes a lot of GC controls directly from the API!

We used this to immediately clean up memory after a big level to ensure we stayed under the limit:
cpp.vm.Gc.run(false); //run the Garbage collector (true/false specifies major or minor)

You shouldn't use these willy-nilly if you don't know what you're doing, but it's nice to have tools like this when you really need them.

# 7. Design around the problem

All of these performance gains were more than enough to get things to a good place on the PC, but we were also trying to ship a PSVita version, and we have long-term plans for the Nintendo Switch, so we need to squeeze out every ounce of performance we can.

But, it's easy to get tunnel vision on technical hacks and forget that simple design can make as big a difference as heroic programming.

## Throttle effects on high speed

At 16x speed many of the effects happen so fast you can't even see them. We had some built in mitigation already -- Azra's lightning bolt becomes simpler the faster the game is running, and the number of particles for AOE attacks is lower. We expanded on that, automatically hiding bouncy battle damage numbers on the higher speeds, and various other similar tweaks.

We also realized that after a certain point, 16x speed could actually be slower than 8x speed if there were too many things on screen, so we put in natural throttle points when the enemy count got too high, which would automatically kick the speed level to 8x or 4x. In practice you will likely only ever see this happen on Endless Battle 2. This would keep performance and rendering steady by not overtaxing the CPU.

There's also a little bit of platform-specific throttling. On the Vita we skip the lightning effect when Azra summons or boosts a character, and various other subtle things like that.

## Bury the bodies

So about that giant pile of enemies in the lower right hand corner of Endless Battle 2 -- there's literally hundreds or even thousands of enemies there, all drawn one on top of the other. So... why not just skip drawing the ones at the bottom that you can't even see?

This is a clever design trick that also requires clever programming, because you need the right algorithm to figure out which ones should be hidden.

Most games like this draw using the Painter's Algorithm -- earlier things in the draw list are overlapped by everything after.

By reversing the order of the painter's algorithm, you can generate a "bury map" and figure out what should be hidden. I created a fake "canvas" with 8 levels of "darkness" (just a two-dimensional byte array), at a much lower resolution than the actual battlefield. Then, starting from the end of the draw list I would take each object's bounding box and "draw" it to this canvas, increasing the "darkness" at that location by 1 for every "pixel" the low-res bounding box obscured. Simultaneously, I would read out the average "darkness" of the area I was trying to draw to. What this effectively does is predict how much overdraw each object will experience during the real draw call.

If the predicted overdraw is high enough, I flag that enemy as "buried," with two thresholds -- complete burying, which is totally invisible, and partial burying, which means draw it, but don't draw its health bar.

(Here's the function for the overdraw test, by the way.)

The trick to get this to work is to properly tune the resolution of the bury map. If it's too high, you're doing a whole extra set of simplified draw calls, but if it's too low you'll bury things too aggressively and get visual bugs. When it's just right, the effect is hardly noticeable, but the performance gains are real -- there's no faster way to draw something than to not draw it!

In the middle of battles, I was noticing a frequent stutter I was sure was a "stop the world" Garbage Collection pause. Profiling showed this wasn't the case, however. Further testing revealed it only happened when an enemy wave started spawning, and later I found it only happened when it was a wave for an enemy type that hadn't spawned before. Clearly some enemy setup code was causing this, and sure enough, profiling found a hot function in graphics setup. I started working on a complicated multithreaded loading setup, then I realized I could just shove all enemy graphic loading routines into the battle preload itself. Individually these were very small loads, adding less than a second to the overall battle load time on our slowest platform, but they prevented very noticeable stutters during gameplay.

## Save it for later

If you're working in a constrained memory environment, an age-old industry trick is to allocate a big chunk of memory for absolutely nothing, and then forget about it until the end of the project. At the end of the project, having naturally squandered your entire memory budget, your stash will save you.

We found ourselves in this situation of needing just a few dozen more bytes to save our PSVita build, but dangit, we forgot to do this trick and now we were stuck! The only option was weeks of desperate and painful code surgery!

But wait! One of my (failed) optimizations earlier on was to preload as many assets as possible, and keep them in memory forever, because I'd falsely assumed that runtime asset fetching was causing bad load times. Turns out it wasn't, so nearly all those wasteful preload-and-keep-forever calls could be removed entirely, and now I had memory to spare!

## Get rid of stuff you aren't using

For the PSVita build especially, we realized that there was a lot of stuff we simply didn't need. Since the resolution is so small, the "original art mode" and the "hd art mode" for sprites are indistinguishable, so we went with the "original art mode" for all sprites. We were also able to improve upon our palette swap function with a bespoke pixel shader (we were previously relying on a software drawing function).

Another thing was the battle map itself -- on the PC and home console targets, we stack a bunch of tilemaps to create the map in layers. But since the map never changes, it's easy enough on the Vita to bake everything down into a single final image at runtime so it can be drawn in just one call.

In addition to wasteful assets, we had lots of wasteful calls. Things like Defenders and Enemies sending a regeneration signal every frame, even if they didn't have a regeneration ability. If the UI happened to be open for this creature, it would trigger a repaint every frame.

There were also half a dozen instances of little algorithms that would calculate something during a hot function only to never have the results returned anywhere, usually the result of speculative design earlier in the project. So we just cut all those.

## NaNocalypse

This one was fun. The profiler indicated that calculating angles was taking a really long time. Here's the Haxe-generated C++ code in the profiler:

This is one of those functions that takes a value like, -90 and converts it to 270. Maybe sometimes you'd have a value like -724 which after a few loops resolves back to 4.

Somehow, this function was being passed the value -2147483648.

Do the math. If you add 360 to -2147483648 every loop, it will take approximately 5,965,233 iterations before it's greater than 0 and exits the loop. By the way, this loop happens every update (not every frame -- every update!) every time a projectile (or anything else) changes its angle.

This was my fault of course, as the value I was passing in was NaN, a special float value that means "Not a number" and usually signifies an error somewhere earlier in your code. If you hard cast this to an integer without checking first, weird stuff like this can happen.

As a temporary fix, I threw in a Math.isNan() check that would set the angle to zero whenever this (fairly rare, but inevitable) case happened while I tracked down what its ultimate cause was, and the lag immediately vanished. Turns out not doing 6 million iterations of pointless nothing can bring you some big speedups!

(The fix for this has since been merged upstream into HaxeFlixel).

## Don't outsmart yourself

Both OpenFL and HaxeFlixel have built in asset caching. This means that when you load an asset, the next time you fetch that asset it will grab it from the cache rather than loading it from the disk from scratch. This behavior can be overridden, and there's some sensible times to do that.

However, I was doing some weird speculative stuff where I would load an asset, explicitly tell the system not to cache the results because I totally knew what I was doing and I didn't want to "waste memory" on the cache. Years later, these "clever" calls were causing me to load the same asset over and over again from scratch, both slowing down the game and trashing the precious memory I was "saving" by skipping the cache.

# 8. Also maybe just don't make levels like Endless Battle 2

So it's great that we did all these little tricks to speed up performance. To be honest, we didn't notice most of them until we started porting our game to lower powered systems, and the problems only became absolutely intolerable on certain levels. I'm glad we got the performance gains in the end, but I think it's also good to avoid pathological level design, Endless Battle 2 just put way more stress on the system than it needed to, especially compared to all other levels in the game.

Even with all those tweaks, the PSVita version just could not handle Endless 2 as originally designed, and I didn't want to risk performance on the base model XB1 and PS4, so I bit the bullet and rebalanced the console versions of Endless 2. I lowered the number of enemies but pumped up their stats, playing through it until it felt like roughly the same difficulty. On the PSVita we also capped the waves at 100 to avoid risking an out of memory crash, but left waves uncapped on PS4 and XB1. This way the "endurance" achievement is still the same experience across all three consoles. The design of the PC version of Endless Batlte 2 remains unchanged, however.

All this was a lesson for Defender's Quest II -- we're going to be extremely careful about levels with no upper bound of on-screen enemies! Of course, "endless" missions are a huge draw for Tower Defense fans so I wouldn't eliminate them entirely, but what if we had levels where there were check points, where you HAVE to eliminate everything on screen before proceeding to the next set of waves? Not only would that let us set a ceiling on the number of on-screen enemies, it would also let us support mid-level saving without having to worry about serializing the insane object soup state of a frenzied battle -- we could just save defender locations, boost levels, etc.

# 9. Closing Thoughts

Performance is a tricky subject, because players don't often understand what goes into it, nor should we expect them to. But I hope this exercise clarifies a bit about how things look under the hood and how design, technical trade-offs, and plain old stupid mistakes conspire to make games slower than they need to be.

The thing is, even in a well designed game written by a talented team, these little "rusty" bits of code are absolutely everywhere. But in practice, only a few of them will actually impact performance. Being able to sniff them out and fix them is equal parts art and science.

I'm happy that all these benefits will carry over to Defender's Quest II. If I'm being perfectly honest, if we hadn't done the PSVita port, I probably would have not even tried at least half of these optimizations. So even if you never pick up the game on the PSVita, you can thank the little console that could for drastically improving Defender's Quest's performance :)

### Related Jobs

Remedy Entertainment — Espoo, Finland
[04.19.19]

Software Developer for DevOps
DeepMind — London, England, United Kingdom
[04.19.19]

Games System Engineer
Paradox Tectonic — Berkeley, California, United States
[04.18.19]

Senior PC/Console Graphics Programmer
Embodied Inc. — Pasadena, California, United States
[04.18.19]

Junior Scripter