Whenever I think of program flow I think of the rivers, locks and weirs of my native England. Code execution is just another path for the water to flow. Programs are just an assortment of gates (if statements), cogs and wheels that do something.
At the most basic level a modern processor has multiple cores. Each core is the same as an old school CPU if you think of it. In the single core days there was no such thing as true multi-core, hyper-threading or any of those fun terms. We cheated by creating a copy of the processors registers (the thread context) and quickly loading them with a copy from another flow of execution: A Thread. Today it's much more exciting because, whether you noticed it or not, there's a war going on between AMD and Intel as to who can put the most cores into their processors. 18 cores with 36 hardware threads is not uncommon now and the count just gets on getting bigger. It's awesome. But there's a catch, you'll burn out your processor real fast if you run every thread at 4Ghz! So each core slows down a bit, the most active cores get a boost of speed, but generally life is good when we have lost of cores running slower than our old school Pentium 4 clock rate.
Case in point. A good friend of mine used to work on the Xbox 360. I think it's safe to say that it's common knowledge that that console had more than one core. A brilliant bit of silicon based on the PowerPC. His multi-threading code was SO efficient that it literally melted the chip. He went through a large number of development kits before he backed off from 100% efficiency I believe. It's an awesome story because you can brown out your processor by making your code too good. Anyways...
How do you get efficiencies that can burn out your CPU? Haha, not going to tell you. Ok, yes I am. Stick a hard core loop that does lots of work on every core and every thread and listen to your fans cry out in pain. I'm sure most of you have seen the videos on the Internet of old AMD and Intel CPUs running without fans. Intel's chip just slowed way down where as the AMD chip "let the smoke out". And as we all know modern processors run on smoke. Let the smoke out and they don't run any more. :) So the bottom line is both AMD and Intel now have special circuitry to slow the processor down when it grows too hot.
What I'm really trying to point out is the basic fact is single threaded games are going to go the way of the dinosaur eventually because our CPUs will have dozens if not hundreds of cores but not all those cores can't possibly run at full clock. My prediction. If we break the 1nm scale then I'll eat my words. But regardless of everything I just said taking the shotgun approach and breaking up your program into many many small pieces so we can execute them at the same time on different cores and their dueling hardware threads is a good idea.
One of the extreme ways I pushed my EON engine was to design my asset database on both loose files and a ZIP like file. When you're developing I use loose files and then when I ship them I use the prefab approach. I built my own reflection system which is 100% thread safe. So asking for a piece of data is incredibly easy. All I say is use one of two streaming functions e_unique<> and e_stream<>. Like most engines every object derives from an Object class. Every asset from a Resource class. So think of a FBX mesh for example. It is made up of renderable and physics geometry, textures, lights, cameras, materials, etc. The simple approach is just to load each piece in turn. That can take a very long time. So instead what I do is I use a hierarchical threading approach.
If I say "auto hMesh = e_stream<Mesh>( path, lambda );" in my code it will immediately kick of a thread for the mesh class and pass it a Reader object. Nothing too complicated there. But then when we have another large asset such as a texture or a vertex buffer I don't serialize them right there I just kick off another stream which spawns another thread. All of the streams have lambda callbacks so you know exactly when the asset has finished loading. Until then you can display a proxy object such as the first LOD in the texture which is quicker to load than the entire 4096x4096 texture even if it's compressed.
My load times are unbelievably fast!
I also wrote a neat little terrain generator based on open simplex noise. Now this is expensive to do because you basically iterate over a block of pixels running a lot of math. Doing this on one thread is annoying but you can easily break it up into multiple threads. What I do is create a 1024x1024 and divide that up into 128x128 blocks and assign each thread to its own block. Each thread knows the position in the bitmap and doesn't need any synchronization. No mutexes. No spinlocks. Just raw speed. By breaking down the problem into concurrent blocks I get extreme performance.
I also use this approach for other generation tasks like creating a color picker widget. The color palette has four threads.
There's lockless and then there's true lockless. If you can design your algorithm in such a way that it can run concurrently, like the terrain idea above, you don't need compare-exchanges, you don't need spinlocks, read-write locks, memory locks, or anything. If you can group things and work within the group via a thread (as long as the cost for spawning the threads is less than the time it takes to do the work) you've got a win.
In future blogs I'll give you some code on some speedy synchronizers and especially how to do multi-threading the cross platform way. The std::thread is great but you can't control priority with it and on Windows understanding priorities is huge.