« Supported Feature: 'Microsoft Flight Simulator X & Multi-Threading' | Main | Solving Old Triangle Rendering Problems »

Lock and Lock-Free Code Compared, Optimized With Intel Thread Profiler

CD Projekt Red (The Witcher) senior programmer Maciej Sinilo has posted his experience with using an evaluation version of Intel Thread Profiler on a dual-core machine to see if there was any real difference between code based on locks and lock-free versions with his multi-threaded experiments.

Aimed at helping users "tune multi-threaded applications faster, for optimal performance on Intel multi-core processors," Intel Thread Profiler enables developers to visualize what percent of code is optimally parallel and where application performance issues exist.

Using Intel Thread Profiler's timeline view, Sinilo found that the average concurrency for his test using code based on locks was 1.99, with 5902.61 transitions per second. His results for the lock-free implementation test showed average concurrency at 3.04 with 57.36 transitions per second.

Sinilo found the lock-free implementation better, but not better. He tried a few methods to optimize the results, including eliminating as many semaphore waits as possible and a trick he found in Intel's Threading Building Blocks: "Every worker thread has its own task queue, if it's empty it tries to steal work from another thread. It wont sleep immediately, instead spin a little bit trying to steal something (yielding from time to time and pausing for a very short periods of timeā€¦ It may be tricky to fine tune this).

He continued, "Eventually, it may wait, but it shouldn't happen that often. Wake-up events are not signaled every time task is added, only when changing queue state from empty to full (possible contention here, as I guard gate state variable, but it's very short). I do not need to implement work-stealing, as all threads acquire tasks from one queue, so it more or less auto-balances itself. I simply spin a little bit waiting for new task to arrive."

His final lock-free implementation had a an average concurrency of 3.58 with 6.29 transitions per second. Sinilo admitted that the improved results didn't amount to much on his dual-core system: "Anyway, what's [the] 'real' difference between all those versions? Not that big, honestly, fractions of one frame. I blame [the] test machine, partially. I may test it on my quad-core work machine and will get back with the results (if they're interesting)."

About

This specially written weblog combines Gamasutra and Intel knowhow to present and deconstruct the latest happenings in visual computing and game technology.

Editor: Eric Caoili

Recent Comments