|
1. Introduction
Interlocked
operations provide a simple mechanism for synchronizing access to
variables that are shared by multiple threads. Interlocked operations
don't supplant classic synchronization functions like critical sections
or mutex objects but complement them as they answer a very specific
need.
2. What are interlocked operations?
Interlocked operations are a high-performance way of updating integer-sized or pointer-sized variables in an atomic manner.
Interlocked
operations are high-performance as they use hardware-based
synchronization techniques rather than operating system level
mechanisms. Interlocked operations are implemented as intrinsic
functions and are even translated in some cases to a single processor
instruction.
Interlocked
operations are atomic. An operation is atomic if it is indivisible.
During an atomic write, you can't have another thread reading the value
that you are writing half way through the write. Similarly, during an
atomic read, you can't have another thread changing the value that you
are reading half way through the read. If the operations were not
atomic, there would be a possibility that a thread will read part of an
old value and part of a new value from a variable that another thread
is writing to. Additionally, the atomicity of interlocked operations is
guaranteed across CPUs.
The
table below shows the interlocked operations available in the Windows
API. Keep in mind that other APIs have more or less operations
available.
|
Windows API function
|
Description
|
|
_InterlockedAnd
|
Perform a Boolean And operation on the specified variable
|
|
_InterlockedCompareExchange
|
Conditionally set a variable on the outcome of a comparison with the current value
|
|
_InterlockedDecrement
|
Decrements the specified variable
|
|
_InterlockedExchange
|
Set a variable and returns its prior value
|
|
_InterlockedExchangeAdd
|
Add to a variable and returns its prior value
|
|
_InterlockedIncrement
|
Increments the specified variable
|
|
_InterlockedOr
|
Perform a Boolean Or operation on the specified variable
|
|
_InterlockedXor
|
Perform a Boolean Exclusive Or operation on the specified variable
|
Because
some of the architectures capable of running Linux don't provide the
necessary hardware support, interlocked operations are not available in
the user side of the Linux API. They are available only on the kernel
side and only on specific architectures.
3. The pros of interlocked operations
Compared
to classic lock mechanisms, interlocked operations are the most
lightweight and efficient synchronization primitives. The advantages
are many.
Interlocked
operations are directly supported by the hardware in the form of one or
many processor instructions. Compiler support is provided in the form
of intrinsic functions. That configuration is very cache friendly. The
flow in the instruction cache will not be disturbed by using an
interlocked operation while the data cache may take a minimal hit,
depending on the platform memory model.
There
is no wait state with interlocked operations. When you try to acquire a
mutex object or a critical section, your thread may have to wait for
them to be released while interlocked operations are performed
immediately. That specific property is the base of lockless
programming. Lockless programming is a model where synchronization is
achieved using non locking primitives. Alternative names are or
lock-free, non-blocking and wait-free programming. As lockless
programming is notoriously difficult, don't expect to make your
application totally lockless. Instead, concentrate your efforts on
trying to make lockless some of the algorithms you are using. Good
candidates are linked lists [Val95] [Har01] and deque [ST04].
There
is no user mode to kernel mode transition as for mutex objects. This is
a significant advantage as user-mode to kernel-mode transitions are
particularly slow operations on most platforms (about 100 times slower
than a normal function call on a Pentium 4 [Ric05]).
It
is possible and relatively easy to create complex synchronization
primitives using interlocked operations. Below is a rudimentary
implementation of a critical section API using
_InterlockedCompareExchange and _InterlockedExchange. Keep in mind that
the following implementation is only an example. You shouldn't write a
replacement for the standard critical section API unless you have a
very specific need.
#define CS long
#define CS_FREE 0
#define CS_USED 1
void InitializeCS(CS* pCS)
{
// Set the free flag
*pCS = CS_FREE;
}
void EnterCS(CS* pCS)
{
while (_InterlockedCompareExchange(pCS, CS_USED, CS_FREE)
== CS_USED)
{
// Spin until we see the free flag then acquire
// the critical section by setting the used flag
}
}
void LeaveCS(CS* pCS)
{
// Set the free flag
_InterlockedExchange(pCS, CS_FREE);
}
4. The cons of interlocked operations
While
interlocked operations are great when you want performance, they are
limited in functionality. Boolean operations or bit manipulation are
usually not available. However, if you are ready to slightly compromise
the non-locking aspect of interlocked operations, it is usually
possible to build the missing operations from already available
operations. Below is an example of an _InterlockedAnd based on the very
useful _InterlockedCompareExchange. That example can easily be
transformed in a variety of Boolean or bit manipulation interlocked
operations by replacing the calculation of the desired value.
void _InterlockedAnd( int * piDestination, int iAnd)
{
int iOriginalValue
int iCurrentValue;
// Get the destination and don't access it
// anymore except through the interlocked operation
iCurrentValue= *piDestination;
do
{
iOriginalValue = iCurrentValue;
// Calculate the desired value
int iDesiredValue = iOriginalValue & iAnd;
// Try to perform an exchange
iCurrentValue = _InterlockedCompareExchange(piDestination,
iDesiredValue, iOriginalValue);
// The value prior to potential exchange is returned
// If the value has changed, repeat the operation
}
while (iOriginalValue != iCurrentValue);
}
Because
of their limitation to integer-sized or pointer-sized variables, there
are cases where you won't be able to avoid critical sections or mutex
objects. When you have to update several members of a data structure,
you will have to use in most case a more complex synchronization
mechanism.
Hardware
support is required for interlocked operations. Most of the modern
processors implement the required support but that is not a given,
especially in the embedded and mobile market segment. In that segment,
adding support for such complex operations would push both the cost and
the power consumption beyond what is currently acceptable.
The
processor instructions used to implement interlocked operations are
highly complex instructions. Not only the processor architecture has a
performance impact, bus design and cache structure have significant
performance implications too. Depending on the conditions, hundreds to
thousands of cycles will be necessary to complete an interlocked
operation. Remember, the fastest synchronization primitive is the one
that you are not using.
A
last limitation of interlock operations lies with inter-process
synchronization. As processes have separate memory spaces,
inter-process synchronization can only be achieved with mutex objects.
5. Debugging tips
5.1 Don't mix synchronization mechanisms
Mixing
interlocked operations with any other synchronization mechanism is a
fast track to synchronization bugs. Synchronization mechanisms have not
been designed to be mixed. If you are using a critical section to
synchronize a data structure, don't use an interlocked operation to
access part of your data structure.
5.2 Don't confuse interlocked operations with volatile variables.
Confusing
interlocked operations with volatile variables can lead to hard to
track bugs. The volatile keyword only ensures that variables are not
kept in registers. It doesn't guarantee atomic interactions across
threads and the C++ standard even allows compliant compilers to
completely ignore the volatile keyword. Additionally, if you access a
volatile variable through another variable (a pointer for example) that
is not qualified as well as volatile, the program behavior is
undefined.
5.3 Beware of memory consistency
Depending
of the hardware and the compiler, interlocked operations may or may not
guarantee the memory consistency. If no guarantee is given, both the
compiler and the hardware can reorder instructions around an
interlocked operation and strict program order will have to be
explicitly enforced.
If
your program writes to memory and then uses what has been written in an
interlocked operation, you need to commit outstanding memory accesses
before performing the interlocked operation. This prevents both the
hardware and the compiler from reordering accesses around the
interlocked operation, causing a synchronization bug. This process is
called a release memory barrier or release memory fence.
If
your program uses an interlocked operation to read a value and
subsequently access it thru memory, you need to commit outstanding
memory accesses after performing the interlocked operation. Again, this
prevents both the hardware and the compiler from reordering accesses
around the interlocked operation, causing a synchronization bug. This
process is called an acquire memory barrier or acquire memory fence.
The table below shows the memory barrier functions available in the Windows API.
|
Windows API function
|
Description
|
|
_ReadBarrier
|
Forces memory reads to complete
|
|
_WriteBarrier
|
Forces memory writes to complete
|
|
_ReadWriteBarrier
|
Block the optimization of reads and writes to global memory
|
You
should note that the x86 implementation of the Windows API provides a
full fence / full memory barrier on interlocked operations.
6. References
[Ric05] Jeffrey Richter. Concurrent Affairs - Performance-Conscious Thread Synchronization. In MSDN Magazine, October 2005.
[Val95]
John D. Valois. Lock-Free Linked Lists Using Compare-and-Swap. In
Proceedings of the 14th ACM Symposium on Principles of Distributed
Computing, August 1995.
[Har01]
Timothy L. Harris. A Pragmatic Implementation of Non-Blocking Linked
Lists. In Proceedings of the 15th International Symposium of
Distributed Computing, October 2001.
[ST04]
Håkan Sundell, Philippas Tsigas.Lock-Free and Practical Deques using
Single-Word Compare-And-Swap. Technical Report no. 2004-02.
|