Lock-Free IPC
16 March 2011
I was doing some reading a while back into lock-free algorithms for multi-threaded communication using buffers. The general idea is to use a circular buffer in shared memory to allow one process (or thread) to write data into a buffer so that another process can read it. The algorithm relies on atomic compare-and-swap instructions, which are part of the x86 instruction set and also present in several other architectures which have features designed to assist with concurrency.
It turns out that one of my colleagues has already written about this on the Rapita blog - it’s a useful approach, and used in many systems.
IPC
There are some issues with using shared memory for IPC: shared memory is generally considered to be ‘dangerous’ in some way, since it punctures the memory isolation principles of most operating system kernels. It is therefore carefully controlled by the OS.
Linux supports shared memory by creating shared memory ‘files’ (segments) which are then mapped into the address space of one or more processes. These segments must be cleaned up after use however, otherwise memory leaks in the Kernel occur.
The design principles of a microkernel often demand complete memory isolation - the first generation microkernels had very poor performance because of this dilemma: on the one hand they demanded isolation, but this prevented high-performance communication between processes. On the other hand by moving system services out of the kernel and into independent processes (Servers) they vastly increased the amount of IPC needed for the basic operation of the system.
Later designs have allowed higher performance by internally using something akin to shared memory to implement their message-passing IPC. The Mach kernel uses a mechanism where the memory containing the message is mapped into the receiving process’ address space. By flagging the memory pages as read-only, the operating system can step in and make a new copy of the memory if the receiving process tries to modify it.
The documentation for the QNX kernel describes why shared memory alone isn’t enough for all IPC mechanisms, since in the end some form of synchronisation will be required, and the process must fall back on the OS to provide this.
Unidirectional buffer
It strikes me that there ought to be a middle ground, similar to the approach used by the Mach microkernel, but that doesn’t require so many system calls to handle passing messages. My current idea is that to create a producer-consumer buffer between two processes, a process requests the kernel to create a new channel (or Port in microkernel terminology) to the receiving process. This will allocate a number of memory pages in both processes which are read/write for the producer and read-only for the consumer. Now we can implement the circular buffer mechanisms as described in the Rapita article, except that we still need at least one variable to be writeable by the consumer, so that it can signal to the producer the current position of the read pointer. This could be accomplished by mapping a single shared memory page which is writable only by the consumer, to store the pointer.
For a single variable this seems extravagant - a page is usually 4kb in size in current operating systems, but could be 4mb on some systems.
Other concerns include how to handle buffers between processes on separate cores (the ideal use case for lock-free buffers) or separate systems in a network. I think it ought to work in the first case, but I suspect extra work needs to be done to flush memory pages from cache - the CAS instruction ought to at least force its own cache line to be flushed, but the rest of a message might not be.
For networked buffers, the best solution might be a ‘dummy’ or stub process to be created which simply streams the content of the buffer over the network. This would be transparent to the sending process, and a corresponding stub on the receiving system should provide the same effect to the receiver. I have concerns about latency and so on, but if the buffer can be kept sufficiently empty the throughput should be high.
Finally, the receiver would have to be careful to make its own copy of any data before modifying it - the read-only pages of the buffer would otherwise cause a segmentation fault. This might be worked around in the same way as the Mach message passing IPC using copy-on-write mechanisms, but would then incur the extra overhead of system calls to perform the memory mapping as each message is received.
Conclusion
I’ll almost certainly be experimenting with this approach in the future, and will be interested to see the relative performance of the different approaches.