Subject Fine Granuality Threading on Linux
Author Jim Starkey
A primary goal of FB2 is to implement fine granuality multi-threading
to take advantage of multiple processors in an SMP configuration.
I've been doing quite a bit of work in that general area on a different
project, and thought I'd share some observations on pitfalls and
challenges, particularly on Linux and similar systems.

The first problem is that gdb (and presumably any other debugger)
can't be used on a multi-threaded system under load. The problem
is an unfortunate interaction between ptrace (the Unix debugging
mechanism) and the Linux implementation of pthreads. The basic
ptrace mechanism works this way: a signal intended for the target
process causes the target process to stall while the signal
is delivered to the debugging process, which figures out what to
do, then restarts the target process. The Linux threading
mechanism uses a separate OS process for each thread. Synchronization
is performed by the pthread mutex calls. When one thread (process)
fails to acquire a mutex, it goes to sleep. When another thread
releases the mutex, it sends a signal to all sleeping threads
(in fact, all threads) to wake up and look around. Unfortunately,
when the target process/threads are under control of the debugger,
there is a very complex multi-process dance involving (apparently)
multiple debugger interactions per wake up. Kinda like the
guys who designed the threads didn't talk to the guys who designed
ptrace or one or the other didn't care.

The bottom line is when debugging a multi-threaded application
under load on Linux, gdb consumes between 45% and 100% of the
cpu. The practical implication is that Linux is not a viable
debugging platform. NT, happily, doesn't exhibit the same
characteristics.

A partial but painful workaround is for the debugging image
to set up handlers for SIGSEGV and SIGILL to invoke gdb via
the system() call to attach to the process. This catches
crashes, but makes ordinary debugging tedious indeed.

A second problem is implementing a use-count mechanism to
control object lifetimes in a multi-threaded environment.
The two alternatives are to use mutexes or other synchronization
mechanisms to protect all addRef/release calls (very, very
expensive) or to use interlocked increment/decrement mechanisms.
Unfortunately, while Microsoft provides intrinsic
InterlockedIncrement/InterlockedDecrement functions that perform
atomic multiprocessor interlocked operations that correctly
return the result of the operation. Unfortunately, there
are no such functions available on Linux. Atomic.h provides
interlocked increment/decrement, but they don't return values.
Interestingly enough, Google couldn't find any example of
the Intel instruction sequences required to implement the
necessary atomic operations using the GNU assembler dialect.

All this means two things. First, there is no alternative
to inline assembler code in FB2. Distasteful, yes, but get
used to it. Second, fine granularity threading is alien
to Linux, so prepared to be a pioneer in a nasty environment
without a workable debugger.

Jim Starkey