Previously I have described a thin lock as a very efficient implementation of object monitor. It’s time to flesh out the design. Since most of the thin lock implementation is itself lock free, all kinds of multicore subtleties come into play. I did my best to analyze every step, but I’m only human. The code hasn’t yet been written in D, much less tested.

Object Layout

Every object (instance of a class) in the D programming language inherits from Object. Object contains a hidden header consisting of two fields:

  • pointer to vtable
  • pointer-sized thin lock

I’m not adding any new fields, just replacing an existing pointer field with the thin lock. The pointer used to point to a lazily evaluated Monitor object.

Thin lock is a union of a bit field and a pointer to a FatLock struct. The union is discriminated by the two lowest bits–if they are zero, it’s a pointer. A null pointer has special meaning–the object can never be shared (it was created as non-shared). See sharing and unsharing n D.

Here’s the thin lock bit-field layout (the low 32-bits; the whole field is either 32- or 64-bit wide, depending on native pointer size).

Bits Name Purpose
11 thread index One-based index into the global thread table
19 recursion counter Used for recursive locking by the same thread
1 currently shared Set during the creation of a shared object. It can be turned off and back on with the help of sharing casts.
1 created shared Set only if the object was created as shared

D runtime has a fixed-size global array of threads. An index into this array is used to identify a thread. This is different from the operating-system-defined thread ID.

Sharing policy

To share an object between threads, it must be created as shared (possibly allocated from a separate, shared heap). Any global object that is not declared shared is only accessible through thread-local handles.

A shared object’s thin lock (which hasn’t been inflated to a fat lock, see later) has the two lowest bits set. Sharing can be cast away, at which point the currently-shared bit is cleared, but the created-shared bit is still on. The object can then be cast back to shared; however, if the created-shared bit is not on, such a cast will throw an exception.

This guarantees that an object that was not created for sharing can never be cast to shared.

When sharing is cast away, the casting thread’s identifier is remembered in the thread index field. The object becomes exclusively owned by the current thread (essentially, locked by it). Any attempt by a different thread to lock such an object will result in an exception.

Locking a never-shared object

An object that was created as non-shared (the default) has zero in its thin lock. This state is permanent and not reachable from any other state.

Originally I thought that the comparison (thinlock == 0) didn’t require any synchronization. I totally spaced out on publication safety (thank you, Andrei, for a reality check!). In general, this test has to be preceded by a read fence. Fortunately, on an x86, publication safety is guaranteed without fencing, so at least on that platform, the check is lightning fast. Because the thin lock is co-located with the rest of the object–next to the vtable pointer–even the overhead of fetching it from main memory is rarely incurred.

Therefore the incentive for writing two versions of a class–one with synchronized, and the other with non-synchronized methods–is practically absent in D.

I mentioned before that the D compiler might be able to elide synchronization of non-shared objects altogether. If that happens, the testing of the two sharing bits in the thin lock would be redundant. There is however a code-size/performance trade-off between the two solutions.

Locking algorithm

– Test thin lock for zero (on an x86, without any synchronization, otherwise precede it with a read fence). If zero, return. This is the most common case: the object was not created for sharing and will never be accessed by another thread.

– Fetch current compact thread ID, which is conveniently stored in thread-local memory. Compact thread ID is pre-calculated for each thread using an 11-bit thread index (see above) shifted left by 21 and OR’ed with 3 (the lowest two bits are set). This is an optimization. Notice that an XOR with a compact thread ID flips the two lowest bits of the thin lock, which makes subsequent tests look logically inverted.

– XOR thin lock with this ID. If the result is 2 (remember the inversion), return. This is the case when the object was created shared, but was cast to non-shared by the current thread. This operation doesn’t require any synchronization, because only the current thread could cast the object back to shared.

if (thinlock ^ compactThreadId == 2)
  return;

– If we are past this point, we know that the object is shared, or we are attempting (incorrectly) to lock an object that is exclusive to (cast to non-shared by) another thread.

– Perform a CAS (atomic Compare And Swap operation) on thinlock. The value we are expecting is 3 (two lowest bits set and noting else), the replacement value is the compact thread ID of the current thread. This operation will succeed in the next most common case–the object is shared but is not currently locked. The resulting thin lock state has the thread index filled with a non-zero value, and the two lowest bits set (see the description of compact thread ID). This state is interpreted as “locked once by a given thread”.

– If the CAS fails (thinlock didn’t have the expected value and the swap did not occur), we know that the object is locked (or the thin lock has been inflated to a fat lock, see later).

– Try the next most common case–the object is locked by the current thread (recursive locking). First XOR the value of the thin lock with the current compact thread ID to isolate the count field. If the result is less that the maximum count shifted left by 2 and the two lowest bits are zero (meaning, they were set before the XOR), then increment the count. These operations don’t require any synchronization, because they only succeed when the lock is owned by the current thread.

uint tmp = thinlock ^ compactTID;
if (tmp < MAX_COUNT_MASK && (tmp & 3) == 0)
{
  thinlock += COUNT_INCREMENT;
  return;
}

– Check for the error of trying to lock an object that is owned exclusively by another thread.

if ((thinlock & 3) == 1)
  throw new ExclusiveLockViolation;

– Check if the lock has been inflated (the two lowest bits are zero). If so, interpret the thin lock as a pointer to FatLock and lock it. Fat lock is implemented using the operating system locking primitives and can deal with contention. Return.

– If we reach this point, we know that there is contention for the lock or the count has overflowed. In either case we have to inflate the lock. We have to preserve one invariant–the lock can only be modified by the thread that holds it, otherwise we are open to all kinds of races. Therefore we have to busy-wait for the lock to be released.

while (thinlock != 3 && (thinlock & 3) != 0)
  compiler_fence();

Notice that busy waiting requires that the compiler not optimize this loop–we need a “compiler fence”. However, no processor memory fences are required, because there is no ordering dependence. It’s enough that, when another thread modifies the lock, the new value eventually becomes visible to the spinning thread.

It’s possible to miss the unlocked state and spin longer than necessary. Starvation is theoretically possible if the other thread keeps unlocking and locking without discovering a contention; with the current thread repeatedly missing the unlocked state. (This part of the algorithm may be optimized further by introducing exponential backup.)

– The loop is exited if either the lock has been released (thinlock == 3) or another thread managed to inflate the lock (the two lowest bits are zero, signifying that the value stored is a pointer to FatLock).

– Try to acquire the lock using CAS (the arguments are the same as in the original attempt).

– If it succeeds, allocate the fat lock, lock it, and atomically store the pointer to it in the thin lock. Return. Once the lock has been inflated, it will remain so for the lifetime of the object.

– If we reach this point, we know that: either the lock has been inflated, another thread is in the process of inflating it, or another thread has acquired the lock without contention while we were busy waiting.

– Try again: go back to busy waiting.

Unlocking

When unlocking, we have the guarantee that the current thread owns the lock, so we don’t need any additional synchronization.

  • If the thin locks is zero, return. The object was not created for sharing.
  • XOR thin lock with the compact thread ID.
  • If the result is 2, we own the object exclusively. Return.
  • If the result is zero, we hit the next most common case–the lock has been taken once by the current thread. Store 3 in the thin lock and return.
  • If the result of XOR is non-zero and its lowest two bits are clear, decrement the recursion count.
  • Otherwise, the lock has been inflated. Unlock the fat lock.

FatLock must also contain a field exclusively-owned-by. This field is filled with a compact thread ID when sharing is cast away. When sharing is re-established, this field goes back to zero. Therefore, after locking the fat lock, additional checking is done: If the field is non-zero and the result of XOR with the current compact thread ID is also non-zero, an exception is thrown (attempt to lock an exclusively owned object).