Race-free Multithreading

May 26, 2009

Race-free Multithreading

Posted by Bartosz Milewski under Programming
[28] Comments

Most languages were caught unaware by the multicore revolution. C++, predictably, developed a portable assembly language for multiprocessors (C++0x atomics). Out of sheer desperation some programmers turned to Erlang. Many still try to ignore the elephant in the room, while others try to shoot it with BB guns provided by their favorite languages.

I’ve looked at many languages and decided that none provided the right combination of performance and safety. We are still waiting for the next big one to break into the multicore territory.

Towards this goal, I’ll present a series of posts in which I’m going to develop a
threading model for that hypothetical language. The model is based on several papers that I reviewed in my previous blogs posts. My own contribution is putting the best of those ideas together into one package. I will use syntax similar to that of the D programming language, but C++ and Java programmers shouldn’t have problems following it.

Teaser #1

In one of my previous posts I described a concurrent data structure based on Haskell’s MVar. It’s essentially a one-element message queue. Let’s have a peek at the implementation that uses my proposed multithreaded scheme. Notice the total absence of special annotations–even the synchronized keyword is missing. The only unusual thing is the move operator, :=, introduced in my previous blog. It is needed in case we want to pass unique objects through MVar. You are probably asking yourself, How the heck is this supposed to work in a multithreaded environment? Read on!

class MVar<T> {
private:
    T    _msg;
    bool _full;
public:
    // put: asynchronous (non-blocking)
    // Precondition: MVar must be empty
    void put(T msg) {
        assert (!_full);
        _msg := msg; // move
        _full = true;
        notify();
    }
    // take: If empty, blocks until full.
    // Removes the message and switches state to empty
    T take() {
        while (!_full)
            wait();
        _full = false;
        return := _msg;
    }
}

Let’s instantiate this template as a monitor–that is an object accessible from multiple threads. We do it by specifying the owner as “self” (the this owner is not explicitly listed as a template parameter, but it can be passed to the template during instantiation).

auto mVar = new MVar<owner::self, int>;

In a self-owned object all methods are automatically synchronized. The move operator acting on an int simply copies it.

That was easy! How about instantiating a self-owned MVar for passing unique objects? Piece of cake!

auto q = new MVar<owner::self, unique Foo>;

auto foo = new unique Foo;
q.put(:= foo); // move foo
assert (foo is null);

// another thread
unique Foo f2 = q.get(); // implicit move of rvalue

So far data races have been avoided by implicit synchronization in self-owned objects and by the use of move semantics for unique objects.

Of course you know how to break the MVar, right? Instantiate it with a regular, non-unique class objects and watch aliases spread through threads. Let’s try it:

auto mVar = new MVar<owner::self, Foo>;
auto myFoo = new Foo;
mVar.put(myFoo);
// now let me race through my local alias!
myFoo.method();

Surprise! This code will not compile because a self-owned object cannot have a thread-local member. Because of a richer type system, the compiler can easily detect such violations. I’ll explain the details of the ownership scheme in my future posts–for now let me assure you that, no matter how hard you try, you can’t create a data race in this system (unless you bypass it with explicit casts).

How to making concurrent programming safer?

I propose to do this by extending the type system. (You’ve already seen some examples above.) I realize that there is strong resistance to extending a language’s type system. The main problem is increased complexity. The programmers are reluctant to study and use complex annotation schemes. I will argue that, in the case of multithreading, the complexity argument is reversed. Shared-memory multithreading is hard! Figuring out how to avoid all the pitfalls of sharing is harder than learning a type system that prevents them.

If a programmer is not willing to learn or cannot understand the race-free type system, he or she should not be writing multithreaded programs.

How to limit complexity?

To limit the complexity of the type system, I came up with several simple principles. The most important one is that multithreaded extensions should not require modifications to single-threaded programs. That means coming up with reasonable defaults for all new annotations.

If necessary, the compiler should be able to derive some of the defaults from program analysis. Because modularity is very important, such analysis should not creep across compilation unit boundaries. Whole-program analysis is out of the question.

How to keep goals reasonable?

There are two major challenges in multithreaded programming:

Avoiding races
Preventing deadlocks

The first problem is approachable, whereas the second seems to be a pie in the sky. I will therefore concentrate on designing a race-free system that is based on existing academic research.

Such system would be incomplete without some support for lock-free programming. There are very good reasons for the language to enforce sequential consistency. For comparison, C++ committee, after a long struggle, decided to allow non-SC primitives (weak atomics). Java only enforces SC for volatile variables.

Teaser #2

Speaking of lock-free programming, here’s the implementation of the infamous Double-Checked Locking Pattern:

class Singleton<T>
{
private:
    lockfree T _value;
public:
    T get() lockfree
    {
        if (_value is null)   
        {
            synchronized(this)
            {
                if (_value is null)
                   _value = new T;
            }
            return _value;
        }
    }
}

A lockfree variable is guaranteed to be always atomically accessed in a sequentially consistent way. A lockfree method is not synchronized, but it can only operate on lockfree variables (or use a synchronized section). Even though the use of lockfree doesn’t introduce low-level races, it may create high-level races when accessing more that one lockfree variable at a time. But we all know that lock free programming is only for desperadoes.

Lessons learned from previous work

There are many commonalities in various approaches to race freedom.

Ownership should be built into the type system. In OO languages each object must have an owner. In C-like languages each piece of data must have an associated lock.
There should be efficient mechanisms for passing data between threads
- By value
- By reference. (The object being passed must be a monitor.)
- As immutable data, by const reference
- By unique reference using move semantics

Most of those ideas are expressible through type qualifiers.

Higher-level models

Explicit synchronization is hard, no doubt about it. In my proposal, the type system makes it safe. There are however other models of concurrency that, although more restrictive, are often easier to use.

One such model is based on message passing. In my opinion, support for message passing should be library-based. The race-free type system will make it safe and flexible. It will, for instance, support passing messages not only by value but also by reference–without sacrificing safety. In traditional type systems there is no way to express the requirement that a message passed by reference must either be a monitor itself (have it’s own lock) or behave like a C++ unique_ptr (an object that leaves no aliases behind). This requirement should be expressible through the type system, allowing the compiler to check for its violations.

I’ve been paying a lot of attention to software transactional memory (STM). I believe that STM could be built into the language–again, with the support of the type system. It looks like the hardest problem with STM is how it should inter-operate with other types of multithreaded access (both locked and lock-free). The simplest solution is to enforce full isolation–no STM object can ever be accessed outside of a transaction. It’s not clear how practical such approach is, but it definitely simplifies things to the point that STM becomes orthogonal to other types of synchronization. And that means it will be possible to tackle it at a later time, after the race-free multithreading model is firmly established.

In the next installment, I will describe the ownership scheme that is the foundation of the race-free type system.

Please vote for this article on reddit.

Bibliography

C. Flanagan, M. Abadi, Object Types against Races. The seminal paper introducing ownership-based type systems.

See also my previous posts on Guava and GRFJ that discuss race free dialects of Java.

28 Responses to “Race-free Multithreading”

pizer Says:

May 26, 2009 at 1:15 pm
Wouldn’t you need some kind of “move-return” in MVar::take ? I.e.:

T take() {
while (!_full) wait();
_full = false;
return :=_msg;
}

Cheers!
Tomasz Stachowiak Says:

May 26, 2009 at 4:06 pm
A great read, as usual. Thanks!
By the way, Charles Bloom (a veteran in graphics/gamedev) referred to your post here: http://cbloomrants.blogspot.com/2009/05/05-26-09-automatic-multithreading.html
Joe Osborn Says:

May 26, 2009 at 4:12 pm
“Shared-memory multithreading is hard!” — you said it! And the type system seems like a great place to start. The idea of owners feels a lot like the way I understand Io to approach the Actor model of concurrency, too.

But while you’re talking about a hypothetical future language, are you also assuming that no-shared-memory architectures won’t be dominant/emerging by the time this language would gain broad exposure? PS3’s Cell is already a step in this direction, and languages like Erlang seem uniquely poised to tackle the no-shared-memory future. (Also, some coroutine-based VM languages [such as Io] have some seemingly solid features for deadlock detection and prevention.)

At least, I hope it’s the future, given the von Neumann bottleneck and the inefficiencies of memory access bookkeeping and synchronization across multiple core. Locking down a section of memory with respect to one of four cores seems a lot more tractable on a hardware level than locking it down with respect to hundreds of cores. We already have multichannel memory controllers, so it seems like designating entire regions of memory or memory controllers to specific processors is only a few generations away. (There surely will be a point, won’t there, when it’s faster to pass tiny bits of data across CPU buses than to globally negotiate the access of many CPUs across a shared resource?)

Do I completely misunderstand and misrepresent your position and the trends in system architecture? I don’t want to seem like a know-it-all, and I promise I’m really looking for advice and information here. Why are you explicitly targeting shared-memory concurrency as your problem here?
Bartosz Milewski Says:

May 26, 2009 at 5:16 pm
To pizer: There is no need for special notation when returning a unique object because it’s an rvalue. It disappears immediately after return. The compiler does a shallow copy and makes sure it doesn’t call the destructor–that’s all.

Another example of a unique rvalue is after construction:
```
auto foo = new unique Foo;
```
Here, again, you don’t have to use a move operator because the right hand side is an rvalue. The compiler just has to remember not to destroy the rvalue source.
Jason House Says:

May 26, 2009 at 5:38 pm
I’m glad you’re finally starting to post the real details of your multi-threading scheme. I can’t wait for D to adopt a real system for this stuff.

How does the lockfree storage class translate into overhead on different processors? (Especially x86) How would CAS work?
Jason House Says:

May 26, 2009 at 5:40 pm
Sorry for the extra comment, but I forgot to check the box about e-mail notification.
Bartosz Milewski Says:

May 26, 2009 at 5:45 pm
To Joe Osborn: Frankly, I don’t know what the future of computer architecture is going to be. Currently, shared memory multiprocessing seems to be dominant and there is a huge gap between hardware and the 20th century programming model we are using.

Message passing a la Erlang is ok, but there are many programming problems that just don’t fit into that mold (or cannot be efficiently implemented with all the copying going around).

Locking down sections of memory makes sense only if your data structures are well localized, which is not the norm in OO languages.

I’d rather bet on transactional memory, especially if it’s aided by hardware.

And as far as NUMA architectures are concerned I have little idea how they can be reflected in programming languages (I’ve seen a custom-made 2D graph-based language for NUMA, but I don’t know how practical it is).
Bartosz Milewski Says:

May 26, 2009 at 6:00 pm
To Jason House: There is a strong opposition to including this type system in D. The usual arguments are that it’s to complicated and it doesn’t cover all possible situations. So don’t hold your breath.

The lockfree storage class is very similar to Java’s volatile and C++0x strong (default) atomics. On the x86 the overhead is on the writing side, where the compiler converts a store into a (locked) exchange (xchg).

CAS could be a primitive of the language, or part of a library written in the assembly language (on the x86, cmpxchg). Most lock-free algorithms don’t use anything other than atomic store, load, and CAS.
Robert Jacques Says:

May 26, 2009 at 8:25 pm
Thanks again for you blogs. I’m happy with how this is taking shape, though un-happy that there is resistance to putting it in D. (Also, your examples are much more C++-ish than D-ish) I’m looking forward to see how ownerships shape up.

Speaking of reasonable defaults, is there some logic behind specifying both lockfree variables and lockfree variables? Wouldn’t it be more logical that accessing a variable on a shared object is implicitly lockfree if not inside a synchronized block?

Also, what is your opinion of disallowing shared objects from having public variables? (i.e. forcing all data access to occur through methods?)
pizer Says:

May 27, 2009 at 12:24 am
How is “_msg” — a member of MVar — considered an rvalue in MVar::take so you don’t need an explicit move?
Nick B Says:

May 27, 2009 at 3:53 am
Interesting post, again !

Can you explain what you mean by Software Transactional Memory and how would a type system help ?
Denis Koroskin Says:

May 27, 2009 at 3:53 am
Great read! Thanks for this article.

One objection: T get() method is not lock-free, because it has a synchronized section (i.e. mutex) inside of the method body.

I’d call it race-free, but then, isn’t the whole language is supposed to be race-free?
Bartosz Milewski Says:

May 27, 2009 at 8:34 am
To Robert Jacques:

“Wouldn’t it be more logical that accessing a variable on a shared object is implicitly lockfree if not inside a synchronized block?”

That wouldn’t work without whole-program analysis. You can make a lockfree variable protected and create a subclass without having access to the implementation. Also, it wouldn’t work for global lockfree objects (these would not be thread-local). But the most important reason is that the programmer has to explicitly commit to lock-free programming. It’s the “I know what I’m doing” pronouncement.

“Also, what is your opinion of disallowing shared objects from having public variables? (i.e. forcing all data access to occur through methods?)”

Absolutely! This is coming in the next post.
Bartosz Milewski Says:

May 27, 2009 at 8:39 am
To pizer: You caught me! I had it returned by := in the earlier version, but I thought I can get away without it. Obviously, only local unique variables become rvalues before they are returned. I will edit my post, so I don’t confuse other readers. Thanks!
Bartosz Milewski Says:

May 27, 2009 at 9:05 am
To Nick B:
I gave several talks about STM. You can view online slides to the one I gave to the Northwest C++ Users Group. I also talked at the D conference, where I explained the type system angle.
Bartosz Milewski Says:

May 27, 2009 at 9:14 am
To Denis Koroskin:
I define a lockfree method as one that is not synchronized by default. Internally, it may contain lock-free access and synchronized sections. It’s a matter of language semantics.
Robert Jacques Says:

May 27, 2009 at 10:13 am
Re:
“Wouldn’t it be more logical that accessing a variable on a shared object is implicitly lockfree if not inside a synchronized block?”

That wouldn’t work without whole-program analysis. You can make a lockfree variable protected and create a subclass without having access to the implementation.
—
A lockfree variable is defined by enforcing sequential consistence, which the subclass would guarantee when not inside a synchronized block. So I don’t see how whole-program analysis is needed. I can see how unnecessary memory fences could be added, by calling a lock free method from inside a synchronized block, but this is an issue with your alternative as well and doesn’t negate program validity.

Re:
Also, it wouldn’t work for global lockfree objects (these would not be thread-local).
—
Aren’t all lockfree objects shared? (i.e. not thread local)

Re:
But the most important reason is that the programmer has to explicitly commit to lock-free programming. It’s the “I know what I’m doing” pronouncement.
—
But the programmer still has to declare “I know what I’m doing” by declaring the method as lock free.

To elaborate, what I was proposing was that instead of:
class Singleton(T) {
private lockfree T _value;
public:
T get() lockfree {
if (_value is null) {// lockfree access
synchronized(this) {
if (_value is null)//Normal access
_value = new T;//Normal access
}
return _value; // lockfree access
}
}
}

where you’re limited to only using lockfree variables outside the synchronized block, you’d write this:
class Singleton(T) {
private T _value;
public:
T get() lockfree {
if (_value is null) { // lockfree access
synchronized(this) {
if (_value is null) // Normal access
_value = new T; //Normal access
}
return _value; // lockfree access
}
}
}

where outside the synchronized block all variables are guaranteed to be sequentially consistent (i.e. lockfree)

It seems to me that requiring the programmer to declare that ‘I know what I’m doing’ in multiple, dependant locations doesn’t add any safety, only complexity.
Bartosz Milewski Says:

May 27, 2009 at 11:20 am
Okay, let me try to come up with more convincing examples. Consider this:
```
class Foo {
   private Bar _bar;
   public void foo() lockfree {
      makeBar(&_bar);
   }
};
void makeBar(Bar * bar) {
   *bar = new Bar;
}
```
(Not possible in Java, but possible in D.) Obviously you can’t call makeBar in its present form. You need a version of makeBar that takes a pointer to lockfree bar:
```
void makeBar(lockfree Bar * bar)
```
With global variables, you want something like this:
```
lockfree Bar TheBar;
```
It means that the object Bar is protected by its own lock, but also the handle can be shared between threads and accessed atomically.
Robert Jacques Says:

May 27, 2009 at 1:48 pm
Well, what prevents lockfree from be applicable to free functions? i.e.
```
class Foo {
   private Bar _bar;
   public void foo() lockfree {
      makeBar(&_bar);
   }
};
void makeBar(Bar * bar) lockfree {
   *bar = new Bar;
}
```
Regarding global variables, not accessing the handle of shared objects atomically seems to me to be a very bad idea (i.e. a race). So I think I’m missing your point. Could you re-phrase, please?
Bartosz Milewski Says:

May 27, 2009 at 2:50 pm
Explaining the semantics of lockfree data and methods to the end user is pretty straightforward. In what you’re proposing, the semantics of lockfree is no longer simple. Essentially, what you are saying is that the compiler can always guess which data must be accessed atomically and which doesn’t–a much harder task.

What if makeBar is a method of some factory object? Must the method be lockfree? What if the factory is shared and requires a lock?

As for globals, I want to make distinction between these two:
```
Bar<owner::self> myBar;
lockfree Bar<owner::self> ourBar;
```
In the first case the handle is thread-local. It’s invisible to other threads, although the object, being self-owned, may be safely passed through a channel.

Using C-like syntax:
```
Bar __shared__ * __thread__ myBar
```
Maybe that’s a better way to explain lockfree.
Robert Jacques Says:

May 27, 2009 at 4:56 pm
Regarding simplicity, what you’ve proposed is:
1) method must be declared lockfree
2) variable must be declared lockfree

I’ve proposed simplifying this to
1) method must be declared lockfree

In both cases the same thing happens: shared variables outside the synchronized block are sequentially consistent and variables inside the block are lock protected. There’s no guessing involved.

Regarding factories, there’s no clear reason that the member variables should or need be tagged as lockfree. I see the usage as being something like this:
```
shared Bar makeBar() {
    lockfree b = new Bar();
    // do stuff
    return b;
}
```
being useful. But I’m fairly sure you could capture the exact same level of flexibility using:
```
shared Bar makeBar() {
    auto b = new unique Bar();
    // do stuff
    return b;
}
```
without breaking the type system. That said, I’m not against the former, per se, as you should always have a way to break the type system, but I don’t see the need for variables to be declared lockfree if the method is already declared lockfree. For some reason, I think the ugliness of something like:
```
shared Bar makeBar() {
    auto b = cast(Bar)cast(void*) new shared Bar();
    // do stuff
    return cast(shared Bar)cast(void*)b;
}
```
```
 auto b = cast(Bar)cast(void*) new shared Bar(); 
```
would be superior to a single keyword.

I agree with your statement on globals, but it seems you’re using the same keyword for two different concepts. In this case I think
```
__thread shared Bar myBar;
shared Bar ourBar; 
```
would be clearer. While I generally support keyword reuse, in the future it would be helpful if you could make it clear when you do so.
Jason House Says:

May 27, 2009 at 5:37 pm
In reply to Robert:

All multi-threaded analysis requires examining what can be done by other threads… With interferance going both ways. By not marking variables lock free, you open up the possibility for lock-free and non-lock-free use of the same variable. That has to be a recipe for disaster. I like the idea of data segregation. It may help flag misuse and also aids code analysis (by people)
Robert Jacques Says:

May 27, 2009 at 6:03 pm
Thanks Jason. You’re completely right. I had mis-read the article and was a bit confused by the multiple, different meanings of the lockfree keyword in the article.
Race-free Multithreading: Ownership « Bartosz Milewski’s Programming Cafe Says:

June 2, 2009 at 2:18 pm
[…] system. I presented the bird’s eye view of the system and provided a few teasers in my previous post. The design is based on published papers (see bibliography at the end). My contribution was to […]
Multithreading Tutorial: Globals « Bartosz Milewski’s Programming Cafe Says:

June 23, 2009 at 10:39 am
[…] June 23, 2009 Multithreading Tutorial: Globals Posted by Bartosz Milewski under C++, Concurrency, D Programming Language, Java, Multithreading, Programming, Type System Leave a Comment If it weren’t for the multitude of opportunities to shoot yourself in the foot, multithreaded programming would be easy. I’m going to discuss some of these “opportunities” in relation to global variables. I’ll talk about general issues and discuss the ways compilers can detect them. In particular, I’ll show the protections provided by my proposed extensions to the type system. […]
What’s Wrong with the Thread Object? « Bartosz Milewski’s Programming Cafe Says:

July 7, 2009 at 3:38 pm
[…] you’ve been following my blog, you know that I’m working on a type system that eliminates races. If I can convince Walter and Andrei, this system might be implemented in the D programming […]
On Actors and Casting « Bartosz Milewski’s Programming Cafe Says:

July 16, 2009 at 11:28 am
[…] D programming language with my proposed race-free type system could dramatically improve the safety of message passing. Race-free type system distinguishes […]
Beyond Locks and Messages: The Future of Concurrent Programming « Bartosz Milewski’s Programming Cafe Says:

August 2, 2010 at 12:03 pm
[…] However, no matter what synchronization mechanism are used (including STM), if the language doesn’t enforce their use, programmers end up with data races–the bane of concurrent programming. The time spent debugging racy programs may significantly cut into, or even nullify, potential productivity gains. Fortress is the only language that attempted to keep track of which data is shared (and, therefore, requires synchronization), and which is local. None of the HPCS languages tried to tie sharing and synchronization to the type system in the way it is done, for instance, in the D programming language (see also my posts about race-free multithreading). […]