Q&A: Developing Multithreaded Applications

Introduction

Dr. Clay Breshears is the author of “The Art of Concurrency” published by O’Reilly. Clay has been involved with parallel and concurrent programming for more than25 years and is currently a Courseware Architect on the Intel Innovative Software Education team, specializing in multi-core and multithreaded programming and training.

Aaron Tersteeg is the community manager for the Intel Parallel Programming Community. Aaron started as a mechanical engineer (oil pump design didn’t do it for him), moved on to Information Resource Management, worked as a Business Process Consultant, Fortune 500 Webmaster, Web Application Development VP of Sales and Marketing, Data Center Product Management and now Software Development.

Breshears and Tersteeg were also joined by Dmitriy Vyukov, a developer of high-performance C/C++ software in the sphere of client/server systems and network servers. In his spare time he develops innovative synchronization algorithms, programming models for multi-core processors and systems of multi-thread code verification. He is the author of Relacy Race Detector (RRD) tool.
Here is a sample of some of the questions and answers that appeared in the chat.

Question: I am aware of the Intel TBB library and am impressed by its feature list and claimed performance gains. However, in the face of OpenCL being more widely supported across platforms in the near future, is the TBB library still relevant?

I understand that the TBB library operates at a much higher level and scales to the hardware it is run on, however I believe OpenCL implementations will also scale not only to the CPU but any hardware that supports OpenCL. This means that a software written using OpenCL would automatically utilize any additional hardware (like the GPU) if present.

If I were to begin working on a new software project, why should I choose the TBB library over OpenCL?

Aaron Tersteeg: I asked Tim Mattson, One of Intel’s Parallel Programming Rock Stars to share his thoughts on this question and this is his reply:

OpenCL and TBB are like comparing Apples and Oranges. The question is not “do I need one or the other?” I need both.

TBB is great for C++ programmers working in a shared address space. It is a high level API and VERY powerful. It is an open source project, but the API is tightly controlled by a small cadre of developers. I don’t make that comment to put it down. I like TBB very much and I am really excited that it is open source software. This has made it a powerful platform for research on shared address space programming in C++.

OpenCL is for C and C++ programmers wanting to write code for heterogeneous platforms. It is VERY low level. It is extremely portable but it does this by exposing the details of the underlying platform. I can write one program that runs on a CPU or a GPU, but I will be totally honest, to make this work you have to query the system for details of the target platform and adapt in your software. This is not the sort of thing the casual programmer can manage very well.

OpenCL is an industry standard. If you judge openness by how many organizations are actively contributing to the standard, then OpenCL is one of the most Open standards out there today. Intel, AMD, Nvidia, Texas Instruments, National Labs, Electronic Arts, Apple, the list of contributors is long. The standard is young. The first commercial products supporting OpenCL only emerged in the fall of 2009. The standard is evolving steadily (1.0 is out there, 1.1 will be released soon and we are already working on 1.2 and 2.0). This evolution is critical. If you look at how aggressively many core CPUs and GPUs are changing you will appreciate how aggressive we have to be in evolving OpenCL.

OpenCL is important and will become even more important over time. But it addresses a different market segment than TBB. Hence, Intel is committed to both APIs and working hard to support them on our platforms.

P.S. And don’t forget OpenMP. People tend to forget that if you want a mature multithreading standard that runs just about everywhere, it’s hard to beat OpenMP. And OpenMP works well for Fortran, C, and C++.

Question: At the moment we have several parallel building blocks to make it easier for us to write parallel code. We have parallel_for, parallel_while, and so on. Can you tell us something more about what kind of building blocks are being worked on for the future?

Clay Breshears: The 2.2 version of Intel Threading Building Blocks (available now) has several new parallel algorithms. These include parallel_for_each and parallel_invoke. Support for C++0X lambda functions has also been added. There is also a fourth concurrent container: concurrent_bounded_queue.

The parallel_while was changed over to parallel_do and the need for an explicit range object has been removed from the parallel_for.

Question: Interesting. From the sound of it, it seems that quite of few of the building blocks available in the ITBB match those available in the latest Visual C++ 2010 Parallel Patterns Library, like parallel_for, parallel_for_each, parallel_invoke. I was wondering if there is maybe some data available to see whether the ITBB or the MS PPL is recommended in certain cases?

Clay Breshears: As for what came first or who had the idea that was copied by the other, I think we’re talking a chicken-and-egg question. In most cases, after an idea has been vetted as being something useful, lots of adoption will be seen. This is one reason we get so many movie sequels or disaster films or super-hero movies all coming out at about the same time.

I don’t have any guide as to which library should be used the other except in limited circumstances. If you’re developing on Linux, you would want TBB. If you’re doing things with .NET, I think PPL is better suited.

That last one may not be as hard and fast as I think. I’ve heard many MS folks talk about VS 2010 and what is going to be available within the IDE; they all seemed to focus on their tools being geared toward .NET. Thus, the Intel tools and VS 2010 would complement each other and give Windows programmers a wide range of support and choice.

Question: Having written a small amount of multi-threaded code, one of the hardest things to deal with is detecting where race conditions could occur. This is especially true in code you write, believing it is thread safe, only to find out later it isn’t. Are there any general tips for analyzing and writing code to determine where race conditions could occur?

Dmitriy Vyukov: The situation is quite similar to memory leaks; just do not do chaotic ad-hoc programming. First, define a system, define components and responsibilities, then implement them.

For example, the data-race/deadlock-bulletproof pattern – encapsulated synchronization:


class foo_t
{
private:
mutex guard;
public:
void bar()
{
scoped_lock lock (guard);

}
void baz()
{
scoped_lock lock (guard);

}
};

It’s trivial to implement correctly, and there is no way one can get a data-race or deadlock with it. If it’s not applicable, then there are other patterns: hierarchical locking, ordered locking, etc.

Just do not do “OK, here I need to lock this, and then lock that, and now probably I can unlock this, and now I need access to that object as well, so lock it too, …”.

Regarding tooling support, indeed, there are tools for detecting data races – Hellgrind, CHESS, Parallel Inspector, Relacy Race Detector, etc. They greatly simplify life.

Clay Breshears: If you think the iterations of a loop are independent, but you’re not sure, run the loop in reverse (in serial) and check to see if you get the same answer. For example, if your for-loop goes from 0 to 100, run the loop from 100 down to 0. It’s not a guaranteed method, but it can detect unsafe loops in many cases. If the problem isn’t the final result but all the intermediate computation results – like adding up a list of numbers; the total is the same, but the intermediate sums are different if done in a different order – you might need to look at the partial results along the way and figure out how those are being affected by running the loop in reverse and ultimately in some concurrent schedule.

Of course, tools are easiest and best.

More by Author

Must Read