Building the Right Environment to Support AI, Machine Learning and Deep Learning
Why Too Many Threads Hurts Performance, and What to do About It.
Arch D. Robison, Intel Corporation
Too Many Threads
Threading is the current method of choice for extracting performance from multi-core chips. It might seem that if a little threading is good, then a lot must be better. In fact, having too many threads can bog down a program. This article discusses why and how task-based programming avoids the problem. The Intel® Threading Building Blocks (Intel® TBB) task scheduler serves as an example.
It is important to distinguish software threads from hardware threads. Software threads are the threads that programs create. Hardware threads are real physical resources. There may be one hardware thread per core on the chip, or more, as for example with Intel Hyper-Threading Technology.
When there are more software threads than hardware threads, the operating system typically resorts to round robin scheduling. Each software thread gets a short turn, called a time slice, to run on a hardware thread. When the time slice runs out, the scheduler suspends the thread and allows the next thread waiting its turn to run on the hardware thread.
Time slicing ensures that all software threads make some progress. Otherwise, some software threads might hog all the hardware threads and starve other software threads. However, fair distribution of hardware threads incurs overhead. There are several kinds of overhead, and it helps to know the culprits so you can spot them when they appear.
The most obvious overhead is saving register state of a thread when suspending it, and restoring the state when resuming it. You might be surprised how much state there is on modern processors. However, schedulers typically allocate big enough time slices so that the save/restore overheads are insignificant, so this obvious overhead is in fact not much of a concern.
A more subtle but significant overhead of time slicing is saving and restoring a thread's cache state, which can be megabytes. Modern processors rely heavily on cache memory, which can be about 10 to 100 times faster than main memory. Accesses that hit in cache are not only much faster; they also consume no bandwidth from the memory bus. Caches are fast, but finite. When the cache is full, a processor must evict data from the cache to make room for new data. Typically, the choice for eviction is the least recently used data, which is typically data from an earlier time slice. Thus software threads tend to evict each other's data, and the cache fighting from too many threads can hurt performance.
A similar overhead, at a different level, is thrashing virtual memory. Most computers use virtual memory. Virtual memory resides on disk, and the frequently used portions are kept in real memory. Similar to caches, the least recently used data is evicted from memory to disk when necessary to make room. Each software thread requires virtual memory for its stack and private data structures. As with caches, time slicing causes threads to fight each other for real memory and thus hurts performance. In extreme cases, there can be so many threads that the program runs out of even virtual memory.
Another problem arises when a time slice expires for a thread holding a lock. All threads waiting for the lock must now wait for the holding thread to get another time slice and release the lock. The problem is even worse if the lock implementation is fair, in which the lock is acquired in first-come first-served order. If a waiting thread is suspended, then all threads waiting behind it are blocked from acquiring the lock. It's like having someone fall asleep in a check-out line. The more software threads there are without hardware threads to run them, the more likely this will become a problem.
A good solution is to limit the number of runnable threads to the number of hardware threads, and possibly limit it to the number of outer-level caches if cache contention is a problem. Because target platforms vary in the number of hardware threads, avoid hard-coding your program to a fixed number of threads. Let your program's degree of threading adapt to the hardware.
Runnable threads, not blocked threads, cause time-slicing overhead. When a thread blocks on an external event, such as a mouse click or disk I/O request, the operating system takes it off the round-robin schedule, so the thread no longer incurs time-slicing overhead. A program may have many more software threads than hardware threads, and still run efficiently if most of those software threads are blocked.
A helpful organizing principle is to separate compute threads from I/O threads. Compute threads should be the threads that are runnable most of the time, and ideally never block on external events. The number of compute threads should match the processor resources. The I/O threads are threads that wait on external events most of the time, and thus do not contribute to having too many threads.
Because the most efficient number of compute threads depends upon the particular hardware, programming in terms of threads can be a poor way to do multithreaded programming. It is often better to formulate your program in terms of logical tasks, not threads, and let a task scheduler take care of mapping the tasks onto threads. The rest of this article will use the Intel® TBB tasks as an example.
The key advantage of tasks versus logical threads is that tasks are much lighter weight than logical threads. On Linux, starting and terminating an Intel® TBB task is about 18 times faster than starting and terminating a thread. On Windows, the ratio is more than 100. This is because a thread has its own copy of many resources, such as register state and a stack. On Linux, a thread even has its own process id. A task, in contrast, is typically a small routine that cannot be preempted at the task level. It can be preempted only by preempting the software thread running it.
Another improvement is unfair scheduling. As mentioned earlier, thread schedulers typically distribute time slices fairly because it is the safest strategy without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often goes to the extreme of not even starting tasks until they can make useful progress, in order to reduce memory consumption.
The scheduler does load balancing; that is, spreading the work across threads so that they are kept busy. Good load balancing can be tricky, because subtle cache, paging, and interrupt effects may cause some threads to finish earlier than others, even when apparent equal pieces of work were handed out. In task-based programming, you break your program into many small tasks, and let the scheduler issue tasks to threads to keep them busy.
The big win from using tasks instead of threads is easier programming. Thread-based programming forces you to think at the low level of hardware threads to get good efficiency, because you need one runnable software thread per hardware thread, not too few or too many. You also have to deal with the relatively coarse grain of threads. With tasks, you can concentrate the logical dependences between tasks, and leave the efficient scheduling to the scheduler.
Example: Summing A Tree
We'll use summing values over a tree as an example, because it involves a common recursive pattern that demonstrates the fundamentals of a task library. If you are not a fan of recursion, do not despair. Intel® TBB has high-level algorithm templates that hide the recursion and let you take an iterative view. For example, the library template parallel_for does parallel iteration, and the template parallel_reduce does reductions like summation. Both of these work over generic iteration spaces. This article, however, looks "under the hood" at the task scheduler that powers these templates, because understanding the task scheduler lets you deal with problems beyond the algorithm templates, or even write your own algorithm templates.
Listing 1 shows serial code for recursively summing over a tree. Field node_count is unused, but declared because it is necessary in the parallel version. Listing 2 shows the parallel code. It is relatively large compared to serial_sum_tree because it expresses parallelism without the help of any linguistic extensions to standard C++. Not depending on language extensions simplifies integration into existing production environments.
The top-level routine parallel_sum_tree in Listing 2 performs three actions:
- Allocate space for the task that will process the root, using an overloaded new operator and method task::allocate_root, both provided by the library. Task objects must be allocated by overloaded new operators provided by the library so that the space can be rapidly recycled when the task completes.
- Constructed the task using constructor sum_task(root,&sum). When the task is run in step 3, it will store the sum of the (sub)tree rooted at root into *sum.
- Start and run the task to completion by invoking task::spawn_root_and_wait.
The real work is inside class sum_task, which is derived from the base class task provided by Intel® TBB. Fields n and sum respectively hold the input value and pointer to the output. These are copies of the arguments passed to the constructor for sum_task. Method execute does the actual computation. It overrides a pure virtual method task::execute. The scheduler executes a task by invoking its execute method.
Method sum_task::execute() operates as follows:
- Check if the tree is so small that serial execution would be faster. If so, use serial_sum_tree from Listing 1.
- Otherwise, create a child task for each non-null subtree, using an inherited method allocate_child() and an overloaded operator new. Put each child on a list.
- Call set_ref_count to indicate the number of children created, plus one for the wait to be done. The task scheduler uses a very light-weight synchronization mechanism that atomically decrements a reference count when each child finishes or for a wait.
- Call spawn_and_wait_for_all to spawn the child tasks and wait for them to complete.
- Store the final sum in *sum.
- Return, which implicitly causes the scheduler to destroy and deallocate the task object. In this example, the return value is NULL. In more sophisticated uses, it is a pointer to the next task to run.
Step 1, using a serial algorithm for a small problem is common in parallel programming. Even though tasks are lighter weight than threads, they still have some overhead compared to functions, and thus for small problems using the serial function is faster. Finding the ideal threshold for serial execution usually requires some experimentation. A lower threshold creates more tasks, and thus more potential parallelism. But making the tasks too small incurs excessive overhead from task management. In a program that is going to generate far more tasks than there are threads, it does not hurt to set the threshold somewhat too high, because there will still be enough potential parallelism to keep all hardware threads busy.
At first glance, the parallelism in Listing 2 might appear to be limited, because the task creates at most two child tasks. The trick here is recursive parallelism. The child tasks each create more child tasks, and so on, until small subtrees are reached. If each task creates two child tasks, then the Nth level of recursion creates 2N child tasks. That offers plenty of potential parallelism to go around.
The trick is efficiently using the potential parallelism. A poorly structured task pool can be a performance killer. For starters, the pool can become a centralized source of contention. Furthermore, the pool's structure can strongly effect performance. Let's look at two extremes to see the effects.
One extreme is making the pool a first-in first-out queue, which maximizes parallelism, because execution will tend to traverse the tree breadth-first as shown in Figure 1. As execution walks each level of the tree, it doubles the number of available tasks. The drawback is that it can thrash cache or virtual memory, because at some point, there simultaneously exists a task for every node in the tree! It's self-defeating overkill, because we need only enough parallelism to keep the hardware threads busy.
The other extreme is to make the pool a last-in first-out stack. Then execution will tend to traverse the tree depth-first as shown in Figure 2. Now the space is proportional to the depth of the tree. Furthermore, the cache behavior on a single thread is usually quite good, because the child is typically working on data that was already pulled into cache by the parent. The drawback is that the parallelism is minimized. Worse yet, multiple threads tend to get in each other's way because each will grab tasks most recently created by other threads, causing cache traffic between the threads.