| CodeGuru Home | VC++ / MFC / C++ | .NET / C# | Visual Basic | Newsletters | VB Forums | Developer.com |
|
|||||||
| C++ (Non Visual C++ Issues) Ask or answer C and C++ questions not related to Visual C++. This includes Console programming, Linux programming, or general ANSI C++. |
![]() |
|
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel code
I am writing a multithreaded Random Forest classifier (for machine learning). This is an embarrassingly parallel algorithm. The basic idea is to grow a bunch of decision trees, but each tree can be grown completely in parallel (requiring only reads from, and not writes to, a single data structure containing all of the examples in the classification problem). So I took my basic unthreaded code, and split off 2 pthreads, each of which grows half of the trees. For example, if I want 100 trees, I make 2 threads, each of which makes 50 trees. There is only one mutex in the entire program, and each thread only uses it once when it is done making its trees, and needs to copy them back into the main list of all trees. I am running this on a dual processor 2.53 GhZ Intel Core 2 Duo Macbook Pro, OS 10.5.7. If I just spawn one pthread, it is a lot slower than the original unthreaded code. If I spawn 2 pthreads, it uses 195% CPU (i.e. most of both processors), and the total runtime is decreased, but it is still slightly slower than not using any threads at all, and only taking advantage of one processor. I have run the program through a profiler, and it seems that a almost 50% of the time is being spent in two functions:
__semwait_signal __spin_lock The basic structure of the code is that I just create some threads using pthread_create, then immediately wait for them to finish with pthread_join. If anyone can help suggest how I can get a better speedup out of multithreading, please let me know. |
|
#2
|
||||
|
||||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
On some systems, when you link with a threading library, this can automatically bring in thread-safe versions of standard functions like malloc/new.
>> it seems that a almost 50% of the time is being spent in two functions: What's higher up on the call-stack? I'm guessing that you'll find it's malloc/new. gg |
|
#3
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Wait on a spinlock will not release the thread, that is when you are waiting on a spinlock, the thread is still eating the cpu time, and hopefully all of it. That's why you see the cpu is at full power and the work is done slowly.
Excessive synchronizations will simply make you multithreaded program actually single-threaded (if not worse). And even it works single-threaded, synchronizing actions are still expensive.
__________________
No emotions. Your English can't afford it. Just try to be of help, ignore everything else. |
|
#4
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Quote:
For __semwait_signal, one level up on the call stack is pthread_join. 31.8% of the execution time is spend in __semait_signal here, i.e. it seems like that is the overhead for waiting for the threads to finish. __spin_lock (18.4% of the execution time) looks like it is being called by various malloc and free functions (malloc_zone_malloc, szone_malloc, szone_free, etc). Quote:
Overall, it looks like the most time is being spent in pthread_join, so any suggestions on how to speed that up would be appreciated. I was wondering if priority or like that could be used... |
|
#5
|
||||
|
||||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
>> it looks like the most time is being spent in pthread_join
Well, pthread_join should (and most likely does) wait efficiently for the given thread to exit. Since the main thread is just creating and joining with worker threads, it isn't surprising that the main thread is spending a lot of time in pthread_join. gg |
|
#6
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Quote:
Thanks... |
|
#7
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
I finally got a slight performance increase out of using 2 processors! The key observation was that all of the active threads get a roughly equal amount of time. So rather than having 2 threads working and 1 waiting, it is better to just have 2 threads working. My old code was basically:
for t = 1 to n: spawn a thread running function foo() for t = 1 to n: wait for thread t to finish (pthread_join) my new code is now: for t = 2 to n: spawn a thread running function foo() start running a copy of foo() in the current thread for t = 2 to n: wait for thread t to finish (pthread_join) this way, the original, main thread does work too. profiling the code reveals that this change has taken calls to __semwait_signal (which were 31.8% of the runtime) completely out of the picture. __spin_lock is still taking up a lot of time, but it seems to be called from malloc / free, so this is really an indication that the program would benefit from further optimization of memory use. at this point, with 2 processors, the speedup is roughly 1.25 (i.e. (runtime original no threads) / (runtime 2 threads) = 1.25). it will be interesting to see how the speedup ratio changes with more processors. |
|
#8
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
after some further testing, it seems that the speedup depends on the size of the data set (i.e. number of examples in the classification problem). on the smaller data sets I have tested, (100s of examples), there is no speedup, or possibly a small amount of slowdown. on the largest data set I tested (4177 examples), the speedup is 1.5 with 2 processors.
|
|
#9
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Quote:
If the main thread is in pthread_join, then it is *not* taking up much (or any) processor. pthread_join is more intelligent than that. How did you conclude that it was? Maybe the tool you were using was reporting clock time rather than processor time. It's certainly true that larger tasks benefit more from multiple cores. The more compute-bound they are, the bigger the speedup. |
|
#10
|
||||
|
||||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Quote:
Viggy |
|
#11
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Quote:
Note: there are some reports that pthreads are unusually slow in Mac OS X, so the terrible performance in this case may be a quirk of this OS that does not apply to other Linux distros. |
|
#12
|
||||
|
||||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
>> so that pthread_join wasn't idly spinning in its own thread
It's not "spinning". It isn't using any CPU at all. The OS will schedule the thread to run when it needs to (presumably with the "sem" is "signal'd"). Wall time != CPU time gg |
|
#13
|
|||
|
|||
|
Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel
Think about it like this. You've told the main thread to wait for both of the worker threads, right? That means the main thread can't possibly exit before the longer of the two workers. Thus there would be no way you could ever expect the main thread to use less than a third of the total wall time----it has to take up at least as much wall time as the program takes to execute!
CPU time is unrelated. |
![]() |
| Bookmarks |
|
||||||
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|