CodeGuru Forums -
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic Newsletters VB Forums Developer.com


Newest CodeGuru.com Articles:

  • Deploying Windows Server 2008 with System Center
  • Remote Desktop Protocol Performance Improvements in Windows Server 2008 R2 and Windows 7
  • The Microsoft Dynamics CRM Security Model
  • SQL Server Modeling Services with Microsoft Visual Studio 2010 Beta 2

  • Search CodeGuru:
     



    Go Back   CodeGuru Forums > Visual C++ & C++ Programming > C++ (Non Visual C++ Issues)
    FAQ Members List Calendar Search Today's Posts Mark Forums Read

    C++ (Non Visual C++ Issues) Ask or answer C and C++ questions not related to Visual C++. This includes Console programming, Linux programming, or general ANSI C++.

    Reply
     
    Thread Tools Search this Thread Rate Thread Display Modes
      #1    
    Old October 3rd, 2009, 08:37 PM
    frostytrees frostytrees is offline
    Junior Member
     
    Join Date: Oct 2009
    Posts: 6
    frostytrees is an unknown quantity at this point (<10)
    2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel code

    I am writing a multithreaded Random Forest classifier (for machine learning). This is an embarrassingly parallel algorithm. The basic idea is to grow a bunch of decision trees, but each tree can be grown completely in parallel (requiring only reads from, and not writes to, a single data structure containing all of the examples in the classification problem). So I took my basic unthreaded code, and split off 2 pthreads, each of which grows half of the trees. For example, if I want 100 trees, I make 2 threads, each of which makes 50 trees. There is only one mutex in the entire program, and each thread only uses it once when it is done making its trees, and needs to copy them back into the main list of all trees. I am running this on a dual processor 2.53 GhZ Intel Core 2 Duo Macbook Pro, OS 10.5.7. If I just spawn one pthread, it is a lot slower than the original unthreaded code. If I spawn 2 pthreads, it uses 195% CPU (i.e. most of both processors), and the total runtime is decreased, but it is still slightly slower than not using any threads at all, and only taking advantage of one processor. I have run the program through a profiler, and it seems that a almost 50% of the time is being spent in two functions:

    __semwait_signal
    __spin_lock

    The basic structure of the code is that I just create some threads using pthread_create, then immediately wait for them to finish with pthread_join.

    If anyone can help suggest how I can get a better speedup out of multithreading, please let me know.
    Attached Files
    File Type: h pThreadedRandomForest.h (4.4 KB, 12 views)
    File Type: h pRandomForest.h (2.8 KB, 11 views)
    Reply With Quote
      #2    
    Old October 3rd, 2009, 08:54 PM
    Codeplug's Avatar
    Codeplug Codeplug is offline
    Senior Member
     
    Join Date: Nov 2003
    Posts: 1,235
    Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    On some systems, when you link with a threading library, this can automatically bring in thread-safe versions of standard functions like malloc/new.

    >> it seems that a almost 50% of the time is being spent in two functions:
    What's higher up on the call-stack? I'm guessing that you'll find it's malloc/new.

    gg
    Reply With Quote
      #3    
    Old October 3rd, 2009, 10:59 PM
    DreamShore DreamShore is offline
    Member
     
    Join Date: May 2008
    Posts: 295
    DreamShore will become famous soon enough (50+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Wait on a spinlock will not release the thread, that is when you are waiting on a spinlock, the thread is still eating the cpu time, and hopefully all of it. That's why you see the cpu is at full power and the work is done slowly.

    Excessive synchronizations will simply make you multithreaded program actually single-threaded (if not worse). And even it works single-threaded, synchronizing actions are still expensive.
    __________________
    No emotions. Your English can't afford it.

    Just try to be of help, ignore everything else.
    Reply With Quote
      #4    
    Old October 4th, 2009, 12:29 AM
    frostytrees frostytrees is offline
    Junior Member
     
    Join Date: Oct 2009
    Posts: 6
    frostytrees is an unknown quantity at this point (<10)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Quote:
    Originally Posted by Codeplug View Post
    On some systems, when you link with a threading library, this can automatically bring in thread-safe versions of standard functions like malloc/new.

    >> it seems that a almost 50% of the time is being spent in two functions:
    What's higher up on the call-stack? I'm guessing that you'll find it's malloc/new.

    gg
    I have both the original unthreaded and the new threaded versions compiled in the same program, so whatever version of malloc is being linked, both versions used the same one. So I don't think that accounts for the difference.

    For __semwait_signal, one level up on the call stack is pthread_join. 31.8% of the execution time is spend in __semait_signal here, i.e. it seems like that is the overhead for waiting for the threads to finish. __spin_lock (18.4% of the execution time) looks like it is being called by various malloc and free functions (malloc_zone_malloc, szone_malloc, szone_free, etc).

    Quote:
    Originally Posted by DreamShore View Post
    Excessive synchronizations will simply make you multithreaded program actually single-threaded (if not worse). And even it works single-threaded, synchronizing actions are still expensive.
    I'm not really doing much synchronization (i.e. locking / unlocking of mutexes). Suppose there are 2 threads, and each is making 500 trees. They will each be running totally independently, without ever writing to the same data, for probably 10 seconds. Then at the end, they each need a mutex to write some data to a buffer, but that only happens once. However, it is possible that the STL containers (mostly vector), and calls to new / delete that I am making are causing locks.

    Overall, it looks like the most time is being spent in pthread_join, so any suggestions on how to speed that up would be appreciated. I was wondering if priority or like that could be used...
    Reply With Quote
      #5    
    Old October 4th, 2009, 02:15 PM
    Codeplug's Avatar
    Codeplug Codeplug is offline
    Senior Member
     
    Join Date: Nov 2003
    Posts: 1,235
    Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    >> it looks like the most time is being spent in pthread_join
    Well, pthread_join should (and most likely does) wait efficiently for the given thread to exit. Since the main thread is just creating and joining with worker threads, it isn't surprising that the main thread is spending a lot of time in pthread_join.

    gg
    Reply With Quote
      #6    
    Old October 4th, 2009, 02:35 PM
    frostytrees frostytrees is offline
    Junior Member
     
    Join Date: Oct 2009
    Posts: 6
    frostytrees is an unknown quantity at this point (<10)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Quote:
    Originally Posted by Codeplug View Post
    >> it looks like the most time is being spent in pthread_join
    Well, pthread_join should (and most likely does) wait efficiently for the given thread to exit. Since the main thread is just creating and joining with worker threads, it isn't surprising that the main thread is spending a lot of time in pthread_join.

    gg
    So basically, if I spawn 2 worker threads, there are a total of 3 threads (including the original, main thread). The problem is, the main thread isn't doing anything other than waiting for the 2 worker threads to finish, but it is still being given roughly 1/3 of the processor time. This doesn't really seem like an efficient way to wait for the workers to finish. Is there some way of giving the main thread lower priority, or having it sleep so it doesn't hog the processor? I tried shed_yield(), but that didn't seem to do much.
    Thanks...
    Reply With Quote
      #7    
    Old October 4th, 2009, 03:24 PM
    frostytrees frostytrees is offline
    Junior Member
     
    Join Date: Oct 2009
    Posts: 6
    frostytrees is an unknown quantity at this point (<10)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    I finally got a slight performance increase out of using 2 processors! The key observation was that all of the active threads get a roughly equal amount of time. So rather than having 2 threads working and 1 waiting, it is better to just have 2 threads working. My old code was basically:

    for t = 1 to n:
    spawn a thread running function foo()

    for t = 1 to n:
    wait for thread t to finish (pthread_join)


    my new code is now:

    for t = 2 to n:
    spawn a thread running function foo()

    start running a copy of foo() in the current thread

    for t = 2 to n:
    wait for thread t to finish (pthread_join)


    this way, the original, main thread does work too. profiling the code reveals that this change has taken calls to __semwait_signal (which were 31.8% of the runtime) completely out of the picture.

    __spin_lock is still taking up a lot of time, but it seems to be called from malloc / free, so this is really an indication that the program would benefit from further optimization of memory use.

    at this point, with 2 processors, the speedup is roughly 1.25 (i.e. (runtime original no threads) / (runtime 2 threads) = 1.25). it will be interesting to see how the speedup ratio changes with more processors.
    Attached Files
    File Type: h pThreadedRandomForest2.h (4.6 KB, 6 views)
    Reply With Quote
      #8    
    Old October 4th, 2009, 03:36 PM
    frostytrees frostytrees is offline
    Junior Member
     
    Join Date: Oct 2009
    Posts: 6
    frostytrees is an unknown quantity at this point (<10)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    after some further testing, it seems that the speedup depends on the size of the data set (i.e. number of examples in the classification problem). on the smaller data sets I have tested, (100s of examples), there is no speedup, or possibly a small amount of slowdown. on the largest data set I tested (4177 examples), the speedup is 1.5 with 2 processors.
    Reply With Quote
      #9    
    Old October 4th, 2009, 08:15 PM
    Lindley Lindley is online now
    Elite Member
    Power Poster
     
    Join Date: Oct 2007
    Location: Fairfax, VA
    Posts: 6,905
    Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Quote:
    Originally Posted by frostytrees View Post
    So basically, if I spawn 2 worker threads, there are a total of 3 threads (including the original, main thread). The problem is, the main thread isn't doing anything other than waiting for the 2 worker threads to finish, but it is still being given roughly 1/3 of the processor time. This doesn't really seem like an efficient way to wait for the workers to finish. Is there some way of giving the main thread lower priority, or having it sleep so it doesn't hog the processor? I tried shed_yield(), but that didn't seem to do much.
    Thanks...
    You should never call sched_yield() yourself. If you think about it long enough the reason why becomes obvious----there's no good reason to give up the processor if you're not waiting for something specific to happen, so there's always a better function to be calling.

    If the main thread is in pthread_join, then it is *not* taking up much (or any) processor. pthread_join is more intelligent than that. How did you conclude that it was? Maybe the tool you were using was reporting clock time rather than processor time.

    It's certainly true that larger tasks benefit more from multiple cores. The more compute-bound they are, the bigger the speedup.
    Reply With Quote
      #10    
    Old October 5th, 2009, 03:48 PM
    MrViggy's Avatar
    MrViggy MrViggy is offline
    Elite Member
     
    Join Date: Feb 2002
    Posts: 3,833
    MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+) MrViggy has much to be proud of (1500+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Quote:
    Originally Posted by frostytrees View Post
    after some further testing, it seems that the speedup depends on the size of the data set (i.e. number of examples in the classification problem). on the smaller data sets I have tested, (100s of examples), there is no speedup, or possibly a small amount of slowdown. on the largest data set I tested (4177 examples), the speedup is 1.5 with 2 processors.
    Yep. This is somewhat typical. Smaller datasets run faster, due to the small amount of data. As your dataset gets smaller still, you run into the extra overhead of creating the multiple threads, etc., and that's why the single threaded version will run faster. However, the runtime should be insignificant for small datasets.

    Viggy
    Reply With Quote
      #11    
    Old October 6th, 2009, 06:30 PM
    frostytrees frostytrees is offline
    Junior Member
     
    Join Date: Oct 2009
    Posts: 6
    frostytrees is an unknown quantity at this point (<10)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Quote:
    Originally Posted by Lindley View Post

    If the main thread is in pthread_join, then it is *not* taking up much (or any) processor. pthread_join is more intelligent than that. How did you conclude that it was? Maybe the tool you were using was reporting clock time rather than processor time.
    I ran the (Shark) profiling tool included in Xcode. It indicated that roughly 1/3 of the (wall) time was being spent in __semwait_signal, and one level up in the call stack was pthread_join. After I restructured the code (see post above), so that pthread_join wasn't idly spinning in its own thread, the program because significantly faster, and the profiling no longer showed any time in __semwait_signal. The obvious conclusion is that in the original 3 thread code, equal time was being given to each thread, including the one doing nothing but sitting in pthread_join.

    Note: there are some reports that pthreads are unusually slow in Mac OS X, so the terrible performance in this case may be a quirk of this OS that does not apply to other Linux distros.
    Reply With Quote
      #12    
    Old October 6th, 2009, 09:23 PM
    Codeplug's Avatar
    Codeplug Codeplug is offline
    Senior Member
     
    Join Date: Nov 2003
    Posts: 1,235
    Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+) Codeplug is a glorious beacon of light (400+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    >> so that pthread_join wasn't idly spinning in its own thread
    It's not "spinning". It isn't using any CPU at all. The OS will schedule the thread to run when it needs to (presumably with the "sem" is "signal'd").

    Wall time != CPU time

    gg
    Reply With Quote
      #13    
    Old October 6th, 2009, 11:35 PM
    Lindley Lindley is online now
    Elite Member
    Power Poster
     
    Join Date: Oct 2007
    Location: Fairfax, VA
    Posts: 6,905
    Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+) Lindley is a name known to all (1000+)
    Re: 2 pthreads on dual CPU slower than non-threaded code for embarrassingly parallel

    Think about it like this. You've told the main thread to wait for both of the worker threads, right? That means the main thread can't possibly exit before the longer of the two workers. Thus there would be no way you could ever expect the main thread to use less than a third of the total wall time----it has to take up at least as much wall time as the program takes to execute!

    CPU time is unrelated.
    Reply With Quote
    Reply

    Bookmarks
    Go Back   CodeGuru Forums > Visual C++ & C++ Programming > C++ (Non Visual C++ Issues)


    Thread Tools Search this Thread
    Search this Thread:

    Advanced Search
    Display Modes Rate This Thread
    Rate This Thread:

    Posting Rules
    You may not post new threads
    You may not post replies
    You may not post attachments
    You may not edit your posts

    BB code is On
    Smilies are On
    [IMG] code is On
    HTML code is Off
    Forum Jump


    All times are GMT -5. The time now is 01:57 PM.



    Acceptable Use Policy


    The Network for Technology Professionals

    Search:

    About Internet.com

    Legal Notices, Licensing, Permissions, Privacy Policy.
    Advertise | Newsletters | E-mail Offers


    Powered by vBulletin® Version 3.7.3
    Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.