hi, i'm basically analyzing a large amount of code trying to find parallelization opportunities (in a nutshell). I'ved edited the code to exploit the opportunties found but am having issues with joining a group of threads who all have different purposes. I'm moved away from the large amount of code and wrote my own short test, which is below(should be).
Summary of code:
I am using conditional pthreads to kick off 5 handlers. These handlers continuously loop and wait until they're called to execute again. I'm having issues b/c the code randomly is getting hung up at different points and i'm not sure if it's a result of misusing the pthreads or b/c i'm not handling the synchronization correctly. I'm currently trying to use semaphores to ensure each thread has finished before continuing on in the calling function. What is the best/most efficient way of waiting on each thread? Posix semaphores are atomic so why does the semaphore value get stuck at 9? I would appreciate some direction.
Codeplug
April 16th, 2009, 10:11 AM
Couple things to read up on:
http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_wait.html
1) mutex should be acquired prior to calling pthread_cond_wait()
2) you should have a shared boolean predicate variable in order to handle "spurious wakeups"
gg
qmyre20
April 16th, 2009, 05:47 PM
appreciate the response.
I quickly realized that i was referencing some lacking explanations for conditional variables and was able to fix the conditional variable issue. The code below handles the functionality i'm looking for but i'm still questioning if it's the most efficient way. Please comment if something could be improved...until then...i'll be digging. Thanks again.
I assume so. Here what I see is a loop that waits in a while loop until other threads have indicated all of them are completed, presumably at different times.
You could look into the concept of joinable threads. Not all OS's support them, but they are one classic solution.
However, you could reconsider if the section //do post work actually implies that such code must be executed in the main thread in this way.
That is, you could consider that the 'post work' is simply another process to be run after all 5 of the previous have completed, launched by the last thread to finish.
This means each thread would have a post processing step, a check to see if IT is the last thread to finish. All of the 5 threads perform this quick check, similar to what you're doing in the while loop.
If the thread 'discovers' it IS the last thread to finish, it would call the 'post work' function itself.
Sometimes this isn't a construction that works in your interests - it may be that your algorithm is designed so this plan doesn't work, but it can in many cases.
Indeed, there is a design patter (I forget the formal name at the moment) that deals with a series of steps to be performed, where the steps and their order are indicated at runtime (think macro programmer/player, for example). A small class library can be created to abstract this notion, so that any collection of operations can be tossed off to thread pools, and post processing requirements can be executed in a series as a feature of the library (which means you'll have a tool for doing this in the future).
One reason this plan is a good thing to consider (where the last thread to finish is the one that calls the post processing), is that it doesn't depend upon a main thread 'to be joined' when the 5 are complete - and such a requirement is usually counter to message based implementations typical of windowed operating systems. That is, you don't want to do much waiting in the 'main' thread that's usually receiving UI messages.
You can, of course, create 6 threads and have one 'drive and wait' as you require, but how high performance is this going to be?
Waiting, signaling, locking, unlocking - they take time. They're quick, yes, but not immediate. Most are dependent on the task scheduler of the OS, so even with joinable threads, there can be a delay between the completion of the last thread and the 'release' or 'firing' of your post processing step. If this isn't performance critical, it doesn't much matter, but if it does, a design which doesn't depend so much on waiting for a join is going to be much faster.
Also, you may consider managing work count with an atomic operation rather than protecting it with a mutex.
Protecting a value with a mutex is, of course, a staple. However, if all of your target OS/Chips have atomic operations, you can use them instead for a much faster/lighter method. Atomic operations are 'akin' to the same notion as protecting with a mutex, and there are only a few of them - increment and decrement, usually with some compare (against zero for example) are available. The idea with an atomic operation is that, without using a mutex, you are guaranteed that the CPU will perform the operation, to completion, without the possibility of interruption from another thread. It's the same basic notion. It's used, for example, in boost's smart pointers in management of the reference counter, when atomic operations are available. It makes the handling of the reference counter hardly any 'heavier' (or slower) than a simple increment or decrement by itself.
qmyre20
April 23rd, 2009, 02:57 PM
appreciate the input jvene. it's an understatement to say that this code is extremely scaled down in comparison to the code i'm analyzing. In saying that, i understand it more than likely limits the feedback i get compared to if i were to post the original code but the point of my research is for me to learn.
It's clear to me that the present code posting isn't very efficient but my initial goal was strictly for functionality to learn if the threading effected my original output. Now that the threading is working correctly i'm trying to obtain some degree of speed up.
I don't think that thread joins are applicable for this model b/c these threads execute x amount of times depending on input parameters. to my (limited) knowledge, the joined threads would prevent multiple executions.
This method of using conditional variables and mutex locks doubled my execution time compared to the original code by the way. It's understandably difficult to implement threading that doesn't create detrimental overhead through 50,000+ iterations of code that is inherently sequential.
Thanks again...
JVene
April 24th, 2009, 02:29 PM
You may be correct that joined threads aren't applicable. I don't use them much myself.
I've used threading for many years, long before it was considered mainstream (way before the common advent of multiple CPU and now multiple Core systems), and I've worked on a number of scenarios. It's use breaks down into a few general categories; offloading work from the GUI messaging thread, parallel execution of an otherwise, typically 'linear' algorithm, and a scheduled series of tasks. Yours seems to be in the latter category.
Based on your feedback, I assume your situation has these contributing features:
The individual tasks are short in duration (which is why the outer, driving loop affects performance - otherwise, the duration would have obviated the opportunity for the outer loop to be much of an interruption).
A collection of tasks are not interdependent in phase 1, but must be completed (and therefore are interdependent) on a phase 2 - the post processing phase.
Subsequent loops are not interdependent on previous task groups. This is conjecture on my part, but if it's true, it lends itself to a solution hinted to in my previous post.
The number of threads invoked may be tightly related to the number of available cores in performance oriented implementation. For something like GUI thread offloading, the number of cores isn't as important, what is important is simply that such processing NOT prevent the GUI thread from continuing it's service to UI messages. In performance work, however, there is very little to be gained (often lost) by running more threads than cores, but then your original design caused a 5th thread to wait until 4 partners were completed, and I assume this implies you're testing and working on a quad core machine.
If you created a queue object which represents a thread of execution, which waits for work to be handed to it (often through some wrapper representing a pointer to a member function), you can hand off work to these threads in a series of tasks. I have an old post around hear on my own QueProcessor objects you might want to search for.
If my assumption is correct, this means that phase 1 tasks for group A, which must completed before phase 2 of group A can perform the post processing, there is no interdependence between group A and the phase 1 tasks or phase 2 tasks for group B, or group C and so on through the 50,000 you mentioned are in a loop to be performed.
As such, you could use the queue processor concept to launch a series of group tasks which coordinate through a "job" object. Several jobs could be in the pipeline at once, some of group A, some of group B, perhaps a few of group C might be in the pipeline already, all the others pending.
When the last of the set of tasks in phase 1 of group A have completed, they could schedule their own post processing in the queue. It is unlikely that phase 2 MUST be completed before phase 1 of group B begins, or that any separation in time of group A's phase 2 (it's post processing) is a problematic, as long as the phase 2 executes after all of phase 1 tasks are completed. Similarly, group B's phase 1 tasks could schedule their own phase 2 task, and so on.
This means two things; first, you don't have cores sitting idle during the post processing - cores remain busy, and that means performance may be enhanced. Second, in any situation where you have waiting, the implication is that there is some kind of wait on a sync object of some kind, and while that leaves the CPU idle, the wall clock ticks by real time, doing nothing, until the OS task scheduler comes around to wake it up.
In a queue processor, there's no wait if the queue has another job already pending - it just takes the task and runs. This approach keeps the cores busy, which is part of the goal - making more use of the available cores, not simply splitting up a task among them where wait times may impact total execution speed more than expected because CPU consumption is limited due to the waits.
I'm in a hurry so I'm not reading through this - I'll reply again if this seems fuzzy.....
qmyre20
July 12th, 2010, 10:57 AM
Jvene,
I appreciate the responses even though I left this post "floating". I kind of deviated from the multithreading application b/c our advisor wanted to consider other means of optimization. I do understand the post though, I just didn't get the opportunity to extend the research to that extent(which was not my decision). Thanks again and I wish I could fast forward to your level of knowledge regarding multithreading...but that would be shortcutting the necessary work.
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.