// JP opened flex table

Click to See Complete Forum and Search --> : Two threads running the same code at different speeds


MarkRMSheppard
October 3rd, 2007, 09:46 PM
I have a processor intensive application that breaks up it's work into different scenarios that are processed on two threads to make maximum use of the processor - a Pentium D (which has a dual core).

However, I'm finding that sometimes the two threads run at vastly different speeds (i.e. one runs ten times faster than the other) even though there is nothing else running on the PC and, viewing the process CPU usage on Task Manager, it is using 100% of the CPU. Sometimes it works fine and both threads run at full speed, and sometimes both threads will run slowly. I usually have to stop and restart the application several times until I (magically) get it running at full speed on both threads.

Both threads are running different instances of exactly the same code.

Does anyone have any idea what might be causing this behaviour and how I can get both threads to run at full speed every time? I'm using VC++ 2005 on a Win2K PC.

I suspect it's all to do with the processor's cache, but I would expect this to right itself as more of the cache becomes available, but it never does. Once a thread starts off running slowly it never recovers. The cache isn't too small to contain two instances because sometimes it works (with both threads executing at top speed).

Thanks in advance.

JVene
October 3rd, 2007, 11:49 PM
The possibilities are near endless without some code example.

I've written threaded code for years, and this doesn't usually happen, but there are exceptional reasons where it certainly can. 100% utilization just can't happen unless both threads are active (that is, if a thread were really stuck, the best that could be achieved on a dual core system would be 50% - translating into 100% of one core while the other is idle).

Your observation is either a) the process has the possibility of entering into infinite loop cycles which appear to do no work, but continue without production (and myriad other variations on that theme) or b) your machine (OS/hardware) is actually busy doing something else (errant or otherwise).

While b isn't likely, it's possible. More likely is that these threads are interfering with each other in some way. We'll need some vision to offer more.

MarkRMSheppard
October 4th, 2007, 12:19 AM
JVene,

The code is a bit extensive so it's hard to give you the relevant bits.

However, the two threads never interact and don't work with the same data - they are completely independant.

Your option b) is not happening because I keep a close eye on the CPU usage in task Manager and no other process uses any CPU time.

I'm pretty sure option a) isn't happening either because I can run the application with the same data several times and sometimes it will run fast and sometimes it will run slow (and there are no random factors in the data - the same data should result in exactly the same results every time).

For one test case I am using, after a fast run the total CPU time showing on Task Manager at the end is 00:01:08. After a slow run it can be anything from 00:01:26 to 00:01:48 or more.

Arjay
October 4th, 2007, 12:32 AM
What resources do these threads use to perform the work?

MarkRMSheppard
October 4th, 2007, 01:35 AM
Memory only. It uses quite a bit of memory and grabs it as lots of relatively small chunks.

Arjay
October 4th, 2007, 01:43 AM
Jvene's the guy to suggest alternative memory allocation techniques.

JVene
October 4th, 2007, 02:38 PM
Wow, thanks Arjay!


Indeed, memory allocation is where all threads must cooperate. The CRT memory allocation system is serialized. I've posted a few items around on on the subject, try 'multiplexed memory allocation' or words to that effect. Those posts suggest a generalized solution to memory allocation issues like this, but you can easily deploy one made specifically for your situation. The simple summary is this: allocate in groups, not units. If two threads attempt an allocation at the same time, one WILL lock against the other. When many allocations are made in loops, at it appears yours may be, each thread can end up waiting, in turn, against each other. If you allocate blocks of items at a time you can reduce the probability of these collisions by many orders of magnitude.

It is at the point of CRT memory allocation, if absolutely nothing else, where your threads have interaction that prepresents some random, possibly harmonic cycle, which would predict the kind of results you've detailed.

MarkRMSheppard
October 4th, 2007, 06:59 PM
Thanks JVene, I'll look up those items and see if they help. The memory allocation is what I'm focusing on now, mostly because I can't see what else could be causing it.

However, what you suggest doesn't quite ring true. All of the memory allocations occur when a thread starts working on a new "scenario" and then, after it has set itself up, no more memory allocations occur until the next scenario. So there may be contention between the threads when they both start together, but it shouldn't keep happening. The thread that keeps getting blocked should be freed up after the other thread has finished getting it's memory and has started processing the problem. I'm finding that threads can run slow even after all memory has been allocated and the threads are just doing the processing.

JVene
October 4th, 2007, 10:24 PM
Well, without more vision into what's happening, we can only be of so much help, but some of us, myself included, have over a decade of threaded development behind us, so it would be quite difficult to present a genuinely unfamiliar scenario.

Put another way, there is nothing that happens under the hood with regard to threading on dual core machines (dual CPU, quad, 8 ways, etc...) that causes threads to run at widely varying speeds like you're reporting - on their own accord or because of some intrinsic design issue with CPU/OS/Language, etc.

There are loads of things that can happen that cause it, and 99.9999% of the time that I've either had it happen, or collaborated to help those that have - a Homer Simpson moment precedes a cure.

d'Oh!

It's true that uneven RAM layout can cause cache misses to give odd performance results, and occasionally to some significant factor. If you have the means to ensure that the two processes are allocating from widely separated blocks, it may help. For example, a routine that runs through a container of objects and modifies a member variable through some simple calculation, competing with a thread that does something very similar on a different member variable, can easily run slower than if the results were being sent to widely disparate destinations. An alphablend that divides a bitmap in half, sending one thread on the upper half and the second on the lower may be faster than one that alternated every other pixel between two threads.

They key observation you've mentioned thus far is 100% CPU utilization. When threads are pending some critical section, CPU utilization can't peg 100%. That helps indicate the kind of thing you're searching for.

In the case of the multiplexed memory allocator, the fact that blocks are allocated by thread means that RAM is generally separated according to thread to some advantage (one that's tunable). This wouldn't help either of the two issues I mentioned above, but it might be a gain when the 'collision' isn't within objects but among the population of them. If the allocation happened 'together' with two threads competing, the the probability that each thread got 'every other block' from the heap is higher than if the allocation is performed in blocks. Interleaving like that offers minor performance tuning, but I don't expect it to be dramatic - yet this kind of thing is platform dependent. AMD CPU's, for example, have enjoyed high speed cache interfaces to each other's cores that gained considerably over Intel CPU's in this regard, until recent CORE architectures appeared (and I'm not entirely sure how they compare in recent examples).

If I were working on your team, I'd begin preparing test suites to analyze the problem, attempting to establish conditions by which I can agrevate the problem, smoke it out of hiding. I'd further stress test the problem as I propose solutions. If the processing CAN be broken into stages, I might create examples of those stages in isolation to see if any one can be made to exhibit the problem outside the context of the overall application.

Something like this was part of the debugging of my own multiplexed allocation system.

//JP added flex table