// JP opened flex table

Click to See Complete Forum and Search --> : array reduction


caleb705
April 18th, 2007, 03:47 PM
I am writing a program using openmp, and I am trying to reduce the time it takes to run the program, I parallelized it and ran it but it actually increased the time to run it. I am guessing this is due to the fact that each thread is writing to the same shared variable (array). I am wondering if there is some way to somehow get around this. I am thinking maybe there is some way to that I could create a different array for each thread, and then somehow reduce them into the one main array at the end of the parallel section. I don't know how to do this though. Also I was thinking if maybe I could declare the array as private, hence having each thread writing to their own version, and then bringing them together by keeping only the desired values.

Any help would be appreciated.

JVene
April 18th, 2007, 06:20 PM
Some code example would be helpful.

I haven't looked under the hood to see how OpenMP initiates threads and handles synchronization, but consider this outline - it's likely in the right ballpark.

First, the top of loop is reached, the underlying assembler prepares the 'split' of the algorithm, and must schedule a thread to execute one portion while the foreground executes the other.

If the thread were initiated at program launch, then it's probably waiting on a signal. When that signal is sent, the delay between that moment and the time the background thread initiates is up to the scheduler of the operating system. Under Windows 2000, for example, that's somewhere around a few hundred microseconds.

The foreground thread begins it's portion, the background thread may not have started before the foreground is some portion through it's part of the job. Once the background thread finally begins, the foreground is likely to have completed first. It will then have to wait, because code is otherwise written as though this loop blocks execution until it is entirely completed. The foreground probably waits on a signal, too - and again, if it does wait, it's release will occur at a time associated with the speed of the operating system's task switcher.

If the loop is short, this synchronization can interfere. If your structure is such that you're calling many short loops in a row, you'll probably gain nothing, and might loose performance.

If that's not the case, you can - and probably do - end up in a memory bound 'data pump' limit. This entirely depends on what your loop is doing, and how quickly it moves from item to item in a list.

You can reach a point where the functions perform within the loop are performed so quickly that the timing of the loop is reduced entirely to the speed of RAM, in which case there's virtually nothing you can do unless the underlying hardware offers you an option. The AMD Opteron, with it's DDR2 RAM might have more headroom than, say, a P4, in this regard.

I'd need to see what your loop is doing to help more on that point.

However, you can gain performance somewhat if the elements upon which the loop operates are separated over RAM, but the nature of intelligent cache within the CPU has enough power to overcome that problem that in my experience it's rarely beneficial to more than 10% gain.

The more work performed within each iteration of the loop, the more gain (up to 100% per additional core) you can achieve through threading.

A quick example of two cases:

Conversion of a bitmap from one color format to another, say RGB to BGRA. The interior loop operates so fast that virtually all various constructions of the loop amount to a RAM data pump limit of speed, and threading under the best conditions only managed to match the single thread performance. It didn't matter if the data was handled one byte at a time (moving R as a byte, G as a byte, B as a byte ) - or if a loop were constructed to read in a word from the source (3 color values from one pixel, 1 color value from another, then then next byte bringing in 2 color values from adjacent pixels, then the 3rd byte bringing in the final). Both were exactly the same timing, and threading usually resulted in minor losses.

The alternate example - alpha blending. Here, the work performed inside each loop involved a multiplication (by alpha), an addition by a rounding value, a psuedo division (we choose to shift and accept a small error instead of division by 255, we used division by 256 with a test for zero or 255 alpha values). This was repeated on each color value, then the output supplied.

In this case, threading resulted in a gain of about 50% over a non-threaded version, and it didn't matter if the threads each worked on alternating lines or if the threads divided the bitmap in half (widely separated data or close data) - the gain was about the same. On faster RAM machines the gain was closer to 85%.

Part of the issue of constructing the threading algorithm was the same basic coordination steps I outlined above - dividing the job, signaling a waiting thread to take it's portion and go, then coordinating the conclusion to act as a single blocking call with respect to application code.

For that I constructed a callback object. The base has an event, and a bool indicating 'finished'. Finished is defaulted to true - an important point.

This object's derived class is used to process the background call. The thread is 'captured' as it were - initialized. If for any reason the thread object can't take the call, the 'capture thread' function returns a false, indicating the foreground should process the entire job. If returning true, the foreground adjusts itself to operate on 'the other half' of the job.

At that point, the background thread has been signaled to go, and this callback object indicates what function is to be performed, along with the parameters which describe it's portion of the job.

When the foreground thread, having moved on to work on it's half of the job (in a two CPU arrangement) - the callback object falls out of scope.

At that point, the destructor checks the 'finished' bool. When the thread accepted the call (by returning true from the capture thread function), it set this value to false. If the destructor sees false, it waits on a signal until it's true. The background thread's exit function sets this back to true and signals the event, releasing the foreground in a synchronized release.

It's been useful to give the background 'a little less to do' - to increase the probability that the background will finish just about the time the foreground does, so a wait is not even necessary. You can't guarantee this, since you have no control over what the processors might have to do, but increasing the probability over time means less waiting, better synchronization.

caleb705
April 18th, 2007, 10:34 PM
here is the sequential code
for (int i=0; i<size; i++)
{
sum1=0;
int j;
j=i+1;
if (j<size && j%I1!=0)
{
sum1 = sum1 -X[j];
}
j=i+I1;
if (j<size && j%(I1*I1) >= (I1))
{
sum1=sum1-X[j];
}
j=i+I1*I1;
if (j<size)
{
sum1=sum1-X[j];
}
j=i-1;
if (j>=0 && j%I1!=(I1-1))
{
sum1=sum1-X[j];
}
j=i-I1;
if (j >=0 && j%(I1*I1) < (I1*I1-I1))
{
sum1=sum1-X[j];
}
j=i-I1*I1;
if (j >=0)
{
sum1=sum1-X[j];
}

T[i] = (rhs - sum1)/6;
}
error[0] = LIN(T,X,size);
where size is somewhere around 17 million. I am pretty sure I have figured it out, with one small exception. I have broken the T into several smaller arrays, and I am recompiling the arrays back into the original later. I am doing this by declaring each of the smaller arrays as lastprivate variables, which should make it faster, but every time I try to access those arrays, I get a segmentation fault.

more code
for (int i=0; i<size2; i++)
{
sum1=0;
int j;
j=i+1;
if (j<size && j%I1!=0)
{
sum1 = sum1 -X[j];
}
j=i+I1;
if (j<size && j%(I1*I1) >= (I1))
{
sum1=sum1-X[j];
}
j=i+I1*I1;
if (j<size)
{
sum1=sum1-X[j];
}
j=i-1;
if (j>=0 && j%I1!=(I1-1))
{
sum1=sum1-X[j];
}
j=i-I1;
if (j >=0 && j%(I1*I1) < (I1*I1-I1))
{
sum1=sum1-X[j];
}
j=i-I1*I1;
if (j >=0)
{
sum1=sum1-X[j];
}
j=i;
T1[j] = (rhs - sum1)/6;
}
error[0] = LIN(T1,X1,size2);
this is one of the parallel sections, T1 is declared shared within the parallel statement and then as a lastprivate in the sections statement. It works fine without the lastprivate statement, but that makes it much to slow.

JVene
April 18th, 2007, 11:31 PM
I'm only vaguely familiar with OpenMP, and I'm not sure lastprivate does what you need.

The code snippets are helpful to assist, but frankly I don't have enough to actually try your algorithm here to assist, or even quite understand the result you require.

You say, for example, that size is around 17 million. This implies you expect a single array of 17 million entries as the result. Do I misunderstand?

If I better understood X[], I1, and rhs, for example, I might have enough from this snippet to try the code, even reply with an alternative tuned for performance.

caleb705
April 19th, 2007, 08:41 AM
I1, and rhs are just constants that are determined at run time. They are read from an input file. X[] is an array that is the same size as T[], later on in the code X[] will get replaced by T[], at which time this function will be called again. This is basically a completely stripped down version of the Jacobi iterative method. LIN is just a function to determine the maximum difference between X and T. Hence determining when to exit the code.

//JP added flex table