Sequential to Parallel Code Transitioning Using Macros

Multiprocessor computers are very common nowadays, but the programs that make use of their advantages are quite few. This situation is mainly due to the difficulties accompanying multithreaded design. There are ways, however, to increase parallelism at low cost. In this article, I'm going to describe a method for easy transitioning from sequential single-threaded to parallel multi-threaded code using macros. I will illustrate the benefit of such a technique with an example: Imagine that you are making an audio processing application. In the code, there are parts like this:

...
  UseFilterXonLeftAudioChannel(...);
  UseFilterYonRightAudioChannel(...);
...

Maybe in time you will start wondering whether it is worth spending time to make the two relatively small computation blocks run in separate threads. To make things even worse, it is not clear whether the final version would include the AudioFilterX transformation. More and more similar filters could appear to be useful. There should be a way not to do a separate code for each function working on a couple of audio channels. The ideal solution in cases like this has to apply to the following requirements:

  • No code repetition for each transformation (code for creating the thread, code for thread synchronization, code for protecting simultaneous write access to the common data, and so on).
  • The code should stay as clear as possible. (Sometimes a thread supporting code with all its parameters, handlers, and functions becomes very difficult to read or support.)
  • The code that makes execution parallel has to be easy removable in case that it's not needed in the future (commenting large blocks makes code hard to read).
  • Passing parameters to code in threads has to be transparent to the programmer.
  • The thread synchronization should be as fast as possible.
  • The code should be portable and platform independent.
  • The solution should give a syntactically simple way to express parallelism, which does not impose significant changes to existing sequential code.
  • Small overhead.

Hiding the repeating code is often done by macros, so I decided to search the solution as a trinity of macros in the following form:

FORK_BEG      // Macro 1
    UseFilterXonLeftAudioChannel();
  FORK_END    // Macro 2
    UseFilterYonRightAudioChannel();
  JOIN        // Macro 3
Term Description
FORK_BEG The beginning of the code in the first of the threads that should run simultaneously
FORK_END The end of the code in the first thread and the beginning of the code in the second thread
JOIN The end of the code in the second thread and the execution point where the two parallel threads should both be finished, before the execution could go on


  ...
  Code 1
  Code 2
  ...
  ...
  FORK_BEG  
    Code 1
  FORK_END
    Code 2
  JOIN
  ...

The macro trinity not only solves problems with making a single code run in parallel on the same type of data, but much more—it makes the sequential code parallel. Let's consider some examples:

Parallel Data Acquisition
Often, processing at some point depends on different data sources obtained on earlier stages of processing. So, if each computation, which acquires intermediate data, has been done simultaneously, the whole computation would be much faster.
FORK_BEG
  SomeProcessingResultingData1();
FORK_END
  SomeProcessingResultingData2();
JOIN
  FinalProcessing(Data1, Data2);
Asynchronous File Access
File Access is slow. So, if instead of waiting it to finish, some computation is run simultaneously, both processes would finish faster.
FORK_BEG
  ReadData1FromFile();
FORK_END
  SomeProcessingResultingData2();
JOIN
  FinalProcessing(Data1, Data2);
More than one execution process splitting
In most cases, more than two processes could run in parallel. For example, in 3D modeling often the final result is acquired after many independent stages of processing, each one of which working on a different sub-object. At the end, all sub-objects are put together in a single object. In that, case a trinity of macros should be used for each execution splitting.
FORK_BEG
  FORK_BEG
    ModelHands();
  FORK_END
    ModelLegs();
  JOIN
FORK_END
  FORK_BEG
    ModelHead();
  FORK_END
    ModelBody();
  JOIN
JOIN
  PutAllPartsTogether();

Let's see how macros are implemented by using the C++ language in detail: Separate threads are running in separate functions, so the code between the FORK_BEG and FORK_END macros should be separated in a function. C++ does not support local functions, but supports local classes, so I'm going to use one of the local classes' member functions as a local function (see Listing 1).

Listing 1

#define FORK_BEG \
class HiddenCls \
{ \
  void LocalFunc() \
  {

#define FORK_END \
  }; \
};

I separated the code from its surrounding code, but I cannot call it from independent threads yet because it does not have its own address, which could be used. Member functions in local classes could have only inline members. That fact does not allow function activation by address, which is very important for starting the separated code in a thread. To avoid that problem, I use a static member function, which is not class instance dependent, when defined in virtual base class. The function could call the local function using the "this" pointer as a parameter (see Listing 2).

Listing 2

class HiddenBaseCls
{
  public:
    static void StaticFunc(HiddenBaseCls* pThis)
                          { pThis->LocalFunc(); };

    virtual void LocalFunc() = 0;
};

#define FORK_BEG \
class HiddenCls : public HiddenBaseCls \
{ \
  public: \
    void LocalFunc() \
    {

#define FORK_END \
  }; \
};

So, now the code block between the first two macros is separated from the surrounding code and could be called independently from functions that only need a pointer to the local class instance. The call to one such function should be added at the end of the FORK_END macro to ensure the LocalFunc code execution. The code which does that looks like this:

HiddenCls HCinst;
StartCodeIndependently(&HCinst);

The function "StartCodeIndependently" could start the separated code in many ways:

  1. Immediately when called, by using the pointer directly. That kind of usage would lead to the same result as the sequential non-macro execution (see Listing 3):
  2. Listing 3

    void StartCodeIndependently(HiddenBaseCls* pInstanceThisPtr)
    {
      HiddenBaseCls::StaticFunc(pInstanceThisPtr);
        // StaticFunc calls LocalFunc
    }
  3. The function creates a thread, which runs the independent code. The code in that case is almost the same with the only difference that some platform-dependent code for working with threads should be added. The StartCodeIndependently function creates a thread and ensures that the code starts in the thread's body. The thread synchronization should be done in the JOIN macro. Listing 4 shows the code in the Win32 platform case:
  4. Listing 4

    static DWORD (WINAPI ThreadFunc) (void* pThis)
    { HiddenBaseCls::StaticFunc((HiddenBaseCls *)(pThis)); };
    
    void StartCodeIndependently(HiddenBaseCls* pInstanceThisPtr,
                                HANDLE& hThread)
    {
      hThread = CreateThread( NULL, 0, // all error handling is
                                       // omitted in example
        ThreadFunc, /* pointer to thread function */
        (void*)pInstanceThisPtr,
        /* argument for the new thread */
        0, NULL );
    }
    ...
    #define JOIN WaitForSingleObject(hThread, INFINITE);
    
  5. The function declares the independent code as "requesting execution" by asking a free working thread for the code from the scheduling module. In that case, the scheduling module is responsible for the efficiency of the parallel execution. It does not create a thread each time execution splitting is needed, but has a set of previously created threads, whose number depends on the machine's characteristics on which the code runs. If scheduling module "decides" that parallel execution would be more efficient, it returns a working thread ready for executing the code. If not, the code is run sequentially. This method has many advantages:
    • The thread creation is a relatively slow process and avoiding it speeds up execution.
    • The scheduling module knows the state of all threads currently running and can optimize execution by decreasing context switching. (When too many threads are set to work in parallel, the switching between them could significantly slows down the execution.)
    • The scheduling module could use different strategies to determine whether to give a free working thread or decline parallel execution. In the method implementation code (* available at www.codeguru.com *) the scheduling module is a simple structure giving information about the work threads. The structure behaves as an array of thread information structures, each one of which supports a work thread. All that StartCodeIndependently function has to do to find out whether the LocalFunc would be executed in parallel is walk through all the thread information structures in search of a free work thread. If the search fails, all processors are used fully at the moment and starting code sequentially is better. On the condition that a free thread is found, the StartCodeIndependently function should fill some of the fields in the free thread information structure before the code starts execution in the thread. The structure contains a "thread is free" flag, a pointer to the code that would be executed in the thread and two pointers to events. The first of the events triggers the LocalFunc execution. The second one is set by the thread to show that the function execution is completed. Even if the code execution in the thread is finished, the thread is not set free before the code in the JOIN macro receives the code execution finish event signal. Just then, the synchronization is done (both execution paths are finished) and the execution of the code after the JOIN macro could start.

Okay, now I have a good mechanism for running parallel execution efficiently, but still something is missing! Oh, yes—how to treat the parameters that are defined out of the code block, processed in its body, and needed after the block processing (see Listing 5)? Or, how to insure that the parallel thread's processing does not lead to common memory usage collisions?

Listing 5

int i = 0; j = 3;
FORK_BEG
  i++;       // Code1 between FORK_BEG & FORK_END
FORK_END
  j += 2;    // Code2 between FORK_END & JOIN
  i--;       // ! hidden collision with i++ from Code1
JOIN
  funcX ( i,j );

So, how do you pass parameters in and out the thread? Fortunately, the search for the solution is limited only for the code bounded between FORK_BEG & FORK_END (Code1 for short). The code between FORK_END & JOIN (Code2 for short) stays in the same thread and does not need specific processing. Code1, after macro replacement, becomes the body of the local class member function, so it has full access to all class members. The only thing that should be done is to declare local class reference members pointing to the real parameters. Listing 6 illustrates the macros for passing one parameter. The idea in cases with more than one parameter is the same (* as could be seen in the method implementation code available at www.codeguru.com *).

Listing 6

#define FORK_BEG_1(Var1Type, Var1Name) /* line1 */ \
{ /* line2 */ \
  ;##Var1Type& Unique_Ref_1 = ##Var1Name; /* line3 */ \
  ##Var1Type* ##Var1Name = NULL; /* line4 */ \
\
  class HiddenCls : public HiddenBaseCls \
  {\
    ##Var1Type& ##Var1Name; /* line8 */ \
\
    public:\
      HiddenCls(##Var1Type& var1_init) : ##Var1Name(var1_init)
               { }; /* line11   */ \
\
    void LocalFunc() \
    {

#define FORK_END_1 \
    }; \
  }; \
\
  HiddenCls HiddenClsInst (Unique_Ref_1); /* line20 */ \
  StartInSeparateThread(&HiddenClsInst);

#define JOIN \
  WaitForBothThreadsToFinishBeforeContinuingExecution(); \
  // (* Please see the full method implementation code available
  // at www.codeguru.com *) \
}; /* line26 */
  • Line 1—The macro now uses two macro parameters, one for the type of the thread parameter and one for its name.
  • Line 2—The opening brace declares a new local block isolating the contents between FORK_BEG and end of JOIN.
  • Line 3—New C++ unique reference is created for the thread parameter.
  • Line 4—A new identifier is defined with its scope valid for all code bounded between FORK_BEG and the end of JOIN (Code1 and Code2). The identifier has the same name as the parameter, but its type is different, so it cannot be used any more from the second thread (Code2) without an error generated by the compiler. This is an elegant way to protect Code2 from simultaneous access to the common data. To let Code1 have access to the data once again, an identifier with the same name as the parameter is declared (Line 8). It points to the real parameter after the reference initialization in Line 11. The scope of the reference is the class scope, so the code in function LocalFunc will use it the same way as if there was no macros at all.
  • Line 20—The reference should be initialized with the proper identifier. Defining Unique_Ref_1 not only ensures data protection of the parameter, but also spares the inconvenience to the programmer to use FORK_END with parameters. Unique_Ref_1 in FORK_END_1 is something like the default macro parameter defined in FORK_BEG_1.
  • Line 26—The closing brace defines the end of the macro trinity scope. The couple of brackets (Lines 2 & Line 26) also ensure that no name collision would occur if more than one macro trinity were used in the same scope or if macros are used recursively (more that one parallel process).

So after defining one parameter macro, Listing 5 becomes what you see in Listing 7:

Listing 7

int i = 0; j = 3;
FORK_BEG(int, i)
  i++;
FORK_END
  j += 2;
  i--;    // collision which is revealed by compiler
JOIN
  funcX( i,j );

The only difference is the explicit declaration that "int i" would be used as a parameter in Code1. The explicit declaration is a very efficient protection, insuring that this and only this parameter would be used. All other parameters are not visible to Code1, so it cannot incidentally use them for unauthorized read/write access. It is one thing to know how not to make mistakes and much different thing if the compiler doesn't let you do them. But, if common data access is forbidden from both two threads, how do you do it if needed for some reason (see Listing 8).

Listing 8

typedef int iArrType[50];
iArrType iArr;
FORK_BEG(iArrType, iArr)
  for (int i=0; i<25; i++)
    iArr[i] = func(i);    // no problem with access
FORK_END
  for (int i=25; i<50; i++)
    iArr[i] = func(i);    // !!! problem with iArr[] access
JOIN

The problem is that iArr is not treated as iArrType, but as iArrType* and the compiler does not allow the operation because it's not type safe. In that case, the programmer could take all responsibility and assure the compiler he knows that both threads could write to common data in a safe way, by creating a reference to iArr and using it instead of iArr in Code2. The example is interesting for one more reason: It shows one more possible application of the described technique—making iterations of a loop execute in parallel. The splitting of a loop is a well-known method for making loop execution faster. Loop splitting could be done to loops, whose iterations are independent (the code in any iteration does not rely on the result from any other iteration). In 2D and 3D modeling, for example, such SIMD code is very commonly used in the following areas:

  • Ray Tracing—a ray is traced for each pixel in the processed scene independently.
  • 2D effects—almost all image processing filters (such as blur, sharp, emboss, and so forth) are made by applying a simple operation on each pixel of the processed image. All the pixel operations are separate from one another.
  • 3D mesh processing—often each vertex in a 3D mesh is processed apart from the others. In 3D mesh smoothing, for example, each vertex is moved according to a rule in a direction defined only from its current position and his neighbour vertices.
  • 3D film rendering—each frame is processed independently.

Making a loop parallel could be done with only a couple of macros bounding loop body. Listing 8 becomes Listing 9:

Listing 9

FOR_FORK_BEG_1(int, i, 0, 50, iArrType, iArr)
  iArr[i] = func(i);
FOR_JOIN

The first line means "a parallel loop 'for (int i=0; i<50; i++)', which body is working with one in/out parameter named iArr of type iArrType."

There is nothing special in loop splitting macro implementation. The loop body is separated the same way by using a local class member function. The only difference is calling the function with two more parameters, determining which iterations would be executed in the separate thread. Iteration dividing strategies could vary, setting the whole loop into two, three, or more parallel parts. (*_For more details please see the method implementation code available at www.codeguru.com_*.)

So far, I have talked mainly about how this method could improve the execution efficiency. But everything has its price. The price in this case is the time spent in the supporting code. Despite the efficient implementation, the usage of this technique is not a good idea if execution time of any of the two codes that would work in parallel is too small.

Acknowledgements

Thanks to my friend Ivan Georgiev for helping me with valuable implementation ideas, examples, and support.

Conclusion

As parallelism is the next dimension of computational performance increase, the methods for splitting the sequential code into parallel units are getting more important. In this article, I described a technique for easy transitioning from sequential single-threaded to parallel multi-threaded code using only three macros for each execution splitting. The method's implementation uses the C++ language, which gives all the means necessary for making the solution elegant, flexible, and platform independent.

Downloads

Download demo project - 9 Kb


Comments

  • I like the idea, but want to know more about parallel execution improvement on dual core CPU platforms

    Posted by RoyBattini on 04/18/2005 04:19am

    Is there any statistics on perfomance improvement from parallel execution on hyperthreaded CPU platforms and on dual core CPU platforms ?

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • On-demand Event Event Date: October 29, 2014 It's well understood how critical version control is for code. However, its importance to DevOps isn't always recognized. The 2014 DevOps Survey of Practice shows that one of the key predictors of DevOps success is putting all production environment artifacts into version control. In this webcast, Gene Kim discusses these survey findings and shares woeful tales of artifact management gone wrong! Gene also shares examples of how high-performing DevOps …

Most Popular Programming Stories

More for Developers

RSS Feeds