Parallel Programming in Visual C++ 2010 CTP

The CTP build of Visual C++ 2010 includes a new library to help you write native parallel code. Writing parallel code is getting more and more important with the broad availability of quad-core CPUs at this time and the many-core CPUs that will appear in the coming years. I will only be talking about the new concurrency library for native code. Of course, writing parallel code has already been possible for a long time. However, you had to create and manage all threads by yourself and this could often be a complex task. Because of this, it requires quite a bit of time to parallelize a simple loop over multiple threads. The new native concurrency library makes this much easier.

This article will explain the parallel_for construct that is part of the native concurrency library in more detail and will briefly touch on a few other constructs. For the example, I will parallelize a trivial implementation of a Mandelbrot fractal renderer.

A Simple Serial Mandelbrot Renderer

The code to render a Mandelbrot fractal looks like the following:

int iHeight     = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth      = rcClient.Width();
int iHalfWidth  = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
CDC memDC;
memDC.CreateCompatibleDC(&dc);
CBitmap bmp;
bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
CBitmap* pOldBmp = memDC.SelectObject(&bmp);
for (int y=-iHalfHeight; y<iHalfHeight; ++y)
{
   // Formula: zi = z^2 + z0
   double dZ0_i = dView_i + y * m_dZoomLevel;
   for (int x=-iHalfWidth; x<iHalfWidth; ++x)
   {
      double dZ0_r = dView_r + x * m_dZoomLevel;
      double dZ_r = dZ0_r;
      double dZ_i = dZ0_i;
      double d = 0.0;
      int iter;
      for (iter=0; iter < maxiter; ++iter)
      {
         double dZ_rSquared = dZ_r * dZ_r;
         double dZ_iSquared = dZ_i * dZ_i;
         if (dZ_rSquared + dZ_iSquared > 4)
         {
            // We escaped
            d = iter+1-log(log(sqrt(dZ_rSquared +
                               dZ_iSquared)))/log(2.0);
            break;
         }
         dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
         dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
      }

      memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
   }
   dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
}
memDC.SelectObject(pOldBmp);
bmp.DeleteObject();
memDC.DeleteDC();

This code first calculates the width and height of the window to which you will be rendering. It also sets up the position and size of your view on the imaginary plane. The renderer will render line by line in a memory device context, so you set up a memory device context and select a bitmap in it whose size is the width of the rendering window and whose height is just 1 pixel. Then, you loop over each line. In each line, you loop over each pixel and for each pixel you iterate a number of times to calculate the value of that pixel. When you escape from the Mandelbrot set, you calculate the value "d" to get some kind of smooth grayscale coloring of your fractal. Once a row has been rendered, it will be blitted to the screen using BitBlt so you can see the progress of the rendering.

When you would run this renderer, it will render line by line from top to bottom. A screenshot of this can be seen below:

Parallelizing the Mandelbrot Renderer

Before you can use the new native concurrency library, you need to include the ppl.h file. Also, these concurrency functions are inside the namespace Concurrency, so either use "using namespace Concurrency" or specify Concurrency in front of every use of something from the library.

#include <ppl.h>
using namespace Concurrency;

To parallelize the above Mandelbrot renderer using the new parallel_for construct from the Visual C++ 2010 CTP concurrency library, you basically only need to change the outer for loop—the loop that is iterating over all the rows. A first version would look like the following:

int iHeight     = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth      = rcClient.Width();
int iHalfWidth  = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
parallel_for(-iHalfHeight, iHalfHeight,1,[&](int y){
   CDC memDC;
   CBitmap bmp;
   memDC.CreateCompatibleDC(&dc);
   bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
   CBitmap* pOldBmp = memDC.SelectObject(&bmp);
   // Formula: zi = z^2 + z0
   double dZ0_i = dView_i + y * m_dZoomLevel;
   for (int x=-iHalfWidth; x<iHalfWidth; ++x)
   {
      double dZ0_r = dView_r + x * m_dZoomLevel;
      double dZ_r  = dZ0_r;
      double dZ_i  = dZ0_i;
      double d     = 0.0;
      int iter;
      for (iter=0; iter < maxiter; ++iter)
      {
         double dZ_rSquared = dZ_r * dZ_r;
         double dZ_iSquared = dZ_i * dZ_i;
         if (dZ_rSquared + dZ_iSquared > 4)
         {
            // We escaped
            d = iter+1-log(log(sqrt(dZ_rSquared +
                                    dZ_iSquared)))/log(2.0);
            break;
         }
         dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
         dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
      }

      memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
   }
   dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
   memDC.SelectObject(pOldBmp);
   bmp.DeleteObject();
   memDC.DeleteDC();
});

The only things that have been changed in this code compared to the original code are the bold parts, meaning the "for" has been replaced with the "parallel_for" and the creation of the memory DC and bitmap are moved inside the parallel_for because you need a separate memory DC for every thread that will be created. The parallel_for construct is using another new feature of C++, called Lambda expressions, that allow you to create anonymous inline functions. Describing lambda expressions is outside the scope of this article.

Making the Renderer Thread Safe

When you would run the above code, you would notice that some lines will be missing in the rendering result. This is because you are writing to the same device context from different threads without any synchronization. To fix this, you will use a critical section to secure access to the device context so that only 1 thread can draw on it at the same time. The changes look as follows:

int iHeight = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth = rcClient.Width();
int iHalfWidth = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
// We need this critical section to have thread safe access to
// our device context
CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);
parallel_for(-iHalfHeight, iHalfHeight,1,[&](int y){
   CDC memDC;
   CBitmap bmp;
   // We need to use a critical section here because we're
   // accessing our dc.
   EnterCriticalSection(&cs);
   memDC.CreateCompatibleDC(&dc);
   bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
   LeaveCriticalSection(&cs);
   CBitmap* pOldBmp = memDC.SelectObject(&bmp);
   // Formula: zi = z^2 + z0
   double dZ0_i = dView_i + y * m_dZoomLevel;
   for (int x=-iHalfWidth; x<iHalfWidth; ++x)
   {
      double dZ0_r = dView_r + x * m_dZoomLevel;
      double dZ_r = dZ0_r;
      double dZ_i = dZ0_i;
      double d = 0.0;
      int iter;
      for (iter=0; iter < maxiter; ++iter)
      {
         double dZ_rSquared = dZ_r * dZ_r;
         double dZ_iSquared = dZ_i * dZ_i;
         if (dZ_rSquared + dZ_iSquared > 4)
         {
            // We escaped
            d = iter+1-log(log(sqrt(dZ_rSquared +
                                    dZ_iSquared)))/log(2.0);
            break;


         }
         dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
         dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
      }

      memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
   }
   // We need to use a critical section here because we're
   // accessing our dc.
   EnterCriticalSection(&cs);
   dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
   LeaveCriticalSection(&cs);
   memDC.SelectObject(pOldBmp);








   bmp.DeleteObject();
   memDC.DeleteDC();
});

Before starting the parallel_for loop, you initialize a critical section. Inside the parallel_for loop, you will wrap all usages of the device context "dc" inside EnterCriticalSection/LeaveCriticalSection constructs to make sure only 1 thread accesses that device context at the same time.

When executing this code, you will see that it is rendering in blocks, as can be seen in the following screenshot. Each block is being handled by a different thread.

This article comes with one attachment. MandelbrotPar_src.zip contains the above Mandelbrot example. In the toolbar of the application, you will find a button with a P in it. When you toggle this button, you will switch between serial and parallel rendering. The titlebar of the application shows the time it took to render the image in milliseconds. Please note that this is a very basic example application. The rendering is happening in the WM_PAINT handler directly, meaning it will redraw the entire fractal each time it needs to paint the window.

Parallel Programming in Visual C++ 2010 CTP

Other Concurrency Constructs

The native concurrency library also has some other high level constructs, such as parallel_for_each and parallel_invoke.

parallel_for_each

This construct can replace the std::for_each construct to iterate over a container in a parallel manner. An example could be a vector drawing application. The application might hold a vector of all objects that have to be rendered. When the application needs to render everything to the screen, it simply iterates over the vector and calls the Draw() functions on each object. For example:

#include <algorithm>
#include <vector>

class CObj
{
public:
   virtual void Draw();
   // ...
}

class CObjRectangle : public CObj
{
public:
   virtual void Draw();
   // ...
}

int main()
{
   std::vector<CObj*> vec;
   // Add objects to this vector
   // ...
   // Draw all objects
   for_each(vec.begin(), vec.end(), [&](CObj* p) {
      if (p)
         p->Draw();
   });
   return 0;
}

The above defines a base class for renderable objects and a specific rectangle class as an example. A vector is filled with renderable objects; then, the for_each construct will call the Draw() method on each object. This is happening in a serial way. You can simply parallelize this as follows:

#include <ppl.h>
using namespace Concurrency;

// ...

int main()
{
   std::vector<CObj*> vec;
   // Add objects to this vector
   // ...
   // Draw all objects
   parallel_for_each(vec.begin(), vec.end(), [&](CObj* p) {
      if (p)
         p->Draw();
   });
   return 0;
}

By simply replacing for_each with parallel_for_each, the loop is now being executed in parallel in several threads.

Note: This assumes that the Draw function will modify only a single object in each iteration.

parallel_invoke

parallel_invoke can be used to launch several functions, execute them in parallel, and wait until they are all finished. A simple example is shown below:

#include <windows.h>
#include <iostream>
#include <ppl.h>

using namespace std;
using namespace Concurrency;

int main()
{
   parallel_invoke(
      [=]{for (int i=0; i<100; ++i)
         {cout << i << endl; Sleep(100);}},
      [=]{for (int i=0; i<10; ++i)
         {cout << "   " << i << endl; Sleep(1000);}}
   );

   return 0;
}

This example will launch two functions in parallel if you have more than one core. The first function will print the numbers 0 to 100 and waits 100 milliseconds between two numbers. The second function will print the numbers 0 to 10 and will wait one second between two numbers. When you execute the above application on a system with more than one core, you will see that the numbers will interleave each other. Note that I didn't include any synchronization, so some numbers might appear on the same line. The parallel_invoke function can accept from 2 up to 10 functions to run in parallel. parallel_for internally is using parallel_invoke to split its work.

Profiling Parallel Code

Visual Studio 2010 CTP contains a new profiling option to help you in profiling the concurrency of your application. It allows you to check CPU utilization, thread blocking, and core execution. Explaining this profiler is outside the scope of this article. You can find some more information regarding it at http://msdn.microsoft.com/en-us/magazine/cc817396.aspx.



About the Author

Marc Gregoire

Marc graduated from the Catholic University Leuven, Belgium, with a degree in "Burgerlijk ingenieur in de computer wetenschappen" (equivalent to Master of Science in Engineering in Computer Science) in 2003. In 2004 he got the cum laude degree of Master In Artificial Intelligence at the same university. In 2005 he started working for a big software consultancy company. His main expertise is C/C++ and specifically Microsoft VC++ and the MFC framework. Next to C/C++, he also likes C# and uses PHP for creating webpages. Besides his main interest for Windows development, he also has experience in developing C++ programs running 24x7 on Linux platforms and in developing critical 2G,3G software running on Solaris for big telecom operators.

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • With JRebel, developers get to see their code changes immediately, fine-tune their code with incremental changes, debug, explore and deploy their code with ease (both locally and remotely), and ultimately spend more time coding instead of waiting for the dreaded application redeploy to finish. Every time a developer tests a code change it takes minutes to build and deploy the application. JRebel keeps the app server running at all times, so testing is instantaneous and interactive.

  • The latest release of SugarCRM's flagship product gives users new tools to build extraordinary customer relationships. Read an in-depth analysis of SugarCRM's enhanced ability to help companies execute their customer-facing initiatives from Ovum, a leading technology research firm.

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds