Parallel Programming in Visual C++ 2010 CTP

The CTP build of Visual C++ 2010 includes a new library to help you write native parallel code. Writing parallel code is getting more and more important with the broad availability of quad-core CPUs at this time and the many-core CPUs that will appear in the coming years. I will only be talking about the new concurrency library for native code. Of course, writing parallel code has already been possible for a long time. However, you had to create and manage all threads by yourself and this could often be a complex task. Because of this, it requires quite a bit of time to parallelize a simple loop over multiple threads. The new native concurrency library makes this much easier.

This article will explain the parallel_for construct that is part of the native concurrency library in more detail and will briefly touch on a few other constructs. For the example, I will parallelize a trivial implementation of a Mandelbrot fractal renderer.

A Simple Serial Mandelbrot Renderer

The code to render a Mandelbrot fractal looks like the following:

int iHeight     = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth      = rcClient.Width();
int iHalfWidth  = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
CDC memDC;
memDC.CreateCompatibleDC(&dc);
CBitmap bmp;
bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
CBitmap* pOldBmp = memDC.SelectObject(&bmp);
for (int y=-iHalfHeight; y<iHalfHeight; ++y)
{
   // Formula: zi = z^2 + z0
   double dZ0_i = dView_i + y * m_dZoomLevel;
   for (int x=-iHalfWidth; x<iHalfWidth; ++x)
   {
      double dZ0_r = dView_r + x * m_dZoomLevel;
      double dZ_r = dZ0_r;
      double dZ_i = dZ0_i;
      double d = 0.0;
      int iter;
      for (iter=0; iter < maxiter; ++iter)
      {
         double dZ_rSquared = dZ_r * dZ_r;
         double dZ_iSquared = dZ_i * dZ_i;
         if (dZ_rSquared + dZ_iSquared > 4)
         {
            // We escaped
            d = iter+1-log(log(sqrt(dZ_rSquared +
                               dZ_iSquared)))/log(2.0);
            break;
         }
         dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
         dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
      }

      memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
   }
   dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
}
memDC.SelectObject(pOldBmp);
bmp.DeleteObject();
memDC.DeleteDC();

This code first calculates the width and height of the window to which you will be rendering. It also sets up the position and size of your view on the imaginary plane. The renderer will render line by line in a memory device context, so you set up a memory device context and select a bitmap in it whose size is the width of the rendering window and whose height is just 1 pixel. Then, you loop over each line. In each line, you loop over each pixel and for each pixel you iterate a number of times to calculate the value of that pixel. When you escape from the Mandelbrot set, you calculate the value “d” to get some kind of smooth grayscale coloring of your fractal. Once a row has been rendered, it will be blitted to the screen using BitBlt so you can see the progress of the rendering.

When you would run this renderer, it will render line by line from top to bottom. A screenshot of this can be seen below:

Parallelizing the Mandelbrot Renderer

Before you can use the new native concurrency library, you need to include the ppl.h file. Also, these concurrency functions are inside the namespace Concurrency, so either use “using namespace Concurrency” or specify Concurrency in front of every use of something from the library.

#include <ppl.h>
using namespace Concurrency;

To parallelize the above Mandelbrot renderer using the new parallel_for construct from the Visual C++ 2010 CTP concurrency library, you basically only need to change the outer for loop—the loop that is iterating over all the rows. A first version would look like the following:

int iHeight     = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth      = rcClient.Width();
int iHalfWidth  = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
parallel_for(-iHalfHeight, iHalfHeight,1,[&](int y){
   CDC memDC;
   CBitmap bmp;
   memDC.CreateCompatibleDC(&dc);
   bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
   CBitmap* pOldBmp = memDC.SelectObject(&bmp);
   // Formula: zi = z^2 + z0
   double dZ0_i = dView_i + y * m_dZoomLevel;
   for (int x=-iHalfWidth; x<iHalfWidth; ++x)
   {
      double dZ0_r = dView_r + x * m_dZoomLevel;
      double dZ_r  = dZ0_r;
      double dZ_i  = dZ0_i;
      double d     = 0.0;
      int iter;
      for (iter=0; iter < maxiter; ++iter)
      {
         double dZ_rSquared = dZ_r * dZ_r;
         double dZ_iSquared = dZ_i * dZ_i;
         if (dZ_rSquared + dZ_iSquared > 4)
         {
            // We escaped
            d = iter+1-log(log(sqrt(dZ_rSquared +
                                    dZ_iSquared)))/log(2.0);
            break;
         }
         dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
         dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
      }

      memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
   }
   dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
   memDC.SelectObject(pOldBmp);
   bmp.DeleteObject();
   memDC.DeleteDC();
});

The only things that have been changed in this code compared to the original code are the bold parts, meaning the “for” has been replaced with the “parallel_for” and the creation of the memory DC and bitmap are moved inside the parallel_for because you need a separate memory DC for every thread that will be created. The parallel_for construct is using another new feature of C++, called Lambda expressions, that allow you to create anonymous inline functions. Describing lambda expressions is outside the scope of this article.

Making the Renderer Thread Safe

When you would run the above code, you would notice that some lines will be missing in the rendering result. This is because you are writing to the same device context from different threads without any synchronization. To fix this, you will use a critical section to secure access to the device context so that only 1 thread can draw on it at the same time. The changes look as follows:

int iHeight = rcClient.Height();
int iHalfHeight = int(iHeight/2.0+0.5);
int iWidth = rcClient.Width();
int iHalfWidth = int(iWidth/2.0+0.5);
int maxiter = 1024;
// Position and size of our view on the imaginary plane
double dView_r = 0.001643721971153;
double dView_i = 0.822467633298876;
// We need this critical section to have thread safe access to
// our device context
CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);
parallel_for(-iHalfHeight, iHalfHeight,1,[&](int y){
   CDC memDC;
   CBitmap bmp;
   // We need to use a critical section here because we're
   // accessing our dc.
   EnterCriticalSection(&cs);
   memDC.CreateCompatibleDC(&dc);
   bmp.CreateCompatibleBitmap(&dc, iWidth, 1);
   LeaveCriticalSection(&cs);
   CBitmap* pOldBmp = memDC.SelectObject(&bmp);
   // Formula: zi = z^2 + z0
   double dZ0_i = dView_i + y * m_dZoomLevel;
   for (int x=-iHalfWidth; x<iHalfWidth; ++x)
   {
      double dZ0_r = dView_r + x * m_dZoomLevel;
      double dZ_r = dZ0_r;
      double dZ_i = dZ0_i;
      double d = 0.0;
      int iter;
      for (iter=0; iter < maxiter; ++iter)
      {
         double dZ_rSquared = dZ_r * dZ_r;
         double dZ_iSquared = dZ_i * dZ_i;
         if (dZ_rSquared + dZ_iSquared > 4)
         {
            // We escaped
            d = iter+1-log(log(sqrt(dZ_rSquared +
                                    dZ_iSquared)))/log(2.0);
            break;


         }
         dZ_i = 2 * dZ_r * dZ_i + dZ0_i;
         dZ_r = dZ_rSquared - dZ_iSquared + dZ0_r;
      }

      memDC.SetPixel(x+iHalfWidth,0,RGB(d*50,d*50,d*50));
   }
   // We need to use a critical section here because we're
   // accessing our dc.
   EnterCriticalSection(&cs);
   dc.BitBlt(0, y+iHalfHeight, iWidth, 1, &memDC, 0, 0, SRCCOPY);
   LeaveCriticalSection(&cs);
   memDC.SelectObject(pOldBmp);








   bmp.DeleteObject();
   memDC.DeleteDC();
});

Before starting the parallel_for loop, you initialize a critical section. Inside the parallel_for loop, you will wrap all usages of the device context “dc” inside EnterCriticalSection/LeaveCriticalSection constructs to make sure only 1 thread accesses that device context at the same time.

When executing this code, you will see that it is rendering in blocks, as can be seen in the following screenshot. Each block is being handled by a different thread.

This article comes with one attachment. MandelbrotPar_src.zip contains the above Mandelbrot example. In the toolbar of the application, you will find a button with a P in it. When you toggle this button, you will switch between serial and parallel rendering. The titlebar of the application shows the time it took to render the image in milliseconds. Please note that this is a very basic example application. The rendering is happening in the WM_PAINT handler directly, meaning it will redraw the entire fractal each time it needs to paint the window.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read