High-Performance Computing, Part 1

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Environment: C++/MFC

Introduction

This article first addresses some of the common programmer’s algorithm problems when writing a computing-intensive image processing application. It then explains how the existing code can be improved from native C++ code to using MMX acceleration. MMX refers to a feature that exists in the Intel processors, which can boost up multimedia software by executing integer instructions on several data in a parallel fashion. Although the processor supports the acceleration, it does not means your application can make use of it automatically. Hence, your application must be manually written to take advantage of it.

Once the user is able to make use of MMX acceleration, he/she can advance one step further by utilising SSE (Streaming SIMD Extension), which features parallel floating point calculations; and SSE2, which is an upgrade to both floating point and integer calculation. You have to be aware that not all hardware supports the acceleration described. Well, to keep the scope small, I had separated this explanation into another article (will be posted soon).

Now, let’s take a look at the list of Intel processors that support each feature.

  • Pentium MMX, Pro, Pentium 2—MMX
  • Pentium 3, Xeon—MMX, SSE
  • Pentium 4 and above—MMX, SSE, SSE2/HHT

The article contents mainly apply to developers using Intel processor hardware. As for AMD users, some of the assembly instructions in MMX are compatible! Prior to this, some basic assembly language knowledge is needed. I will also assume that you have some knowledge on MFC doc/view architecture.

I came out with this article because I think there might be a need to highlight some high performance computing knowledge, especially to those developers who are developing time-critical applications. Image processing is a good example for demo purposes.

In this downloadable demo, you will need to have at least a Pentium processor with MMX to see the results.

Background

In image processing/3D graphics and so forth, performance is critical. The less the time an algorithm takes, the better it is because more functionalities can be squeezed in.

In my examples, I generated two images of size 1024 x 1024. It’s a megabyte in size. For the sake of simplicity, both are 8-bit, grayscale images. I created synthetic images because dependencies upon other classes/objects/DLLs are minimum. Image 1 is an image with white vertical lines on a black background. Image 2 is an image with gradient fills of gray.

The two images will be added together to create a third image. The final product would be an image with gradient fill and vertical line. This is shown in the figure at the beginning of the article.

I can easily create the best performing algorithm in the demo, but instead I want to show you the difference between various kinds of code. I had developed four algorithms that demonstrate image addition: two in C++ and two in assembly language. For each of the algorithms, I put in a high-resolution timer to take the elapsed time reading. The final image will be shown onscreen (main window). The two source images are displayed in two mini windows on the right. Then, I present you with the results of different kinds of PCs in the later section of the article.

Writing the Algorithms

For some basic understanding, let’s go through this C++ code.

void CMMXDemoDoc::OnRunbasic()
{
  int x=0, y=0, i=0, iGray=0;

  // Start timing
  m_el.Begin();

  // Add two images using direct memory access
  // Assume both images are the same size
  for (y=0; y<m_iHeight; y++)
  {
    for (x=0; x<m_iWidth; x++)
    {
      // calc index
      i = x + y*m_iWidth;
      // add two pixels
      iGray = m_pImg1[i] + m_pImg2[i];
      // keep saturation value
      if (iGray > 255) iGray = 255;
      // save to destination image
      m_pImg[i] = iGray;

    }
  }

  // End Timing
  m_el.End();

  // Force redraw
  UpdateAllViews(0);
}

<m_pImg>, <m_pImg1>, and <m_pImg2> are an one-dimensional array of unsigned char. All images are the same size.

There are two for loops; the outer loop iterates through each line in the image. The inner loop iterates through each pixel in the line. For every pixel, we need to calculate the index in the array, add the pixel from image 1 and image 2, check whether it’s saturated (>255), then then store it into the destination image.

What is the problem with this algorithm?

  1. There are lots of arithmetic operations used on addition and there is a multiplication operation as well. Multiply is a bit slower than addition. Try changing the addition operator to multiply and see the differences in the demo.
  2. Array indexing to access random memory data. This causes some slowdown because the processor has to convert from an array index to the actual address in memory.
  3. Two levels of for loops cause the processor to operate stack instructions because ecx (counter register) pushes and pops often. This imposes some processor cycles.

Optimising Using Pointer Arithmetic

void CMMXDemoDoc::OnRunopt()
{
  // Begin timing
  m_el.Begin();

  // Precalculate the pointers
  BYTE *pSrc = m_pImg2;
  BYTE *pSrcEnd = m_pImg2 + m_iSize;
  BYTE *pDest = m_pImg;
  BYTE *pDestEnd = m_pImg + m_iSize;
  int iGray;

  // Copy from img1 to tmp
  memcpy(m_pImg, m_pImg1, m_iSize);

  // loop each pixel and Add
  while (pSrc < pSrcEnd)
  {
    iGray = *pDest + *pSrc;
    if (iGray > 255) iGray=255;
    *pDest = iGray;
    pSrc++;
    pDest++;
  }

  // End Timing
  m_el.End();
  // Force redraw
  UpdateAllViews(0);
}

Notice that I copied the data to the destination image. I can save a pointer addition because memcpy() is an optimized function from VC (memcpy() still takes time).

The preceding code uses a pointer to access the data. There is nothing more to optimize because the pointer is already storing the data memory address. The pointer moves from one element to the next in a sequential form. The source and destination pointers are incremented by 1 each after the pixel has been calculated.

Converting to Assembly

The following code shows how to write the algorithm in assembly. The elapsed time of the function will be much more constant.

int AsmAdd(BYTE *d, BYTE *s, int w, int h)
{
  int iCount = w * h;

  // we assume all data in the register is not used by others
  __asm
  {
    // Assign pointers to register
    mov       esi, [s]         ; put src addr to esi reg
    mov       edi, [d]         ; put dest addr to edi reg
    mov       ecx, [iCount]    ; put count to ecx reg

    codeloop:
    mov       al, [edi]        ; mov a byte of src data to low
                               ; word of eax register
    add       al, [esi]        ; Add 8 bit from dest ptr to al
    jc        nosave           ; jump if carry flag on
    mov       [edi], al
    nosave:
    inc       esi
    inc       edi
    dec       ecx              ; decrement count by 1
    jnz
    codeloop                   ; jump to codeloop if not 0

  }

  return 1;
}

Utilising MMX

We have eight general-purpose registers in our 80×86/Pentium processor. They are eax, ebx, ecx, edx, esi, edi, ebp, and esp. These registers are used to hold operands for logical and arithmetic operation, and operands for address calculations and memory pointers. Register esp is used to hold stack pointers most of the time, so that leaves us with seven registers. When we have lots of data, we would have to make sure these data are well handled by these seven registers. Besides that, we also have another eight 80-bit registers located in the FPU (Floating Point Unit). We can make use of these eight registers in the FPU if we have MMX feature in our processor. If we use MMX, we could access 64 bits from the 80-bit register in the FPU. Why only 64? Because the Intel manual says so. Anyhow, that’s not an issue as we will be happy with eight additional registers. We name this registers as mm0-mm7.

MMX Technology Execution Environment

So, what can we do with these eight additional MMX registers? They are not going to run any faster, right, because the clock speed of the CPU is fixed to the number of operations it can execute in a particular time. Yes, it is! How? By executing MMX parallel instructions on a preloaded stream of bytes!

SIMD Execution model

Data Types introduced with MMX Technologies

The parallel execution can save some clock cycles per instruction. You can load eight 8-bit, four 16-bit, and two 32-bit registers into the 64-bit MMX register. To load or store them, you use the instruction “MOVx”, where x is the data size. Then, you can perform a calculation by calling the MMX instructions. The MMX instruction sets consist of 47 instructions, grouped into several categories:

  • Data Transfer
  • Arithmetic
  • Comparison
  • Conversion
  • Unpacking
  • Logical
  • Shift
  • Empty mmx state instruction (emms)

Study the following code; eight 8-bit unsigned saturated add operations are performed on two 64-bit MMX registers by executing these lines.

unsigned char mask[8] = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
                         0x01, 0x01 }
unsigned char txt[8]  = {0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
                         0x47, 0x48 }
txt                   = "abcdefgh"
__asm {
movq    mm0, txt   ; mm0 = ( 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 )
movq    mm1, mask  ; mm1 = ( 01 | 01 | 01 | 01 | 01 | 01 | 01 | 01 )
paddusb mm0, mm1   ; mm0 = ( 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 )
movq    txt, mm0   ; mov back the data in register to memory.
emms               ; switch back to FPU state.
}

The string txt[], which contains “abcdefgh”, will now be “bcdefghi”.

By using MMX, there are only four instructions executed and eight bytes of data are processed at a time. A typical assembly algorithm without MMX will have to loop eight times, every iteration executing a move instruction and an addition, giving a total of at least 16 instructions. Now what is the speed improvement? I bet you’ll get at least a minimum of 200% in time gain.

In Visual C++ 6.0, the easiest way to make use of MMX is writing assembly code using an inline assembler. The preceding code illustrates this. The list of MMX instruction sets is available on MSDN Web sites. Intrinsics (wrapup function) is available in VC++ .NET.

For this demo, you should at least have a Pentium processor with MMX and above to produce the desired results.

Executing the Demo

Building and executing the demo is pretty straightforward. The source code is presented together with some inline assembler instructions (very well commented). There are also some helper classes and functions.

  • <CElapsed>—provides high-resolution timing using QueryPerformanceCounter().
  • <CMiniView>—Serve additional views for a CDocument class.
  • Some raster drawing codes to assist image display.
  • <CreateNewViews()>—Shows how to create additional views for a typical document.
  • <CheckSSEAvail()>—Checks the availability of SSE presence in your processor. You also can check other features in your processor, such as MMX and HHT.

Test Result

Various kinds of PC platforms and their test results on a 1024×1024 image
Computer and Processor Array Index (in milliseconds) Pointer arithmetic (in milliseconds) MMX Instructions (in milliseconds)
Clone Pentium 2, 400 MHz 46 35 16
Clone Pentium 3(eb) 600 MHz 30.6 24 11
Acer TravMate 332T, Mobile Pentium 2 300 MHz 71 66 58
Acer TravMate 270, Mobile Pentium 4 1.7 GHz 34 20 5.9
Toshiba Satellite 2410, Mobile Pentium 4 1.8 GHz 22 14 4.5
Clone, Pentium 4 1.7 GHz 30 18 5.5
Dell Dimension 2400, Pentium 4 2.2 GHz 18.5 12.5 6

Please note that the speed is affected by several factors. This includes processors clock speed, 1st and 2nd level cache memory size, system bus speed, and DRAM speed. A fast system bus and DRAM will result in faster data transfer into the core register set. FYI, data is fetched from DRAM into the cache memory and then to the processor register set.

Points of Interest

I had recently read through the Alex Farber—Introduction to SSE Programming articles, and found out that he had just posted a similar topic. So, I modified the content and separated it into two parts. The first was to emphasize C++ and MMX and the second describes SSE/SSE2 implementation by using external compilers (Netwide assembler). The version Alex posted is a .NET version of SSE implementation by using intrinsics. So, that makes room for me to post a VC 6.0 version for my next article.

MMX development is meant for intermediate programmers. Some learning curves are required. You have to catch up on Assembly, Intel Processor architectures, MMX instruction sets, assembly optimization, and its terms. I hope this article really helps!

Based on the test results, I found that my laptop, a Toshiba Satellite 2410, seems be the best performer if it’s utilising MMX instructions. It’s bizarre seeing why the Dell Dimension desktop PC with a 2.2ghz processor loses out so much, even to a 1.7 GHz clone system (gigabyte motherboard). I had double confirm this with testing on a few of the Dell systems.

References

You can refer to these helpful sites:

Downloads


Download demo project – 53 Kb

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read