High-Performance Computing, Part 1

Environment: C++/MFC

Introduction

This article first addresses some of the common programmer's algorithm problems when writing a computing-intensive image processing application. It then explains how the existing code can be improved from native C++ code to using MMX acceleration. MMX refers to a feature that exists in the Intel processors, which can boost up multimedia software by executing integer instructions on several data in a parallel fashion. Although the processor supports the acceleration, it does not means your application can make use of it automatically. Hence, your application must be manually written to take advantage of it.

Once the user is able to make use of MMX acceleration, he/she can advance one step further by utilising SSE (Streaming SIMD Extension), which features parallel floating point calculations; and SSE2, which is an upgrade to both floating point and integer calculation. You have to be aware that not all hardware supports the acceleration described. Well, to keep the scope small, I had separated this explanation into another article (will be posted soon).

Now, let's take a look at the list of Intel processors that support each feature.

  • Pentium MMX, Pro, Pentium 2—MMX
  • Pentium 3, Xeon—MMX, SSE
  • Pentium 4 and above—MMX, SSE, SSE2/HHT

The article contents mainly apply to developers using Intel processor hardware. As for AMD users, some of the assembly instructions in MMX are compatible! Prior to this, some basic assembly language knowledge is needed. I will also assume that you have some knowledge on MFC doc/view architecture.

I came out with this article because I think there might be a need to highlight some high performance computing knowledge, especially to those developers who are developing time-critical applications. Image processing is a good example for demo purposes.

In this downloadable demo, you will need to have at least a Pentium processor with MMX to see the results.

Background

In image processing/3D graphics and so forth, performance is critical. The less the time an algorithm takes, the better it is because more functionalities can be squeezed in.

In my examples, I generated two images of size 1024 x 1024. It's a megabyte in size. For the sake of simplicity, both are 8-bit, grayscale images. I created synthetic images because dependencies upon other classes/objects/DLLs are minimum. Image 1 is an image with white vertical lines on a black background. Image 2 is an image with gradient fills of gray.

The two images will be added together to create a third image. The final product would be an image with gradient fill and vertical line. This is shown in the figure at the beginning of the article.

I can easily create the best performing algorithm in the demo, but instead I want to show you the difference between various kinds of code. I had developed four algorithms that demonstrate image addition: two in C++ and two in assembly language. For each of the algorithms, I put in a high-resolution timer to take the elapsed time reading. The final image will be shown onscreen (main window). The two source images are displayed in two mini windows on the right. Then, I present you with the results of different kinds of PCs in the later section of the article.

Writing the Algorithms

For some basic understanding, let's go through this C++ code.

void CMMXDemoDoc::OnRunbasic()
{
  int x=0, y=0, i=0, iGray=0;

  // Start timing
  m_el.Begin();

  // Add two images using direct memory access
  // Assume both images are the same size
  for (y=0; y<m_iHeight; y++)
  {
    for (x=0; x<m_iWidth; x++)
    {
      // calc index
      i = x + y*m_iWidth;
      // add two pixels
      iGray = m_pImg1[i] + m_pImg2[i];
      // keep saturation value
      if (iGray > 255) iGray = 255;
      // save to destination image
      m_pImg[i] = iGray;

    }
  }

  // End Timing
  m_el.End();

  // Force redraw
  UpdateAllViews(0);
}

<m_pImg>, <m_pImg1>, and <m_pImg2> are an one-dimensional array of unsigned char. All images are the same size.

There are two for loops; the outer loop iterates through each line in the image. The inner loop iterates through each pixel in the line. For every pixel, we need to calculate the index in the array, add the pixel from image 1 and image 2, check whether it's saturated (>255), then then store it into the destination image.

What is the problem with this algorithm?

  1. There are lots of arithmetic operations used on addition and there is a multiplication operation as well. Multiply is a bit slower than addition. Try changing the addition operator to multiply and see the differences in the demo.
  2. Array indexing to access random memory data. This causes some slowdown because the processor has to convert from an array index to the actual address in memory.
  3. Two levels of for loops cause the processor to operate stack instructions because ecx (counter register) pushes and pops often. This imposes some processor cycles.

Optimising Using Pointer Arithmetic

void CMMXDemoDoc::OnRunopt() 
{
  // Begin timing
  m_el.Begin();

  // Precalculate the pointers
  BYTE *pSrc = m_pImg2;
  BYTE *pSrcEnd = m_pImg2 + m_iSize;
  BYTE *pDest = m_pImg;
  BYTE *pDestEnd = m_pImg + m_iSize;
  int iGray;

  // Copy from img1 to tmp
  memcpy(m_pImg, m_pImg1, m_iSize);

  // loop each pixel and Add
  while (pSrc < pSrcEnd)
  {
    iGray = *pDest + *pSrc;
    if (iGray > 255) iGray=255;
    *pDest = iGray;
    pSrc++;
    pDest++;
  }

  // End Timing
  m_el.End();
  // Force redraw
  UpdateAllViews(0);
}

Notice that I copied the data to the destination image. I can save a pointer addition because memcpy() is an optimized function from VC (memcpy() still takes time).

The preceding code uses a pointer to access the data. There is nothing more to optimize because the pointer is already storing the data memory address. The pointer moves from one element to the next in a sequential form. The source and destination pointers are incremented by 1 each after the pixel has been calculated.

Converting to Assembly

The following code shows how to write the algorithm in assembly. The elapsed time of the function will be much more constant.

int AsmAdd(BYTE *d, BYTE *s, int w, int h)
{
  int iCount = w * h;

  // we assume all data in the register is not used by others
  __asm
  {
    // Assign pointers to register
    mov       esi, [s]         ; put src addr to esi reg
    mov       edi, [d]         ; put dest addr to edi reg
    mov       ecx, [iCount]    ; put count to ecx reg

    codeloop:
    mov       al, [edi]        ; mov a byte of src data to low
                               ; word of eax register
    add       al, [esi]        ; Add 8 bit from dest ptr to al
    jc        nosave           ; jump if carry flag on
    mov       [edi], al
    nosave:
    inc       esi
    inc       edi
    dec       ecx              ; decrement count by 1
    jnz
    codeloop                   ; jump to codeloop if not 0

  }

  return 1;
}

Utilising MMX

We have eight general-purpose registers in our 80x86/Pentium processor. They are eax, ebx, ecx, edx, esi, edi, ebp, and esp. These registers are used to hold operands for logical and arithmetic operation, and operands for address calculations and memory pointers. Register esp is used to hold stack pointers most of the time, so that leaves us with seven registers. When we have lots of data, we would have to make sure these data are well handled by these seven registers. Besides that, we also have another eight 80-bit registers located in the FPU (Floating Point Unit). We can make use of these eight registers in the FPU if we have MMX feature in our processor. If we use MMX, we could access 64 bits from the 80-bit register in the FPU. Why only 64? Because the Intel manual says so. Anyhow, that's not an issue as we will be happy with eight additional registers. We name this registers as mm0-mm7.

MMX Technology Execution Environment

So, what can we do with these eight additional MMX registers? They are not going to run any faster, right, because the clock speed of the CPU is fixed to the number of operations it can execute in a particular time. Yes, it is! How? By executing MMX parallel instructions on a preloaded stream of bytes!

SIMD Execution model

Data Types introduced with MMX Technologies

The parallel execution can save some clock cycles per instruction. You can load eight 8-bit, four 16-bit, and two 32-bit registers into the 64-bit MMX register. To load or store them, you use the instruction "MOVx", where x is the data size. Then, you can perform a calculation by calling the MMX instructions. The MMX instruction sets consist of 47 instructions, grouped into several categories:

  • Data Transfer
  • Arithmetic
  • Comparison
  • Conversion
  • Unpacking
  • Logical
  • Shift
  • Empty mmx state instruction (emms)

Study the following code; eight 8-bit unsigned saturated add operations are performed on two 64-bit MMX registers by executing these lines.

unsigned char mask[8] = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01,
                         0x01, 0x01 }
unsigned char txt[8]  = {0x41, 0x42, 0x43, 0x44, 0x45, 0x46,
                         0x47, 0x48 }
txt                   = "abcdefgh"
__asm {
movq    mm0, txt   ; mm0 = ( 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 )
movq    mm1, mask  ; mm1 = ( 01 | 01 | 01 | 01 | 01 | 01 | 01 | 01 )
paddusb mm0, mm1   ; mm0 = ( 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 )
movq    txt, mm0   ; mov back the data in register to memory.
emms               ; switch back to FPU state.
}

The string txt[], which contains "abcdefgh", will now be "bcdefghi".

By using MMX, there are only four instructions executed and eight bytes of data are processed at a time. A typical assembly algorithm without MMX will have to loop eight times, every iteration executing a move instruction and an addition, giving a total of at least 16 instructions. Now what is the speed improvement? I bet you'll get at least a minimum of 200% in time gain.

In Visual C++ 6.0, the easiest way to make use of MMX is writing assembly code using an inline assembler. The preceding code illustrates this. The list of MMX instruction sets is available on MSDN Web sites. Intrinsics (wrapup function) is available in VC++ .NET.

For this demo, you should at least have a Pentium processor with MMX and above to produce the desired results.

Executing the Demo

Building and executing the demo is pretty straightforward. The source code is presented together with some inline assembler instructions (very well commented). There are also some helper classes and functions.

  • <CElapsed>—provides high-resolution timing using QueryPerformanceCounter().
  • <CMiniView>—Serve additional views for a CDocument class.
  • Some raster drawing codes to assist image display.
  • <CreateNewViews()>—Shows how to create additional views for a typical document.
  • <CheckSSEAvail()>—Checks the availability of SSE presence in your processor. You also can check other features in your processor, such as MMX and HHT.

Test Result

Various kinds of PC platforms and their test results on a 1024x1024 image
Computer and Processor Array Index (in milliseconds) Pointer arithmetic (in milliseconds) MMX Instructions (in milliseconds)
Clone Pentium 2, 400 MHz 46 35 16
Clone Pentium 3(eb) 600 MHz 30.6 24 11
Acer TravMate 332T, Mobile Pentium 2 300 MHz 71 66 58
Acer TravMate 270, Mobile Pentium 4 1.7 GHz 34 20 5.9
Toshiba Satellite 2410, Mobile Pentium 4 1.8 GHz 22 14 4.5
Clone, Pentium 4 1.7 GHz 30 18 5.5
Dell Dimension 2400, Pentium 4 2.2 GHz 18.5 12.5 6

Please note that the speed is affected by several factors. This includes processors clock speed, 1st and 2nd level cache memory size, system bus speed, and DRAM speed. A fast system bus and DRAM will result in faster data transfer into the core register set. FYI, data is fetched from DRAM into the cache memory and then to the processor register set.

Points of Interest

I had recently read through the Alex Farber—Introduction to SSE Programming articles, and found out that he had just posted a similar topic. So, I modified the content and separated it into two parts. The first was to emphasize C++ and MMX and the second describes SSE/SSE2 implementation by using external compilers (Netwide assembler). The version Alex posted is a .NET version of SSE implementation by using intrinsics. So, that makes room for me to post a VC 6.0 version for my next article.

MMX development is meant for intermediate programmers. Some learning curves are required. You have to catch up on Assembly, Intel Processor architectures, MMX instruction sets, assembly optimization, and its terms. I hope this article really helps!

Based on the test results, I found that my laptop, a Toshiba Satellite 2410, seems be the best performer if it's utilising MMX instructions. It's bizarre seeing why the Dell Dimension desktop PC with a 2.2ghz processor loses out so much, even to a 1.7 GHz clone system (gigabyte motherboard). I had double confirm this with testing on a few of the Dell systems.

References

You can refer to these helpful sites:

Downloads

Download demo project - 53 Kb


Comments

  • The Secret master the nike-scene Is Fairly Straight forward!

    Posted by Acuddence on 05/02/2013 11:44pm

    Progressive questions about nike clarified in addition to the reason why you really should try to read through every message within this story.[url=http://www.nikejpgolf.biz/]nike ゴルフ[/url] Ones double take on nike [url=http://www.nikejpgolf.biz/nike-ゴルフボール-c-23.html]ナイキ ボール[/url] Hot questions regarding nike have been answered and as a consequence why you really need to start reading every term of this specific insider report. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキゴルフ[/url] Independent brief article lets out 4 brand new stuff over nike that absolutely no one is discussing. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキクラブ[/url] How the nike Endeavor Call - Workers who cares about nada is declared the winner?! [url=http://www.nikejpgolf.biz/nike-ゴルフシューズ-c-15.html]nike dunk[/url] Stuff and construction in Sin City - - nike basically leaves with no hasta la vista [url=http://www.nikeyasuyi.com/]ナイキ スニーカー[/url] Products and release in The philipines - - mizuno actually leaves without good-bye [url=http://www.nikeyasuyi.com/nikeナイキRunning-c-3.html]ナイキランニング[/url] Some of the mizuno Firm Dialogue - The Employees Who cares for next to nothing benefits?? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike シューズ[/url] Our nike Home business Presentation : People who cares for virtually nothing benefits? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike dunk[/url] nike is giving new life span for an old topic-- defacto primary

    Reply
  • The Secrets If you like to master the nike-scene Is Rather Basic!

    Posted by Acuddence on 04/26/2013 09:31am

    Progressive queries about mizuno replied to and the reasons you must absolutely view every word of this specific post.[url=http://www.nikejpgolf.biz/]ゴルフ ナイキ[/url] An essential double strain on nike [url=http://www.nikejpgolf.biz/nike-ゴルフボール-c-23.html]ナイキ ボール[/url] Interesting queries about nike addressed and as a consequence reasons why you have to look at every single word in this write up. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ナイキクラブ[/url] Independent page shows Some brand new stuff concerning nike that no company is bringing up. [url=http://www.nikejpgolf.biz/nike-アイアン-c-1.html]ゴルフ ナイキ[/url] The most important mizuno Organisation Talk : Buyers who cares for little or nothing benefits? [url=http://www.nikejpgolf.biz/nike-ゴルフシューズ-c-15.html]nike air force 1[/url] Units and construction in Houston - - nike has left without any kind regards [url=http://www.nikeyasuyi.com/]nike free[/url] Devices and developing in Michigan - - nike has left with no good-bye [url=http://www.nikeyasuyi.com/nikeナイキRunning-c-3.html]ナイキ ランニングシューズ[/url] All the mizuno Endeavor Meaning : Users who likes nothing wins?!? [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike シューズ[/url] The nike Home business Chat - - Those Who loves absolutely nothing triumphs?! [url=http://www.nikeyasuyi.com/nikeナイキDunk-c-9.html]nike dunk[/url] mizuno is giving spanking new life for an old matter. . . metallic standards

    Reply
  • Bad article

    Posted by Legacy on 01/29/2004 12:00am

    Originally posted by: Stanislav

    This article is not very good. It is possible to find a much better article on MMX through Google!

    Reply
  • Finally, an MMX example I can understand

    Posted by Legacy on 11/04/2003 12:00am

    Originally posted by: PeterGV

    Dude, you have my eternal gratitude.
    
    

    With your dumb example of adding two vectors of 8 bytes together, finally there's an MMX example that I can understand. Heck, I'm just a windows driver writer who knows nothing about graphics or media, or...

    Folks reading this might be interested to know that your example:
    __asm {
    movq mm0, txt
    movq mm1, mask
    paddusb mm0, mm1
    movq txt, mm0
    emms
    }

    Becomes:

    _asm {
    movq xmm0, qword ptr [txt]
    paddusb xmm0, qword ptr [mask]
    movq qword ptr [txt], xmm0
    }

    using SSE/SSE2 with in-line assembler and (more or less):

    *(__m128i *)txt = _mm_adds_epu8(*(__m128i *)mask,*(__m128i *)txt);

    using the SSE intrinsics (though, clearly, the sse intrinsic version operates on 16 bytes of data, not 8).

    P

    Reply
  • Optimising Using Pointer Arithmetic?

    Posted by Legacy on 08/14/2003 12:00am

    Originally posted by: immo

    There is not to much difference between using "array style" (brackets) and pointers. Pointers are HARDER to be optimized by optimizer, so i think it's even better to use "array style".
    
    

    BYTE *pSrcEnd = m_pImg2 + m_iSize;
    BYTE *pDestEnd = m_pImg + m_iSize;
    int iGray;

    // Copy from img1 to tmp
    memcpy(m_pImg, m_pImg1, m_iSize);

    m_iSize = -m_iSize;

    // loop each pixel and Add
    do {
    iGray = pDestEnd[iSize] + pSrcEnd[iSize];
    if (iGray > 255) iGray=255;
    pDestEnd[iSize] = iGray;
    } while ( iSize++ ); // asm = inc reg; jnz top_of_loop

    This code is better because:
    1. Only one INC operation.
    2. There is no need to use CMP instruction (INC sets ZERO flag)

    Reply
  • Very good

    Posted by Legacy on 07/31/2003 12:00am

    Originally posted by: Stanislav

    Hi !

    IMHO, this article is in top 5 on codeguru+codeproject for past 6 months !

    Waiting for next :-)

    Best regards !
    Stas.

    Reply
  • Excellent

    Posted by Legacy on 07/30/2003 12:00am

    Originally posted by: cyberman_77

    Great article, Waiting for your next posting.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • CentreCorp is a fully integrated and diversified property management and real estate service company, specializing in the "shopping center" segment, and is one of the premier retail service providers in North America. Company executives travel a great deal, carrying a number of traveling laptops with critical current business data, and no easy way to back up to the network outside the office. Read this case study to learn how CentreCorp implemented a suite of business continuity services that included …

Most Popular Programming Stories

More for Developers

RSS Feeds