**Wong Shao Voon**on

**December 14th, 2011**

## Introduction

In this article, we will explore vectorizing the pixel alphablending code in my earlier article. I believe most C/C++ programmers have tinkered with SIMD instructions like MMX and Streaming SIMD Extensions 2 (SSE) or even Advanced Vector Extensions (AVX). Most programmers, I believe, gave up after their first SIMD attempt, after that effort produced an mediocre or worse performance than the original non-vectorizing code. The main reason for slow performance is CPU cycles are typically wasted in packing the primitive data into the vector prior to SIMD execution and unpacking them into their proper place in the structure after execution. We will dive deeper into this in the next section. The main reason that SIMD is slow to be adopted by programmers around the world, is because SIMD code is harder to write and the resultant code, even with using intrinsics, are not readable and even less understandable.

In this article, we will only focus on SSE2; the reason being MMX is a 64bit SIMD technology, which does not offer significant speed up compared to SSE2 (mostly 128 bit SIMD from Intel) on today's 64-bit processors, and Advanced Vector Extensions (AVX) (256 bit SIMD from Intel) was not chosen for this article because not many mainstream users own these latest Intel processors that can exploit AVX: SSE2 has been around since 2001. (Okay, I admit, the real reason is actually because I do not own AVX supported processor. I promise to update the article when I have access to a PC with an AVX-enabled processor.) All, but the oldest, personal computers should have housed an x86/x64 processor, which support SSE2. Rather than using inline assembly code, we will use SSE2 intrinsics provided by Visual Studio to assist us in calling the SSE2 instructions: Currently, Microsoft 64 bit C/C++ compiler does not permit inline assembly code.

## Structure of Arrays

struct SomeStruct { int aInteger; double aDouble; short aShort; };

To use SSE2 arithmetic instructions on aInteger, we must first pack 4 aInteger into a 128 bit vector of integer type, __m128i.

After we have the computation result, we must unpack 4 aInteger into its individual structure objects in the array.

This is usually what kills the performance! To enhance performance, programmers are advised to pack their data into structure of arrays rather than array of structures; the same advice is applicable to GPGPU programming so as to allow memory coalescing. It is often not possible to alter the existing data domain/model to unnaturally suit the SIMD data packing requirement. In the structure shown below, arrInteger, arrDouble and arrShort are usually pointers to dynamic allocated arrays whose size is only known during runtime but I show them here to be of array type so as to emphasize they are of array type rather than pointer type.

struct SomeStruct { int arrInteger[1000]; double arrDouble[1000]; short arrShort[1000]; };

As the reader may have noticed, integer, double and short data type have different sizes. 4 integers would fit into __m128i and 2 doubles would fit into __m128i and 8 shorts would fit into __m128i. This might present difficulties, when programmers are writing code to loop through the __m128i arrays to do computation. For our case, this is not a problem, because we are only using unsigned short type in this alphablend article.

## Explanation of the Code

For our SSE2 alphablend code we will, as I have just said, use unsigned short type to store each color channel. The reader may ask why use unsigned short to store a value, which only ranges from 0 to 255? The answer is due to the multiplication used, the intermediate values would most likely exceed 255 while the final result still stays within 255. In normal coding, we can use byte variables to do the same computation, without using 16-bit variable. The reason is because the byte value is automatically promoted to integer (word size of the platform) before any computation takes place and is also automatically converted back to byte after computation. We have declared __m128i array to store unsigned short for each color channel of 2 alphablending images. m_fullsize is to store the full size of the array in bytes. m_arraysize is the array size in terms of a 128-bit sized element. m_remainder is to hold the number to indicate which of the unsigned short are valid in the last 128-bit element if the number of unsigned short required is not cleanly divisible by 8 (8 x 16 bits=128-bits). For example, 4 integers fits into the __m128i, if we have 6 integers to computer, array of 2 __m128i element is needed but on the last __m128i element, only 2 integers is used, thus m_remainder would reflect 2.

__m128i* m_pSrc1R; __m128i* m_pSrc1G; __m128i* m_pSrc1B; __m128i* m_pSrc2R; __m128i* m_pSrc2G; __m128i* m_pSrc2B; // the full size of array in bytes int m_fullsize; // the full size of array in 128-bit chunks int m_arraysize; // the remainder of last 128-bit element // which should be filled. int m_remainder;

With GetUShortElement method, we can get unsigned short element from the __m128i array, using an index, which counts in terms of unsigned short elements.

unsigned short& CAlphablendDlg::GetUShortElement( __m128i* ptr, int index) { int bigIndex = index>>3; int smallIndex = index - (bigIndex<<3); return ptr[bigIndex].m128i_u16[smallIndex]; }

With Get128BitElement method, we can get __m128i element from the __m128i array, using a bigIndex, which counts in terms of __m128i elements. bigSize is the size of __m128i array in 128-bit chunks. smallRemainder is the number of valid unsigned short element in the last __m128i element. When we detect that it is the last element and the number of short integer is no clearly divisible by 8, we need to set the unused shorts in the last __m128i element to 1s, so as to prevent division-by-zero error. It is a good practice to do this for every last element returned, regardless of if it is integer type or float type.

__m128i& CAlphablendDlg::Get128BitElement( __m128i* ptr, int bigIndex, int bigSize, int smallRemainder) { if(bigIndex != bigSize-1) // not last element return ptr[bigIndex]; else if(smallRemainder==0) // last element return ptr[bigIndex]; else // last element { for(int i=smallRemainder; i<8; ++i) ptr[bigIndex].m128i_u16[i] = 1; return ptr[bigIndex]; } }

We must take care to allocate our __m128i array with _aligned_malloc function so that they will align with 16 bytes boundary as required by 128-bit SSE2 instructions. We calculate and store the remainder if the number of unsigned short required is not cleanly divisible by 8 (as explained before)(**Note**: there are 8 unsigned short (16 bits data type) in 128-bit).

// Calculate the size of 128 bit array m_fullsize = m_BmpData1.Width * m_BmpData1.Height; // divide by 8, to get the size in 128bit. m_arraysize = (m_fullsize) >> 3; // calculate the remainder m_remainder = m_fullsize%8; // if there is remainder, // add 1 to the 128bit m_arraysize if(m_remainder!=0) ++m_arraysize; // since we are using unsigned short (16bit) elements, // let's muliply the total number of bytes needed by 2. m_fullsize = (m_arraysize) * 8 * 2; // Allocate 128bit arrays m_pSrc1R = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16)); m_pSrc1G = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16)); m_pSrc1B = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16)); m_pSrc2R = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16)); m_pSrc2G = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16)); m_pSrc2B = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));

After allocating our __m128i arrays, we will populate them with the color information from the image object. We use GetUShortElement to get our unsigned short element from __m128i array.

// Populate the SSE2 pointers to array UINT col = 0; UINT stride = m_BmpData1.Stride >> 2; for(UINT row = 0; row < m_BmpData1.Height; ++row) { for(col = 0; col < m_BmpData1.Width; ++col) { UINT index = row * stride + col; BYTE r = (m_pPixels1[index] & 0xff0000) >> 16; BYTE g = (m_pPixels1[index] & 0xff00) >> 8; BYTE b = (m_pPixels1[index] & 0xff); // assign the pixel color to the 128bit arrays GetUShortElement(m_pSrc1R, (int)(index)) = r; GetUShortElement(m_pSrc1G, (int)(index)) = g; GetUShortElement(m_pSrc1B, (int)(index)) = b; } }

Lastly, we must remember to deallocate the __m128i arrays with _aligned_free in destructor of the parent class. This is a good practice to remember to write deallocation code before we begin our main coding.

if( m_pSrc1R ) { _aligned_free(m_pSrc1R); m_pSrc1R = NULL; } if( m_pSrc1G ) { _aligned_free(m_pSrc1G); m_pSrc1G = NULL; } if( m_pSrc1B ) { _aligned_free(m_pSrc1B); m_pSrc1B = NULL; } if( m_pSrc2R ) { _aligned_free(m_pSrc2R); m_pSrc2R = NULL; } if( m_pSrc2G ) { _aligned_free(m_pSrc2G); m_pSrc2G = NULL; } if( m_pSrc2B ) { _aligned_free(m_pSrc2B); m_pSrc2B = NULL; }

Before we run any SSE2 code, we should ensure that SSE2 is supported on the processor first. We can use the SIMD Detector class, written by the author, to do the check.

#include "SIMD.h" SIMD m_simd; if(m_simd.HasSSE2()==false) { MessageBox( "Sorry, your processor does not support SSE2", "Error", MB_ICONERROR); return; }

Now, we have come to the meat of our article: using SSE2 to do alphablending. The formula used is ((SrcColor * Alpha) + (DestColor * InverseAlpha)) >> 8. The code listing below is heavily commented. We use _mm_set1_epi16 instruction to set the same short value (0) in every short element in the __m128i element. _mm_mullo_epi16 instruction is used to multiply the unsigned short and only retain the 16 bits if the final result is overflowed into 32 bits. We used _mm_add_epi16 instruction to adding of unsigned shorts. _mm_srli_epi16 instruction is to do shift right by 8 times to simulate division by 256. We can get away without using division instruction, which is a blessing because SSE2 does not support division for integers. After SSE2 operations, we need to unpack the data back to ARGB format. This unpacking offsets some of the performance gains.

// set alpha 128 bit __m128i nAlpha128 = _mm_set1_epi16 ((short)(Alpha)); // set inverse alpha 128 bit __m128i nInvAlpha128 = _mm_set1_epi16 ((short)(255-Alpha)); // initialize the __m128i variables // so that the compiler shut up about // uninitialized variables. __m128i src1R = _mm_set1_epi16 ((short)(0)); __m128i src1G = _mm_set1_epi16 ((short)(0)); __m128i src1B = _mm_set1_epi16 ((short)(0)); __m128i src2R = _mm_set1_epi16 ((short)(0)); __m128i src2G = _mm_set1_epi16 ((short)(0)); __m128i src2B = _mm_set1_epi16 ((short)(0)); __m128i rFinal = _mm_set1_epi16 ((short)(0)); __m128i gFinal = _mm_set1_epi16 ((short)(0)); __m128i bFinal = _mm_set1_epi16 ((short)(0)); for(int i=0;i<m_arraysize;++i) { // multiply and retain the result in lower 16bits src1R = _mm_mullo_epi16 ( Get128BitElement(m_pSrc1R,i,m_arraysize,m_remainder), nAlpha128); src1G = _mm_mullo_epi16 ( Get128BitElement(m_pSrc1G,i,m_arraysize,m_remainder), nAlpha128); src1B = _mm_mullo_epi16 ( Get128BitElement(m_pSrc1B,i,m_arraysize,m_remainder), nAlpha128); // multiply and retain the result in lower 16bits src2R = _mm_mullo_epi16 ( Get128BitElement(m_pSrc2R,i,m_arraysize,m_remainder), nInvAlpha128); src2G = _mm_mullo_epi16 ( Get128BitElement(m_pSrc2G,i,m_arraysize,m_remainder), nInvAlpha128); src2B = _mm_mullo_epi16 ( Get128BitElement(m_pSrc2B,i,m_arraysize,m_remainder), nInvAlpha128); // Add a and b together rFinal = _mm_add_epi16 (src1R, src2R); gFinal = _mm_add_epi16 (src1G, src2G); bFinal = _mm_add_epi16 (src1B, src2B); // Use shift right by 8 to do division by 256 rFinal = _mm_srli_epi16 (rFinal, 8); gFinal = _mm_srli_epi16 (gFinal, 8); bFinal = _mm_srli_epi16 (bFinal, 8); // unpack the final results into the 8 pixels // if possible int step = i << 3; // multiply by 8 if(i!=m_arraysize-1) // not the last element { pixels[step+0] = 0xff000000 | rFinal.m128i_u16[0] << 16 | gFinal.m128i_u16[0] << 8 | bFinal.m128i_u16[0]; pixels[step+1] = 0xff000000 | rFinal.m128i_u16[1] << 16 | gFinal.m128i_u16[1] << 8 | bFinal.m128i_u16[1]; pixels[step+2] = 0xff000000 | rFinal.m128i_u16[2] << 16 | gFinal.m128i_u16[2] << 8 | bFinal.m128i_u16[2]; pixels[step+3] = 0xff000000 | rFinal.m128i_u16[3] << 16 | gFinal.m128i_u16[3] << 8 | bFinal.m128i_u16[3]; pixels[step+4] = 0xff000000 | rFinal.m128i_u16[4] << 16 | gFinal.m128i_u16[4] << 8 | bFinal.m128i_u16[4]; pixels[step+5] = 0xff000000 | rFinal.m128i_u16[5] << 16 | gFinal.m128i_u16[5] << 8 | bFinal.m128i_u16[5]; pixels[step+6] = 0xff000000 | rFinal.m128i_u16[6] << 16 | gFinal.m128i_u16[6] << 8 | bFinal.m128i_u16[6]; pixels[step+7] = 0xff000000 | rFinal.m128i_u16[7] << 16 | gFinal.m128i_u16[7] << 8 | bFinal.m128i_u16[7]; } // last 128 bit element, not all 16 bit element // are filled with valid value. else if(m_remainder!=0) { for(int smallIndex=0; smallIndex<m_remainder; ++smallIndex) { pixels[i+smallIndex] = 0xff000000 | rFinal.m128i_u16[smallIndex] << 16 | gFinal.m128i_u16[smallIndex] << 8 | bFinal.m128i_u16[smallIndex]; } } // last 128 bit element, all 16 bit element // are filled with valid value. // Because the remainder is zero else { pixels[step+0] = 0xff000000 | rFinal.m128i_u16[0] << 16 | gFinal.m128i_u16[0] << 8 | bFinal.m128i_u16[0]; pixels[step+1] = 0xff000000 | rFinal.m128i_u16[1] << 16 | gFinal.m128i_u16[1] << 8 | bFinal.m128i_u16[1]; pixels[step+2] = 0xff000000 | rFinal.m128i_u16[2] << 16 | gFinal.m128i_u16[2] << 8 | bFinal.m128i_u16[2]; pixels[step+3] = 0xff000000 | rFinal.m128i_u16[3] << 16 | gFinal.m128i_u16[3] << 8 | bFinal.m128i_u16[3]; pixels[step+4] = 0xff000000 | rFinal.m128i_u16[4] << 16 | gFinal.m128i_u16[4] << 8 | bFinal.m128i_u16[4]; pixels[step+5] = 0xff000000 | rFinal.m128i_u16[5] << 16 | gFinal.m128i_u16[5] << 8 | bFinal.m128i_u16[5]; pixels[step+6] = 0xff000000 | rFinal.m128i_u16[6] << 16 | gFinal.m128i_u16[6] << 8 | bFinal.m128i_u16[6]; pixels[step+7] = 0xff000000 | rFinal.m128i_u16[7] << 16 | gFinal.m128i_u16[7] << 8 | bFinal.m128i_u16[7]; } }

## Benchmark Results

Listed below is the formula used in each benchmark.

**Unoptimized Code formula used**

**Optimized Code formula used**

**Optimized Code with Shift formula used**

**SSE2 Optimized Code formula used**

This is timing of doing alphablending 1000 times on a Intel i7 870 at 2.93 Ghz. The performance characteristic is different from what is posted in the earlier article because the benchmarks are done on different PCs. The earlier PC used, has 'retired' from useful service in this world.

(Lower is better) Unoptimized Code : 4679 milliseconds Optimized Code : 2108 milliseconds Optimized Code with Shift : 1931 milliseconds SSE2 Optimized Code : 862 milliseconds

Below is the number of times of speedup we got when compared to the unoptimized version. But comparing the SSE2 speedup to the 2nd runner up (Optimized Code with Shift), the speedup is only 2.2 times, not the 8 times that we expected. So what is the memory consumption that is needed for this SSE2 speedup? 2 bytes for each color channel instead of 1 byte. The original implementation needs 4 bytes to store a color pixel, ARGB. (4 bytes is used because the processor can load and store 32 bit data faster than 24 bit data, which may required shifting it to proper 32 bit boundary before loading inside the 32 bit register). The SSE2 version needs 6 bytes to store a color pixel, RGB (because 2 bytes are used to store each color channel). Individual alpha channel is not stored because they are not used. SSE2 version would need 50% more memory to perform its magic.

(Higher is better) Unoptimized Code : 1x Optimized Code : 2.2x Optimized Code with Shift : 2.4x SSE2 Optimized Code : 5.4x

## Conclusion

We have used SSE2 to speed up from the previous fastest performance by 2 times. The speedup is not dramatic. It is probably not worth the amount of extended effort and unreadable code to do this. To achieve better readability, we could use the vector class, written by Agner Fog of Copenhagen University College of Engineering, which basically extends the ones listed in dvec.h header. These vector class support overloaded operators for -, +, *, / and so on. And they also support the missing operations like multiplication and division for some integer types in SSE2 though those operations are sometimes implemented using the scalar version of instruction (not SSE2, so they have no speed advantage). This eliminates the programmer's effort to implement these unsupported/missing operations himself/herself. Any suggestion to improve the performance further is welcome. We could ramp up performance by using OpenMP or Parallel Patterns Library (PPL) in tandem with SSE2. But that is an article for another day. Any reader interested in a future article about utilizing OpenCL to do alphablending, please drop me a message below. OpenCL is a heterogeneous technology; OpenCL kernel can be run on either CPUs or GPUs.

## Comments

## Interesting

Posted bycircularon09/15/2012 01:07pmThanks for this article. So the key to have some optimization is to pack values of one channel in the same array. Something that should be mentionned is that aligning data and directly loading values with movaps would be faster.

Reply