C++ AMP Implementation
C++ AMP is a new technology being developed by Microsoft for C++ to allow you to easily write code that takes advantage of the computing power of accelerators, such GPUs. For now, C++ AMP requires DirectX 11 GPUs. C++ AMP abstracts the accelerators for you. When using C++ AMP, you don't need to worry about what kind of accelerator is in the system. For example, there is no need to write one implementation for an NVidia GPU and another for an AMD GPU. You just write 1 implementation, and C++ AMP handles the rest. If you have multiple GPUs in your system, C++ AMP can use them all at the same time, and they don't even need to be from the same vendor. For example, if you have a system with both an AMD and an NVidia GPU in it, C++ AMP can use them both together.
Microsoft has the intention to make C++ AMP an open specification, so that other compiler vendors could also add support for it.
For this simple Mandelbrot renderer, once you have a PPL implementation, it's almost trivial to transform it to C++ AMP. Without further ado, here is the C++ AMP version, with a detailed explanation after the code:
array_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight,
parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d)
// Formula: zi = z^2 + z0
int x = idx - halfWidth;
int y = idx - halfHeight;
float Z0_i = view_i + y * zoomLevel;
float Z0_r = view_r + x * zoomLevel;
float Z_r = Z0_r;
float Z_i = Z0_i;
float res = 0.0f;
for (int iter = 0; iter < maxiter; ++iter)
float Z_rSquared = Z_r * Z_r;
float Z_iSquared = Z_i * Z_i;
if (Z_rSquared + Z_iSquared > escapeValue)
// We escaped
res = iter + 1 - fast_log(fast_log(fast_sqrt(Z_rSquared +
Z_iSquared))) * invLogOf2;
Z_i = 2 * Z_r * Z_i + Z0_i;
Z_r = Z_rSquared - Z_iSquared + Z0_r;
unsigned __int32 result = RGB(res * 50, res * 50, res * 50);
a[idx] = result;
catch (const Concurrency::runtime_exception& ex)
MessageBoxA(NULL, ex.what(), "Error", MB_ICONERROR);
Everything of C++ AMP is in the concurrency namespace. Certain Direct3D interoperability functionality, such as fast_log(), is in the concurrency::direct3d namespace.
The first thing the C++ AMP implementation does is defining a 2D array_view over the buffer. The array_view class creates a multi-dimensional view over a user supplied buffer. The data of this buffer is copied to and from GPU memory on demand by C++ AMP. The basic use of array_view is:
where type is the type of data in the buffer and dim is the dimensionality of your data. The dimensionality can be anything you want; it can be 1D, 2D, 3D, 4D... When you define your array_view like this, it will be a read and write view over your buffer, which means that initial data is copied from system memory to GPU memory, and the end result is copied from GPU memory back to system memory. If your GPU kernel is only writing to the buffer, the array_view can be defined as a write-only view over your buffer as follows:
This will optimize memory transfers. No initial data is copied from system memory to the GPU memory; data is only copied after finishing the calculations, from GPU memory to system memory.
The opposite is also possible. You can create a read-only array_view as follows:
array_view<const type, dim>
This will only copy initial data from system memory to GPU memory, and will not copy it back from GPU memory to system memory.
The above Mandelbrot implementation only writes to the buffer, so the array_view is defined as write-only.
Another option is to use the array class instead of array_view. The array class allocates a buffer, while array_view creates a view over a user allocated buffer.
After the C++ AMP Mandelbrot implementation has created the array_view, it uses concurrency::parallel_for_each().
The first parameter to the parallel_for_each() is a compute domain, called a grid. If you have an array_view, you can simply use its grid property as shown in the above implementation. For this Mandelbrot renderer, we have a two dimensional array_view, so the grid will also be 2D. On each position of the grid, a computation will be performed.
The second parameter to parallel_for_each() is a lambda expression that specifies restrict(direct3d) to say that the code in the lambda expression should be executed on your GPU, instead of your CPU. The restrict(direct3d) attribute is a compile time check; the compiler will check the code to see whether the code will be able to run on GPUs. For example, if you call a system call in your restrict(direct3d) GPU kernel, the code will not work on the GPU, and compilation will fail. In other words, if you use restrict(direct3d) and it compiles without errors, you are not violating any GPU restrictions and the code is valid to be executed on GPUs. Any function that you call from inside a GPU kernel should also have the restrict(direct3d) specifier. There are a number of restrictions on the code in a restrict(direct3d) function. MSDN lists them as follows:
- The function can call only functions that have the restrict(direct3d) clause.
- The function must be inlinable.
- The function can declare only int, unsigned int, float, and double variables, and classes and structures that contain only these types.
- Lambda functions cannot capture by reference and cannot capture pointers.
- References and single-indirection pointers are supported only as local variables and function arguments.
- The following are not allowed:
- variables declared with the volatile keyword.
- virtual functions.
- pointers to functions.
- pointers to member functions.
- pointers in structures.
- pointers to pointers.
- goto statements.
- labeled statements.
- try, catch, or throw statements.
- global variables.
- static variables. Use tile_static Keyword instead.
- dynamic_cast casts.
- the typeid operator.
- asm declarations.
The lambda expression accepts only one parameter. This parameter is an index into the compute domain (= grid). For the 2D Mandelbrot case, this index is a 2D index denoted as:
You can access each dimensionality of this index using array notation; for this 2D index you can use idx and idx. This gives you the position in the grid. An added bonus of using an index is that the code looks cleaner. In the C++ AMP version of the Mandelbrot renderer there is no ugly buffer index calculation. Compare the C++ AMP version:
a[idx] = result;
with the PPL version and single-threaded version:
pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result;
The lambda expression should capture all variables it needs by value, except concurrency::array objects, which have to be captured by reference. MSDN also states that the following restrictions apply on the object types that can be captured:
Parallel_for_each() will throw exceptions in case of errors. Possible reasons are:
- No pointers or references (exception: array)
- No char, short, or long long types
- No bool types
- Structs or classes that contain supported types are allowed
- No objects that contain virtual functions or virtual bases
- Structs or classes must be PODs (more formally, they must be copyable by blitting)
- Failure to create the shader, which is the code that runs on the accelerator
- Failure to create buffers
- Invalid grid passed
- Mismatched accelerators
The body of the lambda expression supplied to the parallel_for_each() (= kernel) runs asynchronously with the CPU code following the parallel_for_each() call, which means that your CPU code after the parallel_for_each() call continues to execute in parallel with the code inside the body of the parallel_for_each() lambda, until a synchronization point is reached. If your CPU code at a certain point observes the result of the parallel_for_each() kernel, by inspecting the array_view in this example, at that point C++ AMP synchronizes and blocks until the GPU kernel is finished. A synchronize is also forced when the array_view object goes out of scope. However, when the array_view goes out of scope, the synchronize() method is called from within the array_view destructor and thus is not allowed to throw any exceptions. In that case, GPU and CPU exceptions that happen during the synchronization, including during transferring results from GPU memory back to system memory, will be lost. For this reason, it is recommended to manually call synchronize() at a suitable point in time, and properly catch any exceptions.
C++ AMP also includes a function called copy_async() that copies data from a source to a destination. The function returns an std::future, which you can check whenever you want to see whether the copy has finished or not.
To get even more performance out of your accelerators, you can investigate the tiled model, which works up to three dimensions. However, this article will not go deeper on this more advanced feature. You can find an example in this blog post: tile_static, tile_barrier, and tiled matrix multiplication with C++ AMP.
You can use the accelerator and accelerator_view classes to retrieve information from installed accelerators in a system; and use the get_accelerators() function to get a vector of accelerators. This introductory article on C++ AMP does not go deeper into the functionality of these classes. Only a small example is given.
You can get a hold of the default accelerator in a system by using the default constructor of the accelerator class. Once you have this, you can query it for information using the supports() method and the accelerator_restriction enumeration. For example, the following code checks whether the default accelerator supports double precision calculations or not:
cout << "Your C++ AMP default GPU supports double precision.";
cout << "Your C++ AMP default GPU does not support double precision.";