Handling Multiple Processors in Your Code Using RapidMind

Going back about a decade, remember the technique to speed up your computer? Yes, over clocking, to get better performance from existing hardware, mainly processors. It still exists around us because software performance has traditionally benefited from processors' increasing clock speeds.

People among us from the High Performance Computing (HPC) Industry—commonly game developers, financial application developers, people developing real time data analysis applications, 3D visualizing or medical imaging, and so forth—are always looking to add every single bit that can result in any significant beneficial difference in their products, but the simple over clocking solution doesn't work; processor vendors have all changed course and focused on adding multiple cores to achieve increased performance.

If you are not a veteran of the HPC industry, perhaps the notion of multi-core programming freaks you out and brings a lot of questions with fears to you. Complexity, low-level interfaces, specific understanding of hardware architecture, multi-threading with its issues like dead-locks, being specific to a hardware platform and what else and in a competitive market, the price that you have to pay in terms of cost, efforts, and time, might be the unmotivated factors for you to defer multi-core programming for now.

Don't you think that it would be better if there were some SDK that can take care of all this (okay, not all but most of it) multi-core complexity? The answer is yes, there is. In fact, there are not one but two solutions:

  • PeakStream
  • RapidMind

Although PeakStream doesn't exist around us, as Google has taken over PeakStream, which leaves RapidMind only in the race. The release of RapidMind Multi-core Software Platform brought new language features, an improved runtime API and support for the Cell BE along with NVIDIA and ATI/AMD GPUs, with a comprehensive support for Windows and Linux-based development. The RapidMind Multi-core Software Platform allows software developers to embrace multi-core processors, including GPUs and the Cell BE, to deliver higher performing software with an order of magnitude less effort.

The RapidMind Multi-core Software Platform is a software development platform that allows developers to use standard C++ programming to create high-performance and massively parallel applications or to extend existing applications to run on high-performance processors, including CPUs, GPUs, or Cell BE. The RapidMind Multi-core Software Platform is not a separate IDE, but instead works with your current IDE to provide immediate ease of use. You are given a package of header files, libraries, samples, and documentation to use in your applications.

The RapidMind Multi-core Software Platform lets you develop the application just like any other single threaded application, without the challenges of understanding the processor hardware or complex parallel programming techniques. RapidMind Multi-core Software Platform executes and manages platform-specific computations and data across all cores with the following Hardware, OS, and Compiler Support.

GPU
  • NVIDIA GeForce® 6000, 7000, or 8000 series cards
  • NVIDIA Quadro® card with Shader Model 3.0 support (for example, Quadro FX5500)
  • ATITM x1X00 family of cards
  • ATI HD 2900 cards
Cell
  • Native Cell Broadband EngineTM hardware (for example, IBM® Cell Blade) or Cell BE Simulator software
  • Cell on Sony PlayStation®3 using Yellow DogTM Linux
Operating Systems
  • Windows XP Pro
  • Windows Vista
  • Red HatTM Enterprise Linux 4
  • FedoraTM Core 4 Linux
  • Fedora Core 5 Linux
  • Ubuntu 6.10
  • Yellow Dog Linux 5 on Sony PlayStation®
Compilers
  • Microsoft® Visual C++® 7 or 8 under Windows
  • GCC 4 under Linux

The Developer Edition of RapidMind Multi-core Software Platform for 32- and 64-bit systems is available to download free from http://www.rapidmind.net/downloadeval.php.

Traditional Multi-Threading Model

Figure 1 shows a graphical representation of traditional multi-threading model to achieve multiple cores performance.

Figure 1: RapidMind Multi-core Software Platform (RMDP)

The RapidMind Multi-core Software Platform is presented as an advanced dynamic compiler and runtime management system for parallel processing. It has a sophisticated interface embedded within standard ISO C++. It can be used to express arbitrary parallel computations, but it is not a new language. Instead, it merely adds a new vocabulary to standard ISO C++: a set of nouns (types) and verbs (operations). A user of the RapidMind Multi-core Software Platform writes C++ code in the usual way, but uses specific types for numbers, vectors of numbers, matrices, and arrays. In immediate mode, operations on these values can be executed on the host processor, in the manner of a simple operator-overloaded matrix-vector library. In this mode, the RapidMind Multi-core Software Platform simply reflects standard practice in numerical programming under C++.

However, the RapidMind Multi-core Software Platform also supports a unique retained mode. In this mode, operations are recorded and dynamically compiled into a "program object" rather than being immediately executed. These program objects can be used as functions in the host program. Program objects mimic the behavior of native C++ functions, including support for modularity and scope, so standard C++ object-oriented programming techniques can be leveraged. It should be noted that at runtime, program objects only execute the numerical computations they have recorded, and can completely avoid any overhead due to the object-oriented nature of the specification. The platform uses C++ only as scaffolding to define computations, but rips away this scaffolding for more efficient runtime execution.

Figure 2

By using existing C++ compilers and programming environments (IDEs), application developers using RapidMind are given a small set of types to create parallel programs within their existing C++ application:

  • Value: Contains fixed-length data, similar to the primitive types such as float and int in C++
  • Array: Contains RapidMind values, like C arrays or C++ vectors
  • Program: Contains computations, encapsulate computation, in the same way that a C++ function does

When using the RapidMind Multi-core Software Platform, developers continue to program in C++. After identifying components of their application to accelerate, the overall process of integration is as follows:

  1. Replace types: The developer replaces numerical types representing floating point numbers and integers with the equivalent RapidMind Multi-core Software Platform types.
  2. Capture computations: While the user's application is running, sequences of numerical operations invoked by the user's application can be captured, recorded, and dynamically compiled to a program object by the RapidMind Multi-core Software Platform.
  3. Stream execution: The RapidMind Multi-core Software Platform runtime is used for managed parallel execution of program objects on the target hardware platform, which can be a GPU, the Cell processor, or a multi-core CPU.

Handling Multiple Processors in Your Code Using RapidMind

How difficult is it to use RapidMind? You can find out. You can make a simple program first and then port it to RapidMind.

int main()
{
   // 1. Preparing data

   float input1[ 10000 ];
   float input2[ 10000 ];


   for ( int i = 0 ; i < 10000 ; ++i )
   {
      input1[ i ] = i;
      input2[ i ] = i * 2;
   }


   // 2. Performing computation

   float results[ 10000 ];

   for ( int i = 0; i < 10000 ; ++i )
   {
      result[ i ] = input1[ i ] + input2[ i ];
   }

   // 3. Showing results

   for ( int i = 0; i < 10000 ; ++i )
   {
      std::cout    << "output[" << i << "] = ("
                   << results[ i ] << ")"
                   << std::endl;
   }

   return 0;
}

Pretty simple, isn't it? You just declare two float arrays, fill them with data, add them, and display the results. Now, move the above code to the RapidMind Development Platform and see how difficult it is.

#include <rapidmind/platform.hpp>

using namespace rapidmind;

int main()
{
   // General initialization of the platform
   rapidmind::init();

   // Optionally select specific backends. We'll let the platform
   // decide on the best one to use by not including any
   // use_backend lines.
   // use_backend("glsl");
   // use_backend("cell");
   // use_backend("cc");

   // 1. Preparing data

   // float input1[ 10000 ];
   // float input2[ 10000 ];

   Array< 1 , Value1f> input1( 10000 );
   Array< 1 , Value1f> input2( 10000 );

   // Access the internal arrays where the data is stored
   float * input_data1 = input1.write_data();
   float * input_data2 = input2.write_data();

   for ( int i = 0 ; i < 10000 ; ++i )
   {
   //      input1[ i ] = i;
   //      input2[ i ] = i * 2;

           input_data1[ i ] = i;
           input_data2[ i ] = i * 2;
   }

   // 2. Performing computation

   // float results[ 10000 ];
   Array< 1 , Value1f > output;

   // for ( int i = 0; i < 10000 ; ++i )
   // {
   //    results[ i ] = input1[ i ] + input2[ i ];
   // }

   // The stream program that will be executed on the data
   Program prg = RM_BEGIN {
      In<Value1f> a;    // first input
      In<Value1f> b;    // second input
      Out<Value1f> c;   // output

      c = a + b;              // operation on the data
   } RM_END;

   // Execute the stream program
   output = prg(input1, input2);

   // 3. Showing results
   const float* results = output.read_data();

   for ( int i = 0; i < 10000 ; ++i )
   {
      std::cout      << "output[" << i << "] = ("
                     << results[ i ] << ")"
                     << std::endl;
   }

}

Now, dissect the code above for better understading.

#include <rapidmind/platform.hpp>

using namespace rapidmind;

The first line includes the main header file for the RapidMind platform and is necessary for applications using the platform. The second line is optional but makes it unnecessary to specify rapidmind:: in front of functions and types included with the RapidMind platform.

rapidmind::init();

Here, you just initialized RMDP.

// Optionally select specific backends. We'll let the platform
// decide on the best one to use by not including any use_backend
// lines.
// use_backend("glsl");
// use_backend("cell");
// use_backend("cc");

As an optional step, one or more backends can be specified. The choice of backend determines how and with which hardware RapidMind programs will be run. If you skip this step, the RapidMind platform will pick the best available backend. If you specify a backend, the RapidMind platform will use only that backend. If you specify more than one backend, the RapidMind platform will pick the best one of the specified backends.

Array< 1 , Value1f> input1( 10000 );
Array< 1 , Value1f> input2( 10000 );

Here, you created two arrays. These arrays will hold the 10,000 elements that the stream program will operate on. As you see, Array is a simple template class. The first template parameter specifies the dimensionality of the array (one, two, or three), and the second parameter specifies the element type of which the array holds a collection.

Value1f is a RMDP value type; the 1 means that this value contains a single scalar. Values can contain any fixed number of scalars, but they usually contain between one and four elements. The f stands for float. Values can contain any standard C++ type. Values are templated; Value1f is actually a typedef for Value<1, float>. Typedefs for up to four elements of all basic types are provided by the platform. These arrays will be passed as input to the stream program. The RapidMind platform automatically allocates memory for the arrays.

float * input_data1 = input1.write_data();
float * input_data2 = input2.write_data();

Next, you request two pointers, one per input array, into which you can write initial values for these arrays. These pointers are valid for writing only until the next time one makes a call to the RapidMind platform, which modifies the array.

for ( int i = 0 ; i < 10000 ; ++i )
{
   input_data1[ i ] = i;
   input_data2[ i ] = i * 2;
}

Here, you just filled the arrays.

Array< 1 , Value1f > output;

Next, you create an array to receive the output from the program. Note that you didn't specify the size of the array; the array does not need to be declared with a particular size because it will be replaced completely when the results of the stream computation are placed in it.

   Program prg = RM_BEGIN {
      In<Value1f> a;     // first input
      In<Value1f> b;     // second input
      Out<Value1f> c;    // output

      c = a + b;         // operation on the data
   } RM_END;

In a normal C++ application, the computation happens immediately in the same thread. But with RMDP, the program captures the same computation and stores it in the program object—prg—which can be used later to compute the sum of two numbers. When a program object is defined, every computation on RMDP types between the RM_BEGIN and RM_END statements is collected and stored within the object. This process happens at runtime. This runtime compilation mechanism is powerful,; the generated code is optimized for the exact conditions it is being run under, which is why RMDP-generated code outperforms plain C or manually optimized assembly code in many cases. Although Programs can be defined in any function, generally a program is defined in a constructor of a class encapsulating some computation.

Here, as you have your input and output arrays declared, you need to define the computation to be executed. You now define a stream program that will be executed on the data. The first input is In<Value1f> a, the second input is In<Value1> b, and the output is Out<Value1> c. The two inputs are added to calculate the output.

output = prg(input1, input2);

Here, you pass the inputs to the program, executing it and placing the result in the output array. The first array is connected to the first program input and the second array is connected to the second program input.

const float* results = output.read_data();

for ( int i = 0; i < 10000 ; ++i )
{
   std::cout      << "output[" << i << "] = ("
                  << results[ i ] << ")"
                  << std::endl;
}

And finally, here you just displayed the results to the screen.

Sounds trivial, isn't it? Of course in real life, you'll have more complicated applications than just adding two arrays and a lot more to explore.

Conclusion

The shift has occurred in hardware design from increasing clock speed to a focus on multi-core processing. If you are part of the High Performance Computing Industry in any form, whether as an enthusiastic game developer trying to implement real-time physics in your game engine (either based upon Agiea or Havoc) or trying to develop a next generation automated arbitrage system for capital markets or in any other way, you'll find this gap that exists in the software ecosystem, which is not equipped to tap the enormous performance benefits of multi-core processors. Sooner or later, we all will have to face it, directly or indirectly, and as always, early movers will reap the benefits.

References



Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: October 29, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Are you interested in building a cognitive application using the power of IBM Watson? Need a platform that provides speed and ease for rapidly deploying this application? Join Chris Madison, Watson Solution Architect, as he walks through the process of building a Watson powered application on IBM Bluemix. Chris will talk about the new Watson Services just released on IBM bluemix, but more importantly he will do a step by step cognitive …

  • On-demand Event Event Date: October 23, 2014 Despite the current "virtualize everything" mentality, there are advantages to utilizing physical hardware for certain tasks. This is especially true for backups. In many cases, it is clearly in an organization's best interest to make use of physical, purpose-built backup appliances rather than relying on virtual backup software (VBA - Virtual Backup Appliances). Join us for this webcast to learn why physical appliances are preferable to virtual backup appliances, …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds