Getting Advanced Performance for Multicore Applications in C/C++ and Fortran

Introduction

Maybe you haven’t yet faced the task of modifying code to run efficiently across multiple microprocessor cores, but it won’t be long before you do. Any day now, you may be told to make your company’s current software take full advantage of High Performance Computing (HPC) hardware.

And it’s likely that most new code you write from here on out will need to be optimized to run on multiple processor cores and across multiple processors. What’s driving this trend is the fact the microprocessor companies are no longer working hard to increase clock speeds but are now improving performance primarily by adding more parallel cores.

A modern, eight-core microprocessor will not noticeably improve the performance of a program that is designed to take advantage of only one or two of those cores. But software that takes full advantage of Parallel Computing is more complex, and has more potential points of failure, than traditional code written to execute serially on a single processor core, so writing and debugging it takes new tools and new skills.

James Reinders, an expert on parallelism and Chief Evangelist of Intel’s Software and Solutions Group, says the performance difference between a single-core processor and a newer dual-core one will not be noticeable, but when your users start moving to quad-core (or newer eight-core) machines, they most certainly will notice a difference between software that takes advantage of multiple cores and software that doesn’t.

Don’t think parallelization is only important for high-end, math-intensive programs. It is now becoming an important feature in new versions of desktop business software. Reinders says, “The latest versions of Microsoft Excel have been modified to utilize parallelism because some people have incredibly complex spreadsheets that do numerical calculations quite deeply.”

Adobe is another company that has jumped on the parallelization bandwagon with both feet. Adobe’s latest Creative Suite is optimized to run on multiple-core processors and across multiple processors, so that as Adobe customers upgrade their hardware they can upgrade their software to take full advantage of recent advances in processor technology.

Fortran Is Still Important

It’s not just packaged commercial software, typically written in C, C++, .NET or Visual Studio that is being adapted to take advantage of modern hardware, but Fortran programs of all ages.

In this arena, where computationally-intense tasks are the order of the day every day of the week, performance increases are often more obvious and more important than in desktop-level software. While Reinders notes that the Fortran market in general is now “flat,” he sees Fortran users turning more and more to automated tools that help them update old programs to run on multiple cores and processors, most notably to Intel® Parallel Studio XE 2011.

A key component of this software suite, for Fortran developers, is Visual Fortran Composer XE, which replaces (and includes) Intel’s Visual Fortran Compiler Professional Edition. Intel’s Fortran compiler is now up to v. 12, and Reinders says its performance is almost always better than any other proprietary or open source Fortran compiler, even on non-Intel processors.

Intel Math Kernel Library (Intel MKL) and Intel Integrated Performance Primitives (Intel IPP) performance libraries are other important features in Parallel Studio XE 2011, and cluster support is available separately with Intel Cluster Studio 2011.

No Need to Rewrite All Your Code From Scratch

Claire Cates, Principal Developer, SAS Institute, says the most efficient and lowest-stress way to get started with programming for HPC is not to start from scratch but to open the source code of an existing program in the Composer XE module of Parallel Studio XE 2011, then run the Intel VTune™ Amplifier XE 2011, which Intel’s Amplifier XE 2011 product brief (pdf) says “…helps the C/C++ and Fortran developer with static and dynamic code analysis by providing threading and memory analysis tools, to develop highly robust, secure, and highly optimized applications.”

The tools in Parallel Studio XE 2011 give you graphical representations of code hotspots and performance bottlenecks. You’ll be able to check threads and memory errors with little or no effort, and you’ll get an overall performance profile that can help you optimize your code for both serial and parallel processing.

Along the way, you’ll almost certainly want to check in at the Intel Learning Lab, which has tutorials and videos that will help you get the most out of Intel Parallel Studio XE 2011, which you can download as a free trial here.
Reinders says, “Almost nobody has the luxury of rewriting their code. We’re consistently worried about how we help people with legacy code, making as few modifications as possible. The other thing is, if you’re new to doing a parallel program, if you write it from scratch you may learn a great deal, but it’s unlikely that you’re going to do an optimal job writing your first parallel program from scratch.

“My advice is to always look for the portion of the program that would gain the most from parallelism and try making modifications in place. Over time you’ll get an appreciation for where you’ll get benefits from restructuring your program.

“In general, it’s not really the language you’re programming in; it’s the approach that you take, the algorithm that you take, that matters. Algorithms require thinking at an architectural level of the program. It’s a skill that, like a lot of things, you get over time. It’s difficult to convey very quickly… we don’t have mind-melds, like in Star Trek, to convey all this information right away. It’s something you gain from experience.”

Intel Parallel Studio XE 2011 has tools that help you find areas of your code that will benefit most from parallelism. Reinders says, “The best way in the world to modify code is to go in and find some places where you can call some of our libraries — our multi-media and math libraries — that will do things in parallel for you and just let the libraries do the work.”

That applies to some programs, but not all, he points out. There are times when programmers may need to put out a little more effort than just calling a library, and may need to actually change part of their program.

At the same time, you can eliminate many security worries. Due to the additional complexity of parallelized applications, they are more subject to some vulnerabilities than serial-only applications, but static security analysis is built into the compilers that are part of Intel Parallel Studio XE 2011, and Intel Inspector XE — another component in the software suite — displays found vulnerabilities in an attractive GUI.

These tools are like having an expert guiding you, says Reinders, “as if you had the luxury of having such a person next to you.”
Working Across Platforms “We’re using [Intel Parallel Studio XE 2011] on both Windows and Linux,” says Cates. “We’re doing a lot of grid computing and high performance computing, and that group is specifically using Linux.”

One of her developers, Cates says, told her that trying to find one bug he was chasing “was like playing Whack-a-Mole.” He spent five days becoming increasingly frustrated as each fix he tried created new problems. But when he used the thread-checking facility built into Intel Parallel Studio XE 2011, “it pulled up the data race immediately. There was one place in his code where he didn’t have the serialization primitives around where he was using a shared resource, and that’s what was causing the problem.”

Data race conditions often don’t show up until the software is stressed by multiple users or is otherwise ramped up in a production environment. But, says Cates, Intel Parallel Studio XE 2011 “can find [problems] before you have to ramp it up or it gets to the customer and they find a problem.”

She points out that just about every Fortune 500 company has SAS software, “and we don’t want to give them code that either performs poorly or that has data-race things or potential deadlocks in it. We do everything in our power to make sure that the code we ship is correct.”

As far as Intel Parallel Studio XE 2011 improving performance and improving security across multiple brand of microprocessors, Cates says SAS runs both Intel and AMD processors; that there are some parts of the software only runs on Intel processors, but because 95 percent of their software is platform-independent, “If we can find a performance bottleneck on the Intel [microprocessors], it fixes that bottleneck on the AMD hardware, too.”

Almost All Programmers Need to Learn Parallelization “Over the next couple of years, [parallelization] is going to go from being an optional topic to being a required topic for most developers,” Reinders says.

The two critical concepts to understand as you approach parallelism, he says, are scaling and some of the parallel programming errors that can happen, “specifically data races and deadlocks.”

Scaling, he says, is “an intuitive concept that can change your life. Scaling is a favorite topic of mine. It’s a huge mindset change. You build a program to scale for parallel computing. You don’t build a program to be fast on one core. You build it to scale, and to take advantage of many cores. To me, that’s the most fundamental thing that you need to make intuitive when you’re doing parallel programming.”

There was a time when there were only a few parallel computers in the world. Now forward-looking software vendors are incorporating parallelism in their home and small business programs. And, Reinders points out that machines like those that that parallel programming specialists at NASA drooled over 15 years ago are nothing in today’s computing world. “You can walk down to Office Depot, and buy a laptop that’s got that much performance now, and with a little effort you can get things that were supercomputers 10 years ago from Dell.com for not a lot of money, and have that much compute power sitting next to your desk.

“It shouldn’t be too surprising that the programming methods to take advantage of those are starting to look like the gyrations someone at NASA would have gone through a decade ago to squeeze performance out of a computer that filled a room.”

Picking Code That Can be Easily Parallelized

The Fortran code below can be auto-parallelized by a compiler because each iteration is independent of the others, and the final result of array z will be correct regardless of the execution order of the other iterations.

   do i=1, n
     z(i) = x(i) + y(i)
   enddo

On the other hand, the following code cannot be auto-parallelized, because the value of z(i) depends on the result of the previous iteration, z(i-1).

     do i=2, n
       z(i) = z(i-1)*2
     enddo

This does not mean that the code cannot be parallelized. Indeed, it is equivalent to:

     do i=2, n
       z(i) = z(1)*2**(i-1)
     enddo

However, current parallelizing compilers are not usually capable of bringing out these parallelisms automatically, and it is questionable whether this code would benefit from parallelization in the first place.

The code samples above are from Wikipedia’s article on Automatic Parallelization.

Note: This site may be compensated by companies referenced in the article through advertising programs.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read