Tải bản đầy đủ

Optimizing HPC applications with intel cluster tools


For your convenience Apress has placed some of the front
matter material after the index. Please use the Bookmarks
and Contents at a Glance links to access them.


Contents at a Glance
About the Authors��������������������������������������������������������������������������� xiii
About the Technical Reviewers������������������������������������������������������� xv
Acknowledgments������������������������������������������������������������������������� xvii
Foreword���������������������������������������������������������������������������������������� xix
Introduction������������������������������������������������������������������������������������ xxi
■■Chapter 1: No Time to Read This Book?��������������������������������������������� 1
■■Chapter 2: Overview of Platform Architectures���������������������������� 11
■■Chapter 3: Top-Down Software Optimization������������������������������� 39
■■Chapter 4: Addressing System Bottlenecks��������������������������������� 55
■■Chapter 5: Addressing Application Bottlenecks:

Distributed Memory���������������������������������������������������������������������� 87
■■Chapter 6: Addressing Application Bottlenecks:
Shared Memory�������������������������������������������������������������������������� 173
■■Chapter 7: Addressing Application Bottlenecks:
Microarchitecture����������������������������������������������������������������������� 201
■■Chapter 8: Application Design Considerations��������������������������� 247
Index���������������������������������������������������������������������������������������������� 265


Let’s optimize some programs. We have been doing this for years, and we still love doing it.
One day we thought, Why not share this fun with the world? And just a year later, here we are.
Oh, you just need your program to run faster NOW? We understand. Go to Chapter 1
and get quick tuning advice. You can return later to see how the magic works.
Are you a student? Perfect. This book may help you pass that “Software Optimization
101” exam. Talking seriously about programming is a cool party trick, too. Try it.
Are you a professional? Good. You have hit the one-stop-shopping point for Intel’s
proven top-down optimization methodology and Intel Cluster Studio that includes
Message Passing Interface* (MPI), OpenMP, math libraries, compilers, and more.
Or are you just curious? Read on. You will learn how high-performance computing
makes your life safer, your car faster, and your day brighter.
And, by the way: You will find all you need to carry on, including free trial
software, code snippets, checklists, expert advice, fellow readers, and more at

HPC: The Ever-Moving Frontier
High-performance computing, or simply HPC, is mostly concerned with
floating-point operations per second, or FLOPS. The more FLOPS you get, the better.
For convenience, FLOPS on large HPC systems are typically counted by the quadrillions
(tera, or 10 to the power of 12) and by the quintillions (peta, or 10 to the power of 15)—hence,
TeraFLOPS and PetaFLOPS. Performance of stand-alone computers is currently hovering
at around 1 to 2 TeraFLOPS, which is three orders of magnitude below PetaFLOPS. In
other words, you need around a thousand modern computers to get to the PetaFLOPS
level for the whole system. This will not stay this way forever, for HPC is an ever-moving
frontier: ExaFLOPS are three orders of magnitude above PetaFLOPS, and whole countries
are setting their sights on reaching this level of performance now.
We have come a long way since the days when computing started in earnest. Back

then [sigh!], just before WWII, computing speed was indicated by the two hours necessary
to crack the daily key settings of the Enigma encryption machine. It is indicative that
already then the computations were being done in parallel: each of the several “bombs”1
united six reconstructed Enigma machines and reportedly relieved a hundred human
operators from boring and repetitive tasks.


Here and elsewhere, certain product names may be the property of their respective third parties.


■ Introduction

Computing has progressed a lot since those heady days. There is hardly a better
illustration of this than the famous TOP500 list.2 Twice a year, the teams running the
most powerful non-classified computers on earth report their performance. This
data is then collated and published in time for two major annual trade shows: the
International Supercomputing Conference (ISC), typically held in Europe in June; and the
Supercomputing (SC), traditionally held in the United States in November.
Figure 1 shows how certain aspects of this list have changed over time.

Figure 1.  Observed and projected performance of the Top 500 systems
(Source: top500.org; used with permission)


■ Introduction

There are several observations we can make looking at this graph:3

Performance available in every represented category
is growing exponentially (hence, linear graphs in this
logarithmic representation).


Only part of this growth comes from the incessant
improvement of processor technology, as represented, for
example, by Moore’s Law.4 The other part is coming from
putting many machines together to form still larger machines.


An extrapolation made on the data obtained so far predicts
that an ExaFLOPS machine is likely to appear by 2018. Very
soon (around 2016) there may be PetaFLOPS machines at
personal disposal.

So, it’s time to learn how to optimize programs for these systems.

Why Optimize?
Optimization is probably the most profitable time investment an engineer can make, as
far as programming is concerned. Indeed, a day spent optimizing a program that takes an
hour to complete may decrease the program turn-around time by half. This means that
after 48 runs, you will recover the time invested in optimization, and then move into
the black.
Optimization is also a measure of software maturity. Donald Knuth famously said,
“Premature optimization is the root of all evil,”5 and he was right in some sense. We will
deal with how far this goes when we get closer to the end of this book. In any case, no one
should start optimizing what has not been proven to work correctly in the first place. And
a correct program is still a very rare and very satisfying piece of art.
Yes, this is not a typo: art. Despite zillions of thick volumes that have been written
and the conferences held on a daily basis, programming is still more art than science.
Likewise, for the process of program optimization. It is somewhat akin to architecture: it
must include flight of fantasy, forensic attention to detail, deep knowledge of underlying
materials, and wide expertise in the prior art. Only this combination—and something
else, something intangible and exciting, something we call “talent”—makes a good
programmer in general and a good optimizer in particular.
Finally, optimization is fun. Some 25 years later, one of us still cherishes the
memories of a day when he made a certain graphical program run 300 times faster than
it used to. A screen update that had been taking half a minute in the morning became
almost instantaneous by midnight. It felt almost like love.

The Top-down Optimization Method
Of course, the optimization process we mention is of the most common type—namely,
performance optimization. We will be dealing with this kind of optimization almost
exclusively in this book. There are other optimization targets, going beyond performance
and sometimes hurting it a lot, like code size, data size, and energy.


■ Introduction

The good news are, once you know what you want to achieve, the methodology is
roughly the same. We will look into those details in Chapter 3. Briefly, you proceed in
the top-down fashion from the higher levels of the problem under analysis (platform,
distributed memory, shared memory, microarchitecture), iterate in a closed-loop manner
until you exhaust optimization opportunities at each of these levels. Keep in mind that
a problem fixed at one level may expose a problem somewhere else, so you may need to
revisit those higher levels once more.
This approach crystallized quite a while ago. Its previous reincarnation was
formulated by Intel application engineers working in Intel’s application solution centers
in the 1990’s.6 Our book builds on that solid foundation, certainly taking some things a tad
further to account for the time passed.
Now, what happens when top-down optimization meets the closed-loop approach?
Well, this is a happy marriage. Every single level of the top-down method can be handled
by the closed-loop approach. Moreover, the top-down method itself can be enclosed
in another, bigger closed loop where every iteration addresses the biggest remaining
problem at any level where it has been detected. This way, you keep your priorities
straight and helps you stay focused.

Intel Parallel Studio XE Cluster Edition
Let there be no mistake: the bulk of HPC is still made up by C and Fortran, MPI, OpenMP,
Linux OS, and Intel Xeon processors. This is what we will focus on, with occasional
excursions into several adjacent areas.
There are many good parallel programming packages around, some of them
available for free, some sold commercially. However, to the best of our absolutely
unbiased professional knowledge, for completeness none of them comes in anywhere
close to Intel Parallel Studio XE Cluster Edition.7
Indeed, just look at what it has to offer—and for a very modest price that does not
depend on the size of the machines you are going to use, or indeed on their number.

Intel Parallel Studio XE Cluster Edition8 compilers and libraries,

Intel Fortran Compiler9

Intel C++ Compiler10

Intel Cilk Plus11

Intel Math Kernel Library (MKL)12

Intel Integrated Performance Primitives (IPP)13

Intel Threading Building Blocks (TBB)14

Intel MPI Benchmarks (IMB)15

Intel MPI Library16

Intel Trace Analyzer and Collector17


■ Introduction

Intel VTune Amplifier XE18

Intel Inspector XE19

Intel Advisor XE20

All these riches and beauty work on the Linux and Microsoft Windows OS,
sometimes more; support all modern Intel platforms, including, of course, Intel Xeon
processors and Intel Xeon Phi coprocessors; and come at a cumulative discount akin
to the miracles of the Arabian 1001 Nights. Best of all, Intel runtime libraries come
traditionally free of charge.
Certainly, there are good tools beyond Intel Parallel Studio XE Cluster Edition, both
offered by Intel and available in the world at large. Whenever possible and sensible, we
employ those tools in this book, highlighting their relative advantages and drawbacks
compared to those described above. Some of these tools come as open source, some
come with the operating system involved; some can be evaluated for free, while others
may have to be purchased. While considering the alternative tools, we focus mostly on
the open-source, free alternatives that are easy to get and simple to use.

The Chapters of this Book
This is what awaits you, chapter by chapter:

No Time to Read This Book? helps you out on the burning
optimization assignment by providing several proven recipes
out of an Intel application engineer’s magic toolbox.


Overview of Platform Architectures introduces common
terminology, outlines performance features in modern
processors and platforms, and shows you how to estimate
peak performance for a particular target platform.


Top-down Software Optimization introduces the generic
top-down software optimization process flow and the
closed-loop approach that will help you keep the challenge of
multilevel optimization under secure control.


Addressing System Bottlenecks demonstrates how you can
utilize Intel Cluster Studio XE and other tools to discover
and remove system bottlenecks as limiting factors to the
maximum achievable application performance.


Addressing Application Bottlenecks: Distributed Memory
shows how you can identify and remove distributed memory
bottlenecks using Intel MPI Library, Intel Trace Analyzer and
Collector, and other tools.


Addressing Application Bottlenecks: Shared Memory explains
how you can identify and remove threading bottlenecks using
Intel VTune Amplifier XE and other tools.


■ Introduction


Addressing Application Bottlenecks: Microarchitecture
demonstrates how you can identify and remove microarchitecture
bottlenecks using Intel VTune Amplifier XE and Intel
Composer XE, as well as other tools.


Application Design Considerations deals with the key tradeoffs
guiding the design and optimization of applications. You will
learn how to make your next program be fast from the start.

Most chapters are sufficiently self-contained to permit individual reading in
any order. However, if you are interested in one particular optimization aspect, you
may decide to go through those chapters that naturally cover that topic. Here is a
recommended reading guide for several selected topics:

System optimization: Chapters 2, 3, and 4.

Distributed memory optimization: Chapters 2, 3, and 5.

Shared memory optimization: Chapters 2, 3, and 6.

Microarchitecture optimization: Chapters 2, 3, and 7.

Use your judgment and common sense to find your way around. Good luck!

1. “Bomba_(cryptography),” [Online]. Available:
2. Top500.Org, “TOP500 Supercomputer Sites,” [Online]. Available:
3. Top500.Org, “Performance Development TOP500 Supercomputer
Sites,” [Online]. Available: http://www.top500.org/statistics/
4. G. E. Moore, “Cramming More Components onto Integrated
Circuits,” Electronics, p. 114–117, 19 April 1965.
5. “Knuth,” [Online]. Available: http://en.wikiquote.org/wiki/
6. Intel Corporation, “ASC Performance Methodology - Top-Down/
Closed Loop Approach,” 1999. [Online]. Available:
7. Intel Corporation, “Intel Cluster Studio XE,” [Online]. Available:


■ Introduction

8. Intel Corporation, “Intel Composer XE,” [Online]. Available:
9. Intel Corporation, “Intel Fortran Compiler,” [Online]. Available:
10. Intel Corporation, “Intel C++ Compiler,” [Online]. Available:
11. Intel Corporation, “Intel Cilk Plus,” [Online]. Available:
12. Intel Corporation, “Intel Math Kernel Library,” [Online]. Available:
13. Intel Corporation, “Intel Performance Primitives,” [Online].
Available: http://software.intel.com/en-us/intel-ipp.
14. Intel Corporation, “Intel Threading Building Blocks,” [Online].
Available: http://software.intel.com/en-us/intel-tbb.
15. Intel Corporation, “Intel MPI Benchmarks,” [Online]. Available:
16. Intel Corporation, “Intel MPI Library,” [Online]. Available:
17. Intel Corporation, “Intel Trace Analyzer and Collector,” [Online].
Available: http://software.intel.com/en-us/intel-traceanalyzer/.
18. Intel Corporation, “Intel VTune Amplifier XE,” [Online]. Available:
19. Intel Corporation, “Intel Inspector XE,” [Online]. Available:
20. Intel Corporation, “Intel Advisor XE,” [Online]. Available:


Chapter 1

No Time to Read This Book?
We know what it feels like to be under pressure. Try out a few quick and proven optimization
stunts described below. They may provide a good enough performance gain right away.
There are several parameters that can be adjusted with relative ease. Here are the
steps we follow when hard pressed:

Use Intel MPI Library1 and Intel Composer XE2

Got more time? Tune Intel MPI:

Collect built-in statistics data

Tune Intel MPI process placement and pinning

Tune OpenMP thread pinning

Got still more time? Tune Intel Composer XE:

Analyze optimization and vectorization reports

Use interprocedural optimization

Using Intel MPI Library
The Intel MPI Library delivers good out-of-the-box performance for bandwidth-bound
applications. If your application belongs to this popular class, you should feel the
difference immediately when switching over.
If your application has been built for Intel MPI compatible distributions like
MPICH,3 MVAPICH2,4 or IBM POE,5 and some others, there is no need to recompile the
application. You can switch by dynamically linking the Intel MPI 5.0 libraries at runtime:

$ source /opt/intel/impi_latest/bin64/mpivars.sh
$ mpirun -np 16 -ppn 2 xhpl

If you use another MPI and have access to the application source code, you can
rebuild your application using Intel MPI compiler scripts:

Use mpicc (for C), mpicxx (for C++), and mpifc/mpif77/mpif90
(for Fortran) if you target GNU compilers.

Use mpiicc, mpiicpc, and mpiifort if you target Intel Composer XE.


Chapter 1 ■ No Time to Read This Book?

Using Intel Composer XE
The invocation of the Intel Composer XE is largely compatible with the widely used GNU
Compiler Collection (GCC). This includes both the most commonly used command line
options and the language support for C/C++ and Fortran. For many applications you can
simply replace gcc with icc, g++ with icpc, and gfortran with ifort. However, be aware
that although the binary code generated by Intel C/C++ Composer XE is compatible with the
GCC-built executable code, the binary code generated by the Intel Fortran Composer is not.
For example:

$ source /opt/intel/composerxe/bin/compilervars.sh intel64
$ icc -O3 -xHost -qopenmp -c example.o example.c

Revisit the compiler flags you used before the switch; you may have to remove some
of them. Make sure that Intel Composer XE is invoked with the flags that give the best
performance for your application (see Table 1-1). More information can be found in the
Intel Composer XE documentation.6
Table 1-1.  Selected Intel Composer XE Optimization Flags






Disable (almost all) optimization. Not
something you want to use for performance!



Optimize for speed (no code size increase
for ICC)



Optimize for speed and enable vectorization



Turn on high-level optimizations



Enable interprocedural optimization



Enable auto-vectorization (auto-enabled
with -O2 and -O3)



Generate runtime profile for optimization



Use runtime profile for optimization


Enable auto-parallelization



Enable OpenMP



Emit debugging symbols


Generate the optimization report


Generate the vectorization report


Enable ANSI aliasing rules for C/C++


Chapter 1 ■ No Time to Read This Book?

Table 1-1. (continued)






Generate code for Intel processors with SSE
4.1 instructions



Generate code for Intel processors with
AVX instructions



Generate code for Intel processors with
AVX2 instructions



Generate code for the current machine used
for compilation

For most applications, the default optimization level of -O2 will suffice. It runs fast
and gives reasonable performance. If you feel adventurous, try -O3. It is more aggressive
but it also increases the compilation time.

Tuning Intel MPI Library
If you have more time, you can try to tune Intel MPI parameters without changing the
application source code.

Gather Built-in Statistics
Intel MPI comes with a built-in statistics-gathering mechanism. It creates a negligible
runtime overhead and reports key performance metrics (for example, MPI to
computation ratio, message sizes, counts, and collective operations used) in the popular
IPM format.7
To switch the IPM statistics gathering mode on and do the measurements, enter the
following commands:

$ export I_MPI_STATS=ipm
$ mpirun -np 16 xhpl

By default, this will generate a file called stats.ipm. Listing 1-1 shows an example
of the MPI statistics gathered for the well-known High Performance Linpack (HPL)
benchmark.8 (We will return to this benchmark throughout this book, by the way.)


Chapter 1 ■ No Time to Read This Book?

Listing 1-1.  MPI Statistics for the HPL Benchmark with the Most Interesting Fields
Intel(R) MPI Library Version 5.0

Summary MPI Statistics
Stats format: region
Stats scope : full

# command : /home/book/hpl/./xhpl_hybrid_intel64_dynamic (completed)
# host
: esg066/x86_64_Linux
mpi_tasks : 16 on 8 nodes
# start
: 02/14/14/12:43:33
wallclock : 2502.401419 sec
# stop
: 02/14/14/13:25:16
: 8.43
# gbytes : 0.00000e+00 total
gflop/sec : NA
# region : *
[ntasks] = 16

# entries
# wallclock
# user
# system
# mpi
# %comm
# gflop/sec
# gbytes
# MPI_Send
# MPI_Recv
# MPI_Wait
# MPI_Iprobe
# MPI_Init_thread
# MPI_Irecv
# MPI_Type_commit
# MPI_Type_free
# MPI_Comm_split
# MPI_Comm_free
# MPI_Wtime
# MPI_Comm_size
# MPI_Comm_rank
# MPI_Finalize


Chapter 1 ■ No Time to Read This Book?

From Listing 1-1 you can deduce that MPI communication occupies between 5.3
and 11.3 percent of the total runtime, and that the MPI_Send, MPI_Recv, and MPI_Wait
operations take about 81, 12, and 7 percent, respectively, of the total MPI time. With
this data at hand, you can see that there are potential load imbalances between the job
processes, and that you should focus on making the MPI_Send operation as fast as it can
go to achieve a noticeable performance hike.
Note that if you use the full IPM package instead of the built-in statistics, you will also
get data on the total communication volume and floating point performance that are not
measured by the Intel MPI Library.

Optimize Process Placement
The Intel MPI Library puts adjacent MPI ranks on one cluster node as long as there are cores
to occupy. Use the Intel MPI command line argument -ppn to control the process placement
across the cluster nodes. For example, this command will start two processes per node:

$ mpirun -np 16 -ppn 2 xhpl

Intel MPI supports process pinning to restrict the MPI ranks to parts of the system
so as to optimize process layout (for example, to avoid NUMA effects or to reduce latency
to the InfiniBand adapter). Many relevant settings are described in the Intel MPI Library
Reference Manual.9
Briefly, if you want to run a pure MPI program only on the physical processor cores,
enter the following commands:

$ export I_MPI_PIN_PROCESSOR_LIST=allcores
$ mpirun -np 2 your_MPI_app

If you want to run a hybrid MPI/OpenMP program, don’t change the default Intel
MPI settings, and see the next section for the OpenMP ones.
If you want to analyze Intel MPI process layout and pinning, set the following
environment variable:

$ export I_MPI_DEBUG=4 

Optimize Thread Placement
If the application uses OpenMP for multithreading, you may want to control thread
placement in addition to the process placement. Two possible strategies are:

$ export KMP_AFFINITY=granularity=thread,compact
$ export KMP_AFFINITY=granularity=thread,scatter

The first setting keeps threads close together to improve inter-thread
communication, while the second setting distributes the threads across the system to
maximize memory bandwidth.


Chapter 1 ■ No Time to Read This Book?

Programs that use the OpenMP API version 4.0 can use the equivalent OpenMP
affinity settings instead of the KMP_AFFINITY environment variable:

$ export OMP_PROC_BIND=close
$ export OMP_PROC_BIND=spread

If you use I_MPI_PIN_DOMAIN, MPI will confine the OpenMP threads of an MPI
process on a single socket. Then you can use the following setting to avoid thread
movement between the logical cores of the socket:

$ export KMP_AFFINITY=granularity=thread 

Tuning Intel Composer XE
If you have access to the source code of the application, you can perform optimizations
by selecting appropriate compiler switches and recompiling the source code.

Analyze Optimization and Vectorization Reports
Add compiler flags -qopt-report and/or -vec-report to see what the compiler did to
your source code. This will report all the transformations applied to your code. It will also
highlight those code patterns that prevented successful optimization. Address them if you
have time left.
Here is a small example. Because the optimization report may be very long, Listing 1-2
only shows an excerpt from it. The example code contains several loop nests of seven loops.
The compiler found an OpenMP directive to parallelize the loop nest. It also recognized
that the overall loop nest was not optimal, and it automatically permuted some loops
to improve the situation for vectorization. Then it vectorized all inner-most loops while
leaving the outer-most loops as they are.
Listing 1-2.  Example Optimization Report with the Most Interesting Fields Highlighted
$ ifort -O3 -qopenmp -qopt-report -qopt-report-file=stdout -c example.F90

Report from: Interprocedural optimizations [ipo]


OpenMP Construct at example.F90(8,7)
OpenMP Construct at example.F90(25,7)



Chapter 1 ■ No Time to Read This Book?

LOOP BEGIN at example.F90(9,2)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)
remark #25448: Loopnest Interchanged : ( 1 2 3 4 ) --> ( 1 4 2 3 )
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)
remark #15018: loop was not vectorized: not inner loop


LOOP BEGIN at example.F90(15,8)
remark #25446: blocked by 125
remark #25444: unrolled and jammed by 4
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(13,6)
remark #25446: blocked by 125
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(14,7)
remark #25446: blocked by 128

LOOP BEGIN at example.F90(14,7)
remark #25460: Loop was not optimized



LOOP BEGIN at example.F90(26,2)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)
remark #25448: Loopnest Interchanged : ( 1 2 3 4 ) --> ( 1 3 2 4 )
remark #15018: loop was not vectorized: not inner loop


Chapter 1 ■ No Time to Read This Book?

LOOP BEGIN at example.F90(29,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)
remark #25446: blocked by 125
remark #25444: unrolled and jammed by 4
remark #15018: loop was not vectorized: not inner loop


Listing 1-3 shows the vectorization report for the example in Listing 1-2. As you can
see, the vectorization report contains the same information about vectorization as the
optimization report.
Listing 1-3.  Example Vectorization Report with the Most Interesting Fields Highlighted
$ ifort -O3 -qopenmp -vec-report=2 -qopt-report-file=stdout -c example.F90


LOOP BEGIN at example.F90(9,2)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)
remark #15018: loop was not vectorized: not inner loop


Chapter 1 ■ No Time to Read This Book?

LOOP BEGIN at example.F90(12,5)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(15,8)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(13,6)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(14,7)



LOOP BEGIN at example.F90(15,8)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(13,6)
remark #15018: loop was not vectorized: not inner loop


LOOP BEGIN at example.F90(14,7)








Chapter 1 ■ No Time to Read This Book?

Use Interprocedural Optimization
Add the compiler flag -ipo to switch on interprocedural optimization. This will give the
compiler a holistic view of the program and open more optimization opportunities for the
program as a whole. Note that this will also increase the overall compilation time.
Runtime profiling can also increase the chances for the compiler to generate better
code. Profile-guided optimization requires a three-stage process. First, compile the
application with the compiler flag -prof-gen to instrument the application with profiling
code. Second, run the instrumented application with a typical dataset to produce a
meaningful profile. Third, feed the compiler with the profile (-prof-use) and let it
optimize the code.

Switching to Intel MPI and Intel Composer XE can help improve performance because
the two strive to optimally support Intel platforms and deliver good out-of-the-box (OOB)
performance. Tuning measures can further improve the situation. The next chapters will
reiterate the quick and dirty examples of this chapter and show you how to push the limits.

1. Intel Corporation, “Intel(R) MPI Library,” http://software.intel.com/en-us/
2. Intel Corporation, “Intel(R) Composer XE Suites,”
3.  Argonne National Laboratory, “MPICH: High-Performance Portable MPI,” www.mpich.
4. Ohio State University, “MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,”
5. International Business Machines Corporation, “IBM Parallel
Environment,” www-03.ibm.com/systems/software/parallel/.
6. Intel Corporation, “Intel Fortran Composer XE 2013 - Documentation,”
7. The IPM Developers, “Integrated Performance Monitoring - IPM,” http://ipm-hpc.
8. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “HPL : A Portable
Implementation of the High-Performance Linpack Benchmark for DistributedMemory Computers,” 10 September 2008, www.netlib.org/benchmark/hpl/.
9. Intel Corporation, “Intel MPI Library Reference Manual,” http://software.intel.


Chapter 2

Overview of Platform
In order to optimize software you need to understand hardware. In this chapter we give
you a brief overview of the typical system architectures found in the high-performance
computing (HPC) today. We also introduce terminology that will be used throughout
the book.

Performance Metrics and Targets
The definition of optimization found in Merriam-Webster’s Collegiate Dictionary reads
as follows: “an act, process, or methodology of making something (as a design, system,
or decision) as fully perfect, functional, or effective as possible.”1 To become practically
applicable, this definition requires establishment of clear success criteria. These objective
criteria need to be based on quantifiable metrics and on well-defined standards of
measurement. We deal with the metrics in this chapter.

Latency, Throughput, Energy, and Power
Let us start with the most common class of metrics: those that are based on the total time
required to complete an action–for example, the time it takes for a car to drive from the
start to the finish on a race track, as shown in Figure 2-1. Execution (or wall-clock) time
is one of the most common ways to measure application performance: to measure its
runtime on a specific system and report it in seconds (or hours, or sometimes days).
In this context, the time required to complete an action is a typical latency metric.


Chapter 2 ■ Overview of Platform Architectures

Figure 2-1.  Runtime: observed time interval between the start and the finish of a car on a
race track
The runtime, or the period of time from the start to the completion of an application,
is important because it tells you how long you need to wait for the results. In networking,
latency is the amount of time it takes a data packet to travel from the source to the
destination; it also can be referred to as the response time. For measurements inside the
processor, we often use the term instruction latency as the time it takes for a machine
instruction entering the execution unit until results of that instruction are available—that
is, written to the register file and ready to be used by subsequent instructions. In more
general terms, latency can be defined as the observed time interval between the start of a
process and its completion.
We can generalize this class of metrics to represent more of a general class of
consumable resources. Time is one kind of a consumable resource, such as the time
allocated for your job on a supercomputer. Another important example of a consumable
resource is the amount of electrical energy required to complete your job, called energy to
solution. The official unit in which energy is measured is the joule, while in everyday life
we more often use watt-hours. One watt-hour is equal to 3600 joules.
The amount of energy consumption defines your electricity bill and is a very visible
item among operating expenses of major, high-performance computing facilities. It drives
demand for optimization of the energy to solution, in addition to the traditional efforts
to reduce the runtime, improve parallel efficiency, and so on. Energy optimization work
has different scales; going from giga-joules (GJ, or 109 joules) consumed at the application
level, to pico-joules (pJ, or 10–12 joules) per instruction.
One of the specific properties of the latency metrics is that they are additive, so that
they can be viewed as a cumulative sum of several latencies of subtasks. This means that
if the application has three subtasks following one after another, and these subtasks take
times T1, T2 and T3, respectively, then the total application runtime is Tapp = T1 + T2 + T3.
Other types of metrics describe the amount of work that can be completed by the
system per unit of time, or per unit of another consumable resource. One example of car
performance would be its speed defined as the distance covered per unit of time; or of its
fuel efficiency, defined as the distance covered per unit of fuel—, such as miles per gallon.
We call these metrics throughput metrics. For example, the number of instructions per
second (IPS) executed by the processor, or the number of floating point operations per
second (FLOPS) are both throughput metrics. Other widely used metrics of this class are
memory bandwidth (reaching tens and hundreds of gigabytes per second these days),
and network interconnection throughput (in either bits per second or bytes per second).
The unit of power (watt) is also a throughput metric that is defined as energy flow per unit
of time, and is equal exactly to 1 joule per second.


Chapter 2 ■ Overview of Platform Architectures

You may encounter situations where throughput is described as the inverse of
latency. This is correct only when both metrics describe the same process applied to the
same amount of work. In particular, for an application or kernel that takes one second to
complete 109 arithmetic operations on floating point numbers, it is correct to state that its
throughput is 1 GFLOPS (gigaFLOPS, or 109 FLOPS).
However, very often, especially in computer networks, latency is understood
as the time from the beginning of the packet shipment until the first data arrives at
the destination. In this context, latency will not be equal to the inverse value of the
throughput. To grasp why this happens, compare sending a very large amount of data
(say, 1 terabyte (TB), which is 1012 bytes) using two different methods2:

Shipping with overnight express mail


Uploading via broadband Internet access

The overnight (24-hour) shipment of the 1TB hard drive has good throughput but
lousy latency. The throughput is (1 × 1012 × 8) bits / (24 × 60 × 60) seconds = about 92
million bits per second (bps), which is comparable to modern broadband networks. The
difference is that the overnight shipment bits are delayed for a day and then arrive all
at once, but the bits we send over the Internet start appearing almost immediately. We
would say that the network has much better latency, even though both methods have
approximately the same throughput when considered over the interval of one day.
Although high throughput systems may have low latency, there is no causal link.
Comparing a GDDR5 (Graphics Double Data Rate, version 5) vs. DDR3 (Double Data
Rate, type 3) memory bandwidth and latency, one notices that systems with GDDR5
(such as Intel Xeon Phi coprocessors) deliver three to five times more bandwidth, while
the latency to access data (measured in an idle environment) is five to six times lower
than in systems with DDR3 memory.
Finally, a graph of latency versus load looks very different from a graph of throughput
versus load. As we will see later in this chapter, memory access latency goes up
exponentially as the load increases. Throughput will go up almost linearly at first, then
levels out to become nearly flat when the physical capacity of the transport medium is
saturated. Simply by looking at a graph of test results and keeping those features in mind,
you can guess whether it is a latency graph or a throughput graph.
Another important concept and property of a system or process is its degree of
concurrency or parallelism. Concurrency (or degree of concurrency) is defined as the
number of work items that can potentially be performed simultaneously. In the example
illustrated by Figure 2-2, where three cars can race simultaneously, each on its own
track, we would say this system has concurrency of 3. In computation, an example of
concurrency would be the simultaneous execution of multiple, structurally different
application “threads” by a multicore processor. Presence of concurrency is an intrinsic
property of any modern high-performance system. Processes running on different
machines of a cluster form a common system that executes application code on multiple
machines at the same time. This, too, is an example of concurrency in action.


Chapter 2 ■ Overview of Platform Architectures

Figure 2-2.  A system with the degree of concurrency equal to 3
Cantrill and Bonwick describe three fundamental ways of using concurrency to
improve application performance.3 At the same time, these three ways represent the
typical optimization targets for either latency or throughout metrics:

Increase throughput: By executing multiple tasks concurrently,
the general system throughput can be increased.

Reduce latency: A given amount of work is completed in shorter
time by dividing it into parts that can be completed concurrently.

Hide latency: Multiple long-running tasks are executed in
parallel by the underlying system. This is particularly effective
when some tasks are blocked (for example, if they must wait
upon disk or network I/O operations), while others can proceed

Peak Performance as the Ultimate Limit
Every time we talk about performance of an application running on a machine, we try to
compare it to the maximum attainable performance on that specific machine, or peak
performance of that machine. The ratio between the achieved (or measured) performance
and the peak performance gives the efficiency metric. This metric is often used to drive
the performance optimization, for an increase in efficiency will also lead to an increase in
performance according to the underlying metric. For example, efficiency for the wall-clock
time is the fraction of time that is spent doing useful work, while efficiency for throughout is
a measure of useful capacity utilization.
Consider the example of how to quantify efficiency for a network protocol. Network
protocols normally require each packet to contain a header and a footer. The actual data
transmitted in the packet is then the size of the packet minus the protocol overhead.
Therefore, efficiency of using the network, from the application point of view, is reduced
from the total utilization according to the size of the header and the footer. For Ethernet,
the frame payload size equals 1536 bytes. The TCP/IP header and footer take 40 bytes
extra. Hence, efficiency here is equal to 1536 / 1576 × 100, or 97.5 percent.
Understanding the limitations of maximum achievable performance is an important
step in guiding the optimization process: the limits are always there! These limits are
driven by physical properties of the available materials, maturity of the technology, or
(trivially) the cost. Particularly, the propagation of signals along the wires is limited by
the speed of light in the respective material. Thus, the latency for completing any work
using electronic equipment will always be greater than zero. In the same way, it is not
possible to build an infinitely wide highway, for its throughput will always be limited by
the number of lanes and their individual throughputs.


Chapter 2 ■ Overview of Platform Architectures

Scalability and Maximum Parallel Speedup
The ability to increase performance by using more resources in parallel (for example,
more processors) is called scalability. The basic approach in high-performance
computing is to use many computational resources in parallel to solve one problem, and
to add still more resources if higher performance is required. Scalability analysis indicates
how efficient an application is using the increasing numbers of parallel computing
elements, such as cores, vector units, memory, or network connections.
Increase in performance before and after addition of the resources is called speedup.
When talking about throughput-related metrics, speedup is expressed as the ratio of the
throughput after addition of the resources versus the original throughput. For latency
metrics, speedup is the ratio between the original latency and the latency after addition of
the resources. This way speedup is always greater than 1.0 if performance improves. If the
ratio goes below 1.0, we call this negative speedup, or simply slowdown.
Amdahl’s Law, also known as Amdahl’s argument,4 is used to find the maximum
expected improvement for an entire application when only a part of the application is
improved. This law is often used in parallel computing to predict the theoretical maximum
speedup that can be achieved by adding multiple processors. In essence, Amdahl’s Law
says that speedup of a program using p processors in parallel is limited by the time needed
for the nonparallel fraction of the program (f ), according to the following formula:
Speedup £

1 + f ×(p - 1)

where f takes values between 0 and 1.
As an example, think about an application that needs 10 hours when running on
a single processor core, where a particular portion of the program takes two hours to
execute and cannot be made parallel (for instance, since it performs sequential I/O
operations). If the remaining 8 hours of the runtime can be efficiently parallelized, then
regardless of how many processors are devoted to the parallelized execution of this
program, the minimum execution time cannot be less than those critical 2 hours. Hence,
speedup is limited by at most five times (usually denoted as 5x). In reality, even this 5x
speedup goal is not attainable, since infinite parallelization of code is not possible for the
parallel part of the application. Figure 2-3 illustrates Amdahl’s law in action. If the parallel
component is made 50 times faster, then the maximum speedup with 20 percent of time
taken by the serial part will be equal to 4.63x.


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay