 |
» |
|
|
 |
A triangular loop is a loop nest with an inner loop whose upper or lower
bound (but not both) is a function of the outer loop's
index. Examples of a lower triangular loop and an upper triangular
loop are given below. To simplify explanations, only Fortran examples
are given in this section. While the compiler can usually auto-parallelize one of
the outer or inner loops, there are typically performance problems
in either case: If the outer loop is parallelized
by assigning contiguous chunks of iterations to each of the threads,
the load will be severely imbalanced. For example, in the lower
triangular example above, the thread doing the last chunk of iterations
does far less work than the thread doing the first chunk. If the inner loop is auto-parallelized, then on
each outer iteration in the J loop,
the threads are assigned to work on a different set of iterations
in the I loop, thus losing access
to some of their previously encached elements of F
and thrashing each other's caches in the process.
By manually controlling the parallelization, you can greatly
improve the performance of a triangular loop. Parallelizing the
outer loop is generally more beneficial than parallelizing the inner
loop. The next two sections ("“Parallelizing the outer loop”" and "“Parallelizing the inner loop”") explain how to achieve the
enhanced performance. Parallelizing the outer loop |  |
Using directives you can control the parallelization of the
outer loop in a triangular loop to optimize the performance of the
loop nest. For the outer loop, it is preferable to assign iterations
to threads in a more balanced manner. The simplest method is to
assign the threads one at a time using the attribute CHUNK_SIZE: C$DIR PREFER_PARALLEL (CHUNK_SIZE = 1) DO J = 1, N DO I = J+1, N Y(I,J) = Y(I,J) + ...X(I,J)... |
This causes each thread to execute in the following manner: DO J = MY_THREAD() + 1, N, NUM_THREADS() DO I = J+1, N Y(I,J) = Y(I,J) + ...X(I,J)... |
where 0 <= MY_THREAD()
< NUM_THREADS() In this case, the first thread still does more work than the
last, but the imbalance is greatly reduced. For example, assume
N = 128 and there are 8
threads. Then the default parallel compilation would cause thread
0 to do J = 1 to 16, resulting
in 1912 inner iterations, whereas thread 7 does J =
113 to 128, resulting in 120 inner iterations. With chunk_size
= 1, thread 0 does 1072 inner iterations, and thread
7 does 1023. Parallelizing the inner loop |  |
If the outer loop cannot be parallelized, parallelize the
inner loop if possible. There are two issues to be aware of when
parallelizing the inner loop: Cache thrashing Consider the parallelization of the following inner loop: DO J = I+1, N F(J) = F(J) + SQRT(A(J)**2 - B(I)**2) |
where I varies in the outer
loop iteration. The default iteration distribution has each thread processing
a contiguous chunk of iterations of approximately the same number
as every other thread. The amount of work per thread is about the
same; however, from one outer iteration to the next, threads work
on different elements in F, resulting
in cache thrashing. The overhead of parallelization If the loop cannot be interchanged to be outermost (or at
least outermore), then the overhead of parallelization is compounded
by the number of outer loop iterations.
Below is a scheme that assigns "ownership"
of elements to threads on a cache line basis so that threads always
work on the same cache lines and retain data locality from one iteration
to the next. In addition, the
parallel directive (see the section
“parallel[(attribute_list)]”
) is used to spawn threads just once. The outer, nonparallel loop
is replicated on all processors, and the inner loop iterations are
manually distributed to the threads.  |
C F IS KNOWN TO BEGIN ON A CACHE LINE BOUNDARY NTHD = NUM_THREADS() CHUNK = 8 ! CHUNK * DATA SIZE (4 BYTES) ! EQUALS PROCESSOR CACHE LINE SIZE; ! A SINGLE THREAD WORKS ON CHUNK = 8 ! ITERATIONS AT A TIME NTCHUNK = NTHD * CHUNK ! A CHUNK TO BE SPLIT AMONG THE THREADS ... C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I) ID = MY_THREAD() + 1 ! UNIQUE THREAD ID DO I = 1, N JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK > + (ID-1) * CHUNK + 1 DO JJ = JS, N, NTCHUNK DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1) F(J) = F(J) + SQRT(A(J)**2 - B(I)**2) ENDDO ENDDO ENDDO C$DIR END_PARALLEL |
 |
The idea is to assign a fixed ownership of cache lines of
F and to assign a distribution
of those cache lines to threads that keeps as many threads busy
computing whole cache lines for as long as possible. Using CHUNK
= 8 for 4-byte data makes each thread work on 8
iterations covering a total of 32 bytes—the processor cache
line size for V2200 and X2000 servers. In general, set CHUNK
equal to the smallest value that multiplies by the data size to
give a multiple of 32 (the processor cache line size on
V2200 servers; also, the processor cache line size and CTIcache
line size on X2000 servers). Smaller values of CHUNK
keep most threads busy most of the time; however, setting CHUNK
to obtain a multiple of 32 is better if the application
is on an X2000 system and is executing on more than one hypernode,
which implies that it is using the CTIcache. When, because of the ever-decreasing work in the triangular
loop, there are fewer cache lines left to compute than there are
threads, threads drop out until there is only one thread left to
compute those iterations associated with the last cache line. Compare
this distribution to the default distribution that causes false
cache line sharing
(see the section "“False cache line sharing”"
in this chapter) and consequent thrashing when all threads attempt
to compute data into a few cache lines. The scheme above maps a sequence of NTCHUNK-sized
blocks over the F array. Within
each block, each thread owns a specific cache line of data.
The relationship between data, threads, and blocks of size NTCHUNK
is shown in Figure 8-1 “Data
ownership by CHUNK and NTCHUNK
blocks”. CHUNK is the number of iterations
a thread will work on at one time. The idea is to make a thread
work on the same elements of F
from one iteration of I to the
next (except for those that are already complete). The scheme above
causes thread 0 to do all work associated with the cache lines starting
at F(1), F(1+NTCHUNK),
F(1+2*NTCHUNK), and so on. Likewise,
thread 1 does the work associated with the cache lines
starting at F(9), F(9+NTCHUNK),
F(9+2*NTCHUNK), and so on. Thus,
if a thread assigns certain elements of F
for I = 2,
then it is certain that the same thread encached those elements
of F in iteration I = 1,
thus eliminating cache thrashing among the threads. Examining the code |  |
Having established the idea of assigning cache line
ownership, consider the following Fortran example in more detail: C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I) ID = MY_THREAD() + 1 ! UNIQUE THREAD ID DO I = 1, N JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK > + (ID-1) * CHUNK + 1 DO JJ = JS, N, NTCHUNK DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1) F(J) = F(J) + SQRT(A(J)**2 - B(I)**2) ENDDO ENDDO ENDDO C$DIR END_PARALLEL |
 |
- C$DIR
PARALLEL, PARALLEL_PRIVATE(ID,JS,JJ,J,I)
The PARALLEL directive
spawns threads, each of which begins executing the statements in
the parallel region. Each thread has a private version of the variables
ID, JS,
JJ, J,
and I. - ID = MY_THREAD() + 1 ! UNIQUE
THREAD ID
This establishes a unique ID
for each thread, in the range 1 to num_threads(). - DO I = 1, N
All threads execute the I
loop redundantly (instead of thread 0 executing
it alone). - JS = ((I+1 + NTCHUNK-1 - ID*CHUNK
) / NTCHUNK) * NTCHUNK + (ID-1) * CHUNK + 1
For a given value of I+1,
the above line determines in which NTCHUNK
the value I+1 falls, then assigns
a unique CHUNK of it to each thread
ID. Suppose that there are ntc
NTCHUNKs, where ntc
is approximately N/NTCHUNK.
Then the expression: - (I+1 + NTCHUNK-1 - ID*CHUNK
) / NTCHUNK)
returns a value in the range 1 to ntcfor a given value of I+1.
Then the expression: - ((I+1 + NTCHUNK-1 - ID*CHUNK
) / NTCHUNK) * NTCHUNK
identifies the start of an NTCHUNK
that contains I+1 or is immediately
above I+1 for a given value of
ID. For the NTCHUNK that contains
I+1, if the cache lines owned by
a thread either contain I+1 or
are above I+1 in memory, this expression
returns this NTCHUNK. If the cache
lines owned by a thread are below I+1
in this NTCHUNK, this expression
returns the next highest NTCHUNK.
In other words, if there is no work for a particular thread to do
in this NTCHUNK, then start working
in the next one. - (ID-1) * CHUNK + 1
identifies the start of the particular cache line
for the thread to compute within this NTCHUNK. - DO JJ = JS, N, NTCHUNK
Each thread does a unique set of cache lines starting
at its specific JS and continuing
into succeeding NTCHUNKs until
all the work is done. - DO J = MAX (JJ, I+1), MIN (N,
JJ+CHUNK-1)
This does the work within a single cache line. If
the starting index (I+1)
is greater than the first element in the cache line (JS)
then start with I+1. If the ending
index (N) is less than the last
element in the cache line, then finish with N.
More generally, notice that: Most of the "complicated"
arithmetic is in outer loop iterations. Divides could be replaced, by the programmer, with
shift instructions because they involve powers of two. If this application were to be run on an X2000 multihypernode
system, choosing a chunk size of 8 for 4-byte data (32 bytes is
the X2000 CTIcache line size) would be appropriate.
|