Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 8 Programming conventions for optimal code

Triangular loops

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

A triangular loop is a loop nest with an inner loop whose upper or lower bound (but not both) is a function of the outer loop's index. Examples of a lower triangular loop and an upper triangular loop are given below. To simplify explanations, only Fortran examples are given in this section.

Lower triangular loop

DO J = 1, N 
DO I = J+1, N
F(I) = F(I) + ... + X(I,J) + ...

Upper triangular loop

DO J = 1, N
DO I = 1, J-1
F(I) = F(I) + ... + X(I,J) + ...

While the compiler can usually auto-parallelize one of the outer or inner loops, there are typically performance problems in either case:

  • If the outer loop is parallelized by assigning contiguous chunks of iterations to each of the threads, the load will be severely imbalanced. For example, in the lower triangular example above, the thread doing the last chunk of iterations does far less work than the thread doing the first chunk.

  • If the inner loop is auto-parallelized, then on each outer iteration in the J loop, the threads are assigned to work on a different set of iterations in the I loop, thus losing access to some of their previously encached elements of F and thrashing each other's caches in the process.

By manually controlling the parallelization, you can greatly improve the performance of a triangular loop. Parallelizing the outer loop is generally more beneficial than parallelizing the inner loop. The next two sections ("Parallelizing the outer loop”" and "Parallelizing the inner loop”") explain how to achieve the enhanced performance.

Parallelizing the outer loop

Using directives you can control the parallelization of the outer loop in a triangular loop to optimize the performance of the loop nest.

For the outer loop, it is preferable to assign iterations to threads in a more balanced manner. The simplest method is to assign the threads one at a time using the attribute CHUNK_SIZE:

C$DIR PREFER_PARALLEL (CHUNK_SIZE = 1)
DO J = 1, N
DO I = J+1, N
Y(I,J) = Y(I,J) + ...X(I,J)...

This causes each thread to execute in the following manner:

      DO J = MY_THREAD() + 1, N, NUM_THREADS()
DO I = J+1, N
Y(I,J) = Y(I,J) + ...X(I,J)...

where 0 <= MY_THREAD() < NUM_THREADS()

In this case, the first thread still does more work than the last, but the imbalance is greatly reduced. For example, assume N = 128 and there are 8 threads. Then the default parallel compilation would cause thread 0 to do J = 1 to 16, resulting in 1912 inner iterations, whereas thread 7 does J = 113 to 128, resulting in 120 inner iterations. With chunk_size = 1, thread 0 does 1072 inner iterations, and thread 7 does 1023.

Parallelizing the inner loop

If the outer loop cannot be parallelized, parallelize the inner loop if possible. There are two issues to be aware of when parallelizing the inner loop:

  • Cache thrashing

    Consider the parallelization of the following inner loop:

    DO J = I+1, N 
    F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)

    where I varies in the outer loop iteration.

    The default iteration distribution has each thread processing a contiguous chunk of iterations of approximately the same number as every other thread. The amount of work per thread is about the same; however, from one outer iteration to the next, threads work on different elements in F, resulting in cache thrashing.

  • The overhead of parallelization

    If the loop cannot be interchanged to be outermost (or at least outermore), then the overhead of parallelization is compounded by the number of outer loop iterations.

Below is a scheme that assigns "ownership" of elements to threads on a cache line basis so that threads always work on the same cache lines and retain data locality from one iteration to the next. In addition, the parallel directive (see the section parallel[(attribute_list)] ) is used to spawn threads just once. The outer, nonparallel loop is replicated on all processors, and the inner loop iterations are manually distributed to the threads.

C F IS KNOWN TO BEGIN ON A CACHE LINE BOUNDARY
NTHD = NUM_THREADS()
CHUNK = 8 ! CHUNK * DATA SIZE (4 BYTES)
! EQUALS PROCESSOR CACHE LINE SIZE;
! A SINGLE THREAD WORKS ON CHUNK = 8
! ITERATIONS AT A TIME
NTCHUNK = NTHD * CHUNK ! A CHUNK TO BE SPLIT AMONG THE THREADS
...
C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I)
ID = MY_THREAD() + 1 ! UNIQUE THREAD ID
DO I = 1, N
JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK
> + (ID-1) * CHUNK + 1
DO JJ = JS, N, NTCHUNK
DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)
F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)
ENDDO
ENDDO
ENDDO
C$DIR END_PARALLEL

The idea is to assign a fixed ownership of cache lines of F and to assign a distribution of those cache lines to threads that keeps as many threads busy computing whole cache lines for as long as possible. Using CHUNK = 8 for 4-byte data makes each thread work on 8 iterations covering a total of 32 bytes—the processor cache line size for V2200 and X2000 servers. In general, set CHUNK equal to the smallest value that multiplies by the data size to give a multiple of 32 (the processor cache line size on V2200 servers; also, the processor cache line size and CTIcache line size on X2000 servers). Smaller values of CHUNK keep most threads busy most of the time; however, setting CHUNK to obtain a multiple of 32 is better if the application is on an X2000 system and is executing on more than one hypernode, which implies that it is using the CTIcache.

When, because of the ever-decreasing work in the triangular loop, there are fewer cache lines left to compute than there are threads, threads drop out until there is only one thread left to compute those iterations associated with the last cache line. Compare this distribution to the default distribution that causes false cache line sharing (see the section "“False cache line sharing”" in this chapter) and consequent thrashing when all threads attempt to compute data into a few cache lines.

The scheme above maps a sequence of NTCHUNK-sized blocks over the F array. Within each block, each thread owns a specific cache line of data. The relationship between data, threads, and blocks of size NTCHUNK is shown in Figure 8-1 “Data ownership by CHUNK and NTCHUNK blocks”.

Figure 8-1 Data ownership by CHUNK and NTCHUNK blocks

Data ownership by CHUNK and NTCHUNK blocks

CHUNK is the number of iterations a thread will work on at one time. The idea is to make a thread work on the same elements of F from one iteration of I to the next (except for those that are already complete). The scheme above causes thread 0 to do all work associated with the cache lines starting at F(1), F(1+NTCHUNK), F(1+2*NTCHUNK), and so on. Likewise, thread 1 does the work associated with the cache lines starting at F(9), F(9+NTCHUNK), F(9+2*NTCHUNK), and so on. Thus, if a thread assigns certain elements of F for I = 2, then it is certain that the same thread encached those elements of F in iteration I = 1, thus eliminating cache thrashing among the threads.

Examining the code

Having established the idea of assigning cache line ownership, consider the following Fortran example in more detail:

C$DIR PARALLEL,PARALLEL_PRIVATE(ID,JS,JJ,J,I)
ID = MY_THREAD() + 1 ! UNIQUE THREAD ID
DO I = 1, N
JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK
> + (ID-1) * CHUNK + 1
DO JJ = JS, N, NTCHUNK
DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)
F(J) = F(J) + SQRT(A(J)**2 - B(I)**2)
ENDDO
ENDDO
ENDDO
C$DIR END_PARALLEL
C$DIR PARALLEL, PARALLEL_PRIVATE(ID,JS,JJ,J,I)

The PARALLEL directive spawns threads, each of which begins executing the statements in the parallel region. Each thread has a private version of the variables ID, JS, JJ, J, and I.

ID = MY_THREAD() + 1 ! UNIQUE THREAD ID

This establishes a unique ID for each thread, in the range 1 to num_threads().

DO I = 1, N

All threads execute the I loop redundantly (instead of thread 0 executing it alone).

JS = ((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK + (ID-1) * CHUNK + 1

For a given value of I+1, the above line determines in which NTCHUNK the value I+1 falls, then assigns a unique CHUNK of it to each thread ID. Suppose that there are ntc NTCHUNKs, where ntc is approximately N/NTCHUNK. Then the expression:

(I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK)

returns a value in the range 1 to ntcfor a given value of I+1. Then the expression:

((I+1 + NTCHUNK-1 - ID*CHUNK ) / NTCHUNK) * NTCHUNK

identifies the start of an NTCHUNK that contains I+1 or is immediately above I+1 for a given value of ID.

For the NTCHUNK that contains I+1, if the cache lines owned by a thread either contain I+1 or are above I+1 in memory, this expression returns this NTCHUNK. If the cache lines owned by a thread are below I+1 in this NTCHUNK, this expression returns the next highest NTCHUNK. In other words, if there is no work for a particular thread to do in this NTCHUNK, then start working in the next one.

(ID-1) * CHUNK + 1

identifies the start of the particular cache line for the thread to compute within this NTCHUNK.

DO JJ = JS, N, NTCHUNK

Each thread does a unique set of cache lines starting at its specific JS and continuing into succeeding NTCHUNKs until all the work is done.

DO J = MAX (JJ, I+1), MIN (N, JJ+CHUNK-1)

This does the work within a single cache line. If the starting index (I+1) is greater than the first element in the cache line (JS) then start with I+1. If the ending index (N) is less than the last element in the cache line, then finish with N.

More generally, notice that:

  • Most of the "complicated" arithmetic is in outer loop iterations.

  • Divides could be replaced, by the programmer, with shift instructions because they involve powers of two.

  • If this application were to be run on an X2000 multihypernode system, choosing a chunk size of 8 for 4-byte data (32 bytes is the X2000 CTIcache line size) would be appropriate.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.