Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 13 Troubleshooting

False cache line sharing

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

False cache line sharing is a form of cache thrashing. It occurs whenever two or more threads in a parallel program are assigning different data items in the same cache line. This section discusses how to avoid false cache line sharing by restructuring the data layout and controlling the distribution of loop iterations among threads.

Consider the following Fortran code:

REAL*4 A(8)
DO I = 1, 8
A(I) = ...
.
.
.
ENDDO

Assume there are eight threads, each executing one of the above iterations. A(1) is on a processor cache line boundary (32-byte boundary for V2250 servers) so that all eight elements are in the same cache line. Only one thread at a time can "own" the cache line, so not only is the above loop, in effect, run serially, but every assignment by a thread requires an invalidation of the line in the cache of its previous "owner." These problems would likely eliminate any benefit of parallelization.

Taking all of the above into consideration, review the code:

REAL*4 B(100,100)
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO

Assume there are eight threads working on the I loop in parallel. The J loop cannot be parallelized because of the dependence. Table 13-2 “Default distribution of the I loop” shows how the array maps to cache lines, assuming that B(1,1) is on a cache line boundary. Array entries that fall on cache line boundaries are in shaded cells. Array entries that fall on cache line boundaries are noted by hashmarks(#).

Table 13-1 Initial mapping of array to cache lines

1, 1 1, 2 1, 3 1, 4. . .1, 99 1,100
2, 1 2, 2 2, 3 2, 4. . . 2, 99 2,100
3, 1 3, 2 3, 3 3, 4. . . 3, 99 3,100
4, 1 4, 2 4, 3 4, 4. . . 4, 99 4,100
5, 15, 2 5, 35, 4. . . 5, 995,100
6, 1 6, 2 6, 3 6, 4. . . 6, 99 6,100
7, 1 7, 2 7, 3 7, 4. . . 7, 99 7,100
8, 1 8, 2 8, 3 8, 4. . . 8, 99 8,100
9, 1 9, 29, 3 9, 4. . .9, 99 9,100
10, 1 10, 2 10, 3 10, 4. . . 10, 99 10,100
11, 1 11, 2 11, 3 11, 4. . . 11, 99 11,100
12, 1 12, 2 12, 3 12, 4. . . 12, 99 12,100
13, 113, 213, 313, 4. . .13, 9913, 100
. . .. . .. . .. . .. . .. . .. . .
97, 1 97, 297, 3 97, 4 . . .97, 99 97,100
98, 1 98, 2 98, 3 98, 4. . . 98, 99 98,100
99, 1 99, 2 99, 3 99, 4. . . 99, 99 99,100
100, 1100, 2100, 3100, 4. . .100, 99100,100

 

Array entries surrounded by hashmarks(#) are on cache line boundaries.

HP compilers, by default, give each thread about the same number of iterations, assigning (if necessary) one extra iteration to some threads. This happens until all iterations are assigned to a thread. Table 13-2 “Default distribution of the I loop” shows the default distribution of the I loop across 8 threads.

Table 13-2 Default distribution of the I loop

Thread IDIteration rangeNumber
of iterations
0 1-1212
113-2513
226-3712
338-5013
451-6212
563-7513
676-8712
788-10013

 

This distribution of iterations causes threads to share cache lines. For example, thread 0 assigns the elements B(9:12,1), and thread 1 assigns elements B(13:16,1) in the same cache line. In fact, every thread shares cache lines with at least one other thread. Most share cache lines with two other threads. This type of sharing is called false because it is a result of the data layout and the compiler's distribution of iterations. It is not inherent in the algorithm itself. Therefore, it is reduced or even removed by:

  1. Restructuring the data layout by aligning data on cache line boundaries

  2. Controlling the iteration distribution.

Aligning data to avoid false sharing

Because false cache line sharing is partially due to the layout of the data, one step in avoiding it is to adjust the layout. Adjustments are typically made by aligning data on cache line boundaries. Aligning arrays generally improves performance. However, it can occasionally decrease performance.

The second step in avoiding false cache line sharing is to adjust the distribution of loop iterations. This is covered in “Distributing iterations on c ache line boundaries”.

Aligning arrays on cache line boundaries

Note the assumption that in the previous example, array B starts on a cache line boundary. The methods below force arrays in Fortran to start on cache line boundaries:

  • Using uninitialized COMMON blocks (blocks with no DATA statements). These blocks start on 64-byte boundaries.

  • Using ALLOCATE statements. These statements return addresses on 64-byte boundaries. This only applies to parallel executables.

The methods below force arrays in C to start on cache line boundaries:

  • Using the functions malloc or memory_class_malloc. These functions return pointers on 64-byte boundaries.

  • Using uninitialized global arrays or structs that are at least 32 bytes. Such arrays and structs are aligned on 64-byte boundaries.

  • Using uninitialized data of the external storage class in C that is at least 32 bytes. Data is aligned on 64-byte boundaries.

Distributing iterations on c ache line boundaries

Recall that the default iteration distribution causes thread 0 to work on iterations 1-12 and thread 1 to work on iterations 13-25, and so on. Even though the cache lines are aligned across the columns of the array (see Table 13-2 “Default distribution of the I loop”), the iteration distribution still needs to be changed. Use the CHUNK_SIZE attribute to change the distribution:

      REAL*4 B(112,100) 
COMMON /ALIGNED/ B
C$DIR PREFER_PARALLEL (CHUNK_SIZE=16)
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO

You must specify a constant CHUNK_SIZE attribute. However, the ideal is to distribute work so that all but one thread works on the same number of whole cache lines, and the remaining thread works on any partial cache line. For example, given the following:

NITS = number of iterations

NTHDS = number of threads

LSIZE = line size in words (8 for 4-byte data, 4 for 8-byte data, 2 for 16-byte data) size in words (8 for 4-byte data

the ideal CHUNK_SIZE would be:

CHUNK_SIZE = LSIZE * (1 + ( (1 + (NITS - 1) / LSIZE ) - 1 )/NTHDS)

For the code above, these numbers are:

NITS = 100

LSIZE = 8 (aligns on V2250 boundaries for 4-byte data)

NTHDS =8

CHUNK_SIZE = 8 * (1 + ( (1 + (100 - 1) / 8 ) - 1) / 8) 
= 8 * (1 + ( (1 + 12 ) - 1) / 8)
= 8 * (1 + ( 12 ) / 8)
= 8 * (1 + 1 )
= 16

CHUNK_SIZE = 16 causes threads 0, 1, ..., 6 to execute iterations 1-16, 17-32, ..., 81-96, respectively. Thread 7 executes iterations 97-100. As a result there is no false cache line sharing, and parallel performance is greatly improved.

You cannot specify the ideal CHUNK_SIZE for every loop. However, using

CHUNK_SIZE = x

where x times the data size (in bytes) is an integral multiple of 32, eliminates false cache line sharing. This is only if the following two conditions below are met:

  • The arrays are already properly aligned (as discussed earlier in this section).

  • The first iteration accesses the first element of each array being assigned. For example, in a loop DO I = 2, N, because the loop starts at I = 2, the first iteration does not access the first element of the array. Consequently, the iteration distribution does not match the cache line alignment.

The number 32 is used because the cache line size is 32 bytes for V2250 servers.

Thread-specific array elements

Sometimes a parallel loop has each thread update a unique element of a shared array, which is further processed by thread 0 outside the loop.

Consider the following Fortran code in which false sharing occurs:

      REAL*4 S(8) 
C$DIR LOOP_PARALLEL
DO I = 1, N
.
.
.
S(MY_THREAD()+1) = ... ! EACH THREAD ASSIGNS ONE ELEMENT OF S
.
.
.
ENDDO
C$DIR NO_PARALLEL
DO J = 1, NUM_THREADS()
= ...S(J) ! THREAD 0 POST-PROCESSES S
ENDDO

The problem here is that potentially all the elements of S are in a single cache line, so the assignments cause false sharing. One approach is to change the code to force the unique elements into different cache lines, as indicated in the following code:

      REAL*4 S(8,8)
C$DIR LOOP_PARALLEL
DO I = 1, N
.
.
.
S(1,MY_THREAD()+1) = ... ! EACH THREAD ASSIGNS ONE ELEMENT OF S
.
.
.
ENDDO
C$DIR NO_PARALLEL
DO J = 1, NUM_THREADS()
= ...S(1,J) ! THREAD 0 POST-PROCESSES S
ENDDO

Scalars sharing a cache line

Sometimes parallel tasks assign unique scalar variables that are in the same cache line, as in the following code:

      COMMON /RESULTS/ SUM, PRODUCT 
C$DIR BEGIN_TASKS
DO I = 1, N
.
.
.
SUM = SUM + ...
.
.
.
ENDDO
C$DIR NEXT_TASK
DO J = 1, M
.
.
.
PRODUCT = PRODUCT * ...
.
.
.
ENDDO
C$DIR END_TASKS

Working with unaligned arrays

The most common cache-thrashing complication using arrays and loops occurs when arrays assigned within a loop are unaligned with each other. There are several possible causes for this:

  • Arrays that are local to a routine are allocated on the stack.

  • Array dummy arguments might be passed an element other than the first in the actual argument.

  • Array elements might be assigned with different offset indexes.

Consider the following Fortran code:

COMMON /OKAY/ X(112,100) 
...
CALL UNALIGNED (X(I,J))
...
SUBROUTINE UNALIGNED (Y)
REAL*4 Y(*)
! Y(1) PROBABLY NOT ON A CACHE LINE BOUNDARY

The address of Y(1) is unknown. However, if elements of Y are heavily assigned in this routine, it may be worthwhile to compute an alignment, given by the following formula:

LREM = LSIZE - ( ( ( LOC(Y(1))-4, LSIZE*x) + 4) /x)

where

LSIZE

is the appropriate cache line size in words

x

is the data size for elements of Y

For this case, LSIZE on V2250 servers is 32 bytes in single precision words (8 words). Note that:

( ( MOD ( LOC(Y(1))-4, LSIZE*4) + 4) /4)

returns a value in the set 1, 2, 3, ..., LSIZE, so LREM is in the range 0 to 7.

Then a loop such as:

DO I = 1, N 
Y(I) = ...
ENDDO

is transformed to:

C$DIR NO_PARALLEL 
DO I = 1, MIN (LREM, N) ! 0 <= LREM < 8
Y(I) = ...
ENDDO
C$DIR PREFER_PARALLEL (CHUNK_SIZE = 16)
DO I = LREM+1, N
! Y(LREM+1) IS ON A CACHE LINE BOUNDARY
Y(I) = ...
ENDDO

The first loop takes care of elements from the first (if any) partial cache line of data. The second loop begins on a cache line boundary, and is controlled with CHUNK_SIZE to avoid false sharing among the threads.

Working with dependences

Data dependences in loops may prevent parallelization and prevent the elimination of false cache line sharing. If certain conditions are met, some performance gains are achieved.

For example, consider the following code:

COMMON /ALIGNED / P(128,128), Q(128,128), R(128,128)
REAL*4 P, Q, R
DO J = 2, 128
DO I = 2, 127
P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)
Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)
R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)
ENDDO
ENDDO

Only the I loop is parallelized, due to the loop-carried dependences in the J loop. It is impossible to distribute the iterations so that there is no false cache line sharing in the above loop. If all loops that refer to these arrays always use the same offsets (which is unlikely) then you could make dimension adjustments that would allow a better iteration distribution.

For example, the following would work well for 8 threads:

      COMMON /ADJUSTED/ P(128,128), PAD1(15), Q(128,128),
> PAD2(15), R(128,128)

DO J = 2, 128
C$DIR PREFER_PARALLEL (CHUNK_SIZE=16)
DO I = 2, 127
P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)
Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)
R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)
ENDDO
ENDDO

Padding 60 bytes before the declarations of both Q and R causes the P(1,J), Q(2,J), and R(3,J) to be aligned on 64-byte boundaries for all J. Combined with a CHUNK_SIZE of 16, this causes threads to assign data to unique whole cache lines.

You can usually find a mix of all the above problems in some CPU-intensive loops. You cannot avoid all false cache line sharing, but by careful inspection of the problems and careful application of some of the workarounds shown here, you can significantly enhance the performance of your parallel loops.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.