Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 8 Programming conventions for optimal code

False cache line sharing

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

False cache line sharing is a form of cache thrashing. It occurs whenever two or more threads in a parallel program are assigning different data items in the same cache line. This section discusses how to avoid false cache line sharing by restructuring the data layout and controlling the distribution of loop iterations among threads. To simplify explanations, only Fortran examples are given.

Consider the following example:

REAL*4 A(8)
DO I = 1, 8
A(I) = ...
.
.
.
ENDDO

Imagine eight threads, each executing one of the above iterations. Assume that A(1) is on a processor cache line boundary (32-byte boundary for V2200 and X2000 servers) so that all eight elements are in the same cache line. Only one thread at a time can "own" the cache line, so not only is the above loop, in effect, run serially, but every assignment by a thread requires an invalidation of the line in the cache of its previous "owner." These problems would likely eliminate any benefit of parallelization.

Now consider this example:

REAL*4 B(100,100)
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO

Imagine 8 threads working on the I loop in parallel (the J loop cannot be parallelized because of the dependence). Table 8-1 “Initial mapping of array to cache lines” shows how the array maps to cache lines, assuming that B(1,1) is on a cache line boundary. Array entries that fall on cache line boundaries are noted by hashmarks(#).

Table 8-1 Initial mapping of array to cache lines

# 1, 1 # 1, 2# 1, 3 # 1, 4. . .# 1, 99 # 1,100
2, 1 2, 2 2, 3 2, 4. . . 2, 99 2,100
3, 1 3, 2 3, 3 3, 4. . . 3, 99 3,100
4, 1 4, 2 4, 3 4, 4. . . 4, 99 4,100
5, 1# 5, 2 # 5, 3# 5, 4 #. . . 5, 99# 5,100 #
6, 1 6, 2 6, 3 6, 4. . . 6, 99 6,100
7, 1 7, 2 7, 3 7, 4. . . 7, 99 7,100
8, 1 8, 2 8, 3 8, 4. . . 8, 99 8,100
# 9, 1 # 9, 2# 9, 3 # 9, 4. . .# 9, 99 # 9,100
10, 1 10, 2 10, 3 10, 4. . . 10, 99 10,100
11, 1 11, 2 11, 3 11, 4. . . 11, 99 11,100
12, 1 12, 2 12, 3 12, 4. . . 12, 99 12,100
13, 1# 13, 2 #13, 3# 13, 4 #. . .13, 99# 13, 100 #
. . .. . .. . .. . .. . .. . .. . .
# 97, 1 # 97, 2# 97, 3 # 97, 4 . . .# 97, 99 # 97,100
98, 1 98, 2 98, 3 98, 4. . . 98, 99 98,100
99, 1 99, 2 99, 3 99, 4. . . 99, 99 99,100
100, 1100, 2100, 3100, 4. . .100, 99100,100

 

Array entries surrounded by hashmarks(#) are on cache line boundaries.

The Exemplar compilers, by default, give each thread about the same number of iterations, assigning (if necessary) one extra iteration to some threads until all iterations are assigned to a thread. Table 8-2 “Default distribution of the I loop” shows the default distribution of the I loop across 8 threads.

Table 8-2 Default distribution of the I loop

Thread IDIteration rangeNumber
of iterations
0 1-1212
113-2513
226-3712
338-5013
451-6212
563-7513
676-8712
7 88-10013

 

This distribution of iterations causes threads to share cache lines. For example, thread 0 assigns the elements B(9:12,1) and thread 1 assigns elements B(13:16,1) in the same cache line. In fact, every thread shares cache lines with at least one other thread; most share cache lines with two other threads. This type of sharing is called false because it is a result of the data layout and the compiler's distribution of iterations; it is not inherent in the algorithm itself. Therefore, it can be reduced or even removed by

  1. Restructuring the data layout by aligning data on cache line boundaries

  2. Controlling the iteration distribution

Aligning data to avoid false sharing

Because false cache line sharing is partially due to the layout of the data, one step in avoiding it is to adjust the layout. Typically, these adjustments are made by aligning data on cache line boundaries. (Aligning arrays generally improves performance; however, it can occasionally decrease performance.) The second step in avoiding false cache line sharing (which is covered in the section, “Distributing iterations on cache line boundaries”) is to adjust the distribution of loop iterations.

Aligning arrays on cache line boundaries

Note the assumption that in the previous example, array B starts on a cache line boundary. The methods below force arrays in Fortran to start on cache line boundaries:

  • Using uninitialized COMMON blocks (blocks with no DATA statements). These blocks start on 64-byte boundaries.

  • Using ALLOCATE statements. These statements return addresses on 64-byte boundaries. (Applies only to parallel executables.)

  • Using the directive ALIGN_CTI on an X2000 server. This directive forces arrays of any size to start on CTIcache boundaries (32 bytes on X2000 servers).

The methods below force arrays in C to start on cache line boundaries:

  • Using the functions malloc or memory_class_malloc. These functions return pointers on 64-byte boundaries. (Applies only to parallel executables.)

  • Using uninitialized global arrays or structs that are at least 32 bytes. Such arrays and structs are aligned on 64-byte boundaries.

  • Using the pragma align_cti on an X2000 server. This pragma forces arrays of any size to start on CTIcache boundaries (32 bytes on X2000 servers).

  • Using uninitialized data of the external storage class in C that is at least 32 bytes. Data is aligned on 64-byte boundaries.

Aligning multidimensional arrays on cache line boundaries

Multidimensional arrays can also be aligned on cache line boundaries. Recall that in the example from the beginning of the section:

REAL*4 B(100,100) 
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO

we assumed B(1,1) starts on a cache line boundary. However, because the iteration distribution caused alignment on cache line boundaries to vary from dimension to dimension, performance suffered. Choose a value x for the leftmost dimension (rightmost in C and C++) in arrays so that x times the data size (in bytes) is an integral multiple of the CTIcache line size. (The code that follows gives an example of this idea.) Using such a value aligns data on CTIcache line boundaries at the same index points in all dimensions.

On X2000 servers, both the processor cache lines and the CTIcache lines are 32 bytes. (V2200 servers and nonscalable SMPs do not have CTIcaches.)

For general use, try to align everything on 32-byte boundaries for V2200 and X2000 servers. This alignment will help to eliminate false cache line sharing between processor caches and also between CTIcaches where the penalty for false sharing is even higher. Where possible, try to parameterize cache line size in anticipation of future systems that might have different cache line sizes.

Changing the example so that array B is aligned and padded (to 64 bytes), the leftmost dimension is now 112 instead of 100. (See the modified example below.) The number 112 is the smallest value x greater than 100 such that x times the data size (4 bytes) is an integral multiple of 64.

REAL*4 B(112,100) 
COMMON /ALIGNED/ B
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO

Note that loop limits have not changed. Placing B in a COMMON block has forced B(1,1) onto a cache line boundary (64 bytes), and changing the first dimension to 112 assures that B(1,2), B(1,3), ..., B(1,100) all start on cache line boundaries. Table 8-3 “Restructured mapping of array to cache lines” shows how the restructured array maps to (processor) cache lines.

The next section, "“Distributing iterations on cache line boundaries”," explains how to make the compiler distribute iterations so that threads work on whole cache lines.

Table 8-3 Restructured mapping of array to cache lines

# 1, 1 ## 1, 2 ## 1, 3 ## 1, 4 #. . . # 1, 99 ## 1, 100 #
2, 12, 22, 32, 4. . .2, 992, 100
3, 13, 23, 33, 4. . .3, 993, 100
4, 14, 24, 34, 4. . .4, 994, 100
5, 15, 25, 35, 4. . .5, 995, 100
6, 16, 26, 36, 4. . .6, 996, 100
7, 17, 27, 37, 4. . .7, 997, 100
8, 18, 28, 38, 4. . .8, 998, 100
# 9, 1 ## 9, 2 ## 9, 3## 9, 4 #. . .# 9, 99## 9, 100 #
10, 110, 210, 310, 4. . .10, 9910, 100
11, 111, 211, 311, 4. . .11, 9911, 100
12, 112, 212, 312, 4. . .12, 9912, 100
. . .. . .. . .. . .. . .. . .. . .
# 97, 1 ## 97, 2 ## 97, 3 ## 97, 4 #. . .# 97, 99 ## 97,100 #
98, 1 98, 2 98, 3 98, 4. . . 98, 99 98,100
99, 1 99, 2 99, 3 99, 4. . . 99, 99 99,100
100, 1100, 2100, 3100, 4. . .100, 99100,100
101, 1101, 2101, 3101, 4. . .101, 99101,101
102, 1102, 2102, 3102, 4. . .102, 99102,102
103, 1103, 2103, 3103, 4. . .103, 99103,103
104, 1104, 2104, 3104, 4. . .104, 99104,104
# 105, 1 ## 105, 2 ## 105, 3 ## 105, 4 #. . .# 105, 99 ## 105,105 #
106, 1106, 2106, 3106, 4. . .106, 99106,106
107, 1107, 2107, 3107, 4. . .107, 99107,107
108, 1108, 2108, 3108, 4. . .108, 99108,108
109, 1109, 2109, 3109, 4. . .109, 99109,109
110, 1110, 2110, 3110, 4. . .110, 99110,110
111, 1111, 2111, 3111, 4. . .111, 99111,112
112, 1112, 2112, 3112, 4. . .112, 99112,112

 

Array entries surrounded by hashmarks (#) are on cache line boundaries.

Distributing iterations on cache line boundaries

Recall that the default iteration distribution causes thread 0 to work on iterations 1-12 and thread 1 to work on iterations 13-25, and so on. Even though the cache lines are aligned across the columns of the array (see Table 8-3 “Restructured mapping of array to cache lines”), we still need to change the iteration distribution. Use CHUNK_SIZE to change the distribution:

      REAL*4 B(112,100) 
COMMON /ALIGNED/ B
C$DIR PREFER_PARALLEL (CHUNK_SIZE=16)
DO I = 1, 100
DO J = 1, 100
B(I,J) = ...B(I,J-1)...
ENDDO
ENDDO

You must specify a constant CHUNK_SIZE. However, the ideal would be to distribute work so that all but one thread works on the same number of whole cache lines, and the remaining thread works on any partial cache line. For example, given:

NITS = number of iterations
NTHDS = number of threads
LSIZE = CTIcache line size in words (8 for 4-byte data,
4 for 8-byte data, 2 for 16-byte data on X2000 servers)

the ideal CHUNK_SIZE would be:

CHUNK_SIZE = LSIZE * (1 + ( (1 + (NITS - 1) / LSIZE ) - 1 )/NTHDS)

For the code above, these numbers are:

NITS = 100
NTHDS = 8
LSIZE = 8 (aligns on X2000 CTIcache boundaries for 4-byte data)

CHUNK_SIZE = 8 * (1 + ( (1 + (100 - 1) / 8 ) - 1) / 8) 
= 8 * (1 + ( (1 + 12 ) - 1) / 8)
= 8 * (1 + ( 12 ) / 8)
= 8 * (1 + 1 )
= 16

CHUNK_SIZE = 16 causes threads 0, 1, ..., 6 to execute iterations 1-16, 17-32, ..., 81-96, respectively. Thread 7 executes iterations 97-100. As a result there is no false cache line sharing, and parallel performance is greatly improved.

While you cannot specify the ideal CHUNK_SIZE for every loop, using

CHUNK_SIZE = x

where x times the data size (in bytes) is an integral multiple of 32 will eliminate false cache line sharing if the two conditions below are met:

  • The arrays are already properly aligned (as discussed earlier in this section).

  • The first iteration accesses the first element of each array being assigned. (For example, in a loop DO I = 2, N, because the loop starts at I = 2, the first iteration does not access the first element of the array; consequently, the iteration distribution does not match the cache line alignment.)

The number 32 is used because the CTIcache line size is 32 bytes for X2000 servers.

Thread-specific array elements

Sometimes a parallel loop has each thread update a unique element of a shared array which is further processed by thread 0 outside the loop.

Consider the following Fortran example:

      REAL*4 S(8) 
C$DIR LOOP_PARALLEL
DO I = 1, N
.
.
.
S(MY_THREAD()+1) = ... ! EACH THREAD ASSIGNS ONE ELEMENT OF S
.
.
.
ENDDO
C$DIR NO_PARALLEL
DO J = 1, NUM_THREADS()
= ...S(J) ! THREAD 0 POST-PROCESSES S
ENDDO

The problem here is that potentially all the elements of S are in a single cache line, so the assignments cause false sharing. One approach is to change the code to force the unique elements into different cache lines, as indicated below:

      REAL*4 S(8,8)
C$DIR LOOP_PARALLEL
DO I = 1, N
.
.
.
S(1,MY_THREAD()+1) = ... ! EACH THREAD ASSIGNS ONE ELEMENT OF S
.
.
.
ENDDO
C$DIR NO_PARALLEL
DO J = 1, NUM_THREADS()
= ...S(1,J) ! THREAD 0 POST-PROCESSES S
ENDDO

For multihypernode applications on X2000 servers, the dimensions should be S(8,number_of_threads) because the CTIcache line size is 32 bytes, and the data size is 4 bytes.

Scalars sharing a cache line

Sometimes parallel tasks will assign unique scalar variables that are in the same cache line, as in the following example:

      COMMON /RESULTS/ SUM, PRODUCT 
C$DIR BEGIN_TASKS
DO I = 1, N
.
.
.
SUM = SUM + ...
.
.
.
ENDDO
C$DIR NEXT_TASK
DO J = 1, M
.
.
.
PRODUCT = PRODUCT * ...
.
.
.
ENDDO
C$DIR END_TASKS

This problem is similar to the example in the previous section ("“Thread-specific array elements”"), and can be avoided by padding enough space between the two scalar variables so that the variables are in separate CTIcache lines:

COMMON /RESULTS/ SUM, PAD(7), PRODUCT

where PAD(7) represents 28 bytes—SUM and PAD(7) together take up 32 bytes, forcing PRODUCT into the next CTIcache line on multinode X2000 applications because the CTIcache lines are 32 bytes.

Working with unaligned arrays

The most common cache-thrashing complication using arrays and loops will be that arrays assigned within a loop are unaligned (and possibly unalignable) with each other. There are several possible causes for this:

  • Arrays that are local to a routine are allocated on the stack.

  • Array dummy arguments might be passed an element other than the first in the actual argument.

  • Array elements might be assigned with different offset indexes.

Consider the following Fortran example:

COMMON /OKAY/ X(112,100) 
...
CALL UNALIGNED (X(I,J))
...
SUBROUTINE UNALIGNED (Y)
REAL*4 Y(*)
! Y(1) PROBABLY NOT ON A CACHE LINE BOUNDARY

The address of Y(1) is unknown. However, if elements of Y are heavily assigned in this routine, it may be worthwhile to compute an alignment, given by the following formula:

LREM = LSIZE - ( ( MOD ( LOC(Y(1))-4, LSIZE*x) + 4) /x)

where

LSIZE

is the appropriate cache line size in words

x

is the data size for elements of Y

For this case, assume LSIZE is CTIcache line size (32 bytes on X2000 servers) in single precision words (8 words). Note that

( ( MOD ( LOC(Y(1))-4, LSIZE*4) + 4) /4)

returns a value in the set 1, 2, 3, ..., LSIZE, so LREM is in the range 0 to 7.

Then a loop such as:

DO I = 1, N 
Y(I) = ...
ENDDO

can be transformed to:

C$DIR NO_PARALLEL 
DO I = 1, MIN (LREM, N) ! 0 <= LREM < 8
Y(I) = ...
ENDDO
C$DIR PREFER_PARALLEL (CHUNK_SIZE = 16)
DO I = LREM+1, N
! Y(LREM+1) IS ON A CACHE LINE BOUNDARY
Y(I) = ...
ENDDO

The first loop takes care of elements from the first (if any) partial cache line of data. The second loop begins on a cache line boundary, and can be controlled with CHUNK_SIZE to avoid false sharing among the threads.

Working with dependences

Data dependences in loops may prevent parallelization and prevent the elimination of false cache line sharing. If certain conditions are met, some performance gains can be achieved.

For example, consider the following code:

COMMON /ALIGNED / P(128,128), Q(128,128), R(128,128)
REAL*4 P, Q, R
DO J = 2, 128
DO I = 2, 127
P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)
Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)
R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)
ENDDO
ENDDO

Only the I loop can be parallelized. (Because of the loop-carried dependences in the J loop, it cannot be parallelized.) It is impossible to distribute the iterations such that there will be no false cache line sharing in the above loop. If all loops that refer to these arrays always use the same offsets (which is unlikely) then you could make dimension adjustments that would allow a better iteration distribution. For example, the following would work well for 8 threads:

      COMMON /ADJUSTED/ P(128,128), PAD1(15), Q(128,128),
> PAD2(15), R(128,128)

DO J = 2, 128
C$DIR PREFER_PARALLEL (CHUNK_SIZE=16)
DO I = 2, 127
P(I-1,J) = SQRT (P(I-1,J-1) + 1./3.)
Q(I ,J) = SQRT (Q(I ,J-1) + 1./3.)
R(I+1,J) = SQRT (R(I+1,J-1) + 1./3.)
ENDDO
ENDDO

Padding 60 bytes before the declarations of both Q and R causes the P(1,J), Q(2,J), and R(3,J) to be aligned on 64-byte boundaries for all J. Combined with a CHUNK_SIZE of 16, this causes threads to assign data to unique whole cache lines.

Often in real-world code you will find a mix of all the above problems in some CPU-intensive loops. You will not be able to avoid all false cache line sharing, but by careful inspection of the problems and careful application of some of the workarounds shown here, you will usually be able to significantly enhance performance of your parallel loops.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.