| United States-English |
|
|
|
![]() |
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 8 Programming conventions for
optimal codeFalse cache line sharing |
|
False cache line sharing is a form of cache thrashing. It occurs whenever two or more threads in a parallel program are assigning different data items in the same cache line. This section discusses how to avoid false cache line sharing by restructuring the data layout and controlling the distribution of loop iterations among threads. To simplify explanations, only Fortran examples are given. Consider the following example:
Imagine eight threads, each executing one of the above iterations. Assume that A(1) is on a processor cache line boundary (32-byte boundary for V2200 and X2000 servers) so that all eight elements are in the same cache line. Only one thread at a time can "own" the cache line, so not only is the above loop, in effect, run serially, but every assignment by a thread requires an invalidation of the line in the cache of its previous "owner." These problems would likely eliminate any benefit of parallelization. Now consider this example:
Imagine 8 threads working on the I loop in parallel (the J loop cannot be parallelized because of the dependence). Table 8-1 “Initial mapping of array to cache lines” shows how the array maps to cache lines, assuming that B(1,1) is on a cache line boundary. Array entries that fall on cache line boundaries are noted by hashmarks(#). Table 8-1 Initial mapping of array to cache lines
Array entries surrounded by hashmarks(#) are on cache line boundaries. The Exemplar compilers, by default, give each thread about the same number of iterations, assigning (if necessary) one extra iteration to some threads until all iterations are assigned to a thread. Table 8-2 “Default distribution of the I loop” shows the default distribution of the I loop across 8 threads. Table 8-2 Default distribution of the I loop
This distribution of iterations causes threads to share cache lines. For example, thread 0 assigns the elements B(9:12,1) and thread 1 assigns elements B(13:16,1) in the same cache line. In fact, every thread shares cache lines with at least one other thread; most share cache lines with two other threads. This type of sharing is called false because it is a result of the data layout and the compiler's distribution of iterations; it is not inherent in the algorithm itself. Therefore, it can be reduced or even removed by
Because false cache line sharing is partially due to the layout of the data, one step in avoiding it is to adjust the layout. Typically, these adjustments are made by aligning data on cache line boundaries. (Aligning arrays generally improves performance; however, it can occasionally decrease performance.) The second step in avoiding false cache line sharing (which is covered in the section, “Distributing iterations on cache line boundaries”) is to adjust the distribution of loop iterations. Note the assumption that in the previous example, array B starts on a cache line boundary. The methods below force arrays in Fortran to start on cache line boundaries:
The methods below force arrays in C to start on cache line boundaries:
Multidimensional arrays can also be aligned on cache line boundaries. Recall that in the example from the beginning of the section:
we assumed B(1,1) starts on a cache line boundary. However, because the iteration distribution caused alignment on cache line boundaries to vary from dimension to dimension, performance suffered. Choose a value x for the leftmost dimension (rightmost in C and C++) in arrays so that x times the data size (in bytes) is an integral multiple of the CTIcache line size. (The code that follows gives an example of this idea.) Using such a value aligns data on CTIcache line boundaries at the same index points in all dimensions. On X2000 servers, both the processor cache lines and the CTIcache lines are 32 bytes. (V2200 servers and nonscalable SMPs do not have CTIcaches.) For general use, try to align everything on 32-byte boundaries for V2200 and X2000 servers. This alignment will help to eliminate false cache line sharing between processor caches and also between CTIcaches where the penalty for false sharing is even higher. Where possible, try to parameterize cache line size in anticipation of future systems that might have different cache line sizes. Changing the example so that array B is aligned and padded (to 64 bytes), the leftmost dimension is now 112 instead of 100. (See the modified example below.) The number 112 is the smallest value x greater than 100 such that x times the data size (4 bytes) is an integral multiple of 64.
Note that loop limits have not changed. Placing B in a COMMON block has forced B(1,1) onto a cache line boundary (64 bytes), and changing the first dimension to 112 assures that B(1,2), B(1,3), ..., B(1,100) all start on cache line boundaries. Table 8-3 “Restructured mapping of array to cache lines” shows how the restructured array maps to (processor) cache lines. The next section, "“Distributing iterations on cache line boundaries”," explains how to make the compiler distribute iterations so that threads work on whole cache lines. Table 8-3 Restructured mapping of array to cache lines
Array entries surrounded by hashmarks (#) are on cache line boundaries. Recall that the default iteration distribution causes thread 0 to work on iterations 1-12 and thread 1 to work on iterations 13-25, and so on. Even though the cache lines are aligned across the columns of the array (see Table 8-3 “Restructured mapping of array to cache lines”), we still need to change the iteration distribution. Use CHUNK_SIZE to change the distribution:
You must specify a constant CHUNK_SIZE. However, the ideal would be to distribute work so that all but one thread works on the same number of whole cache lines, and the remaining thread works on any partial cache line. For example, given:
the ideal CHUNK_SIZE would be: CHUNK_SIZE = LSIZE * (1 + ( (1 + (NITS - 1) / LSIZE ) - 1 )/NTHDS) For the code above, these numbers are:
CHUNK_SIZE = 16 causes threads 0, 1, ..., 6 to execute iterations 1-16, 17-32, ..., 81-96, respectively. Thread 7 executes iterations 97-100. As a result there is no false cache line sharing, and parallel performance is greatly improved. While you cannot specify the ideal CHUNK_SIZE for every loop, using CHUNK_SIZE = x where x times the data size (in bytes) is an integral multiple of 32 will eliminate false cache line sharing if the two conditions below are met:
The number 32 is used because the CTIcache line size is 32 bytes for X2000 servers. Sometimes a parallel loop has each thread update a unique element of a shared array which is further processed by thread 0 outside the loop. Consider the following Fortran example:
The problem here is that potentially all the elements of S are in a single cache line, so the assignments cause false sharing. One approach is to change the code to force the unique elements into different cache lines, as indicated below:
For multihypernode applications on X2000 servers, the dimensions should be S(8,number_of_threads) because the CTIcache line size is 32 bytes, and the data size is 4 bytes. Sometimes parallel tasks will assign unique scalar variables that are in the same cache line, as in the following example:
This problem is similar to the example in the previous section ("“Thread-specific array elements”"), and can be avoided by padding enough space between the two scalar variables so that the variables are in separate CTIcache lines: COMMON /RESULTS/ SUM, PAD(7), PRODUCT where PAD(7) represents 28 bytes—SUM and PAD(7) together take up 32 bytes, forcing PRODUCT into the next CTIcache line on multinode X2000 applications because the CTIcache lines are 32 bytes. The most common cache-thrashing complication using arrays and loops will be that arrays assigned within a loop are unaligned (and possibly unalignable) with each other. There are several possible causes for this:
Consider the following Fortran example:
The address of Y(1) is unknown. However, if elements of Y are heavily assigned in this routine, it may be worthwhile to compute an alignment, given by the following formula: LREM = LSIZE - ( ( MOD ( LOC(Y(1))-4, LSIZE*x) + 4) /x) where
For this case, assume LSIZE is CTIcache line size (32 bytes on X2000 servers) in single precision words (8 words). Note that ( ( MOD ( LOC(Y(1))-4, LSIZE*4) + 4) /4) returns a value in the set 1, 2, 3, ..., LSIZE, so LREM is in the range 0 to 7. Then a loop such as:
can be transformed to:
The first loop takes care of elements from the first (if any) partial cache line of data. The second loop begins on a cache line boundary, and can be controlled with CHUNK_SIZE to avoid false sharing among the threads. Data dependences in loops may prevent parallelization and prevent the elimination of false cache line sharing. If certain conditions are met, some performance gains can be achieved. For example, consider the following code:
Only the I loop can be parallelized. (Because of the loop-carried dependences in the J loop, it cannot be parallelized.) It is impossible to distribute the iterations such that there will be no false cache line sharing in the above loop. If all loops that refer to these arrays always use the same offsets (which is unlikely) then you could make dimension adjustments that would allow a better iteration distribution. For example, the following would work well for 8 threads:
Padding 60 bytes before the declarations of both Q and R causes the P(1,J), Q(2,J), and R(3,J) to be aligned on 64-byte boundaries for all J. Combined with a CHUNK_SIZE of 16, this causes threads to assign data to unique whole cache lines. Often in real-world code you will find a mix of all the above problems in some CPU-intensive loops. You will not be able to avoid all false cache line sharing, but by careful inspection of the problems and careful application of some of the workarounds shown here, you will usually be able to significantly enhance performance of your parallel loops. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||