| United States-English |
|
|
|
![]() |
Parallel Programming Guide for HP-UX Systems: K-Class and V-Class Servers > Chapter 13 TroubleshootingFalse cache line sharing |
|
False cache line sharing is a form of cache thrashing. It occurs whenever two or more threads in a parallel program are assigning different data items in the same cache line. This section discusses how to avoid false cache line sharing by restructuring the data layout and controlling the distribution of loop iterations among threads. Consider the following Fortran code:
Assume there are eight threads, each executing one of the above iterations. A(1) is on a processor cache line boundary (32-byte boundary for V2250 servers) so that all eight elements are in the same cache line. Only one thread at a time can "own" the cache line, so not only is the above loop, in effect, run serially, but every assignment by a thread requires an invalidation of the line in the cache of its previous "owner." These problems would likely eliminate any benefit of parallelization. Taking all of the above into consideration, review the code:
Assume there are eight threads working on the I loop in parallel. The J loop cannot be parallelized because of the dependence. Table 13-2 “Default distribution of the I loop” shows how the array maps to cache lines, assuming that B(1,1) is on a cache line boundary. Array entries that fall on cache line boundaries are in shaded cells. Array entries that fall on cache line boundaries are noted by hashmarks(#). Table 13-1 Initial mapping of array to cache lines
Array entries surrounded by hashmarks(#) are on cache line boundaries. HP compilers, by default, give each thread about the same number of iterations, assigning (if necessary) one extra iteration to some threads. This happens until all iterations are assigned to a thread. Table 13-2 “Default distribution of the I loop” shows the default distribution of the I loop across 8 threads. Table 13-2 Default distribution of the I loop
This distribution of iterations causes threads to share cache lines. For example, thread 0 assigns the elements B(9:12,1), and thread 1 assigns elements B(13:16,1) in the same cache line. In fact, every thread shares cache lines with at least one other thread. Most share cache lines with two other threads. This type of sharing is called false because it is a result of the data layout and the compiler's distribution of iterations. It is not inherent in the algorithm itself. Therefore, it is reduced or even removed by:
Because false cache line sharing is partially due to the layout of the data, one step in avoiding it is to adjust the layout. Adjustments are typically made by aligning data on cache line boundaries. Aligning arrays generally improves performance. However, it can occasionally decrease performance. The second step in avoiding false cache line sharing is to adjust the distribution of loop iterations. This is covered in “Distributing iterations on c ache line boundaries”. Note the assumption that in the previous example, array B starts on a cache line boundary. The methods below force arrays in Fortran to start on cache line boundaries:
The methods below force arrays in C to start on cache line boundaries:
Recall that the default iteration distribution causes thread 0 to work on iterations 1-12 and thread 1 to work on iterations 13-25, and so on. Even though the cache lines are aligned across the columns of the array (see Table 13-2 “Default distribution of the I loop”), the iteration distribution still needs to be changed. Use the CHUNK_SIZE attribute to change the distribution:
You must specify a constant CHUNK_SIZE attribute. However, the ideal is to distribute work so that all but one thread works on the same number of whole cache lines, and the remaining thread works on any partial cache line. For example, given the following: NITS = number of iterations NTHDS = number of threads LSIZE = line size in words (8 for 4-byte data, 4 for 8-byte data, 2 for 16-byte data) size in words (8 for 4-byte data the ideal CHUNK_SIZE would be:
For the code above, these numbers are: NITS = 100 LSIZE = 8 (aligns on V2250 boundaries for 4-byte data) NTHDS =8
CHUNK_SIZE = 16 causes threads 0, 1, ..., 6 to execute iterations 1-16, 17-32, ..., 81-96, respectively. Thread 7 executes iterations 97-100. As a result there is no false cache line sharing, and parallel performance is greatly improved. You cannot specify the ideal CHUNK_SIZE for every loop. However, using CHUNK_SIZE = x where x times the data size (in bytes) is an integral multiple of 32, eliminates false cache line sharing. This is only if the following two conditions below are met:
The number 32 is used because the cache line size is 32 bytes for V2250 servers. Sometimes a parallel loop has each thread update a unique element of a shared array, which is further processed by thread 0 outside the loop. Consider the following Fortran code in which false sharing occurs:
The problem here is that potentially all the elements of S are in a single cache line, so the assignments cause false sharing. One approach is to change the code to force the unique elements into different cache lines, as indicated in the following code:
Sometimes parallel tasks assign unique scalar variables that are in the same cache line, as in the following code:
The most common cache-thrashing complication using arrays and loops occurs when arrays assigned within a loop are unaligned with each other. There are several possible causes for this:
Consider the following Fortran code:
The address of Y(1) is unknown. However, if elements of Y are heavily assigned in this routine, it may be worthwhile to compute an alignment, given by the following formula: LREM = LSIZE - ( ( ( LOC(Y(1))-4, LSIZE*x) + 4) /x) where
For this case, LSIZE on V2250 servers is 32 bytes in single precision words (8 words). Note that: ( ( MOD ( LOC(Y(1))-4, LSIZE*4) + 4) /4) returns a value in the set 1, 2, 3, ..., LSIZE, so LREM is in the range 0 to 7. Then a loop such as:
is transformed to:
The first loop takes care of elements from the first (if any) partial cache line of data. The second loop begins on a cache line boundary, and is controlled with CHUNK_SIZE to avoid false sharing among the threads. Data dependences in loops may prevent parallelization and prevent the elimination of false cache line sharing. If certain conditions are met, some performance gains are achieved. For example, consider the following code:
Only the I loop is parallelized, due to the loop-carried dependences in the J loop. It is impossible to distribute the iterations so that there is no false cache line sharing in the above loop. If all loops that refer to these arrays always use the same offsets (which is unlikely) then you could make dimension adjustments that would allow a better iteration distribution. For example, the following would work well for 8 threads:
Padding 60 bytes before the declarations of both Q and R causes the P(1,J), Q(2,J), and R(3,J) to be aligned on 64-byte boundaries for all J. Combined with a CHUNK_SIZE of 16, this causes threads to assign data to unique whole cache lines. You can usually find a mix of all the above problems in some CPU-intensive loops. You cannot avoid all false cache line sharing, but by careful inspection of the problems and careful application of some of the workarounds shown here, you can significantly enhance the performance of your parallel loops. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||