HP 9000 systems employ high-speed cache memory to store the
most recently used instructions and data. The location of instructions
and data in the cache is a function of the n
low-order bits of the address of the data or instruction. Consequently,
instructions and data that have the same n
low-order bits compete for the same cache space. Such competing
objects are called cache aliases of each
other.
Programs that contain loops in which two or more cache aliases
are referenced will run more slowly than expected because the system
must repeatedly swap aliased objects into and out of the cache.
Data cache aliasing can occur when two data
objects that are far apart in virtual memory (so that their addresses
have the same low-order bits but different high-order bits) are
referenced in the same loop. Instruction cache aliasing,
which is less likely to happen but has more serious effects, can
occur when two routines invoked in a loop are aliases of each other.
In this case, the invocation of the second routine forces the first
routine out of the cache, and the next time through the loop the
first routine pushes the second routine out of the cache.
Table 7-1 “Typical Instruction Cache
Aliasing Situation” illustrates this
situation using a 1K-byte cache example. If the cache addresses
of subr1 and
subr2 overlap
significantly, instruction cache aliasing can produce a severe performance
degradation.
Table 7-1 Typical Instruction Cache
Aliasing Situation
Code | Description | Address |
|---|
DO 10 I = 1,1000 CALL subr1 CALL subr2 10 CONTINUE END | Main program loop | |
SUBROUTINE subr1 . . RETURN END | First subroutine | P P+L1 (L1 = length of subr1 code) |
SUBROUTINE subr2 . . RETURN END | Second
subroutine | Q Q+L2 (L2 = length of subr2 code) |
Detecting problems with instruction cache aliases is an inexact
process because the system does not keep records of cache hits and
misses. You may notice, after making some modifications to one piece
of code, that a seemingly unrelated routine runs more slowly than
before. This can happen if your modifications change the virtual
address map of your program and cause increased cache aliasing occurrences.
Or you may suspect that a routine is running much more slowly than
it should.
Fixing instruction
cache problems is also difficult because you do not have much control
over what virtual addresses are assigned to data and instructions.
However, you can move routines around within a source module, and
you can change the order in which object modules are joined by the
linker. There are no hard and fast rules about how the linker orders
various data and text sections, but it tends to place them in the
same order in which you list them in the compiler or ld
command line. You can take advantage of this fact by reordering
the modules on the command line.
Data cache aliasing is a more
common performance problem than instruction cache aliasing, but
it is also easier to deal with. If you suspect that your application
is experiencing data cache performance problems on scalar or small
vectors of data, you can usually correct the situation by rearranging
the order in which your variables are allocated in memory. In Fortran
you can do this by reordering common block assignments.
If your application uses very large arrays of data, hundreds
of kilobytes or megabytes in length, than chances are good that
you will experience data cache aliasing problems no matter how your
data arrays are allocated in memory, simply because the arrays are
too big to fit in the data cache all at once. In this case, you
may be able to improve performance by using the +Odataprefetch
option, either alone or in conjunction with +Ovectorize.
(See “Optimizing Your Program” for
information about these options.)
Another
way to improve performance for very large arrays is to increase
the locality of references to your data. This technique, sometimes
called tiling, requires selecting
an algorithm that processes as much data as possible while the data
is resident in the cache so as to minimize the number of times the
data must be re-cached later. The routines in the BLAS library use
tiling when they operate on large matrices of data. For example,
in multiplying two large matrices, each matrix is cut into tiles,
which are processed in pairs. The size of the tiles is determined
at run time by the size of the matrices and the size of the data
cache on the system executing the application.