Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 2 Architecture overview

System organization: scalable SMPs

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

NOTE: HP-UX is the operating system on V2200 servers. Only the SPP-UX operating system runs on Hewlett-Packard X2000 servers.

Think of a scalable Exemplar SMP as a shared-memory computer with two levels of memory latency. Memory available on the current hypernode (accessed through the crossbar) constitutes the first level, and all other memory (accessed through the CTI rings) constitutes the second.

Exemplar V2200 servers consist of one hypernode that has 4 to 16 PA-8200 processors and 256 Mbytes to 16 Gbytes of physical memory. The X2000 servers consist of 1 to 4 hypernodes with a total of 16 to 64 PA-8000 processors and 16 Gbytes to 64 Gbytes of memory.

Processors within a hypernode communicate with each other, with memory, and with peripherals via a nonblocking crossbar . V2200 servers feature the HP HyperPlane crossbar. Figure 2-2 “V2200 hypernode overview” shows the V2200 crossbar configuration. Figure 2-3 “X2000 hypernode overview” shows the crossbar configuration for X2000 servers. Processors in different hypernodes communicate via CTI rings. These rings are configured in a one-dimensional interconnect for X2000 systems consisting of two or three hypernodes and in two-dimensional interconnects for systems with four hypernodes.

Figure 2-2 “V2200 hypernode overview” and Figure 2-3 “X2000 hypernode overview” show overviews of a V2200 hypernode and a single X2000 hypernode, respectively. These servers are the same except for the V2200's HyperPlane crossbar and the X2000's CTI controllers. Two CPUs and a PCI bus controller share a single CPU agent. The CPUs communicate with the rest of the machine through the CPU agent. The Memory Access Controllers (MACs) provide the interface between the memory banks and the rest of the machine. All intrahypernode memory accesses take approximately 510 nanoseconds on X2000 servers, regardless of location, because they must traverse the crossbar, which gives equal access to all hypernode memory from all CPUs. The CTIrings are used for internode communication.

Figure 2-4 “X2000 crossbar connections ” shows a more detailed view of the connections between the CPU agents, the crossbar, and the Memory Controllers in an X2000 server.

Figure 2-2 V2200 hypernode overview

V2200 hypernode overview

Figure 2-3 X2000 hypernode overview

X2000 hypernode overview

Figure 2-4 X2000 crossbar connections

X2000 crossbar connections

Any processor can access memory on another hypernode by routing its request through its own crossbar to a CTI ring that attaches to that hypernode. Data is returned via a CTI ring and then routed via the crossbar back to the requesting processor.

Figure 2-5 “CTI ring connections for two-hypernode X2000 server” shows the CTI ring connections between two X2000 hypernodes. See Figure 2-3 “X2000 hypernode overview” for details not available in the figure below.

Figure 2-5 CTI ring connections for two-hypernode X2000 server

CTI ring connections for two-hypernode X2000 server

CTI rings are unidirectional. That is, packets can only move in one direction on the rings. Consider the three-hypernode X2000 server illustrated in Figure 2-6 “Unidirectional flow on a CTI ring”; for simplicity, only one of the eight rings is shown. If Node 0 initiates communication with Node 2, it goes through the CTI controller on Node 1 to get to Node 2. Responses from Node 2 to Node 0 travel in the same direction as the request and cover the remainder of the ring.

Figure 2-6 Unidirectional flow on a CTI ring

Unidirectional flow on a CTI ring

As a system scales, a one-dimensional interconnect becomes less efficient because the ring grows to include a CTI controller for every hypernode in the system. For X2000 servers, when the number of nodes exceeds three, a one-dimensional interconnect is no longer optimal. A two-dimensional interconnect is then used to shorten paths between requesting and responding nodes.

The two-dimensional interconnect uses dimension-order routing to determine the path taken by a packet. A request packet first travels the required distance on the X-dimension ring then, if needed, the Y-dimension ring. On the return path, the response packet again travels the X-dimension ring first, then the Y-direction ring. Thus, the response packet does not necessarily follow the same path as the request packet.

Figure 2-7 “CTI ring connections for four-hypernode X2000 server” shows a four-hypernode X2000 server using a two-dimensional interconnect. The node IDs (0, 1, 8, and 9 in the figure) are represented in 5-bit fields, where the first three bits represent the X dimension and the last two bits represent the Y dimension.

Nodes connected in the X dimension are:

  • Node 0 (ID:00000) and node 1 (ID:00001)

  • Node 8 (ID:01000) and node 9 (ID:01001)

Nodes connected in the Y dimension are:

  • Node 0 (ID:00000) and node 8 (ID:01000)

  • Node 1 (ID:00001) and node 9 (ID:01001)

Figure 2-7 CTI ring connections for four-hypernode X2000 server

CTI ring connections for four-hypernode X2000 server

CPUs communicate directly with their own instruction and data caches, which can be accessed by the processor in one clock (assuming a full pipeline). X2000 servers use 1-Mbyte off-chip instruction caches and data caches. V2200 servers use 2-Mbyte off-chip instruction caches and data caches.

Memory

Each process running on a V-Class or K-Class server (running HP-UX 11.0 and above) accesses its own 16-Tbyte virtual address space. Almost all of this space is available to hold program text, data, and the stack; the space used by the operating system is negligible.

On X2000 servers running the SPP-UX operating system, each process can accesses its own 4-Gbyte virtual address space. Again, most of this space is available to program text, data, and the stack with only a negligible amount of space used by the operating system.

The stack size is configurable; refer to the section “Default stack size” for more information.

Processes cannot access each other's virtual address spaces. This virtual memory maps to the physical memory of the system on which the process is running.

Physical memory

All memory (excluding processor caches) on V2200 servers and X2000 servers is implemented in memory banks. In 16-processor V2200 servers and X2000 servers, each hypernode consists of 32 memory banks. This memory is typically partitioned (by the system administrator) into hypernode-local, system-global, CTIcache (on multinode systems), and buffer cache. It is also interleaved as described in the "“Interleaving”" section later in this chapter.

Hypernode-local memory , as its name implies, is local to its hypernode, and cannot be accessed by other hypernodes. This is where application and operating-system executables, as well as user process data that has been explicitly declared private, reside.

System-global memory is accessible by all processors in a given system.

The CTIcache is used to store copies of global data fetched from other hypernodes.

The buffer cache is a file system cache and is used to encache items that have been read from disk and items that are to be written to disk.

Virtual memory

Virtual memory is divided into five classes. The compilers choose default classes to provide your programs with normal SMP memory-transaction semantics. You can also manually assign data to memory classes to improve data locality and further increase performance. However, doing so also requires some other aspects of optimization, particularly loop parallelization, to be handled manually.

Brief descriptions of the virtual memory classes and their physical memory mappings follow:

thread_private

This memory is private to each thread of a process. A thread_private data object has a unique virtual address for each thread within its hypernode. These addresses map to unique physical addresses in hypernode-local physical memory on each hypernode. Threads access the physical copies of thread_private data residing on their own hypernode when they access thread_private virtual addresses.

node_private

This memory is shared among the threads running on a given hypernode but is inaccessible from other hypernodes. A node_private data object has a unique virtual address by which all threads on all hypernodes access it. This address maps to one physical address per hypernode; when a thread accesses the data, it receives the value contained in the physical memory of its own hypernode.

near_shared

Data objects of the near_shared class have a single virtual address by which they can be accessed from any hypernode in the system. Physically, near_shared data is stored entirely within the memory of a particular hypernode. All data of a near_shared object maps to physical addresses on that hypernode.

far_shared

Data objects of the far_shared class have a single virtual address by which they can be accessed from any hypernode in the system. Physically, far_shared data is distributed by pages, in a manner that is approximately round-robin, to all the hypernodes in the system, so the virtual address maps to a single physical address located on one of the hypernodes.

block_shared

Data objects of the block_shared class have a single virtual address by which they can be accessed from any hypernode in the system. Physically, block_shared data is distributed in blocks equally among the hypernodes on which the process is executing , one block per hypernode. block_shared memory must be dynamically allocated; the programmer can then easily ensure that threads on a hypernode make most of their accesses to the block residing on their hypernode.

Using these memory classes is discussed in detail in Chapter 5, "Chapter 5 “Memory classes”."

Data caches

V2200 servers and X2000 servers use high-speed data caches to improve performance, but the architectures differ in their implementations of the cache. CTIcaches are used to improve performance on multihypernode systems. (A CTIcache is a partition of physical memory that exists on each hypernode and is used to store copies of global data fetched from other hypernodes.)

Cache lines

Before examining the specifics of caches, you must understand how data is moved between the cache and memory. A cache line describes the size of a chunk of contiguous data that must be copied into or out of a cache in one operation. V2200 servers use processor cache lines; X2000 servers use processor cache lines and CTIcache lines.

When a processor experiences a cache miss—that is, requests data that is not already encached—the cache line containing the address of the requested data is moved to the cache. This cache line also contains some number of other data objects that were not specifically requested; this number varies according to the object size and the type of cache line in question.

A CTIcache line moves data from shared memory to the CTIcache when a CTIcache miss occurs. For X2000 servers, the CTIcache line is 32 bytes, and each CTIcache line matches one-to-one to a 32-byte processor cache line. When a processor cache miss occurs, the requested data is fetched as part of a contiguous 32-byte cache line. If this data resides in any memory on the processor's hypernode, it need not traverse the CTIcache; if it resides in the memory of another hypernode, it will be fetched through the CTIcache.

All processor-encached data not residing on the processor's hypernode must pass through the CTIcache, so if this data is contained in processor cache, it is also resident in the CTIcache.

One reason cache lines are employed is to allow for data reuse. Data in a cache line is subject to reuse if, while the line is encached, any of the data elements contained in the line besides the requested element are referenced by the program, or if the requested element is referenced more than once.

Because data can only be moved to and from memory as part of a cache line, both load and store operations cause their operands to be encached. Cache-coherency hardware invalidates cache lines in other processors when they are stored to by a particular processor. This indicates to other processors that they must load the cache line from memory the next time they reference its data.

Direct-mapped data caches

V2200 servers use 2-Mbyte off-chip write-back direct-mapped data caches. In a direct-mapped cache, the cache address for a given data object is a function of the object's full virtual address. For V2200 systems, cache addresses are computed within a process using the following formula:

cache_address = MOD(virtual_address,221)

Where the MOD function yields the remainder when virtual_address is divided by 221. The value of 221 is 2,097,152, or 2 Mbytes. Thus, a data object's cache address is the least-significant 21 bits of its virtual address.

X2000 servers use 1-Mbyte off-chip write-back direct-mapped data caches. For X2000 systems, cache addresses are computed within a process using the following formula:

cache_address = MOD(virtual_address,220)

Where the MOD function yields the remainder when virtual_address is divided by 220. The value of 220 is 1,048,576, or 1 Mbyte. Thus, a data object's cache address is the least-significant 20 bits of its virtual address.

This mapping scheme can result in cache thrashing, which is discussed in the section “Cache thrashing”.

Prefetching with the +Odataprefetch compiler option

Prefetching is supported through the command-line option +Odataprefetch. Prefetching encaches data that will be used in future iterations of a loop, while the processor is executing current iterations. The prefetch distance (distance in terms of the number of processor cycles) varies and is tuned to the target machine architecture. Prefetching is not beneficial to loops whose data fits in the cache. For loops whose data does not fit in the cache, performance improvement can be substantial. This option is off (+Onodataprefetch) by default. For additional information on this option, see Appendix D, "Appendix D “Optimization options”."

Data alignment

Aligning data addresses on cache line boundaries allows for efficient data reuse in loops (refer to Chapter 3, "Chapter 3 “Compiler optimizations”"). The linker automatically aligns data over 32 bytes on a 32-byte boundary. Also, it aligns data greater in size than a page on a 64-byte boundary.

You can align data on 64-byte boundaries by:

  • Using Fortran ALLOCATE statements. (Applies only to parallel executables.)

  • Using the C functions malloc or memory_class_malloc. (Applies only to parallel executables.)

Only the first item in a list of data objects appearing in any of these statements is aligned on a cache line boundary. To make most efficient use of available memory, the total size, in bytes, of any array appearing in one of these statements should be an integral multiple of 32. Sizing your arrays this way prevents data following the first array from becoming misaligned. Scalar variables should be listed after arrays and ordered from longest data type to shortest (for example, REAL*8 scalars should precede REAL*4 scalars).

NOTE: Aliases can inhibit data alignment. Be especially careful when equivalencing arrays in Fortran.

You can force CTIcache boundary alignment for specific scalar variables or arrays by using the align_cti directive or pragma. The Fortran directive has the form:

C$DIR ALIGN_CTI(namelist)

In C it has the form:

#pragma _CNX align_cti(namelist)

where namelist is a list of arrays and/or scalars that will be aligned on CTIcache boundaries.

Cache thrashing

Cache thrashing occurs when two or more data items that are needed by the program both map to the same cache address. Each time one of the items is encached, it overwrites another needed item, causing cache misses and impairing data reuse. This section explains how thrashing happens on X2000 servers.

A type of thrashing known as false cache line sharing is discussed in the section “False cache line sharing”.

X2000 servers use a 1-Mbyte direct-mapped data cache. Thus, cache thrashing can become a problem on X2000 servers when two encachable data objects are exactly a multiple of 1 Mbyte apart in virtual memory. To eliminate the problem, you must ensure that your data is not spaced this way.

Consider the following Fortran example:

REAL*8 ORIG(65536), NEW(65536), DISP(65536)
COMMON /BLK1/ ORIG, NEW, DISP
.
.
.
DO I = 1, N
NEW(I) = ORIG(I) + DISP(I)
ENDDO

In this example, the arrays ORIG and DISP overwrite each other in a 1-Mbyte cache. Because the arrays are in a COMMON block, we know that they will be allocated in contiguous memory in the order shown. Each array element occupies 8 bytes, so each array occupies 0.5 Mbyte (8 × 65536 = 524288 bytes); therefore arrays ORIG and DISP are exactly 1 Mbyte apart in memory, and all their elements have identical cache addresses. The layout of the arrays in memory and in the data cache is shown in Figure 2-8 “Array layouts— cache-thrashing”.

Figure 2-8 Array layouts— cache-thrashing

Array layouts— cache-thrashing

When the addition in the body of the loop executes, the current elements of both ORIG and DISP must be fetched from memory into the cache. Because these elements have identical cache addresses, whichever is fetched last will overwrite the first. Remember that processor cache data is fetched 32 bytes at a time; to efficiently execute a loop such as this, the unused elements in the fetched cache line (3 extra REAL*8 elements are fetched in this case) must remain encached until they can be used in subsequent iterations of the loop. Because ORIG and DISP thrash each other, this reuse is never possible; every cache line of ORIG that is fetched is overwritten by the cache line of DISP that is subsequently fetched, and vice versa. The cache line is overwritten on every iteration; typically, in a loop like this, it would not be overwritten until all of its elements were used.

Because memory accesses take substantially longer than cache accesses, this severely degrades performance. Even if the overwriting involved the NEW array, which is stored rather than loaded on each iteration, thrashing would occur, because stores overwrite entire cache lines the same way loads do.

The problem is easily fixed by increasing the distance between the arrays. You can accomplish this by either increasing the array sizes or inserting a padding array.

The following example illustrates the padding approach:

REAL*8 ORIG(65536), NEW(65536), P(4),DISP(65536)
COMMON /BLK1/ ORIG, NEW, P, DISP
.
.
.

Here, the array P(4) moves DISP 32 bytes further from ORIG in memory. Now no two elements of the same index share a cache address, and for the given loop, this postpones cache overwriting until the entire current cache line is completely exploited. P is 4 elements, or 32 bytes, which prevents both processor cache thrashing and CTIcache thrashing on X2000 servers.

The alternate approach involves increasing the size of ORIG or NEW by 4 elements (32 bytes), as shown in the following example:

REAL*8 ORIG(65536), NEW(65540), DISP(65536)
COMMON /BLK1/ ORIG, NEW, DISP
.
.
.

Here, NEW has been increased by 4 elements, providing the padding necessary to prevent ORIG from sharing cache addresses with DISP. Figure 2-9 “Array layouts—non-thrashing” shows how both solutions prevent thrashing.

Figure 2-9 Array layouts—non-thrashing

Array layouts—non-thrashing

It is important to note that this is a highly simplified, worst-case example. On X2000 servers, thrashing can happen any time two data items that are referenced in the same loop are an integral multiple of 1 Mbyte apart in virtual memory. This can happen with data that is not stored in COMMON, in which case it is much more difficult to see, as such data can be stored noncontiguously and may be intermixed with completely unrelated data items.

The loop blocking optimization (described in Chapter 3, "Chapter 3 “Compiler optimizations”") will eliminate thrashing from certain nested loops, but not from all loops. Declaring arrays with dimensions that are not powers of two can help, but it will not necessarily eliminate the problem completely.

Using COMMON blocks in Fortran can also help; it allows you to accurately measure distances between data items, making thrashing problems easier to spot before they happen.

Interleaving

Physical pages are interleaved across the memory banks of a hypernode on a cache-line basis. (There are 32 banks per node in V2200 servers and X2000 servers). Contiguous cache lines are assigned in round-robin fashion, first to the even banks, then to the odd, as shown in Figure 2-10 “V2200 and X2000 memory interleaving” for V2200 servers and X2000 servers.

Interleaving speeds memory accesses by allowing several processors to access contiguous data simultaneously. This is beneficial when a loop that manipulates arrays is split among many processors; in the best case, threads will access data in patterns with no bank contention. Even in the worst case, where each thread initially needs the same data from the same bank, after the initial contention delay, the accesses will be spread out among the banks.

Interleaving example

The following example illustrates a nested loop that accesses memory with very little contention. This example is greatly simplified for illustrative purposes, but the concepts apply to arrays of any size.

REAL*8 A(12,12), B(12,12)
...
DO J = 1, N
DO I = 1, N
A(I,J) = B(I,J)
ENDDO
ENDDO

Assume that arrays A and B are stored contiguously in memory, with A starting in bank 0, CTIcache line 0, processor cache line 0, as shown in Figure 2-11 “V2200 and X2000 interleaving of arrays A and B for V2200 servers and X2000 servers.

Assume the Exemplar Fortran 90 compiler parallelizes the J loop to run on as many processors as are available in the system (up to N). Assuming N=12 and there are four processors available when the program is run, the J loop could be divided into four new loops, each with 3 iterations. Each new loop would run to completion on a separate processor. We will refer to these four processors as CPU0 through CPU3.

Figure 2-10 V2200 and X2000 memory interleaving

V2200 and X2000 memory interleaving
NOTE: This example is designed to simplify illustration. In reality, the dynamic selection optimization (discussed in Chapter 3, "Chapter 3 “Compiler optimizations”") would, given the iteration count and available number of processors described, cause this loop to run serially. The overhead of going parallel would outweigh the benefits.

In order to execute the body of the I loop, A and B must be fetched from memory and encached. Each of the four processors running the J loop will attempt to fetch its portion of the arrays, most likely simultaneously.

This means CPU0 will attempt to read arrays A and B starting at elements (1,1), CPU1 will attempt to start at elements (1,4) and so on. For clarity, Figure 2-11 “V2200 and X2000 interleaving of arrays A and B shows the first 32 CTIcache lines consecutively; after these, only the initial cache lines for each processor are shown. Each processor's initial cache line is shaded.

Because of the number of memory banks in the V2200 and X2000 architecture, interleaving removes the contention from the beginning of the loop from the example, as shown in Figure 2-11 “V2200 and X2000 interleaving of arrays A and B.

CPU0 needs A(1:12,1:3) and B(1:12,1:3)

CPU1 needs A(1:12,4:6) and B(1:12,4:6)

CPU2 needs A(1:12,7:9) and B(1:12,7:9)

CPU3 needs A(1:12,10:12) and B(1:12,10:12)

The data from the V2200/X2000 example above is spread out on different memory banks as described below:

  • A(1,1), the first element of the chunk needed by CPU0, is on cache line 0 in bank 0 on board 0

  • A(1,4), the first element needed by CPU1, is on cache line 9 in bank 1 on board 1

  • A(1,7), the first element needed by CPU2, is on cache line 18 in bank 2 on board 2

  • A(1,10) the first element needed by CPU3, is on cache line 27 in bank 3 on board 3

Because of interleaving, no contention exists between the processors when trying to read their respective portions of the arrays. Contention may surface occasionally as the processors make their way through the data, but the resulting delays are minimal compared to what could be expected without interleaving.

Figure 2-11 V2200 and X2000 interleaving of arrays A and B

V2200 and X2000 interleaving of arrays A and B

Variable-sized pages

Variable-sized pages are used to reduce Translation Lookaside Buffer (TLB) misses and consequently improve performance. With variable-sized pages, each TLB entry used can map a larger portion of an application's virtual address space. Thus, applications with large reference sets can be mapped using fewer TLB entries, resulting in fewer TLB misses. (A TLB is a hardware entity used to translate a virtual memory reference to a physical page.)

If an application is not experiencing performance degradation due to TLB misses, using a different page size does not help. Also, if an application uses too large a page size, fewer pages will be available to other applications on the system, potentially resulting in increased paging activity and performance degradation.

Valid page sizes on the PA-8000 and PA-8200 processors are 4K, 16K, 64K, 256K, 1 Mbyte, 4 Mbytes, 16 Mbytes, 64 Mbytes, and 256 Mbytes. (The default size, which can be configured, is 4K.) Methods for specifying a page size are described below. However, the user-specified page size is only a request for a specific size; the operating system takes various factors into account when selecting the page size.

The following command options and configurable kernel parameters allow you to specify information regarding page sizes.

  • Options to the chatr utility:

    • +pi: affects the page size for the application's text segment

    • +pd: affects the page size for the application's data segment

  • Configurable kernel parameters:

    • vps_pagesize: represents the default or minimum page size (in kilobytes) if the user has not used chatr to specify a value; the default is 4K

    • vps_ceiling: represents the maximum page size (in kilobytes) if the user has not used chatr to specify a value; the default is 16K

    • vps_chatr_ceiling: places a restriction on the largest value (in kilobytes) a user can specify using chatr; the default is 64 Mbytes

For more information on the chatr utility, see the chatr(1) man page.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.