| United States-English |
|
|
|
![]() |
Fortran 90, Fortran 77, C, aC++: Exemplar Programming Guide > Chapter 1 IntroductionExemplar SMP architectures |
|
Hewlett-Packard offers single-processor systems and symmetric multiprocessor (SMP) systems. The SMP systems, known as Exemplar servers, can be either nonscalable or scalable systems. The remainder of this section discusses the scalability of SMPs. Hewlett-Packard's nonscalable SMPs are single-hypernode systems. (For nonscalable SMPs, a hypernode is simply the set of processors and physical memory.) Memory is shared among all the processors, with a bus serving as the interconnect. The shared-memory architecture has a uniform access time from each processor. For example, D-Class servers are nonscalable SMPs. HP's scalable Exemplar systems implement parallel processing using scalable parallel processing technology. Scalable parallel machines can be scaled to meet your specific needs. Current configurations range from one to four hypernodes (or nodes), with the system having from 4 to 64 processors. Processors communicate with each other, with memory, and with I/O devices via a nonblocking crossbar on each hypernode for intrahypernode communication and eight high-speed CTI rings that link the hypernodes together for interhypernode communication. (CTI stands for Coherent Toroidal Interconnect. ) The CTI ring design is derived from the IEEE standard 1596-1992, SCI ( Scalable Coherent Interface), but the Exemplar implementation sacrifices complete SCI compatibility to provide lower latencies. Physical memory is also scalable. V2200 and X2000 servers support up to 16 Gbytes of memory. Each process on an HP-UX 11.0 system can access a 16-terabyte (Tbyte) virtual address space. Scalable parallel processing represents a departure from traditional vector/parallel supercomputers like the Convex C Series. The C Series architecture is used to illustrate the difference between traditional and Exemplar architectures below, but the same differences apply in principle to all vector/parallel machines. Convex C Series machines contain a limited number (1-8) of custom processors connected by a high-speed crossbar to a large, shared memory. For connecting small numbers of processors such as these to memory, crossbars are cost-effective and fast, allowing all processors to access all memory with equally high speed. Each processor is equipped with one or more vector units that speed loop computations involving arrays by performing array arithmetic on up to 128 elements per vector instruction. Machines containing multiple processors can further reduce time-to-solution by adding parallelism at the process, loop, and task level. The Exemplar architectures take a different approach. Rather than using vector units to exploit fine-grained parallelism, the processors in an Exemplar server speed scalar processing by using a reduced set of high-speed instructions coupled with pipelining, high-speed instruction and data caches, and a large register set. Two-dimensional parallelism, which can benefit nested parallel structures, is also possible on multihypernode Exemplar servers. Rather than implementing the first dimension in the vector unit and the second across processors (as in C Series), Exemplar servers can implement the first level within a hypernode and the second across hypernodes. Single-dimensional parallelism that spans hypernodes can also be implemented. Because of the potentially large number of processors available on a multihypernode Exemplar server, memory access via a system-wide crossbar is not practical. Instead, low latency, high-bandwidth memory access is provided by shared memory. In this model, physical memory is distributed among all hypernodes, and the entire virtual address space of a process is accessible by every processor. Processors within a hypernode can access hypernode-local memory via the crossbar regardless of whether the address space is on one or more hypernodes; memory in another hypernode can be accessed via the CTI rings. Of course, interhypernode accesses take longer than intrahypernode accesses. However, part of every hypernode's memory is dedicated to act as a CTIcache, which holds copies of recently used data from other hypernodes. These CTIcaches and the processor caches are coherent, meaning that when a thread references a data item via its virtual address, the value it receives will be the most recently-assigned value. By holding frequently referenced data close to its referencing processes, regardless of the actual memory location of the data, these caches provide excellent data distribution. Programs that optimize well on traditional vector/parallel machines optimize well on Exemplar systems with little manual intervention. Exemplar compilers automatically exploit opportunities for parallelism and data localization in programs written for shared-memory machines. Chapters 3 through 6 discuss manual optimizations that can yield even more performance from such programs. While the Exemplar architectures use the same processors found in HP workstations, the following features sharply distinguish the Exemplar servers from clustered workstations:
The subsections below discuss each distinguishing feature in detail. Each workstation in a cluster has its own private memory; there is no shared memory. That is, any data shared among processors must be passed over the low-performance network that connects them. While an Exemplar server can support this method of programming, it offers the many advantages of shared memory, as described in the “Exemplar vs. vector/parallel architectures”. Many workstation operating systems reserve a large amount of memory for system use, restricting user processes to what is left. The HP-UX operating system requires only a small fraction of each processor's memory, leaving a large majority of it for user processes, whether they are using shared memory or message passing. Programs for clustered workstations are compiled using the workstations' compilers. If the cluster contains workstations that require different executables (that is, if it is a heterogeneous cluster), the programmer must generate the executables using the proper compiler. Homogeneous clusters eliminate this requirement, but automatic parallelization is nevertheless unavailable on any type of cluster. The compilers used may generate efficient code for each processor, but any parallelism or process coordination must be explicitly implemented by the programmer via message passing. Exemplar compilers provide fully automatic parallelism and several new data localization optimizations designed to improve memory usage and aid parallelization. Additionally, directives allow you to further enhance the automatic optimizations performed on your shared-memory program. Exemplar compilers give the highest performance—with little or no programmer intervention—for generic programs that exploit shared memory. Message-passing programs, with their parallelism explicitly coded, also benefit from Exemplar compiler optimizations. To communicate among themselves or access each other's data, processors in a cluster of workstations must communicate over low-performance networks and access distributed memory. Communication can be handled only by passing explicit messages between workstations over the network; because of the distributed memory and absence of parallelizing compilers, programmers must explicitly code parallelism. Parallel tasks running on clusters, then, must be fairly autonomous to avoid wasting time waiting for data or synchronization instructions to travel over the network. Clusters are best suited to coarse-grained parallelism, such as that possible at the process level, or to manually parallelizable algorithms that contain a large ratio of computation to communication. In these cases, task chunks or processes and their data are parcelled out to underused workstations, run to completion, and the results are sent back to the parent. Fine-grained, loop-level parallelism is difficult to efficiently perform on clusters because of the need for frequent data accesses and synchronization. Exemplar servers are suitable for both coarse- and fine-grained parallelism. Programs containing potential parallelism, when compiled with Exemplar compilers, automatically exploit the parallelism, spawning threads to run on as many processors as are available and rejoining these threads upon completion. This fine-grained parallelism takes full advantage of the fully coherent memory caches and high-speed interconnects available on an Exemplar system. While message passing is supported and can be used to speed
certain applications (refer to Chapter 7, "Chapter 7 “Message-passing HP-UX automatically schedules threads within a hypernode to execute on idle and underused processors as necessary. This ensures a balanced machine load and exploits both thread- and process-level parallelism. Peripheral devices connected to an Exemplar server can be accessed from any processor on the machine. On clustered workstations, peripherals are processor-dependent. Programs running on Exemplar systems, therefore, have access to potentially greater mass storage space. In terms of configuring hardware, adding processors to a cluster can actually degrade performance because of the low-performance network and private memory. The network can present a bottleneck when parallelism increases to exploit the new processors; to overcome this, coarser granularity can be used—and this can require more private memory than the processors can address. The absolute performance of an Exemplar server, on the other hand, increases unhindered by a traditional network or private-memory limits. Adding peripherals and memory to an Exemplar server can also provide improved absolute performance, because all processors can access both, whereas memory and peripherals are processor-specific on clusters. |
|||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||