Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-MPI User's Guide > Chapter 1 Introduction

MPI concepts

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The primary goals of MPI are efficient communication and portability.

Although several message-passing libraries exist on different systems, MPI is popular for the following reasons:

  • Support for full asynchronous communication—Process communication can overlap process computation.

  • Group membership—Processes may be grouped based on context.

  • Synchronization variables that protect process messaging—When sending and receiving messages, synchronization is enforced by source and destination information, message labeling, and context information.

  • Portability—All implementations are based on a published standard that specifies the semantics for usage.

An MPI program consists of a set of processes and a logical communication medium connecting those processes. An MPI process cannot directly access memory in another MPI process. Inter-process communication requires calling MPI routines in both processes. MPI defines a library of routines through which MPI processes communicate.

The MPI library routines provide a set of functions that support

  • Point-to-point communications

  • Collective operations

  • Process groups

  • Communication contexts

  • Process topologies

  • Datatype manipulation.

Although the MPI library contains a large number of routines, you can design a large number of applications by using the six routines listed in Table 1-1 “Six commonly used MPI routines”.

Table 1-1 Six commonly used MPI routines

MPI routineDescription
MPI_InitInitializes the MPI environment
MPI_FinalizeTerminates the MPI environment
MPI_Comm_rankDetermines the rank of the calling process within a group
MPI_Comm_sizeDetermines the size of the group
MPI_SendSends messages
MPI_RecvReceives messages

 

You must call MPI_Finalize in your application to conform to the MPI Standard. HP-MPI issues a warning when a process exits without calling MPI_Finalize.

CAUTION: There should be no code before MPI_Init and after MPI_Finalize. Applications that violate this rule are non-portable and may give incorrect results.

As your application grows in complexity, you can introduce other routines from the library. For example, MPI_Bcast is an often-used routine for sending or broadcasting data from one process to other processes in a single operation. Use broadcast transfers to get better performance than with point-to-point transfers. The latter use MPI_Send to send data from each sending process and MPI_Recv to receive it at each receiving process.

The following sections briefly introduce the concepts underlying MPI library routines. For more detailed information refer to MPI: A Message-Passing Interface Standard.

Point-to-point communication

Point-to-point communication involves sending and receiving messages between two processes. This is the simplest form of data transfer in a message-passing model and is described in Chapter 3, “Point-to-Point Communication” in the MPI 1.0 standard.

The performance of point-to-point communication is measured in terms of total transfer time. The total transfer time is defined as

total_transfer_time = latency + (message_size/bandwidth)

where

latency

Specifies the time between the initiation of the data transfer in the sending process and the arrival of the first byte in the receiving process.

message_size

Specifies the size of the message in Mbytes.

bandwidth

Denotes the reciprocal of the time needed to transfer a byte. Bandwidth is normally expressed in Mbytes per second.

Low latencies and high bandwidths lead to better performance.

Communicators

A communicator is an object that represents a group of processes and their communication medium or context. These processes exchange messages to transfer data. Communicators encapsulate a group of processes such that communication is restricted to processes within that group.

The default communicators provided by MPI are MPI_COMM_WORLD and MPI_COMM_SELF. MPI_COMM_WORLD contains all processes that are running when an application begins execution. Each process is the single member of its own MPI_COMM_SELF communicator.

Communicators that allow processes within a group to exchange data are termed intracommunicators. Communicators that allow processes in two different groups to exchange data are called intercommunicators.

Many MPI applications depend upon knowing the number of processes and the process rank within a given communicator. There are several communication management functions; two of the more widely used are MPI_Comm_size and MPI_Comm_rank. The process rank is a unique number assigned to each member process from the sequence 0 through (size-1), where size is the total number of processes in the communicator.

To determine the number of processes in a communicator, use the following syntax:

MPI_Comm_size (MPI_Comm comm, int *size);

where

comm

Represents the communicator handle

size

Represents the number of processes in the group of comm

To determine the rank of each process in comm, use

MPI_Comm_rank(MPI_Comm comm, int *rank);

where

comm

Represents the communicator handle

rank

Represents an integer between zero and (size - 1)

A communicator is an argument to all communication routines. The C code example, “communicator.c” displays the use MPI_Comm_dup, one of the communicator constructor functions, and MPI_Comm_free, the function that marks a communication object for deallocation.

Sending and receiving messages

There are two methods for sending and receiving data: blocking and nonblocking.

In blocking communications, the sending process does not return until the send buffer is available for reuse.

In nonblocking communications, the sending process returns immediately, and may only have started the message transfer operation, not necessarily completed it. The application may not safely reuse the message buffer after a nonblocking routine returns until MPI_Wait indicates that the message transfer has completed.

In nonblocking communications, the following sequence of events occurs:

  1. The sending routine begins the message transfer and returns immediately.

  2. The application does some computation.

  3. The application calls a completion routine (for example, MPI_Test or MPI_Wait) to test or wait for completion of the send operation.

Blocking communication

Blocking communication consists of four send modes and one receive mode.

The four send modes are:

Standard (MPI_Send)

The sending process returns when the system can buffer the message or when the message is received and the buffer is ready for reuse.

Buffered (MPI_Bsend)

The sending process returns when the message is buffered in an application-supplied buffer.

Avoid using the MPI_Bsend mode because it forces an additional copy operation.

Synchronous (MPI_Ssend)

The sending process returns only if a matching receive is posted and the receiving process has started to receive the message.

Ready (MPI_Rsend)

The message is sent as soon as possible.

You can invoke any mode by using the appropriate routine name and passing the argument list. Arguments are the same for all modes.

For example, to code a standard blocking send, use

MPI_Send (void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm);

where

buf

Specifies the starting address of the buffer.

count

Indicates the number of buffer elements.

dtype

Denotes the datatype of the buffer elements.

dest

Specifies the rank of the destination process in the group associated with the communicator comm.

tag

Denotes the message label.

comm

Designates the communication context that identifies a group of processes.

To code a blocking receive, use

MPI_Recv (void *buf, int count, MPI_datatype dtype, int source, int tag, MPI_Comm comm, MPI_Status *status);

where

buf

Specifies the starting address of the buffer.

count

Indicates the number of buffer elements.

dtype

Denotes the datatype of the buffer elements.

source

Specifies the rank of the source process in the group associated with the communicator comm.

tag

Denotes the message label.

comm

Designates the communication context that identifies a group of processes.

status

Returns information about the received message. Status information is useful when wildcards are used or the received message is smaller than expected. Status may also contain error codes.

Examples “send_receive.f”, “ping_pong.c”, and “master_worker.f90” all illustrate the use of standard blocking sends and receives.

NOTE: You should not assume message buffering between processes because the MPI standard does not mandate a buffering strategy. HP-MPI does sometimes use buffering for MPI_Send and MPI_Rsend, but it is dependent on message size. Deadlock situations can occur when your code uses standard send operations and assumes buffering behavior for standard communication mode.
Nonblocking communication

MPI provides nonblocking counterparts for each of the four blocking send routines and for the receive routine. Table 1-2 “MPI blocking and nonblocking calls” lists blocking and nonblocking routine calls.

Table 1-2 MPI blocking and nonblocking calls

Blocking modeNonblocking mode
MPI_SendMPI_Isend
MPI_BsendMPI_Ibsend
MPI_SsendMPI_Issend
MPI_RsendMPI_Irsend
MPI_RecvMPI_Irecv

 

Nonblocking calls have the same arguments, with the same meaning as their blocking counterparts, plus an additional argument for a request.

To code a standard nonblocking send, use

MPI_Isend(void *buf, int count, MPI_datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req);

where

req

Specifies the request used by a completion routine when called by the application to complete the send operation.

To complete nonblocking sends and receives, you can use MPI_Wait or MPI_Test. The completion of a send indicates that the sending process is free to access the send buffer. The completion of a receive indicates that the receive buffer contains the message, the receiving process is free to access it, and the status object, that returns information about the received message, is set.

Collective operations

Applications may require coordinated operations among multiple processes. For example, all processes need to cooperate to sum sets of numbers distributed among them.

MPI provides a set of collective operations to coordinate operations among processes. These operations are implemented such that all processes call the same operation with the same arguments. Thus, when sending and receiving messages, one collective operation can replace multiple sends and receives, resulting in lower overhead and higher performance.

Collective operations consist of routines for communication, computation, and synchronization. These routines all specify a communicator argument that defines the group of participating processes and the context of the operation.

Collective operations are valid only for intracommunicators. Intercommunicators are not allowed as arguments.

Communication

Collective communication involves the exchange of data among all processes in a group. The communication can be one-to-many,
many-to-one, or many-to-many.

The single originating process in the one-to-many routines or the single receiving process in the many-to-one routines is called the root.

Collective communications have three basic patterns:

Broadcast and Scatter

Root sends data to all processes, including itself.

Gather

Root receives data from all processes, including itself.

Allgather and Alltoall

Each process communicates with each process, including itself.

The syntax of the MPI collective functions is designed to be consistent with point-to-point communications, but collective functions are more restrictive than point-to-point functions. Some of the important restrictions to keep in mind are:

  • The amount of data sent must exactly match the amount of data specified by the receiver.

  • Collective functions come in blocking versions only.

  • Collective functions do not use a tag argument meaning that collective calls are matched strictly according to the order of execution.

  • Collective functions come in standard mode only.

For detailed discussions of collective communications refer to Chapter 4, “Collective Communication” in the MPI 1.0 standard. The following examples demonstrate the syntax to code two collective operations; a broadcast and a scatter:

To code a broadcast, use

MPI_Bcast(void *buf, int count, MPI_Datatype dtype, int root, MPI_Comm comm);

where

buf

Specifies the starting address of the buffer.

count

Indicates the number of buffer entries.

dtype

Denotes the datatype of the buffer entries.

root

Specifies the rank of the root.

comm

Designates the communication context that identifies a group of processes.

For example “compute_pi.f” uses MPI_BCAST to broadcast one integer from process 0 to every process in MPI_COMM_WORLD.

To code a scatter, use

MPI_Scatter (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm);

where

sendbuf

Specifies the starting address of the send buffer.

sendcount

Specifies the number of elements sent to each process.

sendtype

Denotes the datatype of the send buffer.

recvbuf

Specifies the address of the receive buffer.

recvcount

Indicates the number of elements in the receive buffer.

recvtype

Indicates the datatype of the receive buffer elements.

root

Denotes the rank of the sending process.

comm

Designates the communication context that identifies a group of processes.

Computation

Computational operations do global reduction operations, such as sum, max, min, product, or user-defined functions across all members of a group. There are a number of global reduction functions:

Reduce

Returns the result of a reduction at one node.

All-reduce

Returns the result of a reduction at all nodes.

Reduce-Scatter

Combines the functionality of reduce and scatter operations.

Scan

Performs a prefix reduction on data distributed across a group.

Section 4.9, “Global Reduction Operations” in the MPI 1.0 standard describes each of these functions in detail.

Reduction operations are binary and are only valid on numeric data. Reductions are always associative but may or may not be commutative.

You can select a reduction operation from a predefined list (refer to section 4.9.2 in the MPI 1.0 standard) or define your own operation. The operations are invoked by placing the operation name, for example MPI_SUM or MPI_PROD, in op as described in the MPI_Reduce syntax below.

To implement a reduction, use

MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype dtype, MPI_Op op, int root, MPI_Comm comm);

where

sendbuf

Specifies the address of the send buffer.

recvbuf

Denotes the address of the receive buffer.

count

Indicates the number of elements in the send buffer.

dtype

Specifies the datatype of the send and receive buffers.

op

Specifies the reduction operation.

root

Indicates the rank of the root process.

comm

Designates the communication context that identifies a group of processes.

For example “compute_pi.f” uses MPI_REDUCE to sum the elements provided in the input buffer of each process in MPI_COMM_WORLD, using MPI_SUM, and returns the summed value in the output buffer of the root process (in this case, process 0).

Synchronization

Collective routines return as soon as their participation in a communication is complete. However, the return of the calling process does not guarantee that the receiving processes have completed or even started the operation.

To synchronize the execution of processes, call MPI_Barrier. MPI_Barrier blocks the calling process until all processes in the communicator have called it. This is a useful approach for separating two stages of a computation so messages from each stage do not overlap.

To implement a barrier, use

MPI_Barrier(MPI_Comm comm);

where

comm

Identifies a group of processes and a communication context.

For example, “cart.C” uses MPI_Barrier to synchronize data before printing.

MPI datatypes and packing

You can use predefined datatypes (for example, MPI_INT in C) to transfer data between two processes using point-to-point communication. This transfer is based on the assumption that the data transferred is stored in contiguous memory (for example, sending an array in a C or Fortran application).

When you want to transfer data that is not homogeneous, such as a structure, or that is not contiguous in memory, such as an array section, you can use derived datatypes or packing and unpacking functions:

Derived datatypes

Specifies a sequence of basic datatypes and integer displacements describing the data layout in memory. You can use user-defined datatypes or predefined datatypes in MPI communication functions.

Packing and Unpacking functions

Provide MPI_Pack and MPI_Unpack functions so that a sending process can pack noncontiguous data into a contiguous buffer and a receiving process can unpack data received in a contiguous buffer and store it in noncontiguous locations.

Using derived datatypes is more efficient than using MPI_Pack and MPI_Unpack. However, derived datatypes cannot handle the case where the data layout varies and is unknown by the receiver, for example, messages that embed their own layout description.

Section 3.12, “Derived Datatypes” in the MPI 1.0 standard describes the construction and use of derived datatypes. The following is a summary of the types of constructor functions available in MPI:

  • Contiguous (MPI_Type_contiguous)—Allows replication of a datatype into contiguous locations.

  • Vector (MPI_Type_vector)—Allows replication of a datatype into locations that consist of equally spaced blocks.

  • Indexed (MPI_Type_indexed)—Allows replication of a datatype into a sequence of blocks where each block can contain a different number of copies and have a different displacement.

  • Structure (MPI_Type_struct)—Allows replication of a datatype into a sequence of blocks such that each block consists of replications of different datatypes, copies, and displacements.

After you create a derived datatype, you must commit it by calling MPI_Type_commit.

HP-MPI optimizes collection and communication of derived datatypes.

Section 3.13, “Pack and unpack” in the MPI 1.0 standard describes the details of the pack and unpack functions for MPI. Used together, these routines allow you to transfer heterogeneous data in a single message, thus amortizing the fixed overhead of sending and receiving a message over the transmittal of many elements.

Refer to Chapter 3, “User-Defined Datatypes and Packing” in MPI: The Complete Reference for a discussion of this topic and examples of construction of derived datatypes from the basic datatypes using the MPI constructor functions.

Multilevel parallelism

By default, processes in an MPI application can only do one task at a time. Such processes are single-threaded processes. This means that each process has an address space together with a single program counter, a set of registers, and a stack.

A process with multiple threads has one address space, but each process thread has its own counter, registers, and stack.

Multilevel parallelism refers to MPI processes that have multiple threads. Processes become multi threaded through calls to multi threaded libraries, parallel directives and pragmas, or auto-compiler parallelism. Refer to “Thread-compliant library” for more information on linking with the thread-compliant library.

Multilevel parallelism is beneficial for problems you can decompose into logical parts for parallel execution; for example, a looping construct that spawns multiple threads to do a computation and joins after the computation is complete.

The example program, “multi_par.f” is an example of multilevel parallelism.

Advanced topics

This chapter only provides a brief introduction to basic MPI concepts. Advanced MPI topics include:

  • Error handling

  • Process topologies

  • User-defined datatypes

  • Process grouping

  • Communicator attribute caching

  • The MPI profiling interface

To learn more about the basic concepts discussed in this chapter and advanced MPI topics refer to MPI: The Complete Reference and MPI: A Message-Passing Interface Standard.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1979-2007 Hewlett-Packard Development Company, L.P.