 |
» |
|
|
 |
Often a cluster might have both regular ethernet and some
form of higher speed interconnect such as InfiniBand. This section
describes how to use the ping_pong_ring.c example program to confirm
that you are able to run using the desired interconnect. Running a test like this, especially on a new cluster, is
useful to ensure that the appropriate network drivers are installed
and that the network hardware is functioning properly. If any machine
has defective network cards or cables, this test can also be useful
at identifying which machine has the problem. To compile the program, set the MPI_ROOT environment
variable (not required, but recommended) to a value such as /opt/hpmpi (Linux) or /opt/mpi (HP-UX), then run % export MPI_CC=gcc (whatever compiler you want) % $MPI_ROOT/bin/mpicc -o pp.x \ $MPI_ROOT/help/ping_pong_ring.c Although mpicc will perform a search for what compiler to use if you don't
specify MPI_CC, it is preferable to be explicit. If you have a shared filesystem, it is easiest to put the
resulting pp.x executable there, otherwise you will have to explicitly
copy it to each machine in your cluster. As discussed elsewhere, there are a variety of supported startup methods,
and you need to know which is appropriate for your cluster. Your
situation should resemble one of the following: No srun, prun, or CCS job scheduler command is available For this case you can create an appfile such as the
following: -h hostA -np 1 /path/to/pp.x -h hostB -np 1 /path/to/pp.x -h hostC -np 1 /path/to/pp.x ... -h hostZ -np 1 /path/to/pp.x |
And you can specify what remote shell command to use (Linux default
is ssh) in the MPI_REMSH environment variable. For example you might want % export MPI_REMSH="rsh -x" (optional) Then run % $MPI_ROOT/bin/mpirun -prot -f appfile % $MPI_ROOT/bin/mpirun -prot -f appfile -- 1000000 Or if LSF is being used, then the hostnames in the appfile
wouldn't matter, and the command to run would be % bsub pam -mpi $MPI_ROOT/bin/mpirun -prot -f appfile % bsub pam -mpi $MPI_ROOT/bin/mpirun -prot -f appfile \ -- 1000000 The srun command is available For this case then you would run a command like % $MPI_ROOT/bin/mpirun -prot -srun -N 8 -n 8 /path/to/pp.x % $MPI_ROOT/bin/mpirun -prot -srun -N 8 -n 8 /path/to/ \ pp.x 1000000 replacing "8" with the number of hosts. Or if LSF is being used, then the command to run might be % bsub -I -n 16 $MPI_ROOT/bin/mpirun -prot -srun \ /path/to/pp.x % bsub -I -n 16 $MPI_ROOT/bin/mpirun -prot -srun \ /path/to/pp.x 1000000 The prun command is available This case is basically identical to the srun case with the obvious change of using prun in place of srun.
In each case above, the first mpirun uses 0-bytes of data per message and is for checking
latency. The second mpirun uses 1000000 bytes per message and is for checking bandwidth.  |
#include <stdio.h> #include <stdlib.h> #ifndef _WIN32 #include <unistd.h> #endif #include <string.h> #include <math.h> #include <mpi.h>#define NLOOPS 1000 #define ALIGN 4096#define SEND(t) MPI_Send(buf, nbytes, MPI_CHAR, partner, (t), \ MPI_COMM_WORLD) #define RECV(t) MPI_Recv(buf, nbytes, MPI_CHAR, partner, (t), \ MPI_COMM_WORLD, &status) #ifdef CHECK # define SETBUF() for (j=0; j<nbytes; j++) { \ buf[j] = (char) (j + i); \ }# define CLRBUF() memset(buf, 0, nbytes) # define CHKBUF() for (j = 0; j < nbytes; j++) { \ if (buf[j] != (char) (j + i)) { \ printf("error: buf[%d] = %d, " \ "not %d\n", \ j, buf[j], j + i); \ break; \ } \ }#else # define SETBUF() # define CLRBUF() # define CHKBUF() #endifint main(argc, argv)int argc; char *argv[];{ int i; #ifdef CHECK int j;#endif double start, stop; int n bytes = 0; int rank, size; int root; int partner; MPI_Status status; char *buf, *obuf; char myhost[MPI_MAX_PROCESSOR_NAME]; int len; char str[1024]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(myhost, &len); if (size < 2) { if ( ! rank) printf("rping: must have two+ processes\n"); MPI_Finalize(); exit(0); } nbytes = (argc > 1) ? atoi(argv[1]) : 0; if (nbytes < 0) nbytes = 0;/* * Page-align buffers and displace them in the cache to avoid collisions. */ buf = (char *) malloc(nbytes + 524288 + (ALIGN - 1)); obuf = buf; if (buf == 0) { MPI_Abort(MPI_COMM_WORLD, MPI_ERR_BUFFER); exit(1); } buf = (char *) ((((unsigned long) buf) + (ALIGN - 1)) & ~(ALIGN - 1)); if (rank > 0) buf += 524288; memset(buf, 0, nbytes); /* * Ping-pong. */ for (root=0; root<size; root++) { if (rank == root) { partner = (root + 1) % size; sprintf(str, "[%d:%s] ping-pong %d bytes ...\n", root, myhost, nbytes); /* * warm-up loop */ for (i = 0; i < 5; i++) { SEND(1); RECV(1); }/* * timing loop */ start = MPI_Wtime(); for (i = 0; i < NLOOPS; i++) { SETBUF(); SEND(1000 + i); CLRBUF(); RECV(2000 + i); CHKBUF(); } stop = MPI_Wtime(); sprintf(&str[strlen(str)], "%d bytes: %.2f usec/msg\n", nbytes, (stop - start) / NLOOPS / 2 * 1024 * 1024); if (nbytes > 0) { sprintf(&str[strlen(str)], "%d bytes: %.2f MB/sec\n", nbytes, nbytes / (1024. * 1024.) / ((stop - start) / NLOOPS / 2)); } fflush(stdout); } else if (rank == (root+1)%size) { /* * warm-up loop */ partner = root; for (i = 0; i < 5; i++) { RECV(1); SEND(1); } for (i = 0; i < NLOOPS; i++) { CLRBUF(); RECV(1000 + i); CHKBUF(); SETBUF(); SEND(2000 + i); } } MPI_Bcast(str, 1024, MPI_CHAR, root, MPI_COMM_WORLD); if (rank == 0) { printf("%s", str); } } free(obuf); MPI_Finalize(); exit(0);} |
 |
ping_pong_ring.c
output |  |
Example output might look like: > Host 0 -- ip 192.168.9.10 -- ranks 0 > Host 1 -- ip 192.168.9.11 -- ranks 1 > Host 2 -- ip 192.168.9.12 -- ranks 2 > Host 3 -- ip 192.168.9.13 -- ranks 3 > > host | 0 1 2 3 > ======|===================== > 0 : SHM VAPI VAPI VAPI > 1 : VAPI SHM VAPI VAPI > 2 : VAPI VAPI SHM VAPI > 3 : VAPI VAPI VAPI SHM > > [0:hostA] ping-pong 0 bytes ... > 0 bytes: 4.57 usec/msg > [1:hostB] ping-pong 0 bytes ... > 0 bytes: 4.38 usec/msg > [2:hostC] ping-pong 0 bytes ... > 0 bytes: 4.42 usec/msg > [3:hostD] ping-pong 0 bytes ... > 0 bytes: 4.42 usec/msg |
The table showing SHM/VAPI is printed because of the "-prot"
option (print protocol) specified in the mpirun command. In general, it could show any of the following
settings: VAPI: InfiniBand UDAPL: InfiniBand IBV: InfiniBand PSM: InfiniBand MX: Myrinet MX IBAL: InfiniBand (on Windows only) IT: IT-API on InfiniBand GM: Myrinet GM2 ELAN: Quadrics Elan4 TCP: TCP/IP MPID: commd SHM: Shared Memory (intra host only) If the table shows TCP/IP for one or more hosts, it is possible
that the host doesn't have the appropriate network drivers installed. If one or more hosts show considerably worse performance than
another, it can often indicate a bad card or cable. If the run aborts with some kind of error message, it is possible
that HP-MPI determined incorrectly what interconnect was available.
One common way to encounter this problem is to run a 32-bit application
on a 64-bit machine like an Opteron or Intel®64. It is not uncommon for
the network vendors for InfiniBand and others to only provide 64-bit libraries
for their network. HP-MPI makes its decision about what interconnect to use before
it even knows the application's bitness. In order to have proper
network selection in that case, one must specify if the app is 32-bit
when running on Opteron and Intel®64
machines: % $MPI_ROOT/bin/mpirun -mpi32 ...
|