| United States-English |
|
|
|
![]() |
Fortran 90 Compiler for HP-UX: Fortran 90 Programmer's Guide > Chapter 6 Performance and optimizationParallelizing HP Fortran 90 programs |
|
The following sections discuss how to use the +Oparallel option and the parallel directives when preparing and compiling HP Fortran 90 programs for parallel execution. Later sections also discuss reasons why the compiler may not have performed parallelization. The last section describes runtime warning and error messages unique to parallel-executing programs. For a description of the +Oparallel option, see “Fine-tuning optimization options”. The following command lines compile (without linking) three source files: x.f90, y.f90, and z.f90. The files x.f90 and y.f90 are compiled for parallel execution. The file z.f90 is compiled for serial execution, even though its object file will be linked with x.o and y.o.
The following command line links the three object files, producing the executable file para_prog:
As this command line implies, if you link and compile separately, you must use f90, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the parallel runtime support. To ensure the best runtime performance from programs compiled for parallel execution on a multiprocessor machine, do not run more than one parallel program on a multiprocessor machine at the same time. Running two or more parallel programs simultaneously may result in their sharing the same processors, which will degrade performance. You should run a parallel-executing program at a higher priority than any other user program; see rtprio(1) for information about setting real-time priorities. Running a parallel program on a heavily loaded system may also slow performance. You can profile a program that has been compiled for parallel execution in much the same way as for non-parallel programs:
The differences are:
The following sections describe conditions that can cause the compiler not to parallelize. These include the following: The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:
You can use the DIR$ NO SIDE EFFECTS directive to force the compiler to ignore side effects when determining whether to parallelize the loop. For information about this directive, see .
If the compiler finds that a runtime determination of a loop"s iteration count cannot be made before the loop starts to execute, the compiler will not parallelize the loop. The reason for this precaution is that the runtime code must know the iteration count in order to determine how many iterations to distribute to the executing processors. The following conditions can prevent a runtime count:
When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution will differ from the serial order when executing on a single processor. This difference is not a problem if the iterations can occur in any order with no effect on the results. Consider the following loop:
In this example, the array A will always end up with the same data regardless of whether the order of execution is 1-2-3-4-5, 5-4-3-2-1, 3-1-4-5-2, or any other order. The independence of each iteration from the others makes the loop an eligible candidate for parallel execution. Such is not the case in the following:
In this loop, the order of execution does matter. The data used in iteration I is dependent upon the data that was produced in the previous iteration (I-1). The array A would end up with very different data if the order of execution were any other than 2-3-4-5. The data dependence in this loop thus makes it ineligible for parallelization. Not all data dependences inhibit parallelization. The following paragraphs discuss some of the exceptions. Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:
The data dependence in this nested loop occurs in the inner (J) loop: each row access of A(J,I) depends upon the preceding row (J-1) having been assigned in the previous iteration. If the iterations of the J loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized. But no such data dependence appears in the outer loop: each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as the rows execute in serial order. When analyzing a loop, the compiler may err on the safe side and assume that what looks like a data dependence really is one and so not parallelize the loop. Consider the following:
The compiler will assume that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. On this assumption, the compiler will not parallelize the loop. However, if the value of K is 100, the dependence is assumed rather than real because A(I-K) is defined outside the loop. If in fact this is the case, the programmer can insert one of the following directives immediately before the loop, forcing the compiler to ignore any assumed dependences when analyzing the loop for parallelization:
For more information about these directives, see “Compatibility directives”. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||