Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
Fortran 90 Compiler for HP-UX: Fortran 90 Programmer's Guide > Chapter 6 Performance and optimization

Parallelizing HP Fortran 90 programs

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

The following sections discuss how to use the +Oparallel option and the parallel directives when preparing and compiling HP Fortran 90 programs for parallel execution. Later sections also discuss reasons why the compiler may not have performed parallelization. The last section describes runtime warning and error messages unique to parallel-executing programs.

For a description of the +Oparallel option, see “Fine-tuning optimization options”.

Compiling for parallel execution

The following command lines compile (without linking) three source files: x.f90, y.f90, and z.f90. The files x.f90 and y.f90 are compiled for parallel execution. The file z.f90 is compiled for serial execution, even though its object file will be linked with x.o and y.o.

f90 +O3 +Oparallel -c x.f90 y.f90
f90 +O3 -c z.f90

The following command line links the three object files, producing the executable file para_prog:

f90 +O3 +Oparallel -o para_prog x.o y.o z.o

As this command line implies, if you link and compile separately, you must use f90, not ld. The command line to link must also include the +Oparallel and +O3 options in order to link in the parallel runtime support.

Performance and parallelization

To ensure the best runtime performance from programs compiled for parallel execution on a multiprocessor machine, do not run more than one parallel program on a multiprocessor machine at the same time. Running two or more parallel programs simultaneously may result in their sharing the same processors, which will degrade performance. You should run a parallel-executing program at a higher priority than any other user program; see rtprio(1) for information about setting real-time priorities.

Running a parallel program on a heavily loaded system may also slow performance.

Profiling parallelized programs

You can profile a program that has been compiled for parallel execution in much the same way as for non-parallel programs:

  1. Compile the program with the +gprof option.

  2. Run the program to produce profiling data.

  3. Run gprof against the program.

  4. View the output from gprof.

The differences are:

  • Step 2 produces a gmon.out file with the CPU times for all executing threads.

  • In Step 4, the flat profile that you view uses the following notation to denote DO loops that were parallelized:

    routine_name##pr_line_nnnn

    where routine_name is the name of the routine containing the loop, pr (parallel region) indicates that the loop was parallelized, and nnnn is the line number of the start of the loop.

Conditions inhibiting loop parallelization

The following sections describe conditions that can cause the compiler not to parallelize. These include the following:

Calling routines with side effects

The compiler will not parallelize any loop containing a call to a routine that has side effects. A routine has side effects if it does any of the following:

  • Modifies its arguments

  • Modifies a global, common-block variable, or save variable

  • Redefines variables that are local to the calling routine

  • Performs I/O

  • Calls another subroutine or function that does any of the above

You can use the DIR$ NO SIDE EFFECTS directive to force the compiler to ignore side effects when determining whether to parallelize the loop. For information about this directive, see .

NOTE: A subroutine (but not a function) is always expected to have side effects. If you apply this directive to a subroutine call, the optimizer assumes that the call has no effect on program results and can eliminate the call to improve performance.

Indeterminate iteration counts

If the compiler finds that a runtime determination of a loop"s iteration count cannot be made before the loop starts to execute, the compiler will not parallelize the loop. The reason for this precaution is that the runtime code must know the iteration count in order to determine how many iterations to distribute to the executing processors.

The following conditions can prevent a runtime count:

  • The loop is a DO-forever construct.

  • An EXIT statement appears in the loop.

  • The loop contains a conditional GO TO statement that exits from the loop.

  • The loop modifies either the loop-control or loop-limit variable.

  • The loop is a DO WHILE construct and the condition being tested is defined within the loop.

Data dependences

When a loop is parallelized, the iterations are executed independently on different processors, and the order of execution will differ from the serial order when executing on a single processor. This difference is not a problem if the iterations can occur in any order with no effect on the results. Consider the following loop:

   DO I = 1, 5
A(I) = A(I) * B(I)
END DO

In this example, the array A will always end up with the same data regardless of whether the order of execution is 1-2-3-4-5, 5-4-3-2-1, 3-1-4-5-2, or any other order. The independence of each iteration from the others makes the loop an eligible candidate for parallel execution.

Such is not the case in the following:

   DO I = 2, 5
A(I) = A(I-1) * B(I)
END DO

In this loop, the order of execution does matter. The data used in iteration I is dependent upon the data that was produced in the previous iteration (I-1). The array A would end up with very different data if the order of execution were any other than 2-3-4-5. The data dependence in this loop thus makes it ineligible for parallelization.

Not all data dependences inhibit parallelization. The following paragraphs discuss some of the exceptions.

Nested loops and matrices

Some nested loops that operate on matrices may have a data dependence in the inner loop only, allowing the outer loop to be parallelized. Consider the following:

   DO I = 1, 10
DO J = 2, 100
A(J,I) = A(J-1,I) + 1
END DO
END DO

The data dependence in this nested loop occurs in the inner (J) loop: each row access of A(J,I) depends upon the preceding row (J-1) having been assigned in the previous iteration. If the iterations of the J loop were to execute in any other order than the one in which they would execute on a single processor, the matrix would be assigned different values. The inner loop, therefore, must not be parallelized.

But no such data dependence appears in the outer loop: each column access is independent of every other column access. Consequently, the compiler can safely distribute entire columns of the matrix to execute on different processors; the data assignments will be the same regardless of the order in which the columns are executed, so long as the rows execute in serial order.

Assumed dependences

When analyzing a loop, the compiler may err on the safe side and assume that what looks like a data dependence really is one and so not parallelize the loop. Consider the following:

   DO I = 101, 200
A(I) = A(I-K)
END DO

The compiler will assume that a data dependence exists in this loop because it appears that data that has been defined in a previous iteration is being used in a later iteration. On this assumption, the compiler will not parallelize the loop.

However, if the value of K is 100, the dependence is assumed rather than real because A(I-K) is defined outside the loop. If in fact this is the case, the programmer can insert one of the following directives immediately before the loop, forcing the compiler to ignore any assumed dependences when analyzing the loop for parallelization:

  • DIR$ IVDEP

  • FPP$ NODEPCHK

  • VD$ NODEPCHK

For more information about these directives, see “Compatibility directives”.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.