Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home

Parallel Programming Guide for HP-UX Systems

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

HP Part Number: B3909-90015

Edition: Sixth Edition

Published: September 2003


Table of Contents

Preface
Scope
Notational conventions
Command syntax
Associated documents
1 Introduction
HP SMP architectures
Bus-based systems
Hyperplane Interconnect systems
Parallel programming model
The shared-memory paradigm
The message-passing paradigm
Overview of HP optimizations
Basic scalar optimizations
Advanced scalar optimizations
Parallelization
2 Architecture overview
System architectures
Data caches
Cache thrashing
Memory Systems
Physical memory
Virtual memory
Interleaving
Variable-sized pages on HP-UX
3 Optimization levels
HP optimization levels and features
Cumulative Options
Using the Optimizer
General guidelines
4 Standard optimization features
Machine instruction level optimizations (+O0)
Constant folding
Partial evaluation of test conditions
Simple register assignment
Data alignment on natural boundaries
Block level optimizations (+O1)
Branch optimization
Dead code elimination
Faster register allocation
Instruction scheduling
Peephole optimizations
Routine level optimizations (+O2)
Advanced constant folding and propagation
Common subexpression elimination
Global register allocation (GRA)
Loop-invariant code motion
Loop unrolling
Register reassociation
Software pipelining
Strength reduction of induction variables
and constants
Store and copy optimization
Unused definition elimination
5 Loop and cross-module optimization features
Strip mining
Inlining within a single source file
Cloning within a single source file
Data localization
Conditions that inhibit data localization
Loop blocking
Data reuse
Loop distribution
Loop fusion
Loop interchange
Loop unroll and jam
Preventing loop reordering
Test promotion
Cross-module cloning
Global and static variable optimizations
Inlining across multiple source files
6 Parallel optimization features
Levels of parallelism
Loop-level parallelism
Threads
Loop transformations
Idle thread states
Determining idle thread states
Parallel optimizations
Dynamic selection
Inhibiting parallelization
Loop-carried dependences (LCDs)
Reductions
Preventing parallelization
Parallelism in the aC++ compiler
Cloning across multiple source files
7 Controlling optimization
Command-line optimization options
Invoking command-line options
+O[no]aggressive
+O[no]all
+O[no]autopar
+O[no]conservative
+O[no]dataprefetch
+O[no]dynsel
+O[no]entrysched
+O[no]fail_safe
+O[no]fastaccess
+O[no]fltacc
+O[no]global_ptrs_unique[=namelist]
+O[no]info
+O[no]initcheck
+O[no]inline[=namelist]
+Oinline_budget=n
+O[no]libcalls
+O[no]limit
+O[no]loop_block
+O[no]loop_transform
+O[no]loop_unroll[=unroll factor]
+O[no]loop_unroll_jam
+O[no]moveflops
+O[no]multiprocessor
+O[no]parallel
+O[no]parmsoverlap
+O[no]pipeline
+O[no]procelim
+O[no]ptrs_ansi
+O[no]ptrs_strongly_typed
+O[no]ptrs_to_globals[=namelist]
+O[no]regreassoc
+O[no]report[=report_type]
+O[no]sharedgra
+O[no]signedpointers
+O[no]size
+O[no]static_prediction
+O[no]vectorize
+O[no]volatile
+O[no]whole_program_mode
+tm target
C aliasing options
Optimization directives and pragmas
Rules for usage
block_loop[(block_factor=n)]
dynsel[(trip_count=n)]
no_block_loop
no_distribute
no_dynsel
no_loop_dependence(namelist)
no_loop_transform
no_parallel
no_side_effects(funclist)
unroll_and_jam[(unroll_factor=n)]
8 Optimization Report
Optimization Report contents
Loop Report
Supplemental tables
9 Parallel programming techniques
Parallelizing directives and pragmas
Parallelizing loops
prefer_parallel
loop_parallel
prefer_parallel, loop_parallel attributes
Combining the attributes
Comparing prefer_parallel, loop_parallel
Stride-based parallelism
critical_section, end_critical_section
Disabling automatic loop thread-parallelization
Parallelizing tasks
Parallelizing regions
Reentrant compilation
Setting thread default stack size
Modifying thread stack size
Collecting parallel information
Number of processors
Number of threads
Thread ID
Stack memory type
10 OpenMP Parallel Programming Model
What is OpenMP?
HP’s implementation of OpenMP
Command-line option
Optimization levels and parallelism
Arrays
Portable timing routines
Nested lock routines
Additional features
New library
Implementation-defined behavior
From HP Programming Model to OpenMP
Syntax
HP Programming Model directives
More information on OpenMP
11 Data privatization
Directives and pragmas for data privatization
Privatizing loop variables
loop_private
save_last[(list)]
Privatizing task variables
task_private
Privatizing region variables
parallel_private
12 Memory classes
Porting multinode applications to single-node servers
Private versus shared memory
thread_private
node_private
Memory class assignments
C and C++ data objects
Static assignments
13 Parallel synchronization
Thread-parallelism
Thread ID assignments
Synchronization tools
Using gates and barriers
Synchronization functions
sync_routine
loop_parallel(ordered)
Critical sections
Ordered sections
Synchronizing code
Using critical sections
Using ordered sections
Manual synchronization
14 Troubleshooting
Aliasing
ANSI algorithm
Type-safe algorithm
Specifying aliasing modes
Iteration and stop values
Global variables
False cache line sharing
Aligning data to avoid false sharing
Distributing iterations on cache line boundaries
Thread-specific array elements
Scalars sharing a cache line
Working with unaligned arrays
Working with dependences
Floating-point imprecision
Enabling sudden underflow
Invalid subscripts
Misused directives and pragmas
Loop-carried dependences
Reductions
Nondeterminism of parallel execution
Triangular loops
Parallelizing the outer loop
Parallelizing the inner loop
Examining the code
Compiler assumptions
Incrementing by zero
Trip counts that may overflow
A Porting CPSlib functions to
pthreads
Introduction
Accessing pthreads
Mapping CPSlib functions to pthreads
Environment variables
Using pthreads
Symmetric parallelism
Asymmetric parallelism
Synchronization using high-level functions
Synchronization using low-level functions
Glossary
Index

List of Tables

Title not available (Notational conventions)
3-1 Locations of HP compilers
3-2 Optimization levels and features
5-1 Loop transformations affecting data localization
5-2 Form of no_loop_dependence directive and pragma
5-3 Computation sequence of A(I,J): original loop
5-4 Computation sequence of A(I,J): interchanged loop
5-5 Forms of block_loop, no_block_loop directives and pragmas
5-6 Form of no_distribute directive and pragma
5-7 Forms of unroll_and_jam, no_unroll_and_jam directives and pragmas
5-8 Form of no_loop_transform directive and pragma
6-1 Form of MP_IDLE_THREADS_WAIT environment variable
6-2 Form of dynsel directive and pragma
6-3 Form of reduction directive and pragma
6-4 Form of no_parallel directive and pragma
7-1 Command-line optimization options
7-2 +O[no]fltacc and floating-point optimizations
7-3 Optimization Report contents
7-4 +tm target and +DA/+DS
7-5 Directive-based optimization options
7-6 Form of optimization directives and pragmas
8-1 Optimization Report contents
8-2 Loop Report column definitions
8-3 Reordering transformation values in the Loop Report
8-4 Optimizing/special transformations values in the Loop Report
8-5 Analysis Table column definitions
8-6 Privatization Table column definitions
8-7 Variable Name Footnote Table column definitions
9-1 Parallel directives and pragmas
9-2 Forms of prefer_parallel and loop_parallel directives and pragmas
9-3 Attributes for loop_parallel, prefer_parallel
9-4 Comparison of loop_parallel and prefer_parallel
9-5 Iteration distribution using chunk_size = 1
9-6 Iteration distribution using chunk_size = 5
9-7 Forms of critical_section/end_critical_section directives and pragmas
9-8 Forms of task parallelization directives and pragmas
9-9 Attributes for task parallelization
9-10 Forms of region parallelization directives and pragmas
9-11 Attributes for region parallelization
9-12 Forms of CPS_STACK_SIZE environment variable
9-13 Number of processors functions
9-14 Number of threads functions
9-15 Thread ID functions
9-16 Stack memory type functions
10-1 Parallel and work-shared directives
10-2 OpenMP and HPPM Directives/Clauses
11-1 Data Privatization Directives and Pragmas
11-2 Form of loop_private directive and pragma
11-3 Form of save_last directive and pragma
11-4 Form of task_private directive and pragma
11-5 Form of parallel_private directive and pragma
12-1 Form of memory class directives and variable declarations
13-1 Forms of gate and barriers variable declarations
13-2 Forms of allocation functions
13-3 Forms of deallocation functions
13-4 Forms of locking functions
13-5 Form of unlocking functions
13-6 Form of wait functions
13-7 Form of sync_routine directive and pragma
13-8 Forms of critical_section, end_critical_section directives and pragmas
13-9 Forms of ordered_section, end_ordered_section directives and pragmas
14-1 Initial mapping of array to cache lines
14-2 Default distribution of the I loop
A-1 CPSlib library functions to pthreads mapping
A-2 CPSlib environment variables

List of Examples

2-1 Cache thrashing
2-2 Cache padding
2-3 Interleaving
4-1 Data alignment on natural boundaries
4-2 Conditional/unconditional branches
4-3 Advanced constant folding and propagation
4-4 Common subexpression elimination
4-5 Loop-invariant code motion
4-6 Loop unrolling
4-7 Loop unrolling
4-8 Register allocation
4-9 Software pipelining
4-10 Strength reduction of induction variables and constants
4-11 Unused definition elimination
5-1 Strip mining
5-2 Inlining within single source file
5-3 Loop-carried dependences
5-4 Avoid loop interchange
5-5 Aliasing
5-6 Aliasing
5-7 I/O statements
5-8 Multiple loop entries or exits
5-9 Simple loop blocking
5-10 Matrix multiply blocking
5-11 Loop blocking
5-12 Loop distribution
5-13 Loop fusion
5-14 Loop fusion
5-15 Loop peeling
5-16 Loop interchange
5-17 Unroll and jam
5-18 Test promotion
6-1 Loop-level parallelism
6-2 Loop transformations
6-3 Parallel-inhibiting LCDs
6-4 Parallel-inhibiting LCDs
6-5 Output LCDs
6-6 Apparent LCDs
6-7 Reduction
6-8 no_parallel
7-1 Data type interaction
7-2 Unsafe type cast
8-1 Optimization Report
8-2 Optimization Report
8-3 Optimization Report
9-1 prefer_parallel, loop_parallel
9-2 prefer_parallel, loop_parallel
9-3 critical_section
9-4 Parallelizing tasks
9-5 Parallelizing tasks
9-6 Region parallelization
11-1 loop_private
11-2 Using loop_private with loop_parallel
11-3 Denoting induction variables in parallel loops
11-4 Denoting induction variables in parallel loops
11-5 Secondary induction variables
11-6 Secondary induction variables
11-7 save_last
11-8 save_last
11-9 task_private
11-10 parallel_private
12-1 thread_private
12-2 thread_private
12-3 thread_private COMMON blocks in parallel subroutines
12-4 node_private
13-1 sync_routine
13-2 sync_routine
13-3 loop_parallel, ordered
13-4 critical_section
13-5 Gated critical sections
13-6 Ordered sections
13-7 Ordered section limitations
13-8 Ordered section limitations
13-9 Critical sections and gates
13-10 Conditionally lock critical sections
Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© Hewlett-Packard Development Company, L.P.