Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP PEX Implementation and Programming Supplement: HP9000 Series 700 Color Workstations > Chapter 5 Performance Hints

Performance Analysis Tools

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

Several HP-UX tools are available to help you determine how the system resources are being used.

/bin/time is a UNIX command that can be used to run a program and determine what percentage of time is being spent in user code and what percentage is being spent in the system.

For example, running a demo program from the PEXlib Programming Manual produced these timing results.

$ /bin/time night-time
real 15.2
user 11.4
sys 0.4

The first time is the time elapsed while running the program. The user time shows how much time was spent in the night-time code, plus how much time was spent in all of the libraries linked with the code (including libPEX5.sl). The sys time shows how much time was spent in the HP-UX kernel.

GlancePlus is an interactive performance diagnostic tool for HP-UX systems. It provides general data on system resources and active processes. You can use Glance to view information about the current use of system resources and active resources. Specific data is available about the current CPU, memory, disk I/O, LAN, NFS, and swap usage. Glance provides both global system information and specific process information.

HP-PerfRX is a tool that continually logs global performance data about processes running on your system and prints tables and graphs showing global system resource usage. It provides summaries of system usage metrics over time. These tools would probably not be useful for initial performance tuning. However, the information might provide insight into how different processes are affecting overall application performance on a particular system.

If you order a new system with instant ignition, you will automatically get trial copies of Glance and HP-PerfRX. For ordering information, or access to trial copies of Glance, GlancePlus, or HP-PerfRX, call 1-800-237-3990 in North America.

HP/PAK, the HP Performance Analysis Kit, consists of three tools that help you analyze the performance of your applications. Each of the tools examines program performance at a different level of detail. XPS looks at the relative use of system resources by all processes at the system level. DPAT is an interactive tool that looks at the performance of a process at the procedure level. HPC looks at the performance of compute-bound procedures at the statement/instruction level. In HP-UX 10.0, HP/PAK is bundled with compilers.

Profiling Your Code

Profiling tools are available on HP-UX. Execution profiles provide information about where your application spends most of its time to help you to identify performance bottlenecks.

The gprof(1) command can be used to give you some raw data about the amount of time spent in each of your application's procedure calls. gprof requires recompilation of your application. gprof also provides cumulative timing information about the execution time of each procedure and the subroutines it calls. Shared libraries are not profilable, so gprof will not provide information about the graphics calls your application makes unless you follow the instructions described in the next section; only the information about the procedures in your graphics libraries will be profiled.

If you have purchased SoftBench, you also have access to the SoftBench Performance Analyzer. Softbench provides user-friendly access to profiling information similar to the information produced by gprof, and some other features as well. More details on SoftBench Performance profiling can be found in the SoftBench User's Guide.

Profiled PEXlib

If you have HP-UX 9.07 or 10.10, and have installed HP-PEX 5.1, Version 3.0 or later, a profiled archive library containing the highest level of PEXlib calls is shipped in 〈profile[11]. To get additional information about time spent in PEXlib calls, add libPEX5_prof.a to your link line, as shown in the example below.

cc -DHPPEX_PROCEDURES program.c -I/usr/include/X11R6 \
-L/usr/lib/X11R6 \
-W, -L /usr/contrib/PEX5/lib \
-lPEX5_prof -l PEX5 -lXext -lX11 -lm

This profiled library is only useful for performance tuning. It should not be used to build an actual product, since it is not a supported part of the HP-PEXlib product.

Other Tools

Many of the techniques described in this document are somewhat invasive. In other words, you need to be able to modify and recompile code to perform some experiments. You may also be able to get access to less invasive tools by contacting your sales representative. Included in this set are tools that extract graphics calls from your application, producing compilable code. By using that extracted code, you can effectively duplicate the interactions of your library with the graphics library, without rebuilding your entire application.

There are several reasons to do this. First, you can easily find out what percentage of time is spent in your application outside of the graphics library. To do this, run your application benchmark and time the results. Next, run your application benchmark and extract graphics calls. Create a compilable program of just graphics calls, run it, and time the results. By comparing the two timings you can see exactly how much time is spent in application overhead compared to graphics calls.

Another reason for extracting calls from your program is to see how efficiently graphics calls are being used. For example, you can look for redundant attribute setting calls. You can also see the effects of different sequences of calls on graphics performances. For example, polylines might be rendered very fast when preceded by one sequence of attribute calls, but not rendered according to published specifications when preceded by another set of attribute setting calls. Extracting the calls gives you an easy way to view the sequence of graphics operations as performed by your application.

Interpreting Published Performance Data

One way to determine if your application's graphics performance is reaching acceptable levels is to compare your benchmark results with the published figures from the Product Data Sheet for your system.

The performance level that you achieve may vary from the published benchmark numbers. For example, the size of vectors in your application might differ from the size of vectors quoted in the Product Data Sheet. If your vectors are longer than the vectors described in the benchmark, and more pixels need to be drawn, your application will not draw as many vectors/second as were drawn in the benchmark. It is best to use the GPC benchmarks to get an indication of the type of application performance you can expect to see.

If all conditions are the same as the Product Data Sheet benchmarks, you should be able to achieve performance comparable to the numbers listed on the Product Data Sheet. Otherwise, it is possible that your application is not executing the most optimized paths of the graphics libraries. If so, it is worth trying some of the experiments described next.

Examining Graphics Interactions

If after profiling your benchmarks you determine that graphics is the bottleneck, and, furthermore, that you are not achieving maximum performance on a graphics system, you need to look at your application's interaction with the graphics libraries.

HP's Graphics Library Optimizations

All HP-UX graphics products are tuned for maximum performance based on typical application usage. In other words, there are some combinations of primitives and attributes that HP graphics libraries will execute faster than other combinations of primitives and attributes. In order to determine what those paths are, you need to study the available documentation for each release. If this documentation does not help you understand the performance problems you are experiencing, then you will need to perform some simple experiments.

Documentation Sources

In most cases, the combinations of primitives and attributes that are optimized do not vary from one API to another, since optimizations are focused on typical application usage. However, there may be some performance issues specific to the graphics products that you are using. HP tunes its graphics libraries for each release. For most graphics APIs, some documentation is shipped with the product about performance tuning. In order to be aware of all performance improvements in the graphics library, read the Release Notes and PERF_NOTES whenever you plan to support a new release of any of HP's graphics APIs.

Most vendors ship similar documentation. When designing your application, it is good practice to read the documentation from multiple vendors, in order to determine which primitives and attributes work best across all of the platforms you plan to support.

Online Documentation

Tips for improving application performance on PEXlib are published in the Release Notes document. These files are found in the /etc/newconfig directory on HP-UX 9.01, 9.03, 9.05 and 9.07 releases, or, on HP-UX 10.0 and later releases, in /opt/graphics/PEX5.

Starbase tips are available in /usr/lib/starbase/PERF_NOTES on HP-UX 9.07 or earlier releases, and in /opt/graphics/PERF_NOTES for systems running HP-UX 10.0 or later releases.

Studying Optimizations Shown in the GPC Quarterly

In general, PLB benchmarks are most useful for comparing systems before purchase. However, there is one piece of information in the GPC Quarterly that is helpful to application developers for performance tuning. Published with a summary of the GPC results is a description of the optimizations made by the vendor to achieve maximum performance for the benchmark. By applying the optimizations used in the benchmarks to your program, you should be able to improve performance in your application.

Systematically Tuning Your Graphics Application

Attributes

The settings of attributes, the number of times attributes are called, and the types of attributes used in your program can all affect graphics performance. Some attribute settings simply involve more work than others. For example, for each light turned on in a PEXSetLightSourceState call, a set of mathematical computations must be done to light the primitives in the scene. The more light sources turned on, the more expensive the call.

Redundant attribute settings (for example, attribute calls that are made more than once but don't change values of the current settings) can be very expensive in some implementations. Although the HP graphics libraries do a lot of redundancy checking, certain redundant attribute calls will cause primitives to be drawn using non-optimized paths in the graphics libraries. This can slow graphics performance considerably. Always avoid making duplicate calls to attribute setting routines. If you must set attributes frequently to different values, consider grouping primitives that share similar attributes. For example, sort the primitives according to reflection characteristics, and render all primitives with the same reflection characteristics at once.

Finally, some attribute settings are not optimized by the implementation.

All three of these factors can affect your application performance. The next section describes an experimentation process that will help you determine which attribute calls are having the most impact on graphics performance.

Attribute Suppression Experiments

It is relatively simple to determine whether or not you are setting attributes correctly in order to execute the optimized paths in the graphics libraries. Run your benchmark on your application as it is currently written and record the timing results. Then, experimenting with one attribute call at a time, suppress the attribute calls (that is, comment out the function calls) in your benchmark and rerun it. Compare the timing results. This will change the appearance of the rendering, but that is acceptable for this kind of experimentation. If you get significantly better results with a reduction in attribute calls, look for redundancy in attribute calls.

A single attribute call may not affect whether or not your application hits the optimized paths. Sometimes you need to experiment with sets of related attributes. If attributes are changed in sets, you need to experiment with the entire set. For example, in PEXlib, both the view orientation matrix and the view mapping matrix might be modified to change the view. Commenting out these calls one at a time would have no effect, since the view orientation matrix and the view mapping matrix are concatenated each time one of them changes; but commenting both calls out at the same time might show a significant improvement.

For example, in PEXlib, you might experiment with the following attribute calls:

PEXlib Lighting and Shading Calls:

  • PEXSetLightSourceState

  • PEXSetReflectionModel

  • PEXSetSurfaceInterpMethod

  • PEXSetTableEntries (for lighting table setup)

  • PEXSetDepthCueIndex

  • PEXSetTableEntries (for depth cueing table)

  • PEXSetPolylineInterpMethod

PEXlib Viewing Calls:

  • PEXViewOrientationMatrix

  • PEXViewMappingMatrix

  • PEXSetTableEntries (for view matrix)

  • PEXSetViewIndex

PEXlib Surface Attributes:

  • PEXSetReflectionAttributes

  • PEXSetInteriorStyle

PEXlib Color Attributes:

  • PEXSetLineColor

  • PEXSetLineColorIndex

  • PEXSetMarkerColor

  • PEXSetMarkerColorIndex

  • PEXSetSurfaceColor

  • PEXSetSurfaceColorIndex

  • PEXSetTextColor

Other PEXlib Calls You Should Experiment With

  • PEXSetFacetCullingMode

  • PEXSetFacetDistinguishFlag

  • PEXSetLineWidth

If your application does not set attributes redundantly, then it might be that your application is setting attributes in a way that is not optimized in the libraries. You need to look at the attribute calls and determine if an optimized path might work for your application instead. While it is not always possible to reduce the number of attribute calls, you may want to make some appropriate tradeoffs between appearance and performance. For example, in a preview operation, it may not be necessary to turn depth cueing on or use wide lines.

If you are confused about which paths are optimized, use the published documentation that is shipped with your libraries, or study the optimizations described in the GPC Quarterly.

Data Formatting Experiments

Just as redundant attribute changes can impact performance, frequent changes in the data formats can also affect application performance. In this case, data format refers to whether or not normals and colors are passed to PEXlib with the vertices. In primitive calls like PEXFillAreaWithData, this information is passed to PEXlib in the vertex_attributes mask. If you are using the OCC interface, vertex attributes are described in the PEXOCC structure.

Determining how changes in data formats are affecting your overall application performance is more difficult than the attribute experiments described above. You will need either to sort the data passed to PEXlib or to perform multipass rendering using your data. To sort data, you would need to group all of the geometry with identical vertex attributes and render it all at once. In a multipass rendering, you would need to traverse the data several times. The first pass might only render primitives without vertex normals. The second pass might render only primitives with vertex normals, etc., until you have rendered all of the primitives in the model. Timing results can be a little confusing, though, because the traversal time needs to be accounted for in a multipass rendering.

Translation between the application's native data format and a packed data format can also have an impact on performance. See the section called Data Formats below for PEX-specific information on data formats.

Window System Interactions

Window size may be a factor in rendering performance. Larger windows can be slower, especially when rasterization is not done in hardware. On HP systems, this is usually not a problem. In HP-PEXlib, window size might be a factor in texture mapping performance, since hardware acceleration is not available for texture mapping on all devices.

Window system interactions can affect performance if the user interactions cause the graphics to be redrawn frequently. This can happen when the application generates a lot of exposure events, and when menus and other user interface items are drawn in the image planes.

Geometry Suppression

Often the amount of detail in the geometric model is greater than the amount of detail needed to render an object realistically. By experimenting with geometry suppression you may also be able to improve application performance. In this case, your application does not render all of the geometry available.

Two general techniques can be applied to many graphics applications. In the first, multiple variations of the geometry are used. Depending on the level of resolution required for the user's task, different geometry is rendered. In some cases, a very coarse resolution is acceptable. Multiple primitives can be combined into a single primitive. Objects that are too small to be seen can be removed altogether, and replaced with an alternate representation.

Another technique uses bounding boxes to trivially reject all offscreen geometry. This technique is useful when you have a "world" scene, and the viewer can only look in one direction.

PEX Specifics

DHA, Protocol Mode and VMX Mode

On HP systems, there are three fundamental ways to communicate with the graphics libraries and render 3D graphics in PEXlib. Direct Hardware Access (DHA) is the fastest method available when both client and server are on the same workstation. In DHA mode, graphics commands are sent directly to the graphics rendering libraries by PEXlib. In contrast, in protocol mode, graphics requests are sent over the network to the graphics server, where they are decoded and translated into graphics commands that are then sent to the graphics libraries on the server. In the X protocol method, PEXlib commands are translated into X protocol requests, which in turn travel over the network to be decoded by the X server and rendered.

Whenever both the client and server are available on the same system and performance is important, you should run in DHA mode. This is the default, but you can explicitly control the mode by setting the environment variable HPPEX_CLIENT_PROTOCOL to DHA.

Structure Mode, Immediate Mode

In PEXlib, there are two ways to draw a scene. If you are using immediate mode, you can pass all of the primitives and attributes in the scene to PEXlib one at a time, each time you want to draw the picture. In structure mode, you store all of the primitives and attributes in a graphics database, then tell PEX to render the contents of that database. Immediate mode rendering is best suited for applications that need to modify model data frequently, or need to reduce memory usage. Structure mode can be used when the data is somewhat static throughout the application.

When running locally, structure mode reduces procedure call overhead and parameter processing times. Most of the cost is incurred at the time the model is built, not when it is rendered. For example, error checking of parameters can be done when the data is stored in the structure, and does not have to be repeated each time the model is rendered.

Structure mode is even more useful in a distributed environment. Since the network is frequently the bottleneck in distributed application performance, it is important to try to minimize network traffic. Storing the data in structures is one way to do that.

Many applications combine the two modes. Non-changing data is stored in structures, but other data that changes frequently is sent to the graphics server in immediate mode. For example, the geometry of a model might be stored in a structure, but viewing calls are made each time the scene is redrawn.

Structure Permissions

Structure permissions control the access to structures by applications. By calling PEXSetStructurePermission, an application can set the permission of a structure to either PEXStructureWriteOnly or PEXStructureLocked. Write-only structures cannot be read by PEXFetchElements, and locked structures cannot be edited.

By setting structure permissions appropriately, you permit PEXlib to use its internal knowledge about the best performance paths, and pack primitives in the most efficient way for the hardware on which it is running. For example, if a locked structure contains multiple PEXPolyline primitives, PEXlib can pack those primitives into a single PEXPolylineSetWithData call, reducing procedure call overhead and resulting in faster execution of those polylines. This example is only possible when the structure is locked.

Other optimizations are even possible in write-only structures. For example, decomposition of polygons can be done only once per write-only structure, instead of every time the contents of the structure are rendered.

Whenever an application needs to continually rerender unchanging models, storing data in structures with write-only or locked permissions should be considered. If you need to edit your structure, set the structure permissions to PEXStructureWriteOnly. If you don't need to edit, best performance can be achieved by setting permissions to PEXStructureLocked.

Using Structures Efficiently

The ExecuteStructure output command that is used to create a structure network can be expensive, because attributes' values are saved when a child structure is executed, and are restored when the traversal returns to the parent structure. Consequently, it is good practice to avoid excessively deep structure networks and avoid creating structures that have very few elements.

Stride and OCC vs. PEX 5.1 Interface

PEXlib offers two major argument interfaces for output commands (primitives and attributes): an explicit interface and an output command context interface (OCC). The explicit interface requires that you specify the display, resource ID (renderer or structure), and request type for every output command function call that you make. The explicit interface is the only interface available on PEXlib 5.1 (including HP-PEXlib, Versions 1.0 and 2.0).

The OCC interface is currently available in HP-PEXlib5.1, Version 3, and will be available in the PEXlib 5.2 implementation. The OCC interface generates the same protocol as it generated using the explicit interface, so that PEXlib programs using the OCC interface can communicate with earlier 5.1 servers.

Output commands using the OCC interface replace the first three arguments, and other frequently used primitive descriptions, like vertex_attributes, with a single OC context. The OC context is an opaque structure that contains many of the arguments that are commonly found in the explicit interface output commands.

In addition to providing a reduced argument count for output commands, the OCC interface supports different data formats. The packed form is the same form that was used in earlier releases of PEXlib. It requires you to format data into packed data structures defined by PEXlib. The stride form allows you to supply data formatted in application-defined structured arrays without the need to copy the data into the PEXlib-defined structures before invoking the PEXlib functions. The unpacked form allows you to supply the data in separate lists for each data type. Vertex coordinates, normals, and colors are stored in separate lists in the unpacked form.

Using the OCC Interface

In general, the OCC interface uses far fewer arguments than the explicit interface, making coding easier and improving performance. The OCC interface is recommended for applications that are supported on HP-PEXlib, Version 3.0 or later. Best performance is achieved by minimizing the number of calls that modify the OCC context, and not intermixing calls that use different OCC contexts.

Data Formats

If your application is running in DHA mode on HP, the selection of data interface may have a significant impact on performance of your application performance. The OCC interface is implemented at the PEXlib level; the protocol generated by the OCC interface is identical to the protocol generated using the 5.1 PEXlib interface. Consequently, performance is only affected when running in DHA mode.

HP-PEXlib is optimized to use the packed and stride data interfaces most efficiently. The unpacked data interface is executed significantly slower. However, whether or not to use the packed interface depends on the size and nature of your data. Using the unpacked form may be more efficient than converting large amounts of data to the packed form, if the packed form is different from the application's native data format, and if conversion routines are more time-consuming than the difference in performance you will get by calling HP-PEXlib using the packed data interface. In order to determine which interface to use for your application, write some simple benchmarks and experiment with the data formats and conversion routines.

One advantage of the stride interface is that it is possible to change vertex and facet attributes without copying data. If your application is going to rerender the same geometric data with different attributes, the stride interface is an appropriate choice.

Shape Hints

All PEXlib FillArea calls accept a shape_hint parameter. By providing a value other than PEXShapeUnknown, you can bypass some unnecessary processing in some cases. For example, on HP hardware, convex shapes can be passed to the graphics hardware immediately. Non-convex shapes may require some preprocessing by PEXlib.

In early releases of HP-PEXlib, shape hints were ignored. In PEXlib 5.1, Version 3.0, shape hints make significant performance differences in many cases. The use of shape hints for potentially non-convex polygons (that is, polygons with more than four sides) is strongly recommended.

Remember that the shape hints must be accurate. Incorrect shape hints can result in the wrong picture being drawn, or in slower performance.

Use of Complex Primitives

Best performance on the newer graphics products can be achieved by reducing the CPU "non-graphics" overhead, such as procedure call overhead. Applications can reduce overhead by packing more primitives into library calls. In HP-PEXlib 5.1, Version 3, a number of complex primitives are supported. However, just using those primitives is not enough. Applications must send enough primitives per call to amortize the overhead. Best performance is achieved when at least eight primitives are packed per call, with performance levelling out at some point above fifty primitives per call. However, some applications have achieved significant performance improvements simply by changing the number of triangles per strip from two or three to five or six.

Compound primitives optimized for PEXlib include:

  • PEXTriangleStrip

  • PEXPolylineSetWithData

  • PEXFillAreaSet

  • PEXSetOfFillAreaSets

  • PEXQuadrilateralMesh

  • PEXFillAreaSetWithData

If the Bottleneck is Not Graphics

By profiling your code, or running /bin/time, you may have determined that graphics is not the bottleneck for your application. Several of the more promising options for tuning your code are described below.

Build Environments

Compilers are continually tuned by Hewlett-Packard. Simply updating to newer revisions of compilers and rebuilding your application may improve performance significantly.

Compiler and Linker Options

If your application is CPU bound, you may be able to substantially improve performance just by compiling and linking with some optimization options. For example, compiler options can automatically remove dead code, make better use of registers, optimize loops, generate in-line code, optimize for a particular architecture (for example, PA RISC, Version 1.0; or PA RISC, Version 1.1), and optimize your application based on a run-time profile. For C programs, this process is described in Optimizing HP C Programs. Linker optimization options are described in Programming on HP-UX.

Archive Math Libraries

If graphics applications are spending significant time in the math libraries, linking with the archive version instead of the shared version might help. This is because the symbol resolution overhead is reduced with archive libraries.

Cache and TLB Misses

Cache and Translation Lookaside Buffer (TLB) misses can cause a CPU bottleneck. A cache holds frequently accessed data and instructions in "local" memory that is faster for the process to access than main memory. A cache miss occurs when the processor needs to reference memory, and a copy of the memory is not stored in cache. Because it is very expensive for the processor to access main memory, frequent cache misses will slow down application performance considerably.

The Translation Lookaside Buffer is used to map physical memory to virtual memory. It contains translations for recently addressed virtual pages. A TLB miss occurs if your application tries to access a page of virtual memory that has not been mapped to physical memory.

Both cache misses and TLB misses can be avoided by improving locality in your application. For example, loops can be written to sequentially access contiguous memory addresses, as opposed to accessing data that is scattered throughout memory. Code routines that frequently call each other should be included in the same source files, or the files containing those routines should be listed next to each other in the ld command. You can also use the ld(1) options for profile-based optimization to reposition code so that better locality is achieved. Note that profile-based optimization will not improve the locality of data references. Profile-based optimization is described in Optimizing HP C Programs.

Memory Bottlenecks

If memory is the bottleneck and your program is thrashing (that is, pages of virtual memory are excessively swapped into physical memory), you may be able to improve application performance just by increasing the amount of physical memory in your system. You can also tune your application's memory usage. To do this, consider the following, all of which will require some code changes in your application:

  • Improve locality within your program. Frequently accessed items that are relatively small and reside in different pages will increase the size of memory needed to swap in pages containing those items.

  • Reduce heap fragmentation. Fragmentation occurs when memory is allocated and freed in patterns that leave unused holes in the heap.

  • Eliminate memory leaks. Memory leaks occur when an allocated piece of memory is no longer needed but is not freed. Several high-quality commercial products are available on Hewlett Packard systems to help you identify and eliminate memory leaks within your program. You should be able to get information about commercial tools from your sales representative.

  • Reduce the size of code and data structures. For example, if a structure contains 32-bit values for each of several boolean values, consider using a group of 1-bit fields instead. Infrequently used code and data can be separated from frequently used code and data.

  • Re-use memory. You can consider using buffers that are allocated once to store temporary items, instead of allocating and freeing memory at different times throughout the execution of your program. This will help reduce fragmentation of the heap, and avoid calls to malloc and free, which are expensive procedures.

  • Consider using primitives that reuse data, like PEXTriangleStrip or PEXSetOfFillAreaSets. PEXSetOfFillAreaSets uses a single "database" of vertices. A set of connectivity lists describe how to connect those vertices to make a polygon.

If Disk Access is the Bottleneck

If the problem is disk access, you might consider modifying your program to access the disk more efficiently. For example, make some tradeoffs between memory usage and disk usage. Blocks of frequently used data could be read in all at once and stored in memory, instead of accessing the disk each time an item is accessed.

Summary

Good graphics application performance depends on a number of factors, including the raw performance capabilities of the graphics hardware and software used by the application; the efficient, balanced use of system resources; and calling sequences that use the most optimized paths through the graphics libraries. Each application is different. No single set of rules will provide optimal performance for a specific application. Good design for efficient use of system resources, an under standing of the performance-critical tasks, and some amount of experimentation are the keys to achieving the best graphics application performance possible.



[11] The actual pathname of this directory depends on the file system structure. See the Graphics Administration Guide for details.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1996 Hewlett-Packard Development Company, L.P.