Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-UX 11i December 2001 Release Notes: HP-UX Servers and Workstations > Chapter 13 Programming

Libraries

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

aC++ Runtime (libCsup*, libstd*, libstream*, librwtool*) (new at 11i original release)

The aC++ runtime provides the run-time environment necessary for deploying C++ based (aC++ compiled) applications on HP-UX 11i.

This release of the aC++ Runtime includes a new ANSI compliant Standard C++ library. The previous version of the runtime included the "classical" C++ STL library that corresponds to the pre-standard (Sept. 1998) definition of the C++ language and library. The updated C++ runtime included for HP-UX 11i retains the classical C++ library functionality but it also includes new components (libstd_v2 and libCsup_v2) that introduce a standard compliant set of C++ interfaces, as required by the "ISO/IEC 14882 Standard for the C++ Programming Language".

The added components, libstd_v2 and libCsup_v2, are new libraries with functionality that did not exist prior to this release of the C++ runtime. The details of the newly added libraries are covered in:

  • file:/opt/aCC/html/libstd_v2/stdug/index.htm

  • file:/opt/aCC/html/libstd_v2/stdref/index.htm

which are available after installation of version A.03.25 or later of the aC++ product.

Over time, with the acceptance of the new library, it is expected that the old classic library will be deprecated and possibly removed from some future operating system release.

Detailed manpages for the new library is included with the Independent Software Unit release. It is also discussed in the aC++ Online Help.

Impact

Overall (file) size of the C++ runtime will increase by about 44%, with 10 new libraries.

Provides access to the standard compliant C++ library for application developers (and deployment of such applications). This is by far the most heavily requested enhancement by the users of the aC++ compiler.

The performance of the new library (iostreams) may be slower.

Compatibility Issues

C++ application (source and binary) forward compatibility with 11.x is fully maintained by preserving the classic C++ library in the new runtime; source files, build systems and object files or libraries produced under HP-UX 11.0 with the previous version of C++ runtime should continue to work under the new runtime.

The new libraries are binary incompatible with the classic C++ libraries. The option -AAmust be used to enable the new libraries and headers.

To preserve backward source and runtime compatibility from HP-UX 11i to 11.0, application developers who develop C++ applications with the use of the new standard C++ library must ensure that the June 2000 Application Release dependent C++ library patches (C++ library and Header File patches: PHSS_21906, PHSS_21947, PHSS_21950, PHSS_21075, and PHSS_22217 as shown at http://www.hp.com/esy/lang/cpp/rels.html#11) are applied to the 11.0 system.

Changes to libc

Large Files Support for C++ Applications

libc has been modified to support large files for C++ applications. C++ applications can now access files greater than 2 GB. This is done by setting _FILE_OFFSET_BITS to 64 in 32-bit mode. More details can be found in the HP-UX Large Files White Paper Version 1.4 on http://docs.hp.com.

HP CxDL Development Tool Support

libc support for HP CxDL Development tool has been included in the setjmp() and longjmp() family of APIs in both 64-bit and 32-bit libc.

libdbm

A new patch for the dbm libraries (libdbm(1) and libndbm(2)) has been created to increase performance of dbm_nextkey().

Header Files

Header files ftw.h and stdio.h were patched to enable C++ large files support.

In addition, numerous defects were fixed.

New Environment Variables for malloc

libc uses a single lock in the malloc() routines to make them thread-safe. In a multi-threaded application, there could be contention on this single lock if multiple threads are calling malloc and free at the same time. This patch provides multiple arenas, where malloc() can allocate space from, and a lock for each arena. Threads are distributed among the arenas. Two new environment variables are introduced:

  • _M_ARENA_OPTS

  • _M_SBA_OPTS

  • _M_ARENA_OPTS

These can be used to tune the number of arenas and the arena expansion factor for threaded applications. In general, the more threads in an application, the more arenas should be used for better performance. Expansion factors control the number of pages to expand each time and assumes the page size is 4096 bytes. The number of arenas can be from 4 to 64 for threaded applications. For non-threaded applications, only one arena is used regardless of whether this environment variable is set or not. However, you still can use this environment variable to change the expansion factor for non-threaded applications.

If the environment variable is not set, or the number of arenas is set to be out of the range, the default number of 8 is used. The expansion factor is from 1 to 4096; the default value is 32. Again, if the factor is out of the range, the default value will be used. For example:

$ export _M_ARENA_OPTS=8:32

where the number of arenas is 8, and the expansion size is 32*4096 bytes. In general, the more arenas you use, the smaller the expansion factor should be, and vice versa.

_M_SBA_OPTS turns on the small block allocator, and sets up parameters for the small block allocator, namely, maxfast, grain, num_smallblocks. Refer to mallopt() for details about the small block allocator, and its parameters. Applications with a small block allocator turned on usually run faster than with it turned off.

A small block allocator can be turned on through mallopt(); however, it is not early enough for C++/Java applications. The environment variable turns it on before the application starts.

mallopt() call can still be used the same way. If the environment variable is set, and no small block allocator has been used, the subsequent mallopt() calls can still overwrite whatever is set through _M_SBA_OPTS. If the environment variable is set, and a small block allocator has been used, then mallopt() will have no effect. For example:

$ export _M_SBA_OPTS=512:100:16

where the maxfast size is 512, the number of small blocks is 100, and the grain size is 16. You must supply all 3 values, and in that order. If not, the default ones will be used instead.

The _M_ARENA_OPTS and _M_SBA_OPTS environment variables have the following impact:

  • Performance is improved for multi-threaded applications.

  • Threaded applications may experience increased heap storage usage but you can adjust the heap usage through _M_ARENA_OPTS.

NOTE: Threaded applications which are linked with archive libc and other shared libraries where those shared libraries have dependencies on shared libc may break.

libc Performance Improvements (new at 11i original release)

Overall libc Performance Tuning

This information refers to the system library libc, /usr/lib/libc.sl. Several header files have been changed as described below. A new archive library has been added to allow linking the string and memory routines archived but the application as a whole can be linked shared.

There are now two different 32-bit system libraries. One is built for use on a PA1.1 machine and the other is built for use on a PA2.0 machine. The correct library is installed at installation time. Other changes to these libraries include a decreased calling overhead for the shared library. Also the build process makes use of pragmas introduced in release 10.20 to decrease the calling overhead in shared libraries.

In addition to the changes to the library builds, changes have been made to selected header files to allow building applications that have decreased calling overhead. These changes apply to both 32-bit and 64-bit applications

Two new libraries are added, /usr/lib/libcres.a and /usr/lib/pa20_64/libcres.a. These archive libraries include the common string and memory functions along with a improved performance qsort routine. A few other selected small routines are also included. The intent of this library is that an application can link this library archived while linking the application as a whole shared. The use of this archived library is a supported link mode and will not introduce the problems normally associated with a shared/archive link.

The 32-bit system libraries now have selected API's built with the pragmas HP_DEFINED_EXTERNAL, HP_LONG_RETURN and HP_NO_RELOCATION. When these three pragmas are used in the building of libc.sl it is referred to as a fastcalled library. The result of this is that the export stubs for the selected interfaces have been inlined in the library code. This reduces the call overhead. Applications that have already been built will benefit from this without any effort other than the replacement of this library. The benefit a given application will gain is very dependent on the applications use of the libc API's that have been fastcalled.

Along with the changes to the build process for libc.2, the following header files have been changed:

  • ctype.h

  • grp.h

  • mntent.h

  • pwd.h

  • stdio.h

  • stdlib.h

  • strings.h

  • string.h

  • time.h

These header files now contain the necessary fastcall pragmas to enable building a fastcalled application. To make use of the pragmas to build the application, the define _HP_SHLIB_CALLS needs to be defined for the application compile. With this define, the application will now have the import stubs inlined in the application code further reducing the shared libary call overhead.

CAUTION: An application that has been built with the _HP_SHLIB_CALLS define can *ONLY* be used with a fastcalled libc. If the application also has APIs that are fastcalled and are part of the applications shared libraries, then that library must also be built with the fastcall technology
The /usr/lib/pa20_64/libc.2 Library

Although the build process for this library has not changed, the runtime architecture for HPPA-2.0 can make use of a reduced call overhead technology similar to that that exists with the 32-bit library. There is no restriction on matching the correct /usr/lib/pa20_64/libc.2 with the fastcalled application like there is with the 32-bit library.

There is a manpage available for libcres.a(5).

Compatibility Issues

An existing PA1.1 application will not have a compatibility issue with the new 32-bit fastcalled /usr/lib/libc.sl. However, if the fastcall technology is used to build an application, then that application can only be used with a fastcall technology library.

An existing 64-bit application does not have any compatibility issues with the existing /usr/lib/pa20_64/libc.sl libraries. If a 64-bit application is built with the fastcall technology, this application will not have any compatibility issues with an existing /usr/lib/pa20_64/libc.sl.

To make use of the application fastcall and the libcres.a features, changes will need to be made to existing make files.

Other Considerations

There is little to no impact from these changes. There is a slight (125KB) increase in amount of disk space required for libcres.a. The changes to the system libraries are transparent to current applications.

Any performance gains for an application are highly dependent on the application's use of libc.sl and what interfaces in this library are used.

The fastcall technology will be delivered with all systems. If there are compatibility concerns, the applications should not be built with this technology.

More API's in libc may make use of the fastcall technology in future releases. Appropriate changes to the header files will be delivered to track these changes.

Performance Improvements to libc's ftw(3C) and nftw(3C)

The libc functions ftw() and nftw() have been rewritten to operate faster, avoid stack overflow conditions, reduce data space usage, and improve parallelism in multi-threaded applications.

libc and commands which call ftw() and nftw() are affected.

ftw()

ftw() was rewritten to eliminate internal recursion, thus avoiding the possibility of a stack overflow on deep file trees. A single fixed-size data structure is allocated in the stack instead of using malloc() to separate buffers for each depth of the tree. Use of strlen() was eliminated, as well as trivial comparisons such as strcmp(buf,"."). The file descriptor re-use algorithm was changed from most-recently-opened to least-recently-opened which can show significant performance gains on very deep file trees.

ftw() will typically show 8% reductions in elapsed time and 50% or more reduction in heap space used.

nftw()

nftw() was rewritten similarly to ftw() with the same benefits. nftw() now fully conforms with the UNIX95 definition, including the fact that when the FTW_PHYS is not set, files are reported only once.

Threaded applications can obtain greater concurrency when specifying absolute path names for the starting path, and FTW_CHDIR is not set. In addition, an internal unbalanced binary tree was replaced with a much more efficient splay tree. The effect of this tree change becomes significant as the number of object inodes being tracked increases. Directory inodes are always tracked, and when executing in UNIX95 mode and the FTW_PHYS option is not set, all files and directories are tracked. When the number of tracked objects reaches about 20,000, the user CPU time with the splay tree is about half the user CPU time for the old nftw(). At 100,000 tracked inodes, the user CPU time is about 90% less for the splay tree.

Another performance improvement to nftw() eliminated calls to access() by checking the mode bits in the stat() buffer. This decreased system CPU time by approximately 4%.

Two defects were fixed in nftw():

  • When the FTW_CHDIR option is set, directories are considered unreadable unless they have both read and execute permissions. (The old nftw() would try to chdir() into a directory without execute permissions and then abort the walk with an error).

  • When the FTW_CHDIR option is set, a directory object is reported to the user function before it is chdir()'ed into.

nftw() improvements vary depending on options provided, with the most significant improvements seen in UNIX95 standard mode with the FTW_PHYS option not set, or when a very large number of directories exist in the file tree being traversed.

Documentation Change

The ftw(3C) and nftw(3C) manpages have been updated, particularly with respect to the two defect fixes and means of achieving best concurrency in threaded applications.

Other Issues

The code size of ftw() and nftw() has increased by about 40%, but the heap requirements are reduced by 50% or more.

If you relied on the FTW_CHDIR defects which were mentioned above, there may need to be an application change.

Performance Issues

Minimally, you should find that ftw() operates about 6% faster and nftw() 4% faster. On very large file trees where the number of tracked inodes is in the tens of thousands or more, the performance gain of nftw() could be 30% to 40% or more.

Performance Improvements to libc's malloc(3C)

A new environment variable, _M_CACHE_OPTS, is available to help tune malloc() performance in kernel-threaded applications. This environment variable configures a thread-private cache for malloc'ed blocks. If cache is configured, malloc'ed blocks are placed into a thread's private cache when free() is called, and may thereafter be allocated from cache when malloc() is called. Having such a cache potentially improves speed performance for some kernel-threaded applications, by reducing mutex contention among threads and by deferring coalescence of blocks.

The thread-private cache is only available for kernel-threaded applications, i.e. those linked with the pthread library. The installed shared pthread library version must be PHCO_19666 or later, or the application must be statically linked with an archive pthread library that is version PHCO_19666 or later, or else cache is not available.

By default cache is not active and must be activated by setting _M_CACHE_OPTS to a legal value. If _M_CACHE_OPTS is set to any out of range values, it is ignored and cache remains disabled.

There are two portions to the thread private cache: one for ordinary blocks and one for small blocks. Small blocks are blocks that are allocated by the small block allocator (SBA), which is configured with the environment variable _M_SBA_OPTS or by calls to mallopt(3C). The small block cache is automatically active whenever both the ordinary block cache and the SBA are active. The ordinary block cache is active only when it is configured by setting _M_CACHE_OPTS. There are no mallopt() options to configure the thread-private cache.

The following shows _M_CACHE_OPTS's subparameters and their meaning:

_M_CACHE_OPTS=<bucket_size>:<buckets>:<retirement_age>

<bucket_size> is (roughly) the number of cached ordinary blocks per bucket that will be held in the ordinary block cache. The allowable values range from 0 through 8*4096 = 32768. If <bucket_size> is set to 0, cache is disabled.

<buckets> is the number of power of 2 buckets that will be maintained per thread. The allowable values range from 8 though 32. This value controls the size of the largest ordinary block that can be cached. For example, if <buckets> is 8, the largest ordinary block that can be cached will be 2^8 or 256 bytes. If <buckets> is 16, the largest ordinary block that can be cached will be 2^20 or 65536 bytes, etc.

<bucket_size>*<buckets> is (exactly) the maximum number of ordinary blocks that will be cached per thread. There is no maximum number of small blocks that will be cached per thread if the small block cache is active.

<retirement_age> controls what happens to unused caches. It may happen that an application has more threads initially than it does later on. In that case, there will be unused caches, because caches are not automatically freed on thread exit -- by default they kept and assigned to newly-created threads. But for some applications, this could result in some caches being kept indefinitely and never reused. <retirement_age> sets the maximum amount of time in minutes that a cache may be unused by any thread before it is considered due for retirement. As threads are created and exit, caches due for retirement are freed back to their arena. The allowable values of <retirement_age> range from 0 to 1440 minutes (=24*60, i.e. one day). If <retirment_age> is 0, retirement is disabled and unused caches will be kept indefinitely. It is recommended that <retirement_age> be configured to 0 unless space efficiency is important and it is known that an application will stabilize to a smaller number of threads than its initial number.

In general, kernel threaded applications that benefit in performance from activating the small block allocator may also benefit further by activating a modest-sized ordinary cache, which also activates caching small blocks (from which most of the benefit is derived). For example, a setting that might be tried to begin with would be:

_M_SBA_OPTS=256:100:8
_M_CACHE_OPTS=100:20:0

The smallest ordinary cache that is legal and will activate small block caching (if the SBA is also configured) is

_M_CACHE_OPTS=1:8:0

It can happen that activating small block caching with this minimum level of ordinary cache gives all the performance benefit that can be gained from malloc cache, and increasing the ordinary block cache size further does not improve matters. Or, increasing cache size further may give some further improvement for a particular application.

The malloc() per-thread cache is a heuristic which may or may not benefit a given kernel-threaded application that makes intensive use of malloc. Only by trying different configurations can you determine whether any speed improvement can be obtained from per-thread cache for a given application, and what the optimal tuning is for that application.

Impact

No impact on performance if cache is not configured or if application is not kernel-threaded. There are possible significant speed performance improvements for some kernel applications if cache is configured.

There is a small additional space cost (in process heap size) associated with the cache machinery. There is no per-block space cost for caching small blocks. However, there is a small space cost per ordinary block cached. ISVs whose applications are very memory intensive may want to configure only a minimum-sized or very small ordinary cache when experimenting with this feature.

malloc() thread-private cache does not change the function of malloc() for nonthreaded or cma threaded applications. It does maintain binary compatibility. However, because it is a change in allocation policy, it can cause different sequences of addresses to be emitted for the same sequence of requests than a previous version of malloc would have emitted. This level of compatibility is more stringent than ordinary binary compatibility and has never been guaranteed across releases of malloc.

The libcres.a Library

libcres.a is a small archive library provided at 11i.

libcres.a contains string, memory and other functions, to provide customers running performance-critical applications with the benefit of a static link.

Linking statically with libc is not a supported method of linking an application. Any performance improvement is highly dependent on the application's use of the included functions. The functions included in this library are:

abs(), bsearch(), div(), ffs(), insque(), labs(), ldiv(), memchr(), memcmp(), memcpy(), memmove(), memset(), strcat(), strchr(), strcmp(), strcpy(), strcspn(), strlen(), strncat(), strncmp(), strcpy(), strrchr(), strspn(), strstr(), swab()

The libcres.a(5) manpage describes its use more thoroughly.

To make use of this library, existing makefiles must be modified to include it on the link line. Existing applications must be re-linked to use this library.

The modules of this library are compiled with the HP optimizing compiler using a +O4flag. As a result, the applications using this library can be linked only by using the HP optimizing compiler.

The functions in this library cannot be overwritten with a user-defined function of the same name, as is the case today with libc names. If this library is used, user libraries cannot contain identically named functions or unexpected results may occur.

Performance of some applications may improve by using this library. The improvement is highly dependent on the application's use of the included functions.

Linker and Object File Tools (ld, crt0.o, dld.sl, libdld.sl, chatr and odump) (new at 11i original release)

The following list summarizes the changes to linker and object file tools.

Linker changes:

  • Incremental linking support in 64-bit ld and elfdump.

  • Unix 98 (32-bit dl()* calls) support in libdld.sl and dld.sl.

  • 32-bit Filtered shared libraries support in ld, dld.sl and in odump.

  • GProf 32-bit shared library support in crt0.o and dld.sl.

  • ld +filter option to create filtered shared libraries.

  • ldd32 -list dynamic dependencies of executable files or shared libraries support in dld.sl.

  • Plabel cache, caches PLABELS at run-time, support in ld and dld.sl.

  • ld +dependdb and +dependdb_outputdir options for generation of dependency database, .ldb file.

  • ld +objdebugonly in both 32-bit and 64-bit, to ignore debug information from non objdebug objects or archives and proceed in +objdebug mode.

  • Special support for OGL's TLS shared library in dld (both 32- and 64-bit).

Tools enhancements:

  • elfdump +ild to display incremental linking information.

  • ar -x option to allow modules from lib to keep datestamp.

  • odump -tlssym option for displaying the TLS (thread) symbols.

  • chatr +q3p enable/disable and q4p enable/disable option to support marking 3rd/4th quadrant for private data space.

  • odump -verifyall option to suppress stub warnings on executable.

  • odump -filtertable to display the filtered shared library's implementation libraries.

Details of Linker Changes

Incremental linking: Incremental linking provides significant linktime improvements for compile-link-debug development cycles by processing only those input files that are actually modified between cycles. Files that are not modified do not need to be reprocessed. For large application, incremental linking may provide up to 10x and sometimes greater improvements in linktime.

Unix 98: Support for the APIs dlopen, dlsym, dlerror and dlclose is added for 32-bit programs.

Filtered Libraries: Filtered shared libraries divide up a large library into one filter and several implementation libraries. The user links against the filter library, but the real definitions of data and functions actually resides in the implementation libraries. At run time, only those implementation libraries that are actually used are loaded. Filtered libraries can be nested; an implementation library can itself be a filtered library containing other implementation libraries.

GProf 32-bit support: GProf is an enhanced version of prof which produces call graph over the input generated by prof. However, the profiling of shared library was not supported in earlier releases. This release will support profiling of shared libraries using the environmental variable LD_PROFILE. No recompilation is required for profiling shared libraries.

ldd32: List dynamic dependencies of incomplete executables files or shared libraries support in dld.sl.

PLabel cache: +plabel_cache is added to 32-bit linker and dld.sl to control the global symbol hash mechanism.

+objdebugonly: ld +objdebugonly in both 32-bit and 64-bit, to ignore debug information from non-objdebug objects or archives and proceed in +objdebug mode.

Other Issues

Various serious and critical defects were repaired.

Forward and backward compatibility are maintained. Use of new features in this release may break backward compatibility.

Invoking chatr on some binaries built with an older linker may emit the following message: chatr(error): dl_header_ext.size != sizeof(dl_header_ext). Please update your version of the linker/chatr. This message should be regarded as a warning rather than an error. chatr operation will be successful in spite of the warning.

Changes to libm

The fesetround() and fehold() functions in fenv.h have been upgraded to the latest ISO C9x specification. Previously the functions returned nonzero to indicate success and zero to indicate failure; now they return zero to indicate success and nonzero to indicate failure.

Any code that depended on the return value will need to change. For example:

if (!fesetround(FE_UPWARD))

{/* deal with failure to set rounding direction */}

could be changed to:

if(fesetrod(FE_UPWARD))

{/* deal with failure to set rounding direction */}

Previous code that depended on the return value are not compatible beginning with the 11.0 May 1999 Extension Pack.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1983-2001 Hewlett-Packard Development Company, L.P.