Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-UX Floating-Point Guide: HP 9000 Computers > Chapter 2 Floating-Point Principles and the IEEE Standard for Binary Floating-Point Arithmetic

Floating-Point Operations

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The IEEE standard requires a complying system to support the following floating-point operations:

Addition

Algebraic addition.

Subtraction

Algebraic subtraction.

Multiplication

Algebraic multiplication.

Division

Algebraic division.

Comparison

There are four possible relations between any two floating-point values: less than, equal, greater than, and unordered. The unordered relation occurs when one or both of the operands is a Not-a-Number (NaN). See “Comparison” for details.

Square Root

The square root operation never overflows or underflows.

Conversion

The following conversions must be supported by a conforming implementation, if the implementation supports single-precision, double-precision, and quad-precision formats:

  • Single-precision to double-precision

  • Single-precision to quad-precision

  • Double-precision to single-precision

  • Double-precision to quad-precision

  • Quad-precision to single-precision

  • Quad-precision to double-precision

  • Floating-point to integer

  • Integer to floating-point

  • Binary floating-point to decimal

  • Decimal to binary floating-point

See “Conversion Between Operand Formats” for more information about these conversions.

Round to Nearest Integral Value

Rounds an argument to the nearest integral value (in floating-point format) based on the current rounding mode. Rounding modes are described in “Inexact Result (Rounding)”.

Remainder

The remainder operation takes two arguments, x and y, and is defined as x - y * n, where n is the integer nearest the exact value x/y. See “The Remainder Operation” for more information.

To understand the properties of each operation, you need a full understanding of denormalized numbers, infinities, and NaNs (see “Normalized and Denormalized Values”, “Infinity”, and “Not-a-Number (NaN)”). HP 9000 systems conform to the IEEE standard for all of these operations.

The standard requires that the result of each operation be rounded from its mathematically exact value into an IEEE representation in accordance with the rounding mode. In round-to-nearest mode (the default), the result is within 1/2 ULP. (There is one exception to this rule; conversions between binary and decimal need not be rounded perfectly at the extremes of their ranges.)

Comparison

The comparison operation determines the truth of an assertion about the relationship of two floating-point values. The four basic assertions are

operand1 < operand2

The first operand is less than the second.

operand1 = operand2

The first operand is equal to the second.

operand1 > operand2

The first operand is greater than the second.

operand1 ? operand2

Unordered. This assertion is true if either operand is a NaN.

The basic assertions can be combined with each other. For example, "a >= b" asserts that a is greater than or equal to b. Similarly, "a <> b" asserts that a is either greater than or less than b; for operands that are not NaNs, this assertion is the opposite of "a = b".

NOTE: The assertion operators should not be confused with actual programming language operators. Languages, for example, do not support the ? operator.

At Release 11.0, the C math library provides six new macros that implement comparison operations without raising exceptions: isgreater, isgreaterequal, isless, islessequal, islessgreater, and unordered. See “C9X Functions and Macros” for details.

An assertion may also be negated.

The IEEE standard defines two versions of every possible assertion: the aware and the non-aware version. Both the aware and non-aware versions of an assertion treat a NaN as a special value that compares as neither less than nor greater than any numeric value, and as unequal to any value, including any other NaN and even itself. This definition yields the interesting fact that the assertion "x = x" will evaluate to FALSE if x is a NaN. In fact, applications sometimes use this comparison operation specifically to detect NaNs, although it is a dangerous practice because some vendors' optimizers remove this operation from the code.

The non-aware version of an assertion behaves the same as the aware version, with the addition that if either or both operands is a NaN, it also raises an invalid operation exception for the <, <=, >, and >= assertions. The =, !=, and ? assertions are the only ones that are valid with NaN ­operands.

Signaling NaNs cause an invalid operation exception for both aware and non-aware assertions.

The behavior of the comparison operation for each of the possible operand kinds is as follows:

Normalized and Denormalized Values

The operands are algebraically compared.

Zero

Zeros are greater than any nonzero negative value and less than any nonzero positive value. The sign of a zero is ignored, so that two zeros always compare as equal even if they have opposite signs.

Infinity

To the comparison operators, infinity is just another signed numeric value whose magnitude is greater than the largest normalized magnitude. Infinities with the same sign compare as equal to each other.

NaN

A NaN compares as unequal to all other operands, including other NaNs and itself. The rules above are used to evaluate assertions involving NaNs as TRUE or FALSE. If the assertion is non-aware, an invalid operation exception is also signaled for any comparison involving a <, <=, >, or >= assertion.

Conversion Between Operand Formats

The standard requires that it be possible to convert between decimal and binary floating-point, and between binary floating-point and integer formats. This section describes some of the properties of various conversions. The operand type integer refers to either signed or unsigned integers.

Single-Precision to Double-Precision or Quad-Precision

These conversions can never overflow, underflow, or be inexact. The only possible type of exception is an invalid operation if the operand is an SNaN.

Double-Precision to Quad-Precision

These conversions can never overflow, underflow, or be inexact. The only possible type of exception is an invalid operation if the operand is an SNaN.

Quad-Precision or Double-Precision to Single-Precision

These conversions can overflow or underflow and are usually inexact.

Quad-Precision to Double-Precision

These conversions can overflow or underflow and are usually inexact.

Decimal to Single-Precision, Double-Precision, or Quad-Precision

These conversions can overflow or underflow and are usually inexact. See “Conversions Between Binary and Decimal” for more information about these conversions.

Single-Precision, Double-Precision, or Quad-Precision to Decimal

These conversions can overflow or underflow and are usually inexact. See “Conversions Between Binary and Decimal” for more information about these conversions.

Single-Precision, Double-Precision, or Quad-Precision to Integer

These conversions are usually inexact. Out-of-range finite values, infinities, and NaNs cause an invalid operation exception. The overflow and underflow exceptions do not apply to these conversions. Results that are too small to round up to one round down to zero. Signed zeros become integer zeros.

HP 9000 systems round these conversions in accordance with IEEE rounding rules. However, some programming languages, such as C, require that these conversions be performed with truncation. See “Truncation to an Integer Value” for information about problems that can result when floating-point values are truncated to integer.

Integer to Quad-Precision

These conversions are always exact and never generate an exception.

Integer to Double-Precision or Single-Precision

These conversions are exact except for conversions of 32-bit integer values greater than 224 - 1 to single-precision, or of 64-bit integer values greater than 253 - 1 to double-precision, which may generate an inexact result exception.

The Remainder Operation

The remainder operation is an exact modulo function. When y is not equal to zero, the remainder r = remainder(x, y) is defined as

r = x - y * n 

where n is the integer nearest the exact value x/y. When |n - x/y| = 1/2, n is even. If r is zero, its sign is that of x.

Two examples:

  • The integer closest to the exact value 1.6/2.0 is 1. So the remainder of 1.6 and 2.0 is 1.6 - (2.0 * 1), or -0.4.

  • The integer closest to the exact value 5.0/2.0 is 2 (the exact value is halfway between 2 and 3, so n is even). So the remainder of 5.0 and 2.0 is 5.0 - (2.0 * 2), or 1.

The result of the remainder operation is not affected by the rounding mode. (The result is always exact, so rounding is not a factor.)

The C math library remainder function implements the IEEE remainder operation.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1997 Hewlett-Packard Development Company, L.P.