 |
» |
|
|
 |
The IEEE standard requires a complying system to support the
following floating-point operations: - Addition
Algebraic
addition. - Subtraction
Algebraic
subtraction. - Multiplication
Algebraic
multiplication. - Division
Algebraic
division. - Comparison
There
are four possible relations between any two floating-point values:
less than, equal, greater than, and unordered.
The unordered relation occurs when one or both of the operands is
a Not-a-Number (NaN). See “Comparison” for details. - Square Root
The square
root operation never overflows or underflows. - Conversion
The
following conversions must be supported by a conforming implementation,
if the implementation supports single-precision, double-precision,
and quad-precision formats: Single-precision to double-precision Single-precision to quad-precision Double-precision to single-precision Double-precision to quad-precision Quad-precision to single-precision Quad-precision to double-precision Floating-point to integer Integer to floating-point Binary floating-point to decimal Decimal to binary floating-point
See “Conversion Between Operand Formats” for more information about these conversions. - Round to Nearest Integral Value
Rounds
an argument to the nearest integral value (in floating-point format)
based on the current rounding mode. Rounding modes are described
in “Inexact Result (Rounding)”. - Remainder
The
remainder operation takes two arguments, x
and y, and is defined as x - y * n,
where n is the integer nearest the exact
value x/y.
See “The Remainder Operation” for
more information.
To understand the properties of each operation, you need a
full understanding of denormalized numbers, infinities, and NaNs
(see “Normalized and Denormalized Values”, “Infinity”, and “Not-a-Number (NaN)”). HP 9000
systems conform to the IEEE standard for all of these operations. The standard requires that the result of each operation be
rounded from its mathematically exact value into an IEEE representation
in accordance with the rounding mode. In round-to-nearest mode (the
default), the result is within 1/2 ULP. (There is one exception
to this rule; conversions between binary and decimal need not be
exact at the extremes of their ranges.) Comparison |  |
The comparison operation determines the truth of an assertion
about the relationship of two floating-point values. The four basic
assertions are - operand1
< operand2
The first operand is less than the second. - operand1 = operand2
The first operand is equal to the second. - operand1 > operand2
The first operand is greater than the second. - operand1 ? operand2
Unordered. This assertion
is true if either operand is a NaN.
The basic assertions can be combined with each other. For
example, "a >= b"
asserts that a
is greater than or equal to b.
Similarly, "a <> b"
asserts that a
is either greater than or less than b;
for operands that are not NaNs, this assertion is the opposite of
"a = b".  |  |  |  |  | NOTE: The assertion operators should not be confused with
actual programming language operators. Languages, for example, do
not support the ? operator. |  |  |  |  |
An assertion may also be negated; for non-NaN operands, negation
of an assertion is the same as asserting the opposite assertion. The IEEE standard defines
two versions of every possible assertion: the aware and the non-aware
version. The aware version of an assertion
treats a NaN as a special value that compares as neither less than
nor greater than any numeric value, and as unequal to any value,
including any other NaN and even itself. This definition yields
the interesting fact that the assertion "x = x"
will evaluate to FALSE if x
is a NaN. In fact, applications sometimes use this comparison operation
specifically to detect NaNs, although it is a dangerous practice
because some vendors' optimizers remove this operation
from the code. The non-aware
version of an assertion behaves the same as the aware version, except
that if either or both operands is a NaN, it also raises an invalid
operation exception for the <, <=, >, and >= assertions.
The =, !=, and ? assertions are the only ones that are valid with
NaN operands. Signaling NaNs cause an invalid operation exception for both
aware and non-aware assertions. The behavior of the comparison operation for each of the possible
operand kinds is as follows: - Normalized and Denormalized
Values
The operands are algebraically compared. - Zero
Zeros are greater than any nonzero negative value
and less than any nonzero positive value. The sign of a zero is
ignored, so that two zeros always compare as equal even if they
have opposite signs. - Infinity
To the comparison operators, infinity is just another
signed numeric value whose magnitude is greater than the largest
normalized magnitude. Infinities with the same sign compare as equal
to each other. - NaN
A NaN compares as unequal to
all other operands, including other NaNs and itself. The rules above
are used to evaluate assertions involving NaNs as TRUE or FALSE.
If the assertion is non-aware, an invalid operation exception is
also signaled for any comparison involving a <, <=,
>, or >= assertion.
Conversion Between Operand Formats |  |
The standard requires that it be possible to convert between
decimal and binary floating-point, and between binary floating-point
and integer formats. This section describes some of the properties
of various conversions. The operand type integer
refers to either signed or unsigned integers. - Single-Precision
to Double-Precision or Quad-Precision
These conversions can never overflow, underflow,
or be inexact. The only possible type of exception is an invalid
operation if the operand is an SNaN. - Double-Precision to Quad-Precision
These conversions can never overflow, underflow,
or be inexact. The only possible type of exception is an invalid
operation if the operand is an SNaN. - Quad-Precision or Double-Precision
to Single-Precision
These conversions can overflow or underflow and
are usually inexact. - Quad-Precision to Double-Precision
These conversions can overflow or underflow and
are usually inexact. - Decimal to Single-Precision, Double-Precision,
or Quad-Precision
These conversions can overflow or underflow and
are usually inexact. See “Conversions Between Binary and
Decimal” for more information about these conversions. - Single-Precision, Double-Precision,
or Quad-Precision to Decimal
These conversions can overflow or underflow and
are usually inexact. See “Conversions Between Binary and
Decimal” for more information about these conversions. - Single-Precision, Double-Precision,
or Quad-Precision to Integer
These conversions are usually inexact. Out-of-range
finite values, infinities, and NaNs cause an invalid operation exception.
The underflow exception does not apply to these conversions; results
that are too small to round up to one round down to zero. Signed
zeros become integer zeros. HP 9000 systems round these conversions in accordance with
IEEE rounding rules. However, some programming languages, such as
C, require that these conversions be performed with truncation.
See “Truncation to an Integer Value” for
information about problems that can result when floating-point values
are truncated to integer. - Integer to Quad-Precision
These conversions are always exact and never generate
an exception. - Integer to Double-Precision or Single-Precision
These conversions are exact except for conversions
of large 32-bit integer values to single-precision, or of large
64-bit integer values to double-precision, which may generate an
inexact result exception.
The Remainder Operation |  |
The remainder operation is an exact modulo function. When
y is not equal to zero, the remainder
r = remainder(x,
y)
is defined as where n is the integer nearest
the exact value x/y.
When |n - x/y| = 1/2,
n is even. If r
is zero, its sign is that of x. Two examples: The integer closest to the exact value
1.6/2.0 is 1. So the remainder of 1.6 and 2.0 is 1.6 - (2.0 * 1),
or -0.4. The integer closest to the exact value 5.0/2.0 is
2 (the exact value is halfway between 2 and 3, so n
is even). So the remainder of 5.0 and 2.0 is 5.0 - (2.0 * 2),
or 1.
The result of the remainder operation is not affected by the
rounding mode. (The result is always exact, so rounding is not a
factor.) The C math library remainder
function implements the IEEE remainder operation.
|