Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-UX Floating-Point Guide: HP 9000 Computers > Chapter 2 Floating-Point Principles and the IEEE Standard for Binary Floating-Point Arithmetic

Floating-Point Formats

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The IEEE standard specifies four formats for representing floating-point values:

  • Single-precision

  • Double-precision (optional, though a double type wider than IEEE single-precision is required by standard C)

  • Single-extended precision (optional)

  • Double-extended precision (optional)

The IEEE standard does not require an implementation to support single-extended precision and double-extended precision in order to be standard-conforming.

HP 9000 systems fully support the single-precision and double-precision formats. They also support quadruple-precision or quad-precision ­format, which is similar to the double-extended precision format.

Single-Precision, Double-Precision, and Quad-Precision Formats

Single-precision, double-precision, and quad-precision values consist of three fields: sign bit, exponent, and fraction. The sign bit reflects the algebraic sign of the value. A 1 indicates a negative value; a 0 indicates a positive value. The exponent represents an integer value that is a power to which 2 is raised. The fraction, also called the significand, represents a value between 1.0 and 2.0 (for normalized values). The result of the exponent expression is multiplied by the fraction to yield the actual numerical value.

The only difference among the single-precision, double-precision, and quad-precision formats is the number of bits allocated for the exponent and fraction. Figure 2-1 “IEEE Single-Precision Format”, Figure 2-2 “IEEE Double-Precision Format”, and Figure 2-3 “IEEE Quad-Precision Format” show the number of bits allocated in each format.

The single-precision format is 32 bits long: 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction.

Figure 2-1 IEEE Single-Precision Format

IEEE Single-Precision Format

The double-precision format is 64 bits long: 1 bit for the sign, 11 bits for the exponent, and 52 bits for the fraction.

The double-precision format is sometimes divided conceptually into two 32-bit words. The word containing the sign bit, the exponent field, and the first portion of the fraction field is referred to as the most significant word. The other word, containing the last portion of the fraction, is called the least significant word.

Figure 2-2 IEEE Double-Precision Format

IEEE Double-Precision Format

The quad-precision format is 128 bits long: 1 bit for the sign, 15 bits for the exponent, and 112 bits for the fraction. This format is divided conceptually into four 32-bit words: the most significant word, two middle words, and the least significant word.

Figure 2-3 IEEE Quad-Precision Format

IEEE Quad-Precision Format
NOTE: On HP 9000 systems, the most significant word is stored at a lower memory address than the least significant word. If, for example, a double-precision value is stored at address 0x1000, the least significant word is stored at address 0x1004. If a quad-precision value is stored at address 0x1000, the least significant word is at address 0x100C. This ordering is often referred to as "big-endian."

The Fraction Field

For normalized values (see “Normalized and Denormalized Values”), the fraction represents a value greater than or equal to 1.0 and less than 2.0. Each bit in the fraction represents the value 2 raised to a negative power. For example, the first bit represents the value 2-1 (0.5), the second bit is 2 (0.25), and so on. The sum of 1.0 and the values represented by all these bits is the value of the fraction. The 1.0 in the sum corresponds to the zeroth fraction bit, 20. Since this bit would always be set for a normalized value, it is not included in the actual format, but it is implied. It is sometimes referred to as the fraction implicit bit or the hidden bit.

For example, if the 23 bits in the fraction field of a single-precision number are

011 0100 0000 0000 0000 0000

and the exponent field is not all 1's or all 0's, the fraction value is

1.0 + 2-2 + 2 + 2-5 = 1.0 + 0.25 + 0.125 + .03125 = 1.40625

The 1.0 represents the fraction implicit bit, and the exponents of -2, -3, and -5 indicate that the second, third, and fifth bits of the fraction field are set.

The Exponent Field

The exponent field uses a biased representation. This means that the value represented by the exponent field is the value in the exponent field interpreted as an unsigned integer minus a constant value (the bias). The purpose of the bias is to allow all exponent calculations to be performed using unsigned arithmetic. For single-precision formats, the bias is 127; for double-precision formats, it is 1023; for quad-precision formats, it is 16383.

Floating-Point Format: Examples

The value 6.0 would be represented in single-precision format as shown in Figure 2-4 “IEEE Single-Precision Format: Example”.

Figure 2-4 IEEE Single-Precision Format: Example

IEEE Single-Precision Format: Example

The first bit is the sign bit. Because the sign bit is 0, the floating-point value is positive. The next eight bits make up the exponent. 1000 0001 equals 129, but the true value of the exponent is derived by subtracting the bias constant 127 from this value. So the true exponent value is 2. The fraction bits are 100 0000 0000 0000 0000 0000, which, when added to the implicit bit, equal 1 + 0.5, or 1.5.

In algebraic terms, a floating-point value is

(-1.0)S * M * 2E-B 

where S is the value of the sign bit, M is the fraction (with implicit bit), E is the exponent, and B is the bias.

In our example, this would be

(-1)0 * 1.5 * 2² = 1.5 * 4.0 = 6.0

Table 2-1 “IEEE Representations of Floating-Point Values” shows some additional examples.

Table 2-1 IEEE Representations of Floating-Point Values

Hexadecimal Representation

Sign

Exponent

Fraction

Value

SP: 40C0 0000
DP: 4018 0000 0000 0000
QP: 4001 8000 0000 0000
0000 0000 0000 0000

+

129 - 127 = 2
1025 - 1023 = 2
16385 - 16383 = 2

1.0 + 0.5 = 1.5

+1.5 * 2² = 6.0

SP: BF00 0000
DP: BFE0 0000 0000 0000
QP: BFFE 0000 0000 0000
0000 0000 0000 0000

-

126 - 127 = -1
1022 - 1023 = -1
16382 - 16383 = -1

1.0 + 0.0 = 1.0

-1.0 * 2-1 = -0.5

SP: 7F00 0001
DP: 7FE0 0000 0000 0001
QP: 7FFE 0000 0000 0000
0000 0000 0000 0001

+

254 - 127 = 127
2046 - 1023 = 1023
32766 - 16383 = 16383

1.0 + 2-23
1.0 + 2-52
1.0 + 2-112

+1.00000019209 * 2127
+1.000...001 (51 zeros) * 21023
+1.000...001 (111 zeros) * 216383

 

Floating-Point Formats and the Limits of IEEE Representation

Because floating-point numbers have a finite number of bits in the fraction, only a finite subset of the continuum of real numbers can be represented exactly in IEEE format. The unit of granularity of the representable numbers is the ULP (Unit in the Last Place). ULPs measure the distance between two numbers in terms of their representation in binary. One ULP is the distance from one value to the next representable value in the direction away from 0.

One ULP is about 1 part in 17 million for single-precision values, 1 part in 10 16 for double-precision values, and 1 part in 1034 for quad-precision values. For this reason, there is a general rule of thumb that single-precision arithmetic represents about 9 decimal places, double-precision about 17, and quad-precision about 36. If you try to read or write a value with a greater number of decimal digits, the last digits will probably not contain useful information.

Because of this granularity in floating-point representation, most real numbers cannot be represented exactly. The result of an arithmetic operation (including the operation of converting from a decimal string into IEEE format) usually must be rounded to a nearby representable number. (For information on rounding, see “Inexact Result (Rounding)”.)

Even some simple fractions cannot be represented exactly. Consider the fraction 1/3. The exact value of this expression would require an infinite number of bits, because the value is an infinitely repeating fraction (0.33333...in decimal, 0.010101...in binary). Many values that can be represented exactly in a few decimal digits cannot be represented exactly in binary: for example, 1/10, which in decimal is 0.1, is in binary 0.000110011001100... Because simple numbers like 1/3 and 1/10 cannot be represented exactly, no floating-point operation can ever yield these exact values.

Although most real numbers cannot be represented exactly in floating-point arithmetic, a great many can. Any integer with a magnitude less than 16 million can be represented exactly in any format, and any 32-bit integer can be represented exactly in double-precision or quad-precision. Also, all numbers representable as some number over a power of 2, such as 0.1875 (3/16) or 27.375 (219/8), can be represented exactly if they have no more decimal digits than the chosen precision can faithfully represent.

Normalized and Denormalized Values

Values that are represented by a sign bit, a fraction, and an exponent whose bits are not all zeros and not all ones are called normalized values (also called normal values). The size of the exponent field, and the fact that the value in the exponent field of a normalized value cannot be 0, determine the smallest magnitude that can be represented in normalized form. For single-precision numbers, the largest-magnitude negative exponent is -126 (that is, 1 - 127); for double-precision numbers, it is -1022 (that is, 1 - 1023); for quad-precision numbers, it is -16382 (that is, 1 - 16383).

Denormalized values (also called subnormal values) fill in the gap on the number line between the smallest-magnitude normalized value and zero. They also allow floating-point values to satisfy the arithmetic rule that x is equal to y if and only if x - y is equal to 0.

A denormalized value is represented by a zero exponent field and a nonzero fraction (if the fraction were also zero, the floating-point value would be zero). You can compute the value of a denormalized number by interpreting the fraction as an integer and then multiplying this integer by 2-149 for single-precision numbers, by 2-1074 for double-precision numbers, and by 2-16494 for quad-precision numbers. The maximum fraction is always 2k - 1, where k is the number of bits in the fraction. (Alternatively, you can compute the value by regarding the implicit bit as 0 and the exponent as 1 minus the bias.)

The purpose of denormalized values is to allow the spaces between zero and the smallest magnitude normalized values to be divided up, so that as values become smaller they underflow with a gradually increasing loss of accuracy.

In the range of representable values, normalized values flow smoothly into denormalized values, but there is an increasing loss of accuracy as denormalized values become smaller and smaller. Table 2-2 “Minimum and Maximum Positive Denormalized Values” shows the range of positive denormalized values. (The hexadecimal representation of the equivalent negative values begins with the digit 8; for example, the minimum negative denormalized value in single-precision is 8000 0001.)

When used as operands, denormalized values are treated exactly like normalized values in most instances. When a denormalized value is the result of an arithmetic operation, however, an underflow exception condition may occur. See “Underflow Conditions” for more information about underflow exceptions. Also, you should be aware that denormalized values can significantly degrade performance. This issue is addressed in “Denormalized Operands”.

Table 2-2 Minimum and Maximum Positive Denormalized Values

Precision

Values

Hexadecimal Representation

Value

Single

Minimum denormalized
Maximum denormalized
Minimum normalized

0000 0001
007F FFFF
0080 0000

2-149
2-149 * (223 - 1)
2-126

Double

Minimum denormalized
Maximum denormalized
Minimum normalized

0000 0000 0000 0001
000F FFFF FFFF FFFF
0010 0000 0000 0000

2-1074
2-1074 * (252 - 1)
2-1022

Quad

Minimum denormalized
Maximum denormalized
Minimum normalized

(24 zeros)...0000 0001
0000 FFFF...(24 more F's)
0001 0000...(24 more zeros)

2-16494
2-16494 * (2112-1)
2-16382

 

Infinity

Values that are larger in magnitude than the maximum-magnitude normalized values are approximated by special bit patterns that represent positive and negative infinity.

According to the IEEE standard, infinities are represented by setting all the bits in the exponent field to 1 (value 255 for single-precision, 2047 for ­double-precision, 32767 for quad-precision) and setting the fraction bits to 0. There are actually two infinity values, negative infinity if the sign bit is 1 and positive infinity if the sign bit is 0.

The IEEE standard defines the properties of infinities. For example, it defines what happens when you add a number to an infinity or subtract one infinity from another. Table 2-3 “Arithmetic Properties of Infinity” shows some of these properties. The term finite value in the table refers to any floating-point value other than infinity or NaN (see “Not-a-Number (NaN)” for information about NaN values). For the multiplication and division operators, the sign of the result is determined by the usual arithmetic rules.

Table 2-3 Arithmetic Properties of Infinity

Operand

Operator

Operand

Result

+Infinity
-Infinity
+Infinity
-Infinity
+Infinity

+
+
+
+
+

Finite Value
Finite Value
+Infinity
-Infinity
-Infinity

+Infinity
-Infinity
+Infinity
-Infinity
NaN (invalid operation)

+Infinity
-Infinity
Finite Value
Finite Value
+Infinity
-Infinity
+Infinity
-Infinity

-
-
-
-
-
-
-
-

Finite Value
Finite Value
+Infinity
-Infinity
-Infinity
+Infinity
+Infinity
-Infinity

+Infinity
-Infinity
-Infinity
+Infinity
+Infinity
-Infinity
NaN (invalid operation)
NaN (invalid operation)

±Infinity

±Infinity
±Infinity

*

*
*

±Finite Value
(except 0)
0
±Infinity

±Infinity

NaN (invalid operation)
±Infinity

±Infinity
±Finite Value
±Infinity

/
/
/

±Finite Value
±Infinity
±Infinity

±Infinity
0
NaN (invalid operation)

+Infinity
-Infinity

sqrt()
sqrt()

+Infinity
NaN (invalid operation)

 

NOTE: In multiplication and division operations with infinity operands, the sign is determined by the usual arithmetic rules.

Not-a-Number (NaN)

A NaN (Not-a-Number) is a special IEEE representation for values that are

  • The result of an invalid operation

  • The result returned by a library function when it would be incorrect to return a numeric value

  • An undetermined value

NaNs are represented by setting all of the bits in the exponent to 1 and setting at least one of the bits in the fraction field to 1.

There are two types of NaNs—a signaling NaN (SNaN) and a quiet NaN (QNaN). When an SNaN is used, it generates an invalid operation exception and, if a trap for this exception is enabled, it produces a trap. A QNaN does not generate an exception; instead, it silently propagates through an operation. Floating-point operations produce only QNaNs.

NOTE: The IEEE standard does not fully define the bit patterns used by the two types of NaNs. HP 9000 systems use the most significant bit of the fraction to differentiate between the two types. If the bit is set to 1, it is an SNaN; if the bit is 0, it is a QNaN.

Table 2-4 “Properties of NaNs” shows some of the properties of NaNs.

Table 2-4 Properties of NaNs

Operand

Operator

Operand

Result

SNaN
QNaN
SNaN1
QNaN1

+
+
+
+

Finite Value
Finite Value
SNaN2
QNaN2

QNaN (invalid operation)
QNaN
QNaN (invalid operation)
QNaN1 or QNaN2
(­­­­implementation-­­
dependent)

SNaN


QNaN

float_to_int()


float_to_int()

Largest-magnitude
integer (invalid
operation)
Largest-magnitude
integer (invalid
operation)

SNaN
QNaN

sqrt()
sqrt()

QNaN (invalid operation)
QNaN

 

Zero

The IEEE standard defines both a positive zero and a negative zero. In both cases, the value is represented by setting all bits in the exponent and fraction to zero. The only difference, therefore, is that the sign bit is set for a negative zero. Table 2-5 “Operations With Zero” shows some of the properties of floating-point zeros.

Table 2-5 Operations With Zero

Operand

Operator

Operand

Result

+Zero

.EQ.

-Zero

True

+Zero
-Zero
+Zero

-Zero

+
+
+

+

+Zero
-Zero
-Zero

+Zero

+Zero
-Zero
+Zero (in round-to-nearest mode)
+Zero (in round-to-nearest mode)

+Zero

-Zero

+Zero
-Zero

-

-

-
-

+Zero

-Zero

-Zero
+Zero

+Zero (in round-to-nearest
mode)
+Zero (in round-to-nearest
mode)
+Zero
-Zero

+Zero

*

-Zero

-Zero

±Infinity
±Finite Value

/
/

±Zero
±Zero

±Infinity
±Infinity

-Zero

sqrt()

-Zero

 

NOTE: In multiplication and division operations with positive and negative zero operands, the sign is determined by the usual arithmetic rules.

The result of some operations is dependent on the rounding mode. The table assumes that the rounding is set to the default round-to-nearest mode. See “Inexact Result (Rounding)”.

Complex Data Types

The IEEE standard does not address the topic of complex arithmetic, so it does not define complex data type formats. HP Fortran 90 and HP FORTRAN/9000 implement two complex data types, single-precision complex (COMPLEX, COMPLEX(KIND=4)) and double-precision complex (COMPLEX(KIND=8)). The COMPLEX type consists of a real and an imaginary component, each of which is a single-precision IEEE operand. The COMPLEX(KIND=8) data type is analogous to COMPLEX, except that each component is a double-precision IEEE operand type. HP Fortran 90 and HP FORTRAN/9000 support both complex data types and a full range of complex arithmetic operations.

NOTE: HP Fortran 90 and HP FORTRAN/9000 also support the nonstandard data type names DOUBLE COMPLEX and COMPLEX*16 (equivalent to COMPLEX(KIND=8)) and COMPLEX*8 (equivalent to COMPLEX).

IEEE Representation Summary

Table 2-6 “IEEE Single-Precision Value Summary (Hexadecimal Values)”, Table 2-7 “IEEE Single-Precision Value Summary (Decimal Values)”, Table 2-8 “IEEE Double-Precision Value Summary (Hexadecimal Values)”, and Table 2-9 “IEEE Double-Precision Value Summary (Decimal Values)” summarize how IEEE values are represented in binary. To determine the class (normalized, infinity, NaN, and so on) of a floating-point value at run time, you can

Table 2-6 IEEE Single-Precision Value Summary (Hexadecimal Values)

Value

Exponent

Fraction

Hexadecimal Values
(Single-Precision)

Positive

Negative

Zero

All zeros

All zeros

0000 0000

8000 0000

Denormalized

All zeros

Nonzero

0000 0001
to
007F FFFF

8000 0001
to
807F FFFF

Normalized

Neither all zeros nor all ones

Anything

0080 0000
to
7F7F FFFF

8080 0000
to
FF7F FFFF

Infinity

All ones

All zeros

7F80 0000

FF80 0000

Quiet NaN

All ones

Most ­significant bit 0

7F80 0001
to
7FBF FFFF

FF80 0001
to
FFBF FFFF

Signaling NaN

All ones

Most ­significant bit 1

7FC0 0000
to
7FFF FFFF

FFC0 0000
to
FFFF FFFF

 

Table 2-7 IEEE Single-Precision Value Summary (Decimal Values)

Value

Decimal Values (Single-Precision)

Positive

Negative

Zero

0.0

-0.0

Denormalized

1.4012985E-45
to
1.1754942E-38

-1.4012985E-45
to
-1.1754942E-38

Normalized

1.1754944E-38
to
3.4028235E+38

-1.1754944E-38
to
-3.4028235E+38

Infinity

Not applicable

Not applicable

Quiet NaN

Not applicable

Not applicable

Signaling NaN

Not applicable

Not applicable

 

Table 2-8 IEEE Double-Precision Value Summary (Hexadecimal Values)

Value

Exponent

Fraction

Hexadecimal Values (Double-Precision)

Positive

Negative

Zero

All zeros

All zeros

0000 0000 0000 0000

8000 0000 0000 0000

Denormalized

All zeros

Nonzero

0000 0000 0000 0001
to
000F FFFF FFFF FFFF

8000 0000 0000 0001
to
800F FFFF FFFF FFFF

Normalized

Neither all zeros nor all ones

Anything

0010 0000 0000 0000
to
7FEF FFFF FFFF FFFF

8010 0000 0000 0000
to
FFEF FFFF FFFF FFFF

Infinity

All ones

All zeros

7FF0 0000 0000 0000

FFF0 0000 0000 0000

Quiet NaN

All ones

Most significant bit 0

7FF0 0000 0000 0001
to
7FF7 FFFF FFFF FFFF

FFF0 0000 0000 0001
to
FFF7 FFFF FFFF FFFF

Signaling NaN

All ones

Most significant bit 1

7FF8 0000 0000 0000
to
7FFF FFFF FFFF FFFF

FFF8 0000 0000 0000
to
FFFF FFFF FFFF FFFF

 

Table 2-9 IEEE Double-Precision Value Summary (Decimal Values)

Value

Decimal Values (Double-Precision)

Positive

Negative

Zero

0.0

-0.0

Denormalized

4.94065E-324
to
2.22507E-308

-4.94065E-324
to
-2.22507E-308

Normalized

2.22507E-308
to
1.79769E+308

-2.22507E-308
to
-1.79769E+308

Infinity

Not applicable

Not applicable

Quiet NaN

Not applicable

Not applicable

Signaling NaN

Not applicable

Not applicable

 

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1997 Hewlett-Packard Development Company, L.P.