| United States-English |
|
|
|
![]() |
HP-UX Floating-Point Guide: HP 9000 Computers > Chapter 2 Floating-Point Principles
and the IEEE Standard for Binary Floating-Point ArithmeticFloating-Point Formats |
|
The IEEE standard specifies four formats for representing floating-point values:
The IEEE standard does not require an implementation to support single-extended precision and double-extended precision in order to be standard-conforming. HP 9000 systems support the single-precision, double-precision, and double-extended precision formats. Double-extended precision format on these systems is also known as quadruple-precision or quad-precision format. Single-precision, double-precision, and quad-precision values consist of three fields: sign bit, exponent, and fraction. The sign bit reflects the algebraic sign of the value. A 1 indicates a negative value; a 0 indicates a positive value. The exponent represents an integer value that is a power to which 2 is raised. The fraction, also called the significand, represents a value between 1.0 and 2.0 (for normalized values). The result of the exponent expression is multiplied by the fraction to yield the actual numerical value. The only difference among the single-precision, double-precision, and quad-precision formats is the number of bits allocated for the exponent and fraction. Figure 2-1 “IEEE Single-Precision Format”, Figure 2-2 “IEEE Double-Precision Format”, and Figure 2-3 “IEEE Quad-Precision Format” show the number of bits allocated in each format. The single-precision format is 32 bits long: 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction. The double-precision format is 64 bits long: 1 bit for the sign, 11 bits for the exponent, and 52 bits for the fraction. The double-precision format is sometimes divided conceptually into two 32-bit words. The word containing the sign bit, the exponent field, and the first portion of the fraction field is referred to as the most significant word. The other word, containing the last portion of the fraction, is called the least significant word. The quad-precision format is 128 bits long: 1 bit for the sign, 15 bits for the exponent, and 112 bits for the fraction. This format is divided conceptually into four 32-bit words: the most significant word, two middle words, and the least significant word.
For normalized values, the fraction represents a value greater than or equal to 1.0 and less than 2.0. Each bit in the fraction represents the value 2 raised to a negative power. For example, the first bit represents the value 2-1 (0.5), the second bit is 2-² (0.25), and so on. The sum of 1.0 and the values represented by all these bits is the value of the fraction. The 1.0 in the sum corresponds to the zeroth fraction bit, 20. Since this bit would always be set for a normalized value, it is not included in the actual format, but it is implied. It is sometimes referred to as the fraction implicit bit or the hidden bit. For example, if the 23 bits in the fraction field of a single-precision number are 011 0100 0000 0000 0000 0000 and the exponent field is not all 1's or all 0's, the fraction value is 1.0 + 2-2 + 2-3 + 2-5 = 1.0 + 0.25 + 0.125 + .03125 = 1.40625 The 1.0 represents the fraction implicit bit, and the exponents of -2, -3, and -5 indicate that the second, third, and fifth bits of the fraction field are set. The exponent field uses a biased representation. This means that the value represented by the exponent field is the value in the exponent field interpreted as an unsigned integer minus a constant value (the bias). The purpose of the bias is to allow all exponent calculations to be performed using unsigned arithmetic. For single-precision formats, the bias is 127; for double-precision formats, it is 1023; for quad-precision formats, it is 16383. The value 6.0 would be represented in single-precision format as shown in Figure 2-4 “IEEE Single-Precision Format: Example”. The first bit is the sign bit. Because the sign bit is 0, the floating-point value is positive. The next eight bits make up the exponent. 1000 0001 equals 129, but the true value of the exponent is derived by subtracting the bias constant 127 from this value. So the true exponent value is 2. The fraction bits are 100 0000 0000 0000 0000 0000, which, when added to the implicit bit, equal 1 + 0.5, or 1.5. In algebraic terms, a floating-point value is
where S is the value of the sign bit, M is the fraction (with implicit bit), E is the exponent, and B is the bias. In our example, this would be
Table 2-1 “IEEE Representations of Floating-Point Values” shows some additional examples. Table 2-1 IEEE Representations of Floating-Point Values
Because floating-point numbers have a finite number of bits in the fraction, only a finite subset of the continuum of real numbers can be represented exactly in IEEE format. The unit of granularity of the representable numbers is the ULP (Unit in the Last Place). ULPs measure the distance between two numbers in terms of their representation in binary. One ULP is the distance from one value to the next representable value in the direction away from 0. One ULP is about 1 part in 17 million for single-precision values, 1 part in 10 16 for double-precision values, and 1 part in 1034 for quad-precision values. For this reason, there is a general rule of thumb that single-precision arithmetic represents about 7 or 8 decimal places, double-precision about 16, and quad-precision about 34. If you try to read or write a value with a greater number of decimal digits, the last digits will probably not contain useful information. Because of this granularity in floating-point representation, most real numbers cannot be represented exactly. The result of an arithmetic operation (including the operation of converting from a decimal string into IEEE format) usually must be rounded to a nearby representable number. (For information on rounding, see “Inexact Result (Rounding)”.) Even some simple fractions cannot be represented exactly. Consider the fraction 1/3. The exact value of this expression would require an infinite number of bits, because the value is an infinitely repeating fraction (0.33333...in decimal, 0.010101...in binary). Many values that can be represented exactly in a few decimal digits cannot be represented exactly in binary: for example, 1/10, which in decimal is 0.1, is in binary 0.000110011001100... Because simple numbers like 1/3 and 1/10 cannot be represented exactly, no floating-point operation can ever yield these exact values. Although most real numbers cannot be represented exactly in floating-point arithmetic, a great many can. Any integer with a magnitude less than 16 million can be represented exactly in any format, and any 32-bit integer can be represented exactly in double-precision or quad-precision. Also, all numbers representable as some number over a power of 2, such as 0.1875 (3/16) or 27.375 (219/8), can be represented exactly if they have no more decimal digits than the chosen precision can faithfully represent. Values that are represented by a sign bit, a fraction, and an exponent whose bits are not all zeros and not all ones are called normalized values (also called normal values). Because the value in the exponent field of a normalized value cannot be 0, the size of the exponent field determines the smallest value that can be represented in normalized format. For single-precision numbers, the largest-magnitude negative exponent is -126 (that is, 1 - 127); for double-precision numbers, it is -1022 (that is, 1 - 1023); for quad-precision numbers, it is -16382 (that is, 1 - 16383). Denormalized values (also called subnormal values) fill in the gap on the number line between the smallest-magnitude normalized value and zero. They also allow floating-point values to satisfy the arithmetic rule that x is equal to y if and only if x - y is equal to 0. A denormalized value is represented by a zero exponent field and a nonzero fraction (if the fraction were also zero, the floating-point value would be zero). You can compute the value of a denormalized number by interpreting the fraction as an integer and then multiplying this integer by 2-149 for single-precision numbers, by 2-1074 for double-precision numbers, and by 2-16494 for quad-precision numbers. The maximum fraction is always 2k - 1, where k is the number of bits in the fraction. (Alternatively, you can compute the value by regarding the implicit bit as 0 and the exponent as 1 minus the bias.) The purpose of denormalized values is to allow the space between the smallest normalized values to be divided up, so that as values become smaller they underflow with a gradually increasing loss of accuracy. In the range of representable values, normalized values flow smoothly into denormalized values, but there is an increasing loss of accuracy as denormalized values become smaller and smaller. Table 2-2 “Minimum and Maximum Positive Denormalized Values” shows the range of positive denormalized values. (The hexadecimal representation of the equivalent negative values begins with the digit 8; for example, the minimum negative denormalized value in single-precision is 8000 0001.) When used as operands, denormalized values are treated exactly like normalized values in most instances. When a denormalized value is the result of an arithmetic operation, however, an underflow exception condition may occur. See “Underflow Conditions” for more information about underflow exceptions. Also, you should be aware that denormalized values can significantly degrade performance. This issue is addressed in “Denormalized Operands”. Table 2-2 Minimum and Maximum Positive Denormalized Values
Values that are larger in magnitude than the maximum-magnitude normalized values are approximated by special bit patterns that represent positive and negative infinity. According to the IEEE standard, infinities are represented by setting all the bits in the exponent field to 1 (value 255 for single-precision, 2047 for double-precision, 32767 for quad-precision) and setting the fraction bits to 0. There are actually two infinity values, negative infinity if the sign bit is 1 and positive infinity if the sign bit is 0. The IEEE standard defines the properties of infinities. For example, it defines what happens when you add a number to an infinity or subtract one infinity from another. Table 2-3 “Arithmetic Properties of Infinity” shows some of these properties. The term finite value in the table refers to any floating-point value other than infinity or NaN (see “Not-a-Number (NaN)” for information about NaN values). For the multiplication and division operators, the sign of the result is determined by the usual arithmetic rules. Table 2-3 Arithmetic Properties of Infinity
A NaN (Not-a-Number) is a special IEEE representation for error values. A NaN can be
NaNs are represented by setting all of the bits in the exponent to 1 and setting at least one of the bits in the fraction field to 1. There are two types of NaNs—a signaling NaN (SNaN) and a quiet NaN (QNaN). When an SNaN is used, it generates an invalid operation exception and, if a trap for this exception is enabled, it produces a trap. A QNaN does not generate an exception; instead, it silently propagates through an operation. Floating-point operations produce only QNaNs.
Table 2-4 “Properties of NaNs” shows some of the properties of NaNs. Table 2-4 Properties of NaNs
The IEEE standard defines both a positive zero and a negative zero. In both cases, the value is represented by setting all bits in the exponent and fraction to zero. The only difference, therefore, is that the sign bit is set for a negative zero. Table 2-5 “Operations With Zero” shows some of the properties of floating-point zeros. Table 2-5 Operations With Zero
The IEEE standard does not address the topic of complex arithmetic, so it does not define complex data type formats. HP Fortran 90 and HP FORTRAN/9000 implement two complex data types, single-precision complex (COMPLEX, COMPLEX(KIND=4)) and double-precision complex (COMPLEX(KIND=8)). The COMPLEX type consists of a real and an imaginary component, each of which is a single-precision IEEE operand. The COMPLEX(KIND=8) data type is analogous to COMPLEX, except that each component is a double-precision IEEE operand type. HP Fortran 90 and HP FORTRAN/9000 support both complex data types and a full range of complex arithmetic operations.
Table 2-6 “IEEE Single-Precision Value Summary (Hexadecimal Values)”, Table 2-7 “IEEE Single-Precision Value Summary (Decimal Values)”, Table 2-8 “IEEE Double-Precision Value Summary (Hexadecimal Values)”, and Table 2-9 “IEEE Double-Precision Value Summary (Decimal Values)” summarize how IEEE values are represented in binary. To determine the class (normalized, infinity, NaN, and so on) of a floating-point value at run time, you can
Table 2-6 IEEE Single-Precision Value Summary (Hexadecimal Values)
Table 2-7 IEEE Single-Precision Value Summary (Decimal Values)
Table 2-8 IEEE Double-Precision Value Summary (Hexadecimal Values)
Table 2-9 IEEE Double-Precision Value Summary (Decimal Values)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||