Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
More options
HP.com home
HP-UX Floating-Point Guide: HP 9000 Computers > Chapter 3 Factors that Affect the Results of Floating-Point Computations

Floating-Point Coding Practices that Affect Application Results

» 

Technical documentation

Complete book in PDF
» Feedback
Content starts here

 » Table of Contents

 » Glossary

 » Index

The most common types of floating-point "bugs" reported to Hewlett-­Packard are not bugs at all, but rather a class of programming mistakes. These types of mistakes usually stem from one of the following invalid assumptions:

  • It is invalid to assume that an arithmetic expression in a computer language produces an exactly representable result and that all of the digits in a floating-point value are always meaningful. The fact that an application prints out 25 decimal digits of a result does not mean that the value printed actually has 25 significant digits. Many of the rightmost digits printed may be meaningless. Values may lose precision in the course of computation, and you must be alert for the kinds of operations that can cause precision loss.

    The significance limitations of the system are immutable. Entering a datum of 3.14159265358979323846 for pi is no better than entering 3.1415926535897932 (in double-precision). In fact, the former may be worse, because it might beguile a programmer into thinking that the system accepted all 21 digits, when in reality it accepted only 17 (9 in single-precision).

  • It is invalid to assume that an arithmetic expression in a computer language will abide by all algebraic rules, including the associative and distributive laws. You cannot make this assumption unless you have made a thorough analysis of the code to determine that rounding errors and other sources of inaccuracy will not invalidate the rules.

Sometimes the source of an erroneous result is related to a particular optimization generated by the compiler. This kind of problem can be particularly hard to solve, because it may disappear when you compile the program with a debugging option in order to debug it. See “Compiler Behavior and Compiler Version” for a discussion of the effects of compiler optimization on floating-point results.

The following sections describe common floating-point programming mistakes that can lead to incorrect application results:

  • Testing floating-point values for equality

  • Taking the difference of similar values

  • Adding values with very different magnitudes

  • Unintentional underflow

  • Truncation to an integer value

  • Ill-conditioned computations

The first mistake produces results that are simply wrong. The others are more insidious: they usually cause the result of an operation to lose a substantial amount of precision.

NOTE: The illustrations in these sections use decimal representations, rather than binary or hexadecimal ones, so as to correspond more closely to the way most users think about floating-point values. These examples illustrate general principles; they cannot necessarily be reproduced using IEEE-754 arithmetic.

Testing Floating-Point Values for Equality

Think very carefully before you test two floating-point values for equality. Because of the inherent inexactness of floating-point representations and because of the many sources of rounding inaccuracies in a floating-point computation, values that should be equal from a purely algebraic perspective in fact rarely will be.

For example, on many computers 1.2 - 0.1 is not exactly equal to 1.1. Consider the following example.

Example 3-4 Sample Program: fpeq.c

#include <stdio.h>

int main(void)
{
union {
double x;
int a[2];
} u1, u2;

u1.x = 1.2 - 0.1;
u2.x = 1.1;

if (u1.x == u2.x)
printf("1.2 - 0.1 equals 1.1\n");
else {
printf("1.2 - 0.1 is NOT equal to 1.1.\n");
printf("1.2 - 0.1 = %x%x\n1.1 = %x%x\n",
u1.a[0], u1.a[1], u2.a[0], u2.a[1]);
}
}

From an algebraic viewpoint, this routine should print that 1.2 - 0.1 equals 1.1. In fact, though, when executed on a Series 700 machine, the output is

1.2 - 0.1 is NOT equal to 1.1.
1.2 - 0.1 = 3ff1999999999999
1.1 = 3ff199999999999a

This anomaly occurs because the numbers 0.1 and 1.1 cannot be represented exactly in IEEE-754 double-precision format. Both values, 1.1 and (1.2 - 0.1), are very close to the real number 1.1, but neither is exact.

A better technique in most circumstances is to test that two values lie within a relative proximity to each other. For example, the Fortran test

IF( X .EQ. Y )

could be replaced by

IF( ABS(X - Y) .LE. EPSILON )

where EPSILON is a sufficiently small floating-point value. This test establishes whether X is within +/-EPSILON of Y.

Consider again the example, rounderr.f, in “How Basic Operations Affect Application Results”. A better way to code that example is to test whether the two values are sufficiently close to each other, rather than exactly the same:

Example 3-5 Sample Program: roundeps.f

      PROGRAM ROUNDEPS
REAL A, B, C, D, E, F, EPSILON

PRINT *, 'Enter epsilon:'
READ *, EPSILON
PRINT *, 'Enter 4 reals:'
READ *, A, B, C, D

E = (A + B) * (C + D)
F = (A * C) + (A * D) + (B * C) + (B * D)

IF (ABS(E - F) .LE. EPSILON) THEN
WRITE (*, 20) E, ' equals ', F
ELSE
WRITE (*, 20) E, ' not equal to ', F
WRITE (*, *) 'Math error!'
ENDIF

20 FORMAT(F, A, F)
END

You must choose a value for the error bound EPSILON that is appropriate to the magnitudes of the values being computed. If computations are yielding results with magnitudes around m, then for single-precision computations, a reasonable value for EPSILON might be m/1.0e5; for double-precision computations, a reasonable value might be m/1.0e14; for quad-precision computations, a reasonable value would be m/1.0e26. Choosing an appropriate value for the error bound, however, requires a thorough knowledge of the mathematical nature of your application. For example, a value of 1.0e-5 for EPSILON yields the following result, which may or may not be acceptable in your application:

$ f90 roundeps.f
roundeps.f
program ROUNDEPS

23 Lines Compiled
$ ./a.out
Enter epsilon:
1.0e-5

Enter 4 reals:
1.1 2.2 3.3 4.4

25.4100018 equals 25.4099998

The actual difference between the values is one bit (in hexadecimal, 41CB47AF versus 41CB47AE).

Taking the Difference of Similar Values

Calculations can lose precision when a program attempts to take the difference between two values that are similar in magnitude and also have some degree of inaccuracy to begin with. If the operands have M bits of insignificance, and the fractions are the same for the first N significant bits, then the difference between these two values will have M+N bits of insignificance because of the cancellation of significant bits during the subtraction.

For example, suppose that a single-precision application performs a computation using two different algorithms and then takes the difference between them to check the similarity of the results. Assume that the true result should be 1.0 and that both actual results have up to four bits of insignificance. If the actual results are

0x3f80001f
0x3f80003e

the magnitude of the difference is

0x36780000

However, the two values were the same for the first eighteen bits of their fractions, which were canceled during the subtraction. This leaves six bits in the difference, four of which are insignificant. So only two bits are significant, and the remaining 22 bits are insignificant. (Although the fraction field has only 23 bits, we include the fraction implicit bit in our count.)

Figure 3-1 “Taking the Difference of Similar Values” shows how this problem commonly manifests itself in practice. Suppose you have two very large values:

       A = 1.350107523E50
B = 1.350106321E50

If you subtract B from A, the result is

C = 0.0000001202E50

which is normalized to 1.202E44. Suppose there are already 4 digits of insignificance in A and B. Normalizing the result from 0.0000001202E50 to 1.202E44 adds another 6 digits of insignificance. So the result has 10 digits of insignificance in all, though the operands had only 4.

Figure 3-1 Taking the Difference of Similar Values

Taking the Difference of Similar Values

The modulo operation (mod(x, y) in Fortran) is an instance of this type of problem when x is much greater than y; remember that the modulo formula is

mod(x,y) = x - int(x/y) * y 

(See “The Remainder Operation” for details.) Trigonometric and transcendental functions use an enhanced version of mod(x, pi/2) during argument reduction. Therefore, although HP-UX math libraries perform extremely careful and accurate argument reduction, trigonometric functions like cos(x) can lose significance when x is large.

Adding Values with Very Different Magnitudes

When the system adds two floating-point values, it equalizes the operands' exponents before performing the calculation. It does so by right-shifting the smaller value so as to give it the same exponent as the larger. If the two values are very different in magnitude, this right shift causes a major loss of precision in the smaller value, as Figure 3-2 “Adding Values with Very Different Magnitudes” illustrates.

Figure 3-2 Adding Values with Very Different Magnitudes

Adding Values with Very Different Magnitudes

In fact, if the difference between the exponents of two single-precision values is 25 or greater, the smaller value is right-shifted out of existence, and adding the two values results in no change at all in the larger value.

For example, consider a criminally minded American bank employee who arranges to have a penny transferred to his or her own account every so often. If the bank's computer uses single-precision floating-point arithmetic, and if the account starts with a balance of zero, then after almost twenty-three million transfers, when the bank account reaches $262,144.00, it will suddenly stop growing. Well before then, the employee may notice irregularities due to rounding errors. After 5 million transfers, the account will have more than 5 million pennies in it; but after 15 million transfers, it will have fewer than 14 million pennies. The larcenous employee may become frustrated with the bank's computer system, but it is behaving perfectly properly. (This irregularity is the reason why bank computers use COBOL's fixed-point arithmetic, not floating-point.)

The following example makes the same point more simply. It accumulates a sum by adding 0.01 to the starting value a thousand times, in single-­precision, and then subtracting it.

Example 3-6 Sample Program: diffmag1.f

      PROGRAM DIFFMAG1
REAL X
INTEGER I

X = 0.01
DO 10 I = 1, 1000
X = X + 0.01
10 CONTINUE
PRINT *, 'X is', X
DO 20 I = 1, 1000
X = X - 0.01
20 CONTINUE
PRINT *, 'X is', X
END

The result is not exact. Instead of 10.01 and 0.01, you are likely to get results similar to the following:

 X is 10.01013
X is 9.99994E-03

The following example is even simpler. At higher magnitudes, adding 1 to a value has no effect at all.

Example 3-7 Sample Program: diffmag2.f

      REAL BIG, SMALL, SUM, DIFF
INTEGER I

SMALL = 1.0

DO I = 1, 10
BIG = 10.0**I
SUM = BIG + SMALL
DIFF = ABS(BIG - SUM)

PRINT 100, SUM, DIFF
END DO

100 FORMAT(F14.1, F8.3)
END

The addition of 1 stops affecting the result when the value reaches about 108:

          11.0   1.000
101.0 1.000
1001.0 1.000
10001.0 1.000
100001.0 1.000
1000001.0 1.000
10000001.0 1.000
100000000.0 .000
1000000000.0 .000
10000000000.0 .000

One way to minimize this kind of precision loss, if you can tolerate the added execution time, is to sort the elements of an array in ascending order before you add them together.

Unintentional Underflow

An underflow may occur when a calculation produces a result that is smaller in magnitude than the smallest normalized value. Sometimes a calculation can lose virtually all significance when it underflows, yielding a denormalized value that is then used in subsequent operations.

Suppose that the minimum decimal exponent of a system is -5, and that A is assigned the value 1.23456789E-3. Then the expression A/1.0E8 would underflow, as Figure 3-3 “Unintentional Underflow” shows.

Figure 3-3 Unintentional Underflow

Unintentional Underflow

Six significant digits are lost during the division, and they cannot be recovered by scaling the quotient back into the normalized range.

As another example, suppose the familiar form of the Pythagorean theorem is coded as

       Z = SQRT(X**2 + Y**2)

Suppose further that this expression is executed using the following values for the single-precision variables X and Y:

       X = 0.95830116E-20
Y = 0.25553963E-20

These values, when squared, produce the values 9.1834095E-41 and 6.5300508E-42 respectively, whose single-precision representations in hexadecimal are

0000ffff
00001234

Both of these values have only 16 bits of significance. The final result, 9.9178697E-21, is a reasonable-looking normalized number. However, because it is produced by a calculation that once lost all but 16 bits of significance, it can have at most 16 bits of significance itself. In fact, it actually has considerably less.

You can find out whether your application has underflowed by using the fetestexcept function. Alternatively, you can use the fesettrapenable function or the +FP compiler option to run the application with the underflow exception trap enabled. See “Exception Bits” and “Command-Line Mode Control: The +FP Compiler Option” for more information.

If you have a single-precision application that underflows frequently, you can solve the problem by changing to double-precision. If a double-­precision application underflows frequently, you could migrate it to quad-precision, but at a considerable loss of efficiency; you may want to restructure your application instead so as to avoid underflows, if possible.

Truncation to an Integer Value

The floor of a value is the greatest whole number less than the value. The ceiling of a value is the smallest whole number greater than the value.

Rounding, precision mode problems, and compiler optimizations can all contribute to inaccurate results, but under most circumstances the inaccuracy is very small and not noticeable. However, some operations can magnify the inaccuracy of a calculation to the point where the result is meaningless. This can occur in algorithms that truncate a floating-point value to make it an integer. Truncating a positive floating-point value, for instance, reduces its magnitude to the floor integer, regardless of how close to the ceiling value it may be. An expression may yield 1.999999 on one system and 2.00001 on another. Both results may be acceptable in terms of expected rounding errors. However, if the result is truncated to an integer, these two systems will produce 1 and 2, respectively, which can be an unacceptable difference.

The following program provides a simple example of this situation.

Example 3-8 Sample Program: trunc.c

#include <stdio.h>

int main(void)
{
double x;
int i, n;

x = 1.5;
for (i = 0; i < 10; i++) {
n = x;
printf("x is %g, n is %d\n", x, n);
x += 0.1;
}
}

No matter how close the value of x gets to 2.0, C conversion rules require the fractional part to be truncated. Therefore, the output of the program is as follows:

x is 1.5, n is 1
x is 1.6, n is 1
x is 1.7, n is 1
x is 1.8, n is 1
x is 1.9, n is 1
x is 2, n is 2
x is 2.1, n is 2
x is 2.2, n is 2
x is 2.3, n is 2
x is 2.4, n is 2

Many algorithms legitimately require truncation of results to integral values. One way to avoid the kind of problem illustrated by the preceding example is to add 0.5 to the result of a floating-point value that must be assigned to an integer:

n = x + 0.5;

Doing this will effectively round the result to the nearest integer (if x is greater than or equal to 0). Another solution is to call the function rint, which rounds a double value to the nearest integer:

n = rint(x);

If you use either of these solutions, the program output is

x is 1.5, n is 2
x is 1.6, n is 2
x is 1.7, n is 2
x is 1.8, n is 2
x is 1.9, n is 2
x is 2, n is 2
x is 2.1, n is 2
x is 2.2, n is 2
x is 2.3, n is 2
x is 2.4, n is 2

Ill-Conditioned Computations

If relatively small changes to the input of a program or to the intermediate results generated by a program cause relatively large changes in the final output, the program is said to be ill-conditioned or numerically unstable.

The following example illustrates an ill-conditioned program:

Example 3-9 Sample Program: sloppy_tangent.f

C The following program is an example of how small
C perturbations in the argument to a function near the
C function's singularity can cause large variations in the
C function's result. A program may be ill-conditioned because
C of a case like this, where minor inaccuracies created during
C preliminary calculations become major inaccuracies when they
C are passed through a function at a point where the function
C has a very steep slope.

PROGRAM SLOPPY_TANGENT
DOUBLE PRECISION X

X = 1.570796D0
WRITE(*,*) 'TAN( X-1.0D-5 ):', TAN( X-1.0D-5 )
WRITE(*,*) 'TAN( X ): ', TAN( X )
WRITE(*,*) 'TAN( X+1.0D-5 ):', TAN( X+1.0D-5 )
END

The output from this program shows how small changes in the argument to the TAN function lead to wildly varying results:

 TAN( X-1.0D-5 ): 96835.46637430933
TAN( X ): 3060023.306952844
TAN( X+1.0D-5 ): -103378.351773411

Ill-conditioned computations cause trouble not because their input values may change, but because the seemingly innocuous rounding errors and loss of significance can have large effects on the final results.

One way to establish that rounding errors and loss of significance are causing a program to produce incorrect results is to run the program in various rounding modes. (See “Rounding Mode: fegetround and fesetround” for information on how to change rounding modes using the fesetround function.) However, this technique does not always work. In the preceding example, for instance, changing the rounding mode has little effect on the results. However, if you do observe that simply changing the rounding mode causes large changes in the application results, then your application is most likely ill-conditioned.

A better technique, which does not require you to use additional functions or even to modify your code, is to make very small changes in the input data and to observe the amount by which the result changes. Wild swings in output magnitude may indicate an ill-conditioned application.

Fixing an ill-conditioned application requires a thorough understanding of the computations executed by the application. Chances are that the instability is caused by very few intermediate computations—possibly by just one. You must try to identify the location of the instability using your knowledge of the application, of the characteristics of the math functions called, and of the data being processed.

Printable version
Privacy statement Using this site means you accept its terms Feedback to webmaster
© 1997 Hewlett-Packard Development Company, L.P.