I am interested in getting a tolerance range +- of e-8 for a data type of double (8 bytes, 64 bits).
For example:
0.123456782456789
I would like to get the plus/minus tolerance range for this that is within 8 decimal places.
They would be:
low range = 0.123456781456789
high range = 0.123456783456789
Notice the difference in the 9th decimal place.
If the input is a double that is given in Hexadecimal. What do I need to add or minus from the hexadecimal? Since this is a double, it will contain 4 data words. Would I need to add/minus 256 (in decimal) from the 3rd data word?
For example:
3FBF 9ADD 1B1F 0D35 is the hex for 0.123456782456789
So… would the low and high range be:
3FBF 9ADD 1*A*1F 0D35
3FBF 9ADD 1*C*1F 0D35
To compute an interval [a, b] that includes the points that are within |x|•10-8 of x, set a and b:
This is approximate because rounding errors could make a or b slightly inaccurate. If you want the interval to absolutely include all the points, you can use a slightly higher value than
1e-8, or you can use more advanced techniques.Some warnings:
If you are using an interval as part of a test to determine whether some computed value is “almost equal to” another value, then determining how large the interval must be requires analysis of the floating-point operations used and the values involved. It is possible for floating-point values to produce errors ranging from zero to infinity, depending on the situation. It is not possible to state any single amount of tolerance that is useful in all situations, or even in “typical” situations.
Determining how large the interval may be requires determining what errors are acceptable for your application. Accepting unequal values as equal to allow for computation errors means that your program will sometimes accept as equal values that are truly unequal (if computed exactly, with no errors). So you need to figure out how large the interval can be before your program produces unacceptable results.
Obviously, if the interval must be larger than it may be, then your program is broken; this use of an interval cannot work. In such case, you must redesign the floating-point operations to produce less error or redesign the program otherwise to avoid this.
It is generally a bad idea to use the representation of a floating-point number to read or alter its value. Doing so requires careful attention to details of your compiler or platform specification. (In particular, code that appears to work in tests, may actually be broken in the sense that it is not supported by the compiler and will break if a different version of the compiler is used or the compilation switches, such as switches for debugging and optimization, are changed.) Additionally, code to access the representation of a floating-point number is generally not portable. (Most notably, some platforms store the bytes of a double in “little endian“ order and some store the bytes in “big endian” order. There are other portability issues as well.) Even when code is correctly written to access the representation of a floating-point number, it may be slower than other methods.
For tolerances as large as 10-8, it is likely sufficient to use
1e-8to compute the interval. That is, you can use ordinary floating-point arithmetic, and it is not necessary to compute the ULP of a floating-point number or to access its representation. However, if you do want to compute the ULP of an IEEE-754 floating-point number, you can do so without accessing its representation using the code in this answer. That code is written forfloatrather thandouble, but you can changeFLTtoDBLto change the constants it uses. Also changefloattodouble, of course.