Please Sign up or sign in to vote.

Introduction

Many scientific applications require accuracy greater than that provided by ‘double precision’. However, no hardware implementation of ‘quadruple precision’ exists, and the cost of implementation in software is often prohibitive. This article describes an effective compromise, implemented using the ‘double-precision’ type.

Disclaimer

The current implementation has not been thoroughly tested. While every attempt has been made to ensure accuracy, there are probably cases in which the code does not behave as expected. Do not trust your career or reputation to this code without thorough testing!

Enhancements

Version 1.1

Accuracy improvements for most functions (the acos(), atan(), asinh() functions still need improvement). A test program, TestQD, has been added to the solution in order to provide basic "sanity checks" of the functions. Bug fixes - fmod(), round(), stream output. The fma() function has been renamed Fma() in order to avoid incompatibility with the VS2013 run-time library.

Version 1.1.1

Bug fix - pow(negative, integer) now returns the correct value Bug fix - 0.0 / 0.0 == NaN The Visual Studio solution provided uses Visual Studio 2013. For those still using VS 2012/2010, the solution and project files from version 1.1 should still work (no file names have been changed/added).

Version 1.1.3

Bug fixes - (unsigned) int, (unsigned) long, and (unsigned) long long constructors Bug fixes - toInt(), toLong(), and toLongLong() converters

Background

I came across this solution for extending precision when performing a simulation of Brownian movement. The program required updating the positions of atoms in the simulation by extremely small amounts, which meant that the increment was rounded (and sometimes - shifted completely out). The only way to ensure that the calculations were accurate was to use higher precision.

Using the code

The code is provided as a Visual Studio 2010 project. The files qd\config.h and include\qd\qd_config.h contain macros that should be modified in order to support different compilers. In particular, some of these APIs are present in the standard C++11 library (but not in earlier versions).

With one exception, the code is designed as a drop-in replacement for the 'double' type. For example, if your original code is:

#include < iostream > #include < math > int main( void ) { double two = 2 . 0 ; std::cout << std::sqrt( two ); return 0 ; }

then the code could be rewritten to use the 'double-double' type as follows:

#include < iostream > #include " dd_real.h" int main( void ) { dd_real two = " 2.0" ; std::cout << std::sqrt( two ); return 0 ; }

The exception is initialization to a constant. In order to initialize a variable to full 'double-double' precision, the value must be passed as a string. This is because a value that is not passed as a string will be rounded at compilation time to 'double' precision, which misses the point.

The IEEE 754 Standard for Floating-Point Arithmetic vs 'double-double'

The IEEE 754-2008 Standard for Floating-Point Arithmetic [1] defines three types for binary arithmetic – binary32, binary64, and binary128. Their properties are summarized in table 1.

Sign Exponent Mantissa Hardware? Table 1 - IEEE 754 binary types binary32 1 bit 8 bits 24 bits Y - 'float' binary64 1 bit 11 bits 53 bits Y - 'double' binary128 1 bit 15 bits 113 bits N

In the binary representations, the MSB of the mantissa is implicit. This accounts for the ‘extra’ bit in the above representations. While the binary32 and binary64 types are typically implemented in hardware, binary128 currently has no hardware implementation. Given that software implementations of binary64 are typically 10 to 100 times slower than hardware implementations, this implies that a software implementation of binary128 will be prohibitively slow.

The double-double type is a compromise between the extra precision of binary128 and the speed of binary64. It relies on properties of IEEE 754 floating point numbers to represent numbers in pairs of binary64 values, such that the sum of the pair represents the number. For best results, there should be no overlap between the bits in the pair, i.e. the absolute value of the smaller number should be no larger than 2-53 of the absolute value of the larger.

The precision achieved in this manner is slightly lower than binary128 (106 bits rather than 113), but the exponent range is slightly smaller than that of binary64. The reason for this is that many operations will only work if neither number in the pair underflows. This means that the larger number must be at least 253*epsilon (the smallest binary64 number).

In this article, I present the basic algorithms without proof. Those interested in the proofs may find them in the references.

Preconditions

The implementation of double-double relies on the following conditions:

All rounding is to “nearest or even” No “hidden bits” are involved in the calculation – operations are always rounded to binary64 with no intermediate step(s)

[2] implies that using an 80x87 in its standard mode (64-bit precision) won’t work; it has to be set to 53-bit precision. On the other hand, a program using SSE2 instructions will work fine. Note that Windows by default places the 80x87 in 53-bit mode, in an attempt to be compatible with the SSE/SSE2 instructions.

Basic Operations [2]

The implementation of ‘double-double’ arithmetic relies on the ability to perform certain floating-point operations exactly. In all of the following, the notation RN( ) implies that the operation should be performed with rounding to nearest or even. It must be emphasized that these operations are only valid if neither overflow nor underflow occur during the operations. For example, addition of two like-signed infinities will produce a result of (inf, NaN), rather than the expected (inf, 0).

It is possible to make these basic operations behave as expected, but only at the cost of an additional test and (possible) branch.

The basic operations are as follows:

Addition/Subtraction [3, 4, 5]

Adds/subtracts two binary64 numbers, returning an exact result – the larger in magnitude is the rounded result of the addition/subtraction, and the smaller number is the residue from the exact result.

Two variants exist:

(s, t) = Fast2Sum(a, b)[3] – requires that exponent(a) >= exponent(b)

s = RN( a + b ) t1 = RN( s - a ) t = RN( b - t1 )

(s, t) = 2Sum(a, b)[4,5] – no preconditions on a, b

s = RN( a + b ) t1 = RN( s - b ) t2 = RN( s - t1 ) t3 = RN( a - t1 ) t4 = RN( b - t2 ) t = RN( t3 + t4 )

Veltkamp Splitting [3]

Takes a floating-point number x of precision p, and splits it into two floating-point numbers (x h , x l ) so that:

x h has p-s significant bits

has p-s significant bits x l has s significant bits

This assumes that no overflow occurs during the calculations. A quirk of binary floating-point is that x l will actually fit into s-1 bits (the sign bit of x l is used as an additional bit). For binary64 (p=53), using s=27 means that both halves of the number will fit into 26 bits.

(x h , x l ) = Split(x, s)

Define: C = 2^s + 1 t1 = RN( C * x ) t2 = RN( x – t1 ) xh = RN( t1 + t2 ) xl = RN( x – xh )

Multiplication [2, 3]

Multiplies two binary64 numbers, returning an exact result - the larger in magnitude is the rounded result of the multiplication, while the smaller is the resiudue from the exact result.

(r1, r2) = 2Mult(x, y) – AKA Dekker Product

Define: s = 27 (xh, xl) = Split( x, s ) (yh, yl ) = Split( y, s ) r1 = RN( x * y ) t1 = RN( -r1 + RN( xh * yh ) ) t2 = RN( t1 + RN( xh * yl ) ) t3 = RN( t2 + RN( xl * yh ) ) r2 = RN( t3 + RN( xl * yl ) )

If a hardware fused-multiply-accumulate instruction is available, the following, much faster, implementation is possible:

(r1, r2) = 2MultFMA(x, y)

r1 = RN( x * y ) r2 = fma( x, y, -r1 )

The ‘double-double’ Type

In the following, I base my examples on the code published by Bailey et al. in their QD library [6,7]. While the basic implementation is theirs, I have made the following enhancements to the code:

Functional:

The implementations of trigonometric and hyperbolic functions have been re-written for greater accuracy and speed. In particular, the reduction of trigonometric function arguments is now performed using the excellent code in fdlibm [8]

Additional functions, not defined in the original QD library, have been added (e.g. expm1, logp1, cbrt, etc.)

Code organization:

The original code may be configured to use faster, but less accurate basic functions (addition, multiplication, division). The less accurate versions of these functions have been removed

All code for the ‘double-double’ type (with exception of class dd_real) has been moved into the qd namespace

All helper functions have been moved into the dd_real class, so as not to pollute the global namespace

If a mathematical function is defined in the C++ standard library for the ‘double’ type, it is also defined in the std namespace for the dd_real type

General

It must be emphasized that the ‘double-double’ type is an approximation to a 106-bit floating-point type. It differs from the IEEE 754 binary128 (113 bits) type in many ways:

Correct rounding is not guaranteed

While every attempt is made to provide ‘faithful’ rounding (i.e. return one of the values closest to the infinitely-precise value), no guarantee is made that the returned value is actually the closest of the two.

‘Wobbling’ precision

A number such as 1 + 2-200 may easily be represented in ‘double-double’ format, but is not a valid IEEE 754 binary128 number. This ‘wobbling’ precision complicates error analysis and other issues.

Infinities and NaNs

Full implementation of infinity and NaN arithmetic requires testing for normal operands and for normal results at all stages of the operation. It is possible to write a cheap test that will work for non-normal (infinite / nan ) and for most normal operands, but it is possible that results within (1 - 2-53) of the largest magnitude will trigger an erroneous overflow condition.

Theorems and lemmas relating to floating-point

The IEEE 754 floating point type obeys a large number of theorems that make error analysis etc. more tractable. Most of these theorems do not apply to 'double-double'

Basic Arithmetic

The basic idea is to use the exact operations defined above to calculate an approximation to the exact operation. Note that only the sum of the two components of the result is meaningful. In particular, the high part of the result does not necessarily equal the result of operating on the high parts, e.g. z h <> x h + y h !

Note that optimized versions of the operators exist, e.g. for adding a 'double' to a 'double-double', etc.

Addition / Subtraction

(z h , z l ) = ddAdd( (x h , x l ), (y h , y l ) )

(sh, sl) = 2Sum(xh, yh) (th, tl) = 2Sum(xl, yl) c = RN(sl + th) (vh, vl) = Fast2Sum(sh, c) w = RN( tl + vl ) (zh, zl) = Fast2Sum(vh, w)

Multiplication

(p h , p l ) = ddMul( (x h , x l ), (y h , y l ) )

Note that the product of x l and y l is not calculated. This is because it can never contribute to the lower half of the result.

(ph, pl) = 2Prod(xh, yh) pl = RN(pl + RN(xh * yl)) pl = RN(pl + RN(xl * yh))



Division

(p h , p l ) = ddDiv( (x h , x l ), (y h , y l ) )

No exact method for performing division exists. Instead, a first approximation is calculated as x h / y h , which is then refined by Newton-Raphson iteration.

Polynomial Functions

sqrt() and cbrt() are calculated using variants of Newton-Raphson iteration.

hypot() is calculated using an algorithm that ensures accuracy for all values.

pow() is calculated as follows:

If the exponent is integral and smaller than LLONG_MAX, the value is calculated using successive squaring and multiplications. If the exponent is non-integral, but smaller than LLONG_MAX, the exponent is split into integral and non-integral components. The integral exponent is calculated as above, and the fractional exponent is calculated as exp( x* log( frac(y) ). If the exponent is larger than LLONG_MAX, it is calculated as exp( x * log( y ) )

Trigonometric Functions

For trigonometric functions, the argument is first reduced to the range [-pi/4, pi/4] using the pi/2 remainder algorithm provided in fdlibm [8]. This is then divided by 8 in order to get a value in the range [-pi/32, pi/32]. The sine and cosine are calculated as a polynomial, and the tangent is calculated as a ratio between the sine and cosine.

For the inverse functions, Newton-Raphson iterations are performed in order to calculate, e.g. y where tan( y ) – x = 0.

Logarithms

The logarithm functions (log, log10, log2) are calculated by multiplying x by the appropriate constant, and then calculating log2(C * x). We use log2() as the base function because it allows us to easily extract the integer component of the result (using the frexp() API), leaving only the logarithm of the fraction to be calculated.

Exponentials

The functions exp, exp2, expm1 are all calculated by reducing the argument to less than 1, and calculating the Taylor series. For expm1, the zeroth term of the Taylor series is not added.

Note that if the argument of expm1 is large enough, the result is identical to exp(x) - 1.0.

Hyperbolic Functions

With the exception of sinh(x) where x is close to zero, these functions are calculated by using the exp() function as appropriate. For sinh(x) where x is close to zero, the Taylor series for the function is used in an attempt to reduce catastrophic cancellation.

Utility Functions

ldexp, frexp, fmod, modf, fma, copysign, etc. all exist and are defined in a manner identical to that of the similar functions in the C++ Standard.

std::numeric_limits is fully defined.

Constants

The functions dd_pi(), dd_e(), dd_ln10(), etc. return the expected values, rounded to nearest or even.

Known Bugs

The accuracy of some of the inverse trigonometric (acos(), atan()) and inverse hyperbolic functions (asin()) needs to be improved.

References

IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754-2008, Aug. 2008 Handbook of Floating-Point Arithmetic. J.M. Muller et al., Birkhauser 201x A floating-point technique for extending the available precision. T.J. Dekker, Numerische Mathematik 18(3):224-242, 1971 The Art of Computer Programming vol. 2. D. Knuth, Addison-Wesley Reading MA, 3rd Edition, 1998 Quasi-double precision in floating-point addition. O. Moller. BIT 5:37-50, 1965 Algorithms for quad-double precision floating-point arithmetic. Y. Hida, X.S. Li, D.H. Bailey, Proceedings of the 15th IEEE Symposium on Computer Arithmetic pp 155-162, 2001 The original QD library is available at http://crd.lbl.gov/~dhbailey/mpdist The fdlibm library is available at http://www.netlib.org/fdlibm Software Manual for the Elementary Functions. W. J. Cody Jr. & W. Waite, Prentice-Hall, 1980

History

2015-03-14 Original version 2015-03-31 Bug fixes, basic test program