Problem: float +/- result depend on computation order, Win32/x64, compiler options  
Author Message
Pieneer





PostPosted: Visual C++ Language, Problem: float +/- result depend on computation order, Win32/x64, compiler options Top

Hi, I have this small test program test.cpp:

#include <iostream>
using namespace std;
int main()
{
cout.precision(8);
float v1= -1.9513026f, v2= 0.31476471f, v3= 3.1415927f;
cout << "v1-v2+v3= " << v1-v2+v3 << endl;
cout << "v1+v3-v2= " << v1+v3-v2 << endl;
return 0;
}

When I compile it (using Visual Studio 2005 SP1) with cl /EHsc test.cpp as a Win64 application (both on Windows XP x64 and Vista x64) and run it I get this result:

v1-v2+v3= 0.87552547
v1+v3-v2= 0.87552536

So, depending on the order of computations I get different results. I know that float has only 6-7 significant digits, so some inaccuracy is to be expected (using double in stead of float gives the result 0.87552539), but my problem is that I have to compare the results of the program to results of the same program compiled as a Win32 application, which gives:

v1-v2+v3= 0.87552536
v1+v3-v2= 0.87552536

If I compile with cl /clr test.cpp I get all 0.87552536 both for Win64 and Win32, so this raises the question if the different result (0.87552547) is a bug of the compiler Could someone please verify this with another processor. I have AMD Athlon 64 X2 4200+.

Unfortunately I can't use /clr or double in this case, but I have to use /EHsc, /MT and float in order to speed up the computation and save memory (the real application handles lots of data). Is there any way of getting the same results for the Win32 and Win64 applications under these restrictions Any help is much appreciated.



Visual C++12  
 
 
oflebbe





PostPosted: Visual C++ Language, Problem: float +/- result depend on computation order, Win32/x64, compiler options Top

Hi,

this behaviour of + and - is by design. In numerical mathematics this is called "cancellation". Subtracting numbers of comparable size is getting large absolute errors.

In 32 Bit Mode all floating point operations (regardless of float or double) are done in the FPU registers by default, which have an internal precision of 96 (I forgot the correct size) Bits. Rounding down to floats (32bits) or double (64) is done only when a write back to Memory is needed. This 80 Bit mode was a design by Intel, which I loved, but was very controversial in the numerical mathematics world, because it can yield to strange numerical artefacts if one trys to apply numerical tricks.

In 64 Bit Mode floating point is done in the SSE registers which have both a double precision and a single precision mode. Now every computation is done in the respective precision, yielding to predictive, but larger, numerical rounding errors.

You code will behave the same way on Linux/x86 and Linux/amd64, btw.

The workaround is

1) Never do numerical comparision with ==, rather than
std::abs (a-b) < error. This is the textbook solution.

2) use double for calculation.

Further Reading: What Every Computer Scientist Should
Know About Floating-Point Arithmetic
http://docs-pdf.sun.com/800-7895/800-7895.pdf

Regards,
Olaf Flebbe



 
 
Pieneer





PostPosted: Visual C++ Language, Problem: float +/- result depend on computation order, Win32/x64, compiler options Top

Hi,

Thank you very much for the quick and informative reply. I was afraid the reason was due to design.

However, as I mentioned compliling with cl /clr test.cpp on x64 gives:

v1-v2+v3= 0.87552536
v1+v3-v2= 0.87552536

Does /clr imply that FPU is used in stead of SSE, or what is the reason for that one working as I would like it to work, i.e. giving identical results regardless of using Win32 or Win64.

Regarding the workarounds:

1) The small differences discussed here eventually result in big differences when applied many times on big datasets. The comparisons are actually done between the results of the 64-bit application and 32-bit application.The comparison is now much harder due to these differences.

2) Unfortunately double takes twice the memory space and approximately twice the time, therefore I have to use float.


 
 
oflebbe





PostPosted: Visual C++ Language, Problem: float +/- result depend on computation order, Win32/x64, compiler options Top

Hi,

I recommend reading:

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Please have a look at any numerics book about how to handle sums of many operands with mixed sign and comparable size.Best Regards Olaf Flebbe