Yet another floating point tutorial

Why though?

I know, the topic is already covered by excellent tutorials and explanations. To name a few,

I still think that reexplaining some obscure concepts with different words (and buttons) might help someone understand floating point numbers better. To do fewer mistakes, to make things faster, to create better software in general.

So here we are.

6-bit integer number

Floating point numbers were invented to represent real numbers. Like 0.5 or 3.1415926. The ones that can't be represented by integers like 1 or 3. But in fact, computers only do integers.

And only some finite subset of it. Computers operate with bit tuples, and there are only as many different combinations of bits as 2n where n is the length of a tuple.

Here's a model of a 6-bit integer number.

Quest 1. Please make a model from the above show 31. Check!

Integer numbers are not universally standardized but there are some common conventions. For instance, the most common way to represent negative integers is by using the higher half of the range.

Quest 2. Please make a model from the above show -32. Check!

Bitwise, it looks a little odd, but it all makes perfect sense when we see all the range in one picture. Let's pick an even simpler model. The 1-digit decimal number will do.

Unsigned 1-digit decimal Signed 1-digit decimal 0 0 1 1 2 2 3 3 4 4 5 -5 6 -4 7 -3 8 -2 9 -1

The first half of the range is shared between signed and unsigned types. Then there's a leap back for the signed ones. After that, they diverge by exactly half a range.

If you take overflows into consideration, it seems natural. Yes, signed numbers overflow at half-range, but unsigned numbers overflow too, only half a range later. It's like two continuous rolls of numbers going along each other.

Unsigned 1-digit decimal Signed 1-digit decimal ... ... 9 -1 0 0 1 1 2 2 3 3 4 4 5 -5 6 -4 7 -3 8 -2 9 -1 0 0 1 1 2 2 3 3 4 4 5 -5 ... ...

Please don't write your code expecting this behavior though. Integer overflows are not universally standardized. And even when they are, you'd be better off not exploiting them because it makes code obscure, error-prone, and often non-portable.

Puzzle 1. What will this C++ program print out? This puzzle requires significant C++ knowledge. If you're not familiar with the language, feel free to guess. // built with Clang 3.8 on Intel Core i7 uint16_t a = 50'000; uint16_t b = 50'000; uint64_t c = a*b; std::cout << c; 0 500000000 -1794967296 18446744071914584320 Check!

Integer numbers may be used to represent some non-integer values. For instance, 0.45 of a meter is 45 centimeters. 45.6 centimeters is 456 millimeters, and so on.



456 mm = 45.6 cm = 0.456 m





It's just a matter of representation. It's all about where to put the decimal point.

6-bit fixed point number

With things like price tags or body temperature, we know where the decimal point goes. These things could be manipulated with the decimal point in mind.



$ 9.99



36.6°C





They are still integers, but they are integer number of fractional numbers. In our case, an integer number of quarters.

Quest 3. Please make a model from the above show -8. Check!

You can add and subtract them like the regular numbers, but you have to introduce more elaborate rules for division and multiplication. And you have to watch for overflows and underflows accordingly.

6-bit floating point number

Floating point numbers are that popular because they work even if you don't know where the decimal point should be or where it will appear after the computation.

They come from scientific (AKA exponential) notation. With this notation, you can write down both astronomical and subatomic values.



m Sun : 1.989e+30 kg



m e : 9.1093897e-31 kg





Just like a number in scientific notation, a floating point number has a sign, some meaningful digits, and an exponent.

The three major differences are:

it's all binary not decimal; the exponent value goes before meaningful digits; and since its binary, and all the numbers except for 0 start from "1", you don't have to write down the first "1" most of the time. It will be there implicitly. Saves you a bit.

Quest 4. Please make a model from the above show 1. Check!

Only for the very small numbers, when we want our exponent to be smaller than we can afford, we do not imply the "1". In scientific notation, its like writing "0.00123e-45". It's not normal. And we call these numbers subnormal or denormalized.

Quest 5. Please make a model from the above show 0.25. Check!

Since you're still working with bits, and every number is still just an integer in disguise, you can imagine a floating point number as an integer number of 2n.

Quest 6. Please make a model from the above show 2. Check!

Unlike integer numbers, floating point numbers are standardized, and the standard specifies several useful conventions. The whole range of possible bits consists not only of numbers, but of two distinct zeros, two values for infinitely large values, and a whole subrange of "not a number"s.

To understand why do you need special values for infinite numbers, you need to understand floating point zeros first. They are not actual zeros. They model anything that is absolutely smaller than the smallest representable number. Of course, the may occasionally represent an actual zero, but they very well may be some small numbers instead. And as such, they deserve to retain their signs.

auto min_float = std::numeric_limits<float>::denorm_min(); std::cout << min_float; // prints 1.4013e-45 std::cout << min_float / 2.f; // prints 0 std::cout << -min_float / 2.f; // prints -0

The smallest representable number is the smallest denormalized number. One half of it is semantically a number, but pragmatically a zero.

Denormalized values are not universally available. If you don't want that precision, you can win some performance by turning them off.

Speaking of performance, some of the compiler optimizations are algebraic. This means that the compiler optimizes floating point numbers such as if they were real. Usually, it's not a problem. Sometimes it is.

Puzzle 2. What will this C++ program print out? std::cout << 0 - (min_float / 2.f); -1.4013e-45 -0 0 1.4013e-45 It depends Check!

When you divide a non-zero number by a zero, you get an infinite. And just like zeros are not zeros, these are not infinities. These are just numbers that are too big to be represented.

There are also numbers that can't possibly be represented in this model at all, like square roots of negative numbers. Technically, they are still numbers, but being complex numbers and not real, they don't have the representation. Operations that take numbers as input and can't provide numbers as output return "not a number" then.

Representation error

Of course, even real numbers are only represented partially. We have only that many bits, that many combinations, and the real number's range is infinite.

When a number doesn't have the representation, we pick the nearest number that has one instead.

C ← 7 8 9

4 5 6

1 2 3

- 0 .

Input number:

As 6-bit FP:

Instrumental error:



The difference between their values is our representation error.

Quest 7. Using the thing from the above, please enter a number with the error of exactly 0.001. Check!

The common misconception is, since our smallest representable numbers are small, the representation error should be small too. But remember, we are talking about integer numbers of 2n. The greater the n, the greater the possible representation error is.

Quest 8. Using the thing from the above, please enter a number with the error of 1000. Check!

Computational error

Representational error is not our worst enemy. The real world data, the values that come from sensors or user input, have their own errors too. And they usually shade the representation error of floating point numbers.

However, as we process the data, we introduce computational error, too.

0.25 0.5 2 16 + 0.25 0.5 2 16 = , error:

Computational error is not imminent. If the arguments are close enough exponentially, the operation result may be represented without an error.

Puzzle 3. The thing from the above lets you pick 16 variants of 6-bit floats addition. How many of them cause no error?

0 2 4 8 16 Check!

There is a common belief that comparing two floats exactly is unsafe because of the possible error, but this error is often manageable. For instance, this loop is quite safe (remember, floats are just integer numbers of 2n).

// 16 iterations loop for(auto i = 0.; i != 4; i += 0.25) cout << i << ' ';

It might get needlessly tricky to recognize safe from unsafe. For instance, this one isn't safe since 0.1 is only represented in floating point model with an error.

// "infinite" loop for(auto i = 0.; i != 4; i += 0.1) cout << i << ' ';

What's worse, this error tends to accumulate as the computation goes. Of course, it depends on the algorithm, some are more prone to accumulate error than others.

If you want to learn more about how to estimate this error, please see Estimating floating point error the easy way.