This post explains how to convert floating point numbers to binary numbers in the IEEE 754 format. A good link on the subject of IEEE 754 conversion exists at Thomas Finleys website. For this post I will stick with the IEEE 754 single precision binary floating-point format: binary32. See this other posting for C++, Java and Python implementations for converting between the binary and decimal formats.
Expressing numbers in scientific notation
You may be aware that binary numbers, like decimal numbers, can have decimal points. And that binary numbers, like decimal numbers, can be expressed using scientific notation:
decimal: 923.52 = 9.2352 x 102
binary: 101011.101 = 1.01011101 x 25
The number that the 10 or 2 is raised to, the “exponent”, represents the number of places shifted to the left or the right of the decimal point accordingly.
IEEE 754 Representation for binary32
In IEEE 754 floating-point representation, the binary number is divided into three sections: the sign bit, the exponent and the mantissa (fractional part).
This occupies just one bit and represents the sign: 0 for positive and 1 for negative.
The exponent section for a 16-bit (half-precision) floating point occupies 5 bits and stores the exponent value described above. For 32-bit (single-precision) as in the above binary32 example, this section occupies 8 bits; for 64-bit (double-precision) formats this section will occupy 11 bits.
Dealing with positive and negative exponents
An 8-bit exponent encoding can represent integers from 0 (00000000) to 255 (11111111). But what about negative exponents? We need to be able to include these, too. To cover this, we ensure that the exponent is of value 127 greater.
If our exponent is (say) 3 then add 127 to it to give 3 + 127 = 130 (decimal) = 10000010 (binary). This bias is simply 2n – 1 where n is the number of exponent bits, so 8 bit exponent encodings would have a bias of 28 – 1 = 128 – 1 = 127.
If our exponent was minus 3, then the outcome would be -3 + 127 = 124 (decimal) = 1111100 (binary). In other words, (00000000) to (01111111) represents the exponents from -127 to zero, and (10000000) to (11111111) would represent the exponents from +1 to 128.
The third section of our 32-bit representation is 23 bits long. The mantissa, sometimes called the significand, represents the fractional part of the number in binary scientific notation ie the binary number to the right of the decimal point.
Example: 12.375 into IEEE 754 binary format
This example for converting from decimal representation into a binary32 format is taken from the Wikipedia page. Consider the number 12.375.
Take the non-fractional part of 12.375 and convert it into binary in the normal way:
12 (decimal) is 1100 (binary)
Since 12 = (8 * 1) + (4 * 1) + (2 * 0) + (2 * 0)
Converting the fractional part (0.375) into binary is done using the following procedure:
1. multiply the fraction by 2
2. keep the integer part of multiplication as the binary result
3. re-multiply new fraction by 2
4. repeat 1 – 3 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format i.e.:
0.375 x 2 = 0.750 = 0 + 0.750 => 0
0.750 x 2 = 1.500 = 1 + 0.500 => 1
0.500 x 2 = 1.000 = 1 + 0.000 => 1
The fraction part eventually comes to 0.000, so we terminate. The binary result is 011, therefore
0.375 (decimal) is 0.011 (binary)
12.375 (decimal) is now 1100.011 (binary).
Convert result to the required binary scientific format
IEEE 754 binary32 format requires that you represent values in the scientific format described previously, so that
1100.011 = 1.100011 x 23.
From this scientific notation we can now deduce:
Sign = 0 (positive number)
Exponent = 3
bias = 28-1 = 127 (8-bit exponent encoding for binary32)
adding this to the exponent gives:
3 + 127 = 130 (decimal) = 10000010 (binary)
Mantissa = 100011 (fractional part to the right of the decimal point)
From these we form the resulting 32 bit IEEE 754 binary32 format representation of 12.375 as: