Aussie AI

Chapter 17. CUDA Floating-Point Bit Tricks

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 17. CUDA Floating-Point Bit Tricks

Bits in Floating-Point Numbers

The basic 32-bit floating-point number in C++ is a float with a size of 4 bytes. How can you manipulate the bits in a floating-point value, using the 32-bit float type? You cannot use any of the C++ bitwise operators on floating-point numbers, as they only work for integers.

The trick is to convert it to an unsigned integer (32-bit) with the same bits, and then use the integer bitwise operations. The obvious way to convert a float to unsigned is casting:

    float f = 3.14f;
    unsigned int u = (unsigned)f;  // Fail!

Nope. That doesn’t get to the bits, because it does a proper conversion between floating-point numbers and integers, which is usually what you want when you aren’t thinking about bits (i.e., all normal people).

To get to the bits in C++, we have to trick the compiler into thinking that it’s already got an unsigned integer with pointer type casts:

    unsigned int u = *(unsigned int*)(&f);  // Tricky!

That’s a bit old-school for type casting. Here’s the modern way with reinterpret_cast:

    unsigned int u = *reinterpret_cast<unsigned int*>(&f);

Once we have the bits, then we can twiddle the bits of our unsigned integer to our heart’s content. When we’re finished, we can do the same trick in reverse to re-create a floating-point number:

    f = *(float *)(&u);   // Floating again...
    f = *reinterpret_cast<float*> (&u);  // Trendy version

And here’s a timely reminder that it’s important to use an “unsigned” type in C++ for the bit faking code, because the “>>” right-shift operator has undefined behavior on negatives.

Other Methods: Type casts aren’t the only way in C++. There’s also a trick involving “union” structures, and you can also directly copy the bits to a differently typed variable using “memcpy” or “bcopy”.

It seems to me that this type cast trick should be the fastest way, because a good compiler should convert the address-of, reinterpret_cast and indirection sequence into a simple variable copy, especially with the “reinterpret_cast” hint. However, I haven’t actually benchmarked the speed of the different methods.

Pitfalls

Bitwise manipulation of float data is not the most portable code in the world. Let’s examine some of the possible pitfalls in using these techniques.

Bitwise zero testing: If you’ve gone to the trouble to access the bits of a floating-point number, you might as well use them. Obviously, testing for “0.0” is a common requirement, so let’s make it faster:

    #define FLOAT_IS_ZERO(f) \
     ((*reinterpret_cast<unsigned int*>(&f)) == 0u) // Bug!

Oops! We forgot about negative zero. There are two zeros in floating-point, depending on the sign bit, and it’s hard to test it efficiently with bitwise operations (e.g., mask the sign bit or shift left first).

Strict anti-aliasing rule. An important point about all this is that most of it is platform-dependent, and officially “undefined behavior”. Some of it is standardized by IEEE 754, but many variations are possible. Another issue is that there’s a “strict anti-aliasing rule” that specifies that many of these tricks are officially non-standard methods. Accessing a floating-point number as if it’s an unsigned number is a technical violation of this rule. The “reinterpret_cast” method is probably less likely to run afoul of this problem, but it’s still not guaranteed.

Anyway, the union trick and the use of memcpy don’t really strike me as being particularly more portable, although memcpy might be less likely to be optimized wrongly by a compiler making wrong assumptions. Some additional risk mitigations are warranted, such as adding a lot of unit tests of even the most basic arithmetic operations. However, you’re still not officially covered against an over-zealous optimizer that might rely on there being no aliases allowed.

Byte sizes. Another much simpler portability issue is checking the byte sizes of data types, which can vary across platforms. Most of this bit-fiddling stuff relies on particular 16-bit and 32-bit layouts. It doesn’t hurt to add some self-tests to your code so you don’t get bitten on a different platform, or even by a different set of compiler options:

   assert(sizeof(int) == 4);
   assert(sizeof(short int) == 2);
   assert(sizeof(float) == 4);
   assert(sizeof(unsigned int) == 4);

Also note that for this to work well, both types must be the same size. So, this would be a useful code portability check if it worked:

   #if sizeof(float) != sizeof(unsigned int)  // Fails!
   #error Big blue bug
   #endif

This macro preprocessor trick doesn’t work because sizeof isn’t allowed in a preprocessor expression, because the preprocessing phase precedes the syntax analysis. A better version uses a “static_assert” statement, which does compile-time checking in a more powerful way.

    static_assert(sizeof(float) == sizeof(unsigned), "Bug!");

Floating-Point Builtin Functions

The alternative to directly accessing the bits as an unsigned integer is to use the existing C++ functions. There are various existing functions for bitwise manipulation of floating-point numbers, in two categories: standard C++ library functions and compiler-specific intrinsics.

C++ has standard functions for the manipulation of floating-point numbers, and their bitwise representations.

std::signbit — Portably test the sign bit of a floating-point number.
std::copysign — Portably copies the sign bit from one float, merging it with the value of another (i.e., another’s exponent and mantissa).

There are also various compiler-specific “intrinsics” or “builtins” to manipulate floating-point numbers. For Microsoft Visual Studio C++, these are in <intrin.h> and there are also versions for GCC and other compilers.

frexp — Get the mantissa and exponent.
ldexp — Bitshifting by an integer shift-count.
scalbn — Also integer bitshift on a float.
logb — Extracts the exponent.
ilogb — Extracts the exponent to integer.
modf — Splits into whole and fractional parts.
fma — Fused multiply add on float (Microsoft intrinsic)
remainder — Get fractional part of floating-point (Microsoft intrinsic)
_fcvt — Low-level convert float to string (Microsoft intrinsic)

For many of the listed functions, there are additional versions for different floating-point data types, such as float, double and long double. For example, “frexp” will split a double type into its significand (fractional part) and exponent integer, but there’s also “frexpf” for 32-bit float types, and “frexpl” for long double types.

Floating-Point Bit Tricks

Once you’ve got the bits into an unsigned integer, what can you do?

Assuming you’re willing to throw the standards documents to the curb, you can do quite a lot. The bits can be directly manipulated in non-obvious ways to speed up some types of floating-point arithmetic with integer bitwise arithmetic on the underlying bits. Examples of floating-point bit manipulations used to optimize neural networks include:

Sign bit flipping: this can be used for fast non-multiplication binarized networks with floating-point computations.
Exponent bit manipulations: bitshifting float values in logarithmic quantization can be implemented as integer addition on the exponent bits of a float.
Add-as-integer networks: This method simply adds the underlying bit representations together as integers, to create a type of multiplication-free neural network. Weirdly, this simple trick implements an approximate multiplication algorithm known as Mitchell’s algorithm.
Fast log2 computation on float types using the exponent bits directly.

The first step is to extract the bit patterns. Let’s assume it’s a standard 32-bit float type with 1 sign bit, 8 exponent bits, and 23 stored mantissa bits. You can get the different bits:

   int signbit = (u >> 31);
   int exponent = ( (u >> 23) & 255 );  // Fail!
   int mantissa = ( u & ((1 << 23) - 1 ));

Nice try, but that’s only 2 out of 3. The exponent is wrong here! The bits are correct, but it’s not the right number. We have to subtract the “offset” (or “bias”) of the exponent, which is 127 for an 8-bit exponent. This is correct:

   int exponent = ( (u >> 23) & 255 ) - 127;  // Correct!

Note that the sign bit and mantissa can be stored as unsigned (i.e., positive or zero), but the exponent must be a signed integer, even though it is extracted from the bits of an unsigned integer. For a fraction like decimal 0.25 (i.e., a quarter), this is equal to 2^-2, so the exponent is -2. In an 8-bit exponent, the range of the exponent is -128 to +127. Note that the sign bit in a float specifies the overall sign of the whole number, and is not the sign of the exponent.

Here are some macro versions of the above bit extractions:

    #define AUSSIE_FLOAT_SIGN(f)     \
      ((*(unsigned *)&(f)) >> 31u)   // Leftmost bit
    #define AUSSIE_FLOAT_EXPONENT(f) \
      ((int)(((((*(unsigned*)&(f)))>> 23u) & 255) - 127)) 
    #define AUSSIE_FLOAT_MANTISSA(f) \
      ((*(unsigned*)&(f)) & 0x007fffffu) // Right 23 bits

Note that these macros don’t work for constants, but give a compilation error such as “l-value required”. This is because of the “&” address-of operator trick being used needs a variable, not a constant. I don’t see an easy way around it for bitwise trickery.

If you dislike bits for some strange reason, here’s a simple way to define the sign bit macro using the “<” operator, which also works on constants:

    #define AUSSIE_FLOAT_SIGN(f)  ( (f) < 0.0f)  // Sign test

Example: Add-as-Integer Approximate Multiply

The add-as-integer method suggested by Mogami (2020) simply adds the integer bit representation of two floating-point variables, as if they are integers. It’s quite surprising that this has any useful meaning, but it’s actually a type of approximate multiplication called Mitchell’s algorithm. Here’s what the C++ code looks like on 32-bit float types:

    float aussie_add_as_int_mogami(float f1, float f2)
    {
        // Add as integer Mogami(2020)
        int c = *(int*)&(f1)+*(int*)&(f2)-0x3f800000; 
        return *(float*)&c;
    }

The magic number 0x3f800000 is (obviously) equal to “127<<23” and its purpose is to fix up the offset of the exponent. Otherwise, there are two offsets of 127 combined. (Is there a faster way? It’s annoying to waste a whole addition operation on what’s just an adjustment.)

Note that this algorithm is one exceptional case where we don’t want to use unsigned integer types when tweaking bit representations. This trick needs the temporary variable of type “int” and the pointers to be “int*” so that it can correctly handle the sign bits of the two floating-point numbers.

This add-as-integer algorithm is not restricted to 32-bit float data. It should also work for 16-bit floating-point numbers in both float16 and bfloat16 formats, provided the magic number is changed to a different bitshift count and an offset of 15 (not 127) for 5-bit exponents.

Example: Float Bitshift via Integer Addition

This is another surprising bitwise trick on floating-point numbers. You cannot perform the standard bitshift operators on float types in C++, so you cannot easily speed up floating-point multiplication via bitshifts in the same way as for integers.

Bitshifts are a fast way of doing an integer multiplication by a power-of-two (e.g., “x<<1” is the same as “x*2”). Note that it also doesn’t work to convert the float to its unsigned int bit version and shift it using integer bitshift operators.

On some platforms, there are some builtin special functions such as ldexp and scalbn for doing bitshifting on float data. The ldexp function accepts an integer power, and then bitshifts a floating-point number by this many places. The ldexp function is for double types, ldexpf is for float, and ldexpl is for long double types. The scalbn set of functions appears to be almost identical to ldexp functions. There is also a reverse function “frexp” which extracts the significant (fraction) and the power-of-two for a floating-point argument.

Although we can’t bitshift floating-pointer values, there is an intriguing alternative optimization using integer arithmetic directly: addition. The suggestion in the DenseShift paper (Li et al., 2023) is to simply add the shift count to the exponent bits using integer addition.

Here’s some example C++ code that works for 32-bit floating-point numbers:

    float aussie_float_bitshift_add_int(float f1, int bits)   
    {
        // Bitshift float by adding int to exponent bits
        // FP32 = 1 sign bit, 8 exponent, 23 mantissa
        unsigned int u = *(unsigned int*)&f1; // Get the bits
        if (u == 0) return f1;  // special case, don’t change
        u += (bits << 23);  // Add shift count to exponent
        return *(float*)&u; // Convert back to float
    }

How does it work? Well, it makes a certain kind of sense. The exponent in a floating-point representation is a power-of-two, and we are bitshifting, which is increasing the number by a power-of-two. Hence, we can increase the power-of-two by adding 1 to the exponent, and it also works for adding numbers more than 1.

Note that this code also works for bitshift of a negative count (e.g., bitshift of -1 subtracts from the exponent and thereby halves the number) or zero (unchanged). However, this exponent-addition trick can overflow if the resulting number overflows or underflows the exponent range (e.g., -128 to +127).

This method has thereby improved the performance of floating-point multiplication by changing it to integer addition. The idea works provided we are multiplying by a power-of-two, which is done in logarithmic quantization. However, it’s a little tricky in that special formats like zero (and NaN) are problematic for this algorithm. I had to add the test “u==0” which slows things down (maybe there’s a better way?). Also, this approach can theoretically overflow the exponent bits, messing up the sign bit, but that’s only if the float is very big or very tiny. Checking for all these wrinkles will slow down the code.

Example: Log2 of Floating-Point is the Exponent

The log2 function for float types is a non-linear function that is quite expensive to compute. We already computed log2 of an integer with low-level bit fiddling methods based on a count-leading-zeros algorithm in the previous chapter. There’s also a different bitwise trick for log2 of floating-point numbers. This method computes the truncated integer version of the log2 algorithm (e.g., for use in logarithmic power-of-two quantization). There’s a very easy way:

The base-2 logarithm is the exponent!

It’s sitting right there, already calculated, hidden in plain sight amongst the 32 bits of your friendly float variables. Here’s some C++ code to extract it:

    int ilog2_exponent(float f)  // Log2 for 32-bit float
    {
        unsigned int u = *(unsigned int*)&f;
        int iexp = ((u >> 23) & 255);  // 8-bit exponent
        iexp -= 127;  // Remove the "offset"
        return iexp;
    }

Alternatively, for greater portability and probably extra speed, too, there are some standardized builtin C++ functions available across various platforms (including Linux and Microsoft) that can extract the exponent: frexp, ldexp, ilogb, and scalbn, are some that come to mind.

References

Eric Sakk (2018), Understanding Floating-Point Numbers, Concepts in Computer Systems (Volume 2), 7 June 2018, https://www.amazon.com/dp/1983093025/
Sean Eron Anderson (2005), Bit Twiddling Hacks, Stanford University, https://graphics.stanford.edu/~seander/bithacks.html
T. Mogami (2020), Deep neural network training without multiplications, In Beyond BackPropagation WS at 34th Conference on Neural Information Processing Systems, 2020, https://arxiv.org/abs/2012.03458 (Uses integer addition of the bits of an IEEE 754 floating-point representation to perform approximate floating-point multiplication.)
Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lerevre, Guillaume Melquiond, Nathalie Revol, Damien Stehle, Serge Tones (2018), Handbook of Floating-Point Arithmetic, Birkhauser, 2018, https://link.springer.com/book/10.1007/978-3-319-76526-6, Contents: https://cds.cern.ch/record/1315760/files/9780817647049_TOC.pdf
Wonyeol Lee, Rahul Sharma, Alex Aiken (2016), Verifying Bit-Manipulations of Floating-Point, Stanford University, USA, https://theory.stanford.edu/~aiken/publications/papers/pldi16b.pdf
Xinlin Li, Bang Liu, Rui Heng Yang, Vanessa Courville, Chao Xing, Vahid Partovi Nia (2023), DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, Oct 2023, https://arxiv.org/abs/2208.09708 (Uses integer addition on the sign and exponent bits of IEEE 754 floating-point to perform bitshifts on floats to perform power-of-two number quantization on 32-bit floats.)
Mostafa Elhoushi, Zihao Chen, Farhan Shafiq, Ye Henry Tian, Joey Yiwei Li (2021), DeepShift: Towards Multiplication-Less Neural Networks, July 2021, https://arxiv.org/abs/1905.13298 (Bitwise shifting and sign bit manipulation.)