Aussie AI
Chapter 16. CUDA Arithmetic Optimizations
-
Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
-
by David Spuler
Chapter 16. CUDA Arithmetic Optimizations
Arithmetic Overload
You wouldn’t be reading a CUDA book unless you had a lot of arithmetic to do! Crunching through large volumes of data is where a GPU shines. The methods of optimization to fix an arithmetic bottleneck are basically in two big categories:
- Do some other type of arithmetic instead.
- Do fewer computations.
Alternative forms of faster arithmetic include changing to integers or doing bitwise shifting or addition. The ways to do fewer arithmetic computations tend to involve higher-level algorithmic changes to your application or reducing the amount of data.
There are two basic ways that arithmetic computations can be sped up whilst retaining the same results:
- Single operator improvements
- Expression-level optimizations (multiple operators)
Some of the methods of speeding up arithmetic come from the theory of compiler optimization (e.g., strength reduction, sub-expression elimination). Hence, the compiler will often automatically perform these types of optimizations (when the optimizer is invoked). To some extent, this makes these transformations redundant. Even so, good programming practice is to avoid situations where these optimizations are needed on a large scale. The compiler does not look at the program as a whole and can miss some “obvious” optimizations.
General Arithmetic Optimization Methods
Smaller Data Sizes (Quantization). Reducing the size of data from 32-bit floating-point to 8-bit integers (or 4-bit) is effective at reducing both arithmetic computation costs and memory access costs. The computation kernels for 4-bit tensors are much more efficient than for 32-bit calculations.
Low-Precision Arithmetic.
Choose float over double.
Also prefer the lower-precision mathematical library functions,
such as expf rather than exp.
Integer arithmetic. Calculations involving integers are faster than in floating-point. This is another advantage of quantizing tensors from 32-bit floating-point to smaller integers. Even 32-bit integers may have a lower compute, but don’t have any memory benefit from a smaller data type.
Bit Parallelism (Bit-Level Kernels). Integer operations can be considered as 32 or 64 bit operations in parallel. This idea is used in data structures such as bit vectors and Bloom filters, or AI kernels such as low-bit quantization. Using these in CUDA adds a second layer of concurrency beyond thread parallelism, thereby even further parallelizing the algorithm.
Instruction Latency.
Some hardware instructions are simply faster than others.
Different generations of GPUs have instructions with improved
speed, and additional machine-code instructions.
Knowing the exact speed of individual instructions
can be used to optimize by inserting inline PTX assembly
with the asm statement.
String Data Processing. The C and C++ style of null-terminated string data is not well suited to parallel processing, being somewhat linear in nature. In general, prefer string representations with a fixed size, or at least a “length” variable, rather than null-terminated storage where the length is not known. On the other hand, there are several CUDA C++ libraries for text processing on a GPU, such as RAPIDS, which are very fast.
Operator Strength Reduction
Individual operations in C++ can be optimized in several ways. The general term is “strength reduction” because a stronger operator with high computation complexity is “reduced” to an equivalent operator that is simpler and faster. Strength reduction is a technique used in automatic optimization by compilers, but can also be used by programmers to improve algorithms.
The main “strong” operations that we’re trying to avoid are:
- Floating-point arithmetic (even addition)
- Multiplication
- Division
- Remainder (
%operator) - Math functions (e.g.,
sqrtforexpf)
Some of the general approaches in regard to strength reduction include:
- Bitwise operations (e.g., bitshifts can replace multiplication)
- Multiplication is slower than addition.
- Avoid division and modulo/remainder operators (they’re the worst!)
- Use integer arithmetic rather than floating-point (where possible)
- Use
floatsingle-precision arithmetic, not double-precision. - Approximate arithmetic (e.g., for math functions)
Bitshift for multiplication: The canonical example that everybody knows is that shift operators can replace multiplications by a power of two. But it’s only for integers, not for floating-point numbers. Note that there’s a whole class of AI engines that rely on this integer multiply-vs-bitshift optimization called “logarithmic quantization” or “power-of-two quantization.” Here’s a dummy example of integer multiplication;
y = x * 4;
This can be more efficiently coded as a left bitshift:
y = x << 2;
Bug alert! If you’re making this code change, you’re likely to introduce some bugs.
The “<<” and “*” operators have different precedence levels, so make sure you add more parentheses.
Also, consider whether you need to use “unsigned” type when switching to a bitwise operator.
Right shift for division: The use of bitshifting works for division, too (but only for unsigned):
y = x / 4;
y = x >> 2u; // faster
Avoiding multiplication: There are some simple cases even with the most basic operators that have multiple options:
y = x * 2;
y = x + x; // Addition
y = x << 1; // Shift
Avoid % Remainder Operations
The arithmetic modulus operator (remainder) can also be optimized for power-of-two operands using bitwise arithmetic (but only on integers):
y = x % 512; // Remainder (mod)
y = x & 511u; // Bitwise-AND
And here’s another one with integer relative comparisons versus bitwise-and, although this one might not necessarily be faster:
if (x >= 512) if (x & ~511u) // Bitwise-AND of the complement (unsigned)
Another common use of the remainder operator is the use of modulo arithmetic, such as the
wraparound array implementation of a queue abstract data type,
where the value of a variable is cyclically counted from 0 up to N-1, and then back to 0.
The most common idiom for coding this is:
x = (x + 1) % N;
However, the % operator is expensive, and in this case it is not really needed. The following code sequence performs the same task more efficiently:
if (x == N - 1)
x = 0;
else
x++;
This can also be written more concisely, but not necessarily more efficiently, as an expression with the “?:” ternary operator:
(x == N - 1) ? (x = 0) : (x++);
Another example of a clever avoidance of % is when the operand is similar to the usual
byte or word size. For example, consider this remainder:
x % 256
This can be more efficiently coded with bitwise-and using:
x & 255
But this can be even more efficiently coded as a type cast:
(unsigned char) x
The conversion to this “unsigned char” type will be efficiently implemented by grabbing a byte out
of a word. Unfortunately, this method is not portable to all obscure systems, as it relies on
an “overflow” being handled harmlessly, and on “unsigned char” always containing 8 bits.
Reciprocal Multiplication
Division is a slow operation, whether in a CPU or a GPU. Multiplication is often significantly faster than division, and in some cases a division can be replaced by a multiplication using the reciprocal. A case in point is floating-point division by a constant. For example, consider the division:
f = g / 100.0;
This can be replaced by the multiplication:
f = g * 0.01; // Reciprocal
If the divisor is a symbolic constant, it is possible to replace the symbolic constant with a hard-coded constant (or another symbolic constant). However, it is more convenient to replace the constant with an explicit reciprocal calculation. For example, consider the code:
f = g / DIVISOR;
This can be rewritten as:
f = g * (1.0 / DIVISOR);
The compiler should calculate the reciprocal using “constant folding” at compile-time. Note that the brackets around the division expression are probably not strictly necessary because optimizers know about associativity, but are certainly helpful to make life easier for the optimizer (and these poor critters need a break every now and then).
If the divisor is a complex expression, the compiler might not automate the use of a reciprocal. Here’s the slow version of division by a scale factor:
v[i] /= sqrtf(3.14159f);
Here’s the faster way using the reciprocal of the constant:
v[i] *= 1.0f / sqrtf(3.14159f);
And we really should pre-calculate this constant using constant folding and a const variable:
const float scalefactor = 1.0f / sqrtf(3.14159f);
v[i] *= scalefactor;
Implicit Type Conversions
Hidden unnecessary C++ type conversions are a common source of extra inefficiency.
The main type in a Transformer is usually “float” (32-bit),
rather than “double” (64-bit).
Avoid unnecessary type conversion code in two ways:
- Don’t mix
floatanddouble - Don’t mix
floatandint
The use of float and int tends to be something professional C++ programmers are aware of,
after having been burned a few times,
and doesn’t occur that often by accident.
However, inadvertently mixing float and double is difficult to avoid,
and sneaks into your code all the time.
For example, here’s some C++ code that looks perfectly correct:
float scalefactor = sqrt(2.0) * 3.14159;
re You know this isn’t real AI code because it doesn’t have 27 decimal places for pi, which we’ve memorized by rote. AI engines don’t really need anywhere near that much precision, but it looks good for the boss.
The above code is also a small slug,
because it may be unnecessarily using “double” size arithmetic,
although the compiler might fix it with constant folding (but emit a warning anyway).
Here’s the corrected code:
float scalefactor = sqrtf(2.0f) * 3.14159f;
Note that this example shows there are two places where an “f” suffix is needed
to signify that float
arithmetic is required:
- Numeric constants (i.e., “
2.0f” specifying a 32-bitfloat, rather than “2.0”, which is a 64-bitdoubleconstant). - Standard C++ functions (i.e., “
sqrtf” returnsfloatrather than “sqrt” returningdouble).
Without the suffix “f”, the default is double type constants and double arithmetic functions.
A lot of C++ compilers will warn about these type conversions losing precision,
so if you aim for warning-free compilation as a quality goal, you’ll also fix most of these wasteful hidden type conversions.
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: CUDA C++ Optimization |
|
The new CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |