Aussie AI
Vectorized Multiply Vector by Scalar
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Vectorized Multiply Vector by Scalar
The requirement to multiply a vector by a scalar is common when using scaling vectors. Division by a scalar is also handled by multiplying by the reciprocal (e.g. needed for Softmax). Multiplication by a scalar is amenable to vectorization because the naive C++ version is very simple:
void aussie_vector_multiply_scalar(float v[], int n, float c)
{
// Multiply all vector elements by constant
for (int i = 0; i < n; i++) {
v[i] *= c;
}
}
Loop Pointer Arithmetic. First, we can try the basic C++ optimization of pointer arithmetic:
void aussie_vector_multiply_scalar_pointer_arith(float v[], int n, float c)
{
// Multiply all vector elements by constant
for (; n > 0; n--, v++) {
*v *= c;
}
}
AVX1 multiply-by-scalar:
There is no special scalar multiplication opcode in AVX or AVX-2,
but we can populate a constant register (128-bit or 256-bit) with multiple copies of the scalar (i.e. _mm_set1_ps or _mm256_set1_ps),
and we need do this only once.
We can then use the SIMD multiply intrinsics in the unrolled loop section.
The AVX 128-bit vector multiplication by scalar becomes:
void aussie_vector_multiply_scalar_AVX1(float v[], int n, float c)
{
const __m128 rscalar = _mm_set1_ps(c); // Vector of scalars
for (int i = 0; i < n; i += 4) {
__m128 r1 = _mm_loadu_ps(&v[i]); // Load floats
__m128 dst = _mm_mul_ps(r1, rscalar); // Multiply by scalars
_mm_store_ps(&v[i], dst); // convert to floats (aligned)
}
}
AVX2 multiply-by-scalar:
Even faster is to use 8 parallel multiplications with AVX-2's 256-bit registers.
The AVX-1 version is simply changed to use the “__m256” type and the analogous AVX-2 intrinsics.
The new code looks like:
void aussie_vector_multiply_scalar_AVX2(float v[], int n, float c)
{
const __m256 rscalar = _mm256_set1_ps(c); // Vector of scalars
for (int i = 0; i < n; i += 8) {
__m256 r1 = _mm256_loadu_ps(&v[i]); // Load floats
__m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars
_mm256_store_ps(&v[i], dst); // convert to floats (aligned)
}
}
Combining AVX-2 with pointer arithmetic. Finally, we can get a small extra benefit by adding pointer arithmetic optimizations to the AVX-2 parallelized version. The new code is:
void aussie_vector_multiply_scalar_AVX2_pointer_arith(float v[], int n, float c)
{
// Multiply all vector elements by constant
const __m256 rscalar = _mm256_set1_ps(c); // vector full of scalars...
for (; n > 0; n -= 8, v += 8) {
__m256 r1 = _mm256_loadu_ps(v); // Load floats into 256-bits
__m256 dst = _mm256_mul_ps(r1, rscalar); // Multiply by scalars
_mm256_store_ps(v, dst); // convert to floats (Aligned version)
}
}
Benchmarking results. In theory, the AVX-2 intrinsics could parallelize the computation by 8 times, but benchmarking showed that it only achieved a 4-times speedup.
Vector-scalar operation benchmarks (N=1024, ITER=1000000):
Vector mult-scalar C++: 1412 ticks (1.41 seconds)
Vector mult-scalar pointer-arith: 995 ticks (0.99 seconds)
Vector mult-scalar AVX1: 677 ticks (0.68 seconds)
Vector mult-scalar AVX2: 373 ticks (0.37 seconds)
Vector mult-scalar AVX2 + pointer arith: 340 ticks (0.34 seconds)
|
• Next: • Up: Table of Contents |
|
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |