1. C++ Vector Sum Performance

Vector sum: to sum up the elements of an integer vector and store the result in some destination.

In this post we’re going to begin investigating how to implement the ‘Vector sum’ function defined above, in doing so we can see how attempts at performance improvement affect the functions runtime.

A caveat before proceeding, we’re strictly using un-optimised compiled code. Compilers are smart and can (generally) implement the following optimisations on their own.

Meet the functions

0 – The slowest one

There’s nothing special to see here. This is a basic implementation of ‘Vector sum’ and is chock full of repetitive, unnecessary memory references.

int get_sum_of_int_vector_0(const vector<int>& vec, int* destination)
{
    for (int i = 0; i < vec.size(); i++)
    {
        *destination = *destination + vec[i];
    }
    return *destination;
}

1 – The slow one

This is identical to the one above aside from the fact that vec.size() is now stored in a local variable. This won’t give much improvement here, but instead if the vec size was being calculated on each loop iteration (e.g doing something silly like sizeof(vec)/sizeof(vec[0])) then storing the size in a local variable would be of great benefit.

// move vec.size() out of the for loop
int get_sum_of_int_vector_1(const vector<int>& vec, int* destination)
{
    size_t vector_size = vec.size();
    for (int i = 0; i < vector_size; i++)
    {
        *destination = *destination + vec[i];
    }
    return *destination;
}

2 – The ‘normal’ one

This is the ‘normal’ one because it’s the most likely implementation that programmers would create from the get-go. The biggest difference here is that the repeated writes to destination are gone from each loop iteration, instead we accumulate the result in a local variable sum and then store sum in destination at the end.

// use a local variable to accumulate the result
int get_sum_of_int_vector_2(const vector<int>& vec, int* destination)
{
    int sum = 0;
    size_t vector_size = vec.size();
    for (int i = 0; i < vector_size; i++)
    {
        sum += vec[i];
    }
    *destination = sum;
    return sum;
}

3 – Loop unrolling 1.0

Building on the previous implementation we include some minor loop unrolling (2×1 loop unrolling). Loop unrolling here essentially means performing more computations per loop iteration, the idea is that this can reduce the overhead related to loop iteration. In this instance we add two vec elements to sum on each loop iteration until we reach first_loop_limit, this is a safety bound so we don’t try access data beyond the end of vec (because we’re using i+=2).

// use 2x1 loop unrolling to perform two computations per loop iteration
int get_sum_of_int_vector_3(const vector<int>& vec, int* destination)
{
    int sum = 0;
    size_t vector_size = vec.size();
    size_t first_loop_limit = vector_size - 1;

    int i;
    for (i = 0; i < first_loop_limit; i += 2)
    {
        sum = (sum + vec[i]) + vec[i + 1];
    }

    // add the rest beyond first_loop_limit
    for (; i < vector_size; i++)
    {
        sum += vec[i];
    }

    *destination = sum;
    return sum;
}

4 – Loop unrolling 2.0

Now things are starting to get interesting, we’re adding an extra dimension (2×2 loop unrolling) onto the loop unrolling introduced above.

Using the fact that

(sum of even-indexed vec elements) + (sum of odd-indexed vec elements) = sum of all vec elements

we can introduce and use two accumulating variables sum_even and sum_odd to keep track of the even-indexed elements sum and the odd-indexed elements sum.

This allows us to break out

sum = (sum + vec[i]) + vec[i + 1]; // from Loop unrolling 1.0

into two independent expressions

 sum_even += vec[i];
 sum_odd += vec[i + 1];

because there’s no dependency between these expressions in each loop iteration, the CPU can more readily utilise instruction-level parallelism to perform the two computations simultaneously. (This is machine-dependent)

Now we have a loop-unrolled, ‘Vector sum’ implementation that should perform two computations simultaneously in each loop iteration.

// use 2x2 loop unrolling to perform two computations using two accumulators per loop iteration
int get_sum_of_int_vector_4(const vector<int>& vec, int* destination)
{
    int sum_even = 0;
    int sum_odd = 0;
    size_t vector_size = vec.size();
    size_t first_loop_limit = vector_size - 1;

    int i;
    for (i = 0; i < first_loop_limit; i += 2)
    {
        sum_even += vec[i];
        sum_odd += vec[i + 1];
    }

    // add the rest beyond first_loop_limit
    int sum_of_rest = 0;
    for (; i < vector_size; i++)
    {
        sum_of_rest += vec[i];
    }

    *destination = sum_even + sum_odd + sum_of_rest;
    return *destination;
}

Next time..

In the next post we’ll contrast and compare the execution times of the functions above – testing the assumptions made about improving performance.