2. C++ Vector Sum Performance

Vector sum: to sum up the elements of an integer vector and store the result in some destination.

Following on from the previous post we’re now going to look at some rough timing measurements to compare function performance.

To talk about performance we need to create some simple benchmarking code which can approximate a functions execution time, the details of this can be found here in the projects github (the benchmarking code is in main.cpp). To approximate a functions execution time we repeat it X number of times and calculate the average value.

Results

To attain the results below, a vector of 10,000 elements is used and each function is executed 100,000 times.

As mentioned in the previous post, we’re using code that hasn’t been optimised by the compiler to better see the effect of our changes.
The times listed below are in milliseconds.

0 – The slowest one

int get_sum_of_int_vector_0(const vector<int>& vec, int* destination)
{
    for (int i = 0; i < vec.size(); i++)
    {
        *destination = *destination + vec[i];
    }
    return *destination;
}

TOTAL TIME: 4073.5
TOTAL ITERATIONS: 100,000
AVERAGE TIME: 0.040735

As expected, the performance of this function is pretty poor with a total execution time of just over 4 seconds.

1 – The slow one

// move vec.size() out of the for loop
int get_sum_of_int_vector_1(const vector<int>& vec, int* destination)
{
    size_t vector_size = vec.size();
    for (int i = 0; i < vector_size; i++)
    {
        *destination = *destination + vec[i];
    }
    return *destination;
}

TOTAL TIME: 4012
TOTAL ITERATIONS: 100,000
AVERAGE TIME: 0.04012

A very measly performance gain can be seen here by moving vec.size() out of the for loop, so measly in fact that it could be disregarded but looking at the assembly we can see that some instructions have been optimised away for the comparison of i to the loops upper limit.

*Assembly comparison – function 0 on the left and 1 on the right.*

2 – The ‘normal’ one

// use a local variable to accumulate the result
int get_sum_of_int_vector_2(const vector<int>& vec, int* destination)
{
    int sum = 0;
    size_t vector_size = vec.size();
    for (int i = 0; i < vector_size; i++)
    {
        sum += vec[i];
    }
    *destination = sum;
    return sum;
}

TOTAL TIME: 2560.16
TOTAL ITERATIONS: 100,000
AVERAGE TIME: 0.0256016

Now our execution time is down to just over 2.5 seconds, a big relative performance increase over the previous two functions! By simply accumulating the result in a local variable we remove a vector_size number of writes and reads to destination, as can be seen in the assembly below.

*Assembly comparison – function 1 on the left and 2 on the right.*

3 – Loop unrolling 1.0

// use 2x1 loop unrolling to perform two computations per loop iteration
int get_sum_of_int_vector_3(const vector<int>& vec, int* destination)
{
    int sum = 0;
    size_t vector_size = vec.size();
    size_t first_loop_limit = vector_size - 1;

    int i;
    for (i = 0; i < first_loop_limit; i += 2)
    {
        sum = (sum + vec[i]) + vec[i + 1];
    }

    // add the rest beyond first_loop_limit
    for (; i < vector_size; i++)
    {
        sum += vec[i];
    }

    *destination = sum;
    return sum;
}

TOTAL TIME: 2470.61
TOTAL ITERATIONS: 100,000
AVERAGE TIME: 0.0247061

There’s no major improvement to see here, though we are doing more ‘work’ per loop iteration we’re not yet taking advantage of instruction level parallelism (ILP) – we’re still carrying out two sums in sequence here: sum = (sum + vec[i]) + vec[i + 1];

4 – Loop unrolling 2.0

TOTAL TIME: 1643.42
TOTAL ITERATIONS: 100,000
AVERAGE TIME: 0.0164342

// use 2x2 loop unrolling to perform two computations using two accumulators per loop iteration
int get_sum_of_int_vector_4(const vector<int>& vec, int* destination)
{
    int sum_even = 0;
    int sum_odd = 0;
    size_t vector_size = vec.size();
    size_t first_loop_limit = vector_size - 1;

    int i;
    for (i = 0; i < first_loop_limit; i += 2)
    {
        sum_even += vec[i];
        sum_odd += vec[i + 1];
    }

    // add the rest beyond first_loop_limit
    int sum_of_rest = 0;
    for (; i < vector_size; i++)
    {
        sum_of_rest += vec[i];
    }

    *destination = sum_even + sum_odd + sum_of_rest;
    return *destination;
}

Another increase in performance, from 2.47 seconds to 1.64 seconds! By using the even and odd accumulator variables we can take advantage of ILP in the main loop body (executing two arithmetic operations at the same time).

Conclusion

We managed to significantly increase the performance of our vector-sum function, dropping execution time (according to the benchmark) from just over 4 seconds to 1.6 seconds.

The key ideas are

Removing unnecessary reads and writes
Using local accumulator variables
Using minor loop unrolling and taking advantage of ILP when possible