cplusplus.co.il

Tuples

Posted on: 15/01/2010

One of the containers introduced within TR1 (which is already widely available – both in gcc and Visual Studio) is a Tuple type, which is adopted from The Boost Tuple Library. Tuple types are very convenient at times; For example, it is possible to return multipe values from a function through a tuple, or write more intuitive and expressive code by utilizing tuples.

In this post we will examine the functionality offered by the new Tuple container, and have a go at profiling its performance. Actually, the results of said profiling were a small (pleasant) surprise to me.

Let us consider the most straightforward iterative implementation of a Fibonacci number computation:

unsigned normal_fib (unsigned n) {
    unsigned prev(1), curr(1);

    for (unsigned c=2;c<n;++c) {
        unsigned old = curr;
        curr += prev;
        prev = old;
    }

    return curr;
}

Nothing fancy here of course. The problem is, however, that the facilities used in this code are not very expressive: I had to use a temporary variable “old” to keep the value of the previous number, while all I wanted to do was swap (prev,curr) with (curr,prev+currr). You may think that it is not very efficient to use a temporary variable and using a trick to swap the values would be in order, but actually the compiler is pretty smart and in most cases it will be able to do a better job if such tricks are avoided.

With that said, we would like to utilize the new Tuple facility to improve expressiveness (and, ideally, get better performance as well):

#include <tr1/tuple>

// let's treat tr1 as part of std:
namespace std { using namespace std::tr1; }

unsigned tupled_fib (unsigned n) {
    unsigned prev(1), curr(1);

    for (unsigned c=2;c<n;++c)
        std::tie(prev, curr) = std::make_tuple(curr, curr+prev);

    return curr;
}

Using std::tie and the notion of Tuples, we were able to provide a shorter, more intuitive, body for the given function. We were actually able to implement the function exactly as it was described in plain english (two paragraphs above) – this is indeed very expressive!

… .. .

Now you may be asking yourselves what is the preformance penalty, if any?

Exploiting the useful Timer class I have assembled the following test program:

#include <vector>
#include <algorithm>
#include <functional>

#include "timer.hpp" // Timer class

int main () {
    std::vector<unsigned> data(1000000, 100);

    {
        Timer t("normal");
        std::for_each(data.begin(), data.end(),
                        std::ptr_fun(normal_fib));
    }

    {
        Timer t("tupled");
        std::for_each(data.begin(), data.end(),
                        std::ptr_fun(tupled_fib));
    }
}

Note that the first test is the normal one. One could argue that the test order influences the results since the first accesses to the vector are likely to generate cache misses, but the results are consistent when the order of tests is reversed.

Here is what the performance looked like with no optimizations whatsoever:

antrikot:~/work/sandbox/tr1> g++ -Wall -pedantic tuple.cc
antrikot:~/work/sandbox/tr1> ./a.out
normal: 0.62secs.
tupled: 11.41secs.

But here is the interesting part; See what happens when we use an optimization level of 2:

antrikot:~/work/sandbox/tr1> g++ -Wall -pedantic -O2 tuple.cc
antrikot:~/work/sandbox/tr1> ./a.out
normal: 0.24secs.
tupled: 0.19secs.

The tupled version is both written better (in terms of readability, maintainability, expressiveness, etc) and is slightly faster! It also makes sense since it is a known fact that the STL relies heavily on compiler target-code optimizations for efficiency.

If the benchmarking test is modified to contain a few invocations of each function in a random order (not only once like the given code suggests), the performance gap is pretty much closed. However, the difference in code quality remains.

The tests were run on my Intel Pentium M 1.73Ghz laptop, with gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4).

I have also examined the generated Assembly (available here) and it looks like the loop body in the optimized version is implemented exactly the same for both functions (tupled and normal), while the non-optimized version contains a lot of inefficiency in the tupled implementation. Can you spot, and explain, other major differences?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: