DivergentCoder - AoS & SoA Explorations Part 2

In my previous post I started doing some exploration and benchmarking of a simple test system using an Array of Structures (AoS) approach and a Structure of Arrays (SoA) approach. For this simple particle system test case and using two rather straightforward implementations I found the SoA implementation to be faster on my Core 2 Duo with particle counts between 1024 and 32768. For the last two test cases of 65536 particles and 131072 particles the SoA implementation started to lose ground dramatically. It this performance cliff here that I want to take a closer look at for the first part of this post.

As a refresher, the code for our SoA update loop looked like this:

for (int i=0; i<count; i++)
{
    x[i] += vx[i] * dt;
    y[i] += vy[i] * dt;
    z[i] += vz[i] * dt;
    w[i] += vw[i] * dt;
}

So what can we do to improve the speed here? Looking at the assembly I can see that the compiler is not unrolling this loop at all - and I know that can give some nice speed up in loops like this. Using a little bit of template programming we can coax the compiler to do this and make it easy to experiment with different levels of unrolling without any pain. This gives us some new code that looks like this:

template<int UNROLL>
void loop(float * a, float * b, int i, float dt)
{
    a[i] += b[i] * dt;
    loop<UNROLL - 1>(a, b, i + 1, dt);
}

template<>
void loop<0>(float * a, float * b, int i, float dt) {}

void update(float dt)
{
    for (int i=0; i<count; i+=16)
    {
        loop<16>(x, vx, i, dt);
        loop<16>(y, vy, i, dt);
        loop<16>(z, vz, i, dt);
        loop<16>(w, vw, i, dt);
    }
}

From testing, I found that unrolling the loop into blocks of 16 gave me the best speed boost. If you'll recall, previously our SoA code took around 10.33 seconds to perform the test when the particles per emitter was 65536. With just this simple change the time to perform the test has decreased to around 5.42 seconds - a pretty massive change! In fact it also brings the SoA performance back beyond the AoS performance for this test case which was at about 7.7 seconds. Moving on to the next test case of 131072 particles per emitter, the new SoA update loop clocks in at 10.90 seconds versus 30.23 seconds (nearly triple!) for the original SoA loop and 15.44 seconds for the AoS loop. Attempting a similar loop unrolling for the AoS loop brought no such improvement.

The next thing I would like to look at is the effect of vectorizing our loops. Now, if I tell the compiler (in this case VC++2008) to make use of SSE it should go ahead and do the work for us. Great! Unfortunately, when I do so the code doesn't end up any faster. Why? Well, looking at the disassembly again, I can see that the compiler is indeed using SSE now for both the AoS and SoA update loops - but it is using all scalar instructions, doing the work one element at a time. I am no SSE guru (far from it), but I know we can do better than that. So I am going to break out some intrinsics and see where it takes me.

Using SSE intrinsics leads me to the following AoS update loop:

void update(float dt)
{
    __m128 vdt = _mm_load1_ps(&dt);

    for (int i=0; i<count; i++)
    {
        __m128 pt = _mm_load_ps(&particles[i].x);
        __m128 vel = _mm_load_ps(&particles[i].vx);

        pt = _mm_add_ps(pt, _mm_mul_ps(vel, vdt));

        _mm_store_ps(&particles[i].x, pt);
    }
}

My SSE version of the SoA update loop looks like this:

template<int UNROLL>
void loop(float * a, float * b, int i, __m128 dt)
{
    __m128 va = _mm_load_ps(a + i);
    __m128 vb = _mm_load_ps(b + i);

    va = _mm_add_ps(va, _mm_mul_ps(vb, vdt));
    _mm_store_ps(a + i, va);

    loop<itr - 4>(a, b, i + 4, vdt);
}

template<> void loop<0>(float * a, float * b, int i, __m128 dt) {}

void update(float dt)
{
    __m128 vdt = _mm_load1_ps(&dt);

    for (int i=0; i<count; i+=16)
    {
        loop<16>(x, vx, i, vdt);
        loop<16>(y, vy, i, vdt);
        loop<16>(z, vz, i, vdt);
        loop<16>(w, vw, i, vdt);
    }
}

I will repeat the same test on these loops as for my previous post. As a refresher - I am making 128 of these particle systems, each with the same number of particles and timing how long it takes to update all 128 systems 60 times. I am running the test with particle counts going from 16 up to 131072, doubling the count for each test. I am measuring the timing using QueryPerformanceCounter. Using this test setup I obtain the following results running on the P8400 (Core 2 Duo @ 2.26 GHz) inside my laptop:

For both the AoS loop and the SoA loop we can see that the SSE versions of the loop perform significantly faster than the original loops when the particles per emitter is less than 1024. After that point the SSE loops are both still generally faster than their non-SSE counterpart but not all that much faster. Now, I should point out that there isn't really anything special about hitting 1024 particles per emitter - it is really more an artifact of how I have constructed this test. With 32 bytes of data per particle, 512 particles per emitter and 128 emitters - the entire working data set is less than the size of my L2 cache (3MB). Once I go up to 1024 ppe, then I have blown past my L2 cache size and performance predictably suffers. Indeed, if I limit the test to 1 emitter then I get a pretty consistent doubling in execution time all the way up to 65536 ppe. If I were going to continue to optimize this then I would probably focus on this aspect of the performance.

I think that about covers it for this post. I was going to touch on one big advantage of SoA in being able to separate hot data from cold data versus an AoS approach as well as take a look at a hybrid AoS/SoA approach as per @mcnabbd's suggestion, but time is short so I will save that for another post.