DivergentCoder - Hybrid AoSoA Revisited

After spending some time talking with Doug McNabb at GDC about this I got the bug to go back and take a look at the hybrid "AoSoA" approach I presented in my previous post and see if I couldn't improve the speed of that even more. As it turns out, I was able to make one simple change that significantly improved performance over what I had been getting.

If you'll recall, in this hybrid approach I was storing four different arrays of structures where each structure held 8 floats of information - the position and velocity for four particles along a particular axis. The idea being that we are storing some data together that will be used together in order to reduce the number of data streams we have to maintain relative to an SoA approach while still getting some of the benefits that come with SoA. Unfortunately, just doing this yielded some disappointing results, performing roughly on par with an SSE implementation utilizing a more straightforward AoS approach to storing the data.

Playing around with this, I hit upon the idea of trying to increase the amount of data stored in each structure. Namely, I ended up storing 16 floats of position followed by 16 floats of velocity for a single axis inside a single one of these structures. This improved performance dramatically, bringing it right up to the same level as the SoA approach.

What we end up with is a structure that looks like this:

struct SSEVec2
{
    float pos[16];
    float vel[16];
};

struct ParticleSystem
{
    SSEVec2 * x;
    SSEVec2 * y;
    SSEVec2 * z;
    SSEVec2 * w;
    int count;

    __inline void update_loop(__m128 dt, SSEVec2 & entry)
    {
        __m128 pos = _mm_load_ps(entry.pos);
        __m128 vel = _mm_load_ps(entry.vel);
        pos = _mm_add_ps(pos, _mm_mul_ps(vel, dt));
        _mm_store_ps(entry.pos, pos);

        pos = _mm_load_ps(entry.pos + 4);
        vel = _mm_load_ps(entry.vel + 4);
        pos = _mm_add_ps(pos, _mm_mul_ps(vel, dt));
        _mm_store_ps(entry.pos + 4, pos);

        pos = _mm_load_ps(entry.pos + 8);
        vel = _mm_load_ps(entry.vel + 8);
        pos = _mm_add_ps(pos, _mm_mul_ps(vel, dt));
        _mm_store_ps(entry.pos + 8, pos);

        pos = _mm_load_ps(entry.pos + 12);
        vel = _mm_load_ps(entry.vel + 12);
        pos = _mm_add_ps(pos, _mm_mul_ps(vel, dt));
        _mm_store_ps(entry.pos + 12, pos);
    }

    void update(float dt)
    {
        __m128 vdt = _mm_load1_ps(&dt);
        
        for (int i=0; i<(count / 16); i++)
            update_single(vdt, x[i]);

        for (int i=0; i<(count / 16); i++)
            update_single(vdt, y[i]);

        for (int i=0; i<(count / 16); i++)
            update_single(vdt, z[i]);

        for (int i=0; i<(count / 16); i++)
            update_single(vdt, w[i]);
    }
};

Which gives us the following results:

So - why does this improve performance so much? My first thought on this has to do with memory performance and spending less time tying up the bus doing writes. Typically, when you modify something in memory you don't actually write directly to RAM but rather first to the CPU cache - with cache lines being flushed back to main memory if they have been modified. In our original structure, each cache line that is going to be flushed back to main memory contains a lot of data that has not changed at all - all of our velocity data. By pushing more position data together, it means that each time we write back our results to main memory we are doing more useful work than we were before.