DivergentCoder - AoS & SoA Explorations Part 3

If you've been following, I have been taking a look at Array of Structures and Structure of Arrays through the simple example of a particle system. If you haven't been following then you can read up here and here. In this final post I am going to take a brief look at just two more things: what happens when there is more data involved and how does a hybrid approach compare.

Thus far I have been looking at a very simple particle system, one in which we really only have a position vector and a velocity vector that we use to modify the position during every update. Realistically though, we would likely have more than just that to deal with if we were building an actual particle system. For example, we would probably want to assign a color to each of our particles. In the case of SoA we don't particularly care - the data is allocated and initialized and the next time we care about it will probably be when we render the particles. Indeed, taking my SoA ParticleSystem structure from previous posts and adding in a new array of particle colors and then re-running my tests shows no sign of any performance impact. In fact, I can keep adding in array after array of cold data (data I'm not touching during the update) and not see any performance degradation - it simply isn't a factor.

What happens in the AoS case though? If I take the AoS data structure and add another 32-bit integer to it I end up with a 36 byte structure - this means my position and velocity vectors no longer end up on nice neat 16-byte boundaries and I will have to modify my SSE update loop to use unaligned loads/stores. This gives us something like the following:

struct Particle
{
    float x;
    float y;
    float z;
    float w;

    float vx;
    float vy;
    float vz;
    float vw;

    unsigned int color;
};

struct ParticleSystem
{
    Particle * particles;
    int count;

    void update(float dt)
    {
        __m128 vdt = _mm_load1_ps(&dt);
        for (int i=0; i<count; i++)
        {
            __m128 pt = _mm_loadu_ps(&particles[i].x);
            __m128 vel = _mm_loadu_ps(&particles[i].vx);

            pt = _mm_add_ps(pt, _mm_mul_ps(vel, vdt));

            _mm_storeu_ps(&particles[i].x, pt);
        }
    }
};

Re-running the same test I used in the previous two posts we obtain the following results:

To be sure, the unaligned memory accesses are slower and I could have just padded the data structure, but testing showed that to be even worse. In practice this might not be an issue since you will likely have other data in your particle structure if you are going the AoS route. Regardless, we can see how the addition of the color data has affected the performance of the update.

The next thing I want to look at briefly is a hybrid of the two approaches. In this approach we store data that we know will be used together in a structure and make arrays of those structures. In each of these structures we can store groups of each property according to the width of our SIMD registers (so 4 floats in the case of SSE) to aid in vectorization. For our simple particle system it might make sense to store the position and the velocity for one axis together in one of these structures. This leads us to something like this:

struct ParticleSystem
{
    struct SSEVec2
    {
        float pos[4];
        float vel[4];
    };

    SSEVec2 * x;
    SSEVec2 * y;
    SSEVec2 * z;
    SSEVec2 * w;

    int count;

    void update(float dt)
    {
        __m128 vdt = _mm_load1_ps(&dt);
        for (int i=0; i<(count/4); i++)
        {    
            __m128 pos = _mm_load_ps(x[i].pos);
            __m128 vel = _mm_load_ps(x[i].vel);
            pos = _mm_add_ps(pos, _mm_mul_ps(vel, vdt));
            _mm_store_ps(x[i].pos, pos);
        }
        for (int i=0; i<(count/4); i++)
        {    
            __m128 pos = _mm_load_ps(y[i].pos);
            __m128 vel = _mm_load_ps(y[i].vel);
            pos = _mm_add_ps(pos, _mm_mul_ps(vel, vdt));
            _mm_store_ps(y[i].pos, pos);
        }
        for (int i=0; i<(count/4); i++)
        {    
            __m128 pos = _mm_load_ps(z[i].pos);
            __m128 vel = _mm_load_ps(z[i].vel);
            pos = _mm_add_ps(pos, _mm_mul_ps(vel, vdt));
            _mm_store_ps(z[i].pos, pos);
        }
        for (int i=0; i<(count/4); i++)
        {    
            __m128 pos = _mm_load_ps(w[i].pos);
            __m128 vel = _mm_load_ps(w[i].vel);
            pos = _mm_add_ps(pos, _mm_mul_ps(vel, vdt));
            _mm_store_ps(w[i].pos, pos);
        }
    }
};

Running this approach to our simple particle system we get the following results (compared with the SoA & AoS SSE implementations):

So - not quite what I was hoping for. The point of doing something like this (rather than straight up SoA) is to reduce the number of data streams we have to manage - and so reducing the overhead of having those data streams. In this particular case it doesn't seem to have done much for us, performing slightly worse even than the more straightforward AoS SSE implementation. Of course, there are a couple of ways we could actually structure our data in this hybrid approach and I tried a few of them without much luck, but I would be interested to hear about other approaches. I thought it worthwhile to share these results even if it is a negative result (and maybe particularly because of that).

That covers it for the time being for this series of posts. If there is a point to all of this then I suppose it would be to, of course, know your data and know how your data will be accessed. But also, don't just assume - actually try some different approaches out and see what works and how things compare. I hope these three posts have been somewhat informative to some of you out there - I know it was informative for me putting them together.