DivergentCoder - AoS & SoA Explorations Part 1

I thought I would go ahead and give a very simple example of something I am sure many people reading this are already at least somewhat familiar with. Going by my experience talking with people about this there are at least some of you out there who are somewhat familiar with the topic but haven't really done anything with it and aren't quite sure whether it is all that big of a deal. Since this is an area I am exploring myself right now, I thought I would add some fuel to the fire for those of you out there who fall in to this area by providing some simple examples and some hard numbers.

For those of you who are completely lost, if you haven't stopped reading already, a simple overview: AoS and SoA refer to "Array of Structures" and "Structure of Arrays" respectively. These two terms refer to two different ways of laying out your data in memory. The names are pretty self explanatory, but to be sure - an array of structures is what you are probably most familiar with, grouping properties of an object together and making an array of those objects in memory, whereas a structure of arrays would be a single structure in which you make an array (sized according to how many instances of your object you want) for each property. If you are wondering why one would bother with this difference - the argument is that a structure of arrays can allow for better cache utilization, making better use of each read you make from memory, separating out data you care about for a particular computation from data you don't, as well as providing a more effective route to vectorizing your code (i.e. using SSE, AltiVec, AVX, etc.).

For my exploration of the practical differences between the two approaches, I will go ahead and use the tried and true example of a particle system. This particle system is going to start off very simply - each particle will simply have a four component "position" and a four component "velocity" that gets multiplied by our dt and added to the position during each update. The AoS/SoA implementations for this particle system will also start off in a very straightforward manner and I will be adding some complication to them later.

In the AoS approach we end up with something like this:

struct Particle
{
    float x;
    float y;
    float z;
    float w;

    float vx;
    float vy;
    float vz;
    float vw;
};

struct ParticleSystem
{
    Particle * particles;
    int count;

    void update(float dt)
    {
        for (int i=0; i<count; i++)
        {
            x += vx * dt;
            y += vy * dt;
            z += vz * dt;
            w += vw * dt;
        }
    }
};

Pretty standard fare. For my test setup I am making 128 of these particle systems, each with the same number of particles and timing how long it takes to update all 128 systems 60 times. I am running the test with particle counts going from 16 up to 131072, doubling the count for each test. I am measuring the timing using QueryPerformanceCounter. Using this test setup I obtain the following results running on the P8400 (Core 2 Duo @ 2.26 GHz) inside my laptop:

As you can see, the time it takes to update pretty neatly doubles with the number of particles being updated, right up until between 512 & 1024 particles per emitter where it quadruples and then resumes its doubling again.

Now, we will look at the SoA version:

struct ParticleSystem
{
    float * x;
    float * y;
    float * z;
    float * w;

    float * vx;
    float * vy;
    float * vz;
    float * vw;

    int count;

    void update(float dt)
    {
        for (int i=0; i<count; i++)
        {
            x[i] += vx[i] * dt;
            y[i] += vy[i] * dt;
            z[i] += vz[i] * dt;
            w[i] += vw[i] * dt;
        }
    }
};

As you can see it is not really all that different other than the important point that all instances of an individual property are stored together. Repeating the same test for this structure gives the following results:

We see here that the time to update just about doubles all the way up until between 32768 ppe and 131072 ppe. We also see that this SoA implementation starts off slower than the AoS implementation, only becoming faster once we reach 1024 particles. It then remains faster until we reach 65536 particles, at which point the performance starts to become pretty atrocious for the SoA version.

That's it for the first part. In the next part I will take a closer look at those performance cliffs and I will start to try and make these straightforward implementations perform a little better. After that I will look a little more closely at some of the real strength on the SoA side by vectorizing the code and adding in some more data to our rather simplistic particle system.