DivergentCoder - Messing with ISPC

I just started reading about Intel's SPMD Program Compiler - ISPC - tonight and thought I would give it a quick test. If you haven't been following, a little while back I wrote a series of posts looking at data organization and SIMD optimization for a very dumb particle system - you can read over the posts starting here. To briefly recap, I set up a simple system with a 4 component position and 4 component velocity and went over various ways to organize the data and write an update loop, performing tests on a variety of particle counts and providing timings for those test.

After reading about ISPC I went ahead and dusted off the code and test setup I used for those posts. Very quickly I wrote a pretty straightforward program to compile with ISPC, looking like this:

export void simple_update(uniform float px[], 
                                uniform float py[], 
                                uniform float pz[], 
                                uniform float pw[],
                                uniform float vx[], 
                                uniform float vy[], 
                                uniform float vz[], 
                                uniform float vw[],
                                uniform int count, 
                                uniform float dt)
{
    for (uniform int i=0; i<count; i += programCount)
    {
        int index = i + programIndex;
        px[index] = px[index] + vx[index] * dt;
        py[index] = py[index] + vy[index] * dt;
        pz[index] = pz[index] + vz[index] * dt;
        pw[index] = pw[index] + vw[index] * dt;
    }
}

task void 
update_axis(uniform float pos[], 
             uniform float vel[], 
             uniform int count, 
             uniform float dt)
{
    for (uniform int i=0; i<count; i += programCount)
    {
        int index = i + programIndex;
        pos[index] = pos[index] + vel[index] * dt;
    }
} 

export void task_update(uniform float px[], 
                             uniform float py[], 
                             uniform float pz[], 
                             uniform float pw[],
                             uniform float vx[], 
                             uniform float vy[], 
                             uniform float vz[], 
                             uniform float vw[],
                             uniform int count, 
                             uniform float dt)
{
    launch < update_axis(px, vx, count, dt) >;
    launch < update_axis(py, vy, count, dt) >;
    launch < update_axis(pz, vz, count, dt) >;
    launch < update_axis(pw, vw, count, dt) >;
}

Like I said - straightforward. As you can see I made two different implementations: the simple_update just goes through the elements, performing the update "one at a time" much like the original SOA loop I wrote in C++; the task_update behaves similarly but launches a task for each axis independently. The language itself is obviously very understandable, reminiscent of writing a simple shader program. To compile this I just do the following at the command line:

ispc -O2 --arch=x86 --target=SSE2 -o ParticleSimd.o -h ParticleSimd.h ParticleSimd.ispc

ParticleSimd.ispc is the name of the file that contains the code from above and after entering this at the command line I am provided with ParticleSimd.o for me to link with my C/C++ program and ParticleSimd.h which I can include from my C/C++ program. To use the code in a program all I have to do is call ispc::update_simple or ispc::update_task (in the case of C++ where the namespace makes sense). Pretty simple! How does it perform? I re-ran the last few test cases using this code and got the following results:

So, interesting results - not that much different from what I already had but still a little bit faster for the task based update. It seems like there is some definite potential here for some cool stuff, so I'll probably dig in to it a little deeper and see what I can get out of it. Fun.