DivergentCoder - Downsampling with ISPC

After my last post I spent some more time playing with ISPC, taking a look at a different problem from before. In this case I am taking a single channel texture composed of 32-bit floating point values and downsampling the texture by taking the minimum value from each 2x2 block of texels. The scalar code for this is pretty simple:

void downsample(float * src, float * dst, int width, int height)
{
    for (int y=0; y<height; y+=2)
    {
        int offs = y * width;
        for (int x=0; x<(width/2); x++)
        {
            int x0 = 2 * x;
            int i0 = offs + x0;
            int i1 = offs + x0 + 1;
            int i2 = offs + width + x0;
            int i3 = offs + width + x0 + 1;

            int d0 = (offs / 4) + x;
            dst[d0] = min(min(src[i0], src[i1]), min(src[i2], src[i3]));
        }
    }
}

To time this I disabled SpeedStep and used RDTSC to determine the number of cycles spent on a single downsample, reporting the minimum number of cycles spent over 2000 iterations - making sure to dirty the cache between each iteration. All tests presented here are from running on the P8400 Core 2 Duo inside my laptop. I tested the code by downsampling from 3 different resolutions: 2048x2048, 1024x1024 and 512x512. The results for the scalar code (in millions of cycles):

Scalar
12.323
3.133
0.657

How much better can we do with ISPC? My first stab at implementing this in ISPC was a pretty straightforward translation of the scalar code:

export void downsample(uniform float input[], uniform float output[], uniform int width, uniform int height)
{
    for (uniform int y=0; y<height; y+=2)
    {
        uniform int offs = (y * width);
        for (uniform int x=0; x<(width >> 1); x+=programCount)
        {
            int x0 = (x + programIndex) << 1;
            int i0 = offs + x0;
            int i1 = offs + x0 + 1;
            int i2 = offs + width + x0;
            int i3 = offs + width + x0 + 1;
			
            int d0 = (offs >> 2) + x + programIndex;
			
            output[d0] = min(min(input[i0], input[i1]), min(input[i2], input[i3]));
        }
    }
}

Upon compiling this you will notice that ISPC complains that it cannot compute the expression performing the assignment to the output array without performing a gather - not ideal for performance. Indeed, timing this bit of code yields results that are only marginally better than the original scalar code. Being disappointed with this I took some time to think how I would implement this using intrinsics and came up with the following bit of code that works on the problem in 8x2 blocks of texels:

void downsample(float * src, float * dst, int width, int height)
{
    float * src0 = src;
    float * src1 = src + width;

    for (int y=0; y<height; y+=2)
    {
        for (int x=0; x<width; x+=8)
        {
            __m128 r00 = _mm_load_ps(src0);
            __m128 r01 = _mm_load_ps(src0 + 4);

            __m128 r10 = _mm_load_ps(src1);
            __m128 r11 = _mm_load_ps(src1 + 4);

            __m128 m0 = _mm_min_ps(r00, r10);
            __m128 m1 = _mm_min_ps(r01, r11);

            __m128 a = _mm_shuffle_ps(m0, m1, _MM_SHUFFLE(3,1,3,1));
            __m128 b = _mm_shuffle_ps(m0, m1, _MM_SHUFFLE(2,0,2,0));

            _mm_stream_ps(dst,  _mm_min_ps(a,b));

            src0 += 8;
            src1 += 8;
            dst += 4;

            if ((y + 2) < height)
            {
                _mm_prefetch((char *)(src0 + (width * 2)), _MM_HINT_NTA);
                _mm_prefetch((char *)(src1 + (width * 2)), _MM_HINT_NTA);
            }
        }
        src0 += width;
        src1 += width;
    }
}

Performing the test on this bit of code yields me the following results:

Scalar
12.323
3.133
0.657

Intrinsics
8.501
1.953
0.261

Significantly better! Now, is there any way I can translate this to ISPC? Unfortunately, I could not see any way of getting rid of the gather in ISPC and making it do something similar - however, it did guide me to reformulate the ISPC code in to something that it was able to do much better with:

export void downsample(uniform float input[], uniform float output[], uniform int width, uniform int height)
{
    for (uniform int y=0; y<height; y+=2)
    {
        uniform int offs = (y * width);
        for (uniform int x=0; x<width; x += (programCount << 1))
        {
            uniform float temp[programCount << 1];

            float r00 = input[offs + x + programIndex];
            float r01 = input[offs + x + programCount + programIndex];
            float r10 = input[offs + x + width + programIndex];
            float r11 = input[offs + x + width + programCount + programIndex];
			
            temp[programIndex] = min(r00, r10);
            temp[programIndex + programCount] = min(r01, r11);
            temp[programIndex] = min(temp[programIndex << 1], temp[(programIndex << 1) + 1]);

            int d0 = (offs >> 2) + (x >> 1);
            output[d0 + programIndex] = temp[programIndex];
        }
    }
}

Running this code through my test setup yielded the following expanded results:

Scalar
12.323
3.133
0.657

Intrinsics
8.501
1.953
0.261

ISPC
10.788
2.665
0.451

Looking better, but still not as good as I get using intrinsics. I went ahead and took @mattpharr - the guy working on ISPC - up on an offer to look over the problem. I had been using the V1.0 release of ISPC and he pointed me to a newer release of it and some additional operations available in it, providing his own implementation of the function for ISPC that doesn't have to do any gathers:

export void downsample_ispc(uniform float in[], uniform float out[], uniform int width, uniform int height) 
{
    // loop over pixels in input image.  Assumes that programCount divides
    // width evenly and that height is a multiple of two.
    for (uniform int y = 0; y < height; y += 2) {
        for (uniform int x = 0; x < width; x += 2 * programCount) {
            uniform int baseInOffset = y * width + x;
            float r00 = in[baseInOffset + programIndex];
            float r01 = in[baseInOffset + programCount + programIndex];
            float r10 = in[baseInOffset + width + programIndex];
            float r11 = in[baseInOffset + width + programCount + programIndex];

            // reduce vertically across two scanlines
            float m0 = min(r00, r10), m1 = min(r01, r11);

            // reduce horizontally with a rotated version of self; the
            // results we want are now in offsets 0 and 2.
            float m0h = min(m0, rotate(m0, 1));
            float m1h = min(m1, rotate(m1, 1));

            // repack them so that we have the results we want repeating,
            // like [0,2,0,2]
            const int permute = 2 * (programIndex & 1);
            float m0r = shuffle(m0h, permute);
            float m1r = shuffle(m1h, permute);

            // mask off the first half of the lanes from m0r and the second
            // half from m1r.
            const int mask0 = (programIndex < programCount / 2) ? 0xffffffff : 0;
            const int mask1 = ~mask0;

            // mask and OR together the results; use intbits() to
            // reinterpret bits to be an integer for bit operations, then
            // go back to a float with floatbits().  (These should have no
            // runtime cost, just let the compiler see the intent.)
            float result = floatbits((intbits(m0r) & mask0) |
                                     (intbits(m1r) & mask1));

            uniform int outOffset = (y/2) * (width/2) + (x/2);
            out[outOffset + programIndex] = result;
        }
    }
}

This is doing essentially the same thing as my version using intrinsics but makes up for the lack of a shuffle operator that operates on two distinct vectors by doing some bit masking work which isn't quite as optimal. Beyond that ISPC also doesn't have an operator for doing prefetching - although Matt has gone ahead and added a note to the ISPC bug tracker, so I imagine we can expect to see that in the future. Running Matt's code through the test setup yields the following results:

Scalar
12.323
3.133
0.657

Intrinsics
8.501
1.953
0.261

ISPC - Me
10.788
2.665
0.451

ISPC - Matt
10.556
2.475
0.317

This is getting closer to the speed of the intrinsics version; Matt has also sent me a version that utilizes a new shuffle operator he is adding to ISPC that gets it even closer still (and makes the code quite a bit more intuitive). With the future addition of a prefetch operator that should probably make up the rest of the difference. While the results still aren't quite at the same level as my implementation using intrinsics, ISPC is still an evolving and interesting tool with a skilled and responsive author behind it. I am having fun playing with ISPC and can definitely see adding it to my toolset when working on a project - so, thank you to Matt Pharr and Intel for putting it out there.