I have finished adding some preliminary support for "blocking regions" to taco. The idea here is to tag regions of a task that are performing some high latency, low cpu overhead work - like say blocking network i/o. With these regions appropriately tagged, we can then migrate the fiber they are executing on to a dedicated i/o thread, freeing the current thread to continue processing tasks. Once the blocking region is exited, the worker fiber can migrate back to the task execution threads.
Internally, taco maintans a list of inactive "blocker" threads. When a task wants to enter one of these regions it will make a call to taco::BeginBlocking. This will yield execution on the current task thread to the next appropriate worker fiber, while posting the current worker fiber to one of these inactive blocker threads and signalling it awake. When the task is finished with whatever blocking work it is doing it just calls taco::EndBlocking. This will put the blocker thread back to sleep and in the inactive list, scheduling the fiber back on one of the task threads.
Simple enough in concept, but how well does it work? I went ahead and coded up a simple artificial test that simply enters one of these blocking regions and sleeps for some amount of time before doing some CPU work:
void test_blocking(bool blocking, unsigned ntasks, unsigned sleeptime, unsigned cputime)
{
taco::future<uint32_t> * tasks = new taco::future<uint32_t>[ntasks];
for (unsigned i=0; i<ntasks; i++)
{
tasks[i] = taco::Start([=]() -> uint32_t {
if (blocking) { taco::BeginBlocking(); }
std::this_thread::sleep_for(std::chrono::microseconds(sleeptime));
if (blocking) { taco::EndBlocking(); }
unsigned steps = 0;
for (unsigned j=0; j<cputime; j++)
{
// collatz
unsigned n = 837799;
while (n != 1)
{
n = (n & 1) ? ((3 * n) + 1) : (n / 2);
steps++;
}
}
return steps;
});
}
for (unsigned i=1; i<ntasks; i++)
{
BASIS_TEST_VERIFY_MSG(tasks[i] == tasks[i - 1], "Mismatch between %u and %u (%u vs %u)",
i, i - 1, (uint32_t)tasks[i], (uint32_t)tasks[i - 1]);
}
delete [] tasks;
}
Okay, so it is a completely silly test, but hopefully it should approximate some vision of what this blocking facility could be used for. The test setup measures the runtime for this function over a variety of task counts, of time spent sleeping and time spent on cpu work; it does this both with the blocking region enabled and with it disabled and reports the relative speedup.
With a maximum speedup of 33-34x, I would say that is a pretty good improvement. Of course, this is just an artificial test and putting a thread to sleep for some requested amount of time isn't exactly a precise thing. Perhaps a more realistic test would provide some more interesting results. Still, this gives me hope that I am working in a good direction. The results also make me think about some other possible things to test; for example, right now every fiber that blocks will potentially spin up a new thread - what if instead there were a limited pool? Something to experiment with in the future I think.