I have finished adding some preliminary support for "blocking regions" to taco. The idea here is to tag regions of a task that are performing some high latency, low cpu overhead work - like say blocking network i/o. With these regions appropriately tagged, we can then migrate the fiber they are executing on to a dedicated i/o thread, freeing the current thread to continue processing tasks. Once the blocking region is exited, the worker fiber can migrate back to the task execution threads.

Internally, taco maintans a list of inactive "blocker" threads. When a task wants to enter one of these regions it will make a call to taco::BeginBlocking. This will yield execution on the current task thread to the next appropriate worker fiber, while posting the current worker fiber to one of these inactive blocker threads and signalling it awake. When the task is finished with whatever blocking work it is doing it just calls taco::EndBlocking. This will put the blocker thread back to sleep and in the inactive list, scheduling the fiber back on one of the task threads.

Simple enough in concept, but how well does it work? I went ahead and coded up a simple artificial test that simply enters one of these blocking regions and sleeps for some amount of time before doing some CPU work:

void test_blocking(bool blocking, unsigned ntasks, unsigned sleeptime, unsigned cputime)
{
    taco::future<uint32_t> * tasks = new taco::future<uint32_t>[ntasks];
    for (unsigned i=0; i<ntasks; i++)
    {
        tasks[i] = taco::Start([=]() -> uint32_t {
            if (blocking) { taco::BeginBlocking(); }

            std::this_thread::sleep_for(std::chrono::microseconds(sleeptime));

            if (blocking) { taco::EndBlocking(); }

            unsigned steps = 0;
            for (unsigned j=0; j<cputime; j++)
            {
                // collatz
                unsigned n = 837799;
                while (n != 1)
                {
                    n = (n & 1) ? ((3 * n) + 1) : (n / 2);
                    steps++;
                }
            }
            return steps;
        });
    }
    
    for (unsigned i=1; i<ntasks; i++)
    {
        BASIS_TEST_VERIFY_MSG(tasks[i] == tasks[i - 1], "Mismatch between %u and %u (%u vs %u)", 
            i, i - 1, (uint32_t)tasks[i], (uint32_t)tasks[i - 1]);
    }

    delete [] tasks;
}

Okay, so it is a completely silly test, but hopefully it should approximate some vision of what this blocking facility could be used for. The test setup measures the runtime for this function over a variety of task counts, of time spent sleeping and time spent on cpu work; it does this both with the blocking region enabled and with it disabled and reports the relative speedup.

Task Count
CPU
Sleep
Enabled (ms)
Disabled (ms)
Relative
64
64
1024
28
240
8.57
64
64
2048
34
246
7.24
64
64
4096
33
246
7.45
64
64
8192
30
251
8.37
64
64
16384
48
497
10.35
64
128
1024
44
250
5.68
64
128
2048
33
247
7.48
64
128
4096
38
242
6.37
64
128
8192
42
238
5.67
64
128
16384
52
496
9.54
64
256
1024
68
241
3.54
64
256
2048
44
251
5.70
64
256
4096
58
237
4.09
64
256
8192
45
252
5.60
64
256
16384
89
501
5.63
64
512
1024
83
246
2.96
64
512
2048
124
250
2.02
64
512
4096
79
251
3.18
64
512
8192
121
252
2.08
64
512
16384
145
494
3.41
128
64
1024
28
495
17.68
128
64
2048
48
497
10.35
128
64
4096
48
497
10.35
128
64
8192
48
497
10.35
128
64
16384
64
997
15.58
128
128
1024
69
492
7.13
128
128
2048
75
500
6.67
128
128
4096
71
491
6.92
128
128
8192
75
501
6.68
128
128
16384
82
996
12.15
128
256
1024
95
495
5.21
128
256
2048
80
496
6.20
128
256
4096
124
499
4.02
128
256
8192
96
495
5.16
128
256
16384
146
994
6.81
128
512
1024
173
499
2.88
128
512
2048
233
499
2.14
128
512
4096
172
497
2.89
128
512
8192
236
496
2.10
128
512
16384
257
1002
3.90
256
64
1024
50
992
19.84
256
64
2048
65
995
15.31
256
64
4096
76
999
13.14
256
64
8192
63
997
15.83
256
64
16384
91
1997
21.95
256
128
1024
99
993
10.03
256
128
2048
98
993
10.13
256
128
4096
127
995
7.83
256
128
8192
130
993
7.64
256
128
16384
146
1991
13.64
256
256
1024
237
994
4.19
256
256
2048
242
989
4.09
256
256
4096
237
994
4.19
256
256
8192
236
995
4.22
256
256
16384
261
2001
7.67
256
512
1024
467
1000
2.14
256
512
2048
459
991
2.16
256
512
4096
258
1005
3.90
256
512
8192
263
999
3.80
256
512
16384
378
1996
5.28
512
64
1024
93
1987
21.37
512
64
2048
96
1994
20.77
512
64
4096
83
1990
23.98
512
64
8192
99
1990
20.10
512
64
16384
118
3984
33.76
512
128
1024
240
1989
8.29
512
128
2048
178
1990
11.18
512
128
4096
247
1997
8.09
512
128
8192
192
1991
10.37
512
128
16384
261
3983
15.26
512
256
1024
350
1989
5.68
512
256
2048
334
1989
5.96
512
256
4096
262
1999
7.63
512
256
8192
342
1996
5.84
512
256
16384
355
3996
11.26
512
512
1024
933
2002
2.15
512
512
2048
526
2000
3.80
512
512
4096
644
1991
3.09
512
512
8192
642
1993
3.10
512
512
16384
933
3999
4.29

With a maximum speedup of 33-34x, I would say that is a pretty good improvement. Of course, this is just an artificial test and putting a thread to sleep for some requested amount of time isn't exactly a precise thing. Perhaps a more realistic test would provide some more interesting results. Still, this gives me hope that I am working in a good direction. The results also make me think about some other possible things to test; for example, right now every fiber that blocks will potentially spin up a new thread - what if instead there were a limited pool? Something to experiment with in the future I think.