DivergentCoder - taco - blocking regions

I have finished adding some preliminary support for "blocking regions" to taco. The idea here is to tag regions of a task that are performing some high latency, low cpu overhead work - like say blocking network i/o. With these regions appropriately tagged, we can then migrate the fiber they are executing on to a dedicated i/o thread, freeing the current thread to continue processing tasks. Once the blocking region is exited, the worker fiber can migrate back to the task execution threads.

Internally, taco maintans a list of inactive "blocker" threads. When a task wants to enter one of these regions it will make a call to taco::BeginBlocking. This will yield execution on the current task thread to the next appropriate worker fiber, while posting the current worker fiber to one of these inactive blocker threads and signalling it awake. When the task is finished with whatever blocking work it is doing it just calls taco::EndBlocking. This will put the blocker thread back to sleep and in the inactive list, scheduling the fiber back on one of the task threads.

Simple enough in concept, but how well does it work? I went ahead and coded up a simple artificial test that simply enters one of these blocking regions and sleeps for some amount of time before doing some CPU work:

void test_blocking(bool blocking, unsigned ntasks, unsigned sleeptime, unsigned cputime)
{
    taco::future<uint32_t> * tasks = new taco::future<uint32_t>[ntasks];
    for (unsigned i=0; i<ntasks; i++)
    {
        tasks[i] = taco::Start([=]() -> uint32_t {
            if (blocking) { taco::BeginBlocking(); }

            std::this_thread::sleep_for(std::chrono::microseconds(sleeptime));

            if (blocking) { taco::EndBlocking(); }

            unsigned steps = 0;
            for (unsigned j=0; j<cputime; j++)
            {
                // collatz
                unsigned n = 837799;
                while (n != 1)
                {
                    n = (n & 1) ? ((3 * n) + 1) : (n / 2);
                    steps++;
                }
            }
            return steps;
        });
    }
    
    for (unsigned i=1; i<ntasks; i++)
    {
        BASIS_TEST_VERIFY_MSG(tasks[i] == tasks[i - 1], "Mismatch between %u and %u (%u vs %u)", 
            i, i - 1, (uint32_t)tasks[i], (uint32_t)tasks[i - 1]);
    }

    delete [] tasks;
}

Okay, so it is a completely silly test, but hopefully it should approximate some vision of what this blocking facility could be used for. The test setup measures the runtime for this function over a variety of task counts, of time spent sleeping and time spent on cpu work; it does this both with the blocking region enabled and with it disabled and reports the relative speedup.

Task Count

CPU

Sleep

Enabled (ms)

Disabled (ms)

Relative

1024

240

8.57

2048

246

7.24

4096

246

7.45

8192

251

8.37

16384

497

10.35

128

1024

250

5.68

128

2048

247

7.48

128

4096

242

6.37

128

8192

238

5.67

128

16384

496

9.54

256

1024

241

3.54

256

2048

251

5.70

256

4096

237

4.09

256

8192

252

5.60

256

16384

501

5.63

512

1024

246

2.96

512

2048

124

250

2.02

512

4096

251

3.18

512

8192

121

252

2.08

512

16384

145

494

3.41

128

1024

495

17.68

128

2048

497

10.35

128

4096

497

10.35

128

8192

497

10.35

128

16384

997

15.58

128

1024

492

7.13

128

2048

500

6.67

128

4096

491

6.92

128

8192

501

6.68

128

16384

996

12.15

128

256

1024

495

5.21

128

256

2048

496

6.20

128

256

4096

124

499

4.02

128

256

8192

495

5.16

128

256

16384

146

994

6.81

128

512

1024

173

499

2.88

128

512

2048

233

499

2.14

128

512

4096

172

497

2.89

128

512

8192

236

496

2.10

128

512

16384

257

1002

3.90

256

1024

992

19.84

256

2048

995

15.31

256

4096

999

13.14

256

8192

997

15.83

256

16384

1997

21.95

256

128

1024

993

10.03

256

128

2048

993

10.13

256

128

4096

127

995

7.83

256

128

8192

130

993

7.64

256

128

16384

146

1991

13.64

256

1024

237

994

4.19

256

2048

242

989

4.09

256

4096

237

994

4.19

256

8192

236

995

4.22

256

16384

261

2001

7.67

256

512

1024

467

1000

2.14

256

512

2048

459

991

2.16

256

512

4096

258

1005

3.90

256

512

8192

263

999

3.80

256

512

16384

378

1996

5.28

512

1024

1987

21.37

512

2048

1994

20.77

512

4096

1990

23.98

512

8192

1990

20.10

512

16384

118

3984

33.76

512

128

1024

240

1989

8.29

512

128

2048

178

1990

11.18

512

128

4096

247

1997

8.09

512

128

8192

192

1991

10.37

512

128

16384

261

3983

15.26

512

256

1024

350

1989

5.68

512

256

2048

334

1989

5.96

512

256

4096

262

1999

7.63

512

256

8192

342

1996

5.84

512

256

16384

355

3996

11.26

512

1024

933

2002

2.15

512

2048

526

2000

3.80

512

4096

644

1991

3.09

512

8192

642

1993

3.10

512

16384

933

3999

4.29

With a maximum speedup of 33-34x, I would say that is a pretty good improvement. Of course, this is just an artificial test and putting a thread to sleep for some requested amount of time isn't exactly a precise thing. Perhaps a more realistic test would provide some more interesting results. Still, this gives me hope that I am working in a good direction. The results also make me think about some other possible things to test; for example, right now every fiber that blocks will potentially spin up a new thread - what if instead there were a limited pool? Something to experiment with in the future I think.