I've been spending a little more time working on taco lately and thought it would be good to spend some time writing out a little more about how the internals work.

As I mentioned before, the intent of taco is to create a coroutine based task processing library. On Windows these coroutines are just a basic wrapper around the system fiber support. For other platforms I am building them on top of the POSIX ucontext functionality and I employ this little bit of code with setjmp/longjmp to speed things up. With this as a basis the library is able to suspend tasks and then resume them at any point in the future, retaining their state of execution (i.e. these are not stackless coroutines).

Strictly speaking, each task is not itself a coroutine and instead there is really only one type of coroutine - the worker coroutine. The whole library is built around instances of this coroutine running on a set of threads and yielding execution to one another when appropriate. The worker routine itself is fairly simple, looking a little like:

void worker()
{
	for (;;)
	{
		if (HasScheduledTasks())
		{
			ExecuteTask();
		}
		else if (HasScheduledWorkers())
		{
			YieldTo(NextWorker()); // current fiber becomes 'inactive'
		}
		else
		{
			SleepUntilSignaled();
		}
	}
}

When you first initialize the library it will spin up a number of threads (by default it aims for the same number of threads as there are hardware threads) and each one of these threads will execute one of these workers. When you schedule a task for execution, one of these workers will pick it up to execute it; at this point things don't look all that different from any other task processing library. However, returning to the silly example from the last post:

// ...
taco::Schedule([]() -> void {
	printf("Hello ");
	taco::Switch();
	printf("World\n");
});
// ...

When this task gets executed and makes the call to Switch, it will cause the worker it is running on to be suspended and rescheduled for future execution. At this point if there are more tasks to be started then we will invoke an inactive worker (or create a new one); if there are no more tasks pending we will instead execute a previously scheduled fiber (e.g. like the one we just suspended). In code this looks a little like this:

void Switch()
{
	fiber current = FiberCurrent();
	fiber next = HasScheduledTasks() ? GetInactiveWorker() : NextWorker();
	if (next)
	{
		YieldTo(next, [=]() -> void { ScheduleFiber(current); });
	}
}

This looks a little different in the actual code but it should give you the right idea. The lambda being passed in as an argument is actually executed by the next worker at the point it is resumed - it is important to schedule the fiber we are currently executing only after we have actually switched to the next one (we do not want another thread to pick it up and try to start executing before this thread has switched from it).

Since we are executing tasks on a number of different threads it follows that some amount of synchronization between tasks might be required. However, with the ability to suspend and resume our tasks at any point, we don't rely on the normal thread blocking synchronization primitives. It becomes very simple to instead build our own non-blocking synchronization routines - if we can't acquire a lock, then we just switch to another worker until we can acquire the lock. This can look as simple as:

void LockMutex(mutex * m)
{
	uint32_t ticket = m->lock_counter.fetch_add(1);
	while (m->unlock_counter.load() != ticket)
	{
		Switch();
	}
}
void UnlockMutex(mutex * m)
{
	m->unlock_counter++;
}

So, this is maybe a little redundant with the previous post, but hopefully it explains a little more what taco is about and how it works. Right now I am prototyping a mechanism for marking regions of a task as doing "blocking" work - e.g. something like blocking file i/o that will spend a lot of time waiting instead of computing. The idea is that these regions will get migrated off their current thread to a thread dedicated to handling these regions of work, allowing the current thread to continue processing other tasks. Early testing shows some promising results - but I'll save that for a follow on post.