I have found myself crossing-paths with threadpools quite a lot over the last few weeks. For my first encounter, I found myself pouring over the fast-api docs in an attempt to better understand how it dispatches functions to it's external threadpool (outside the main event-loop). Not long afterwards (during my orderbook rewrite[1]), I began to think about matching-engines again, specifically, high-throughput message handling once an order has been executed (@ 100k's messages p/s). As a matching-engine operates on orders sequentially—where resource allocation is focused on processing incoming orders as fast as possible—post-trade execution requires a pragmatic approach to handling those trade messages reliably outside of the core matching-engine logic (at least that's how I've always interpreted it). This is where threadpools began to show their relevance once more (frequency illusion at it's finest). What this post entails is a brief overview of threadpools, how I went about implementing one and some of the interesting things I learned along the way 🕳️🐇.
Using multiple threads (of execution) allows for multiple tasks to be scheduled within a single process in such a way that they can be handled concurrently. What this means is that rather than waiting for a single function to execute sequentially before moving to the next, a process can try to optimize it's pipeline by scheduling tasks in such a way as to minimize idle time between instructions (e.g. executing instructions in another function while waiting for an asynchronous call to complete). An important cavaet here is that instantiating a thread is an expensive endeavour. A thread runtime requires it's own stack along with copying over any relevant state from the parent thread before beginning execution, all of which requires kernel involvement for scheduler and memory allocation.
Threadpool overview: Each task (T) is enqueued before being executed by a thread (note the fixed number of threads in the pool)
This is where threadpools come into play. Threadpools can offer an efficient implementation of the traditional multi-threaded model by providing a (fixed) series of long-running threads of execution (pool), rather than going through unnecessary and repeated creation and teardown of threads as work arises. Limiting the number of threads being created can also help mitigate potential resource contention, consumption and excessive context-switching. The important thing to note when using threadpools is that they are rather inelastic, by that I mean, we want to have a relatively good idea of our average workload before deciding to use a threadpool. Afterall, this inelasticity is what can make a threadpool more efficient than spawning threads on demand.
I'll try not to get into the details of threadpool implementations too much as I believe the design considerations offer a better insight into how they operate. The program was written in C and turned out to be much more compact and simple than I anticipated (which I'm super happy about). For those curious about the technical details, I'll provide a link to the source code of thool below[2]. For now, here are some of the considerations and common patterns that I found interesting while writing thool.
task
object comprises of a function and an argument
property.
A worker will pull a task from the queue and execute the function using
the given arguments (if any were given). There appeared to be two common
approaches to implementing a taskqueue, by using a linked-list or a
circular buffer. Using a linked-list offers more flexibility through an
unbounded-queue, i.e., it can grow indefinitely, opening up the
possibility to adopting a truly lock-free, unbounded queueing
mechanism. However, with these benefits comes the cost of frequent
malloc
and free
usage, an expensive set of
calls that I would like to avoid. The circular-buffer on the other-hand,
is the antithesis of the linked-list. While it is bounded to how many
tasks it can accomadate, we can drastically minimize the number of
allocations required. Along with the circular buffer, mutexes become a
good default option, as I find them easier to reason about and there is
potential to be blocked by the queue size anyways.
Mutex
Mutex
wait
when a given scenario is met (e.g. no
tasks to process) and be notified or signal
led once that
state of the system changes. This allows us to avoid the cost of
constantly spinning threads like you would see in a spin-lock for
example and helps to save CPU cycles. In thool, you will find a
single cond-var used to signal a thread to wait while the
taskqueue
is empty and to awaken when there are tasks
pending. I believe a binary-sempahore could also be used similarily
here but it's not something I am familiar with and it didn't feel
entirely appropriate compared to using conditional variables.
thread_count
,
was responsible for managing the number of alive and active threads in
the threadpool and taskqueue_lock
, which governed which
thread had control over the taskqueues read-write privilege.
Fortunately, synchronisation is mutually exclusive which means a small
number of mutexes can be used to govern threadpool state effectively.
While I'm happy with the finished product, I think there are some
interesting areas that I would like to explore with thool. I would
like to consider adding in a force_add
option when
queueing tasks onto the threadpool. While this would require a
malloc
it would hopefully be logarithmic. Finally,
I think it would be nice to expand the thool library such that it
could be used on Windows machines. This would require writing some kind
of wrapper around the pthread
module as that is not
available on Windows runtimes. Anyways, I hope someone found this
useful in some way, I really enjoyed the process 🤙.