Jake—CV

about writing reading

Figuring out threadpools

I have found myself crossing-paths with threadpools quite a lot over the last few weeks. For my first encounter, I found myself pouring over the fast-api docs in an attempt to better understand how it dispatches functions to it's external threadpool (outside the main event-loop). Not long afterwards (during my orderbook rewrite^[1]), I began to think about matching-engines again, specifically, high-throughput message handling once an order has been executed (@ 100k's messages p/s). As a matching-engine operates on orders sequentially—where resource allocation is focused on processing incoming orders as fast as possible—post-trade execution requires a pragmatic approach to handling those trade messages reliably outside of the core matching-engine logic (at least that's how I've always interpreted it). This is where threadpools began to show their relevance once more (frequency illusion at it's finest). What this post entails is a brief overview of threadpools, how I went about implementing one and some of the interesting things I learned along the way 🕳️🐇.

What are threadpools and why use them?

Using multiple threads (of execution) allows for multiple tasks to be scheduled within a single process in such a way that they can be handled concurrently. What this means is that rather than waiting for a single function to execute sequentially before moving to the next, a process can try to optimize it's pipeline by scheduling tasks in such a way as to minimize idle time between instructions (e.g. executing instructions in another function while waiting for an asynchronous call to complete). An important cavaet here is that instantiating a thread is an expensive endeavour. A thread runtime requires it's own stack along with copying over any relevant state from the parent thread before beginning execution, all of which requires kernel involvement for scheduler and memory allocation.

Threadpool overview: Each task (T) is enqueued before being executed by a thread
(note the fixed number of threads in the pool)

This is where threadpools come into play. Threadpools can offer an efficient implementation of the traditional multi-threaded model by providing a (fixed) series of long-running threads of execution (pool), rather than going through unnecessary and repeated creation and teardown of threads as work arises. Limiting the number of threads being created can also help mitigate potential resource contention, consumption and excessive context-switching. The important thing to note when using threadpools is that they are rather inelastic, by that I mean, we want to have a relatively good idea of our average workload before deciding to use a threadpool. Afterall, this inelasticity is what can make a threadpool more efficient than spawning threads on demand.

Threadpool considerations

I'll try not to get into the details of threadpool implementations too much as I believe the design considerations offer a better insight into how they operate. The program was written in C and turned out to be much more compact and simple than I anticipated (which I'm super happy about). For those curious about the technical details, I'll provide a link to the source code of thool below^[2]. For now, here are some of the considerations and common patterns that I found interesting while writing thool.

Taskqueue: Acts as a container for incoming tasks which reside there until being pulled by an idle worker. Each task object comprises of a function and an argument property. A worker will pull a task from the queue and execute the function using the given arguments (if any were given). There appeared to be two common approaches to implementing a taskqueue, by using a linked-list or a circular buffer. Using a linked-list offers more flexibility through an unbounded-queue, i.e., it can grow indefinitely, opening up the possibility to adopting a truly lock-free, unbounded queueing mechanism. However, with these benefits comes the cost of frequent malloc and free usage, an expensive set of calls that I would like to avoid. The circular-buffer on the other-hand, is the antithesis of the linked-list. While it is bounded to how many tasks it can accomadate, we can drastically minimize the number of allocations required. Along with the circular buffer, mutexes become a good default option, as I find them easier to reason about and there is potential to be blocked by the queue size anyways.

Mutex

Producer acquires mutex and enqueues tasks

Consumer acquires mutex and pulls tasks

Fixed vs. dynamic threads: While I've spoke about the inelasticity of threadpools earlier, this is still worth quickly exploring. Aside from the expense of creating / destroying threads, having a dynamic threadpool doesn't make sense entirely, because the benefits of a threadpool are to avoid such expensive operations. For this reason, you will likely see a fixed number of threads being created on threadpool creation.
Conditional variables: Provide a wonderful mechanism for communicating a binary state between a group of threads using a mutex. Threads can be made to wait when a given scenario is met (e.g. no tasks to process) and be notified or signalled once that state of the system changes. This allows us to avoid the cost of constantly spinning threads like you would see in a spin-lock for example and helps to save CPU cycles. In thool, you will find a single cond-var used to signal a thread to wait while the taskqueue is empty and to awaken when there are tasks pending. I believe a binary-sempahore could also be used similarily here but it's not something I am familiar with and it didn't feel entirely appropriate compared to using conditional variables.
Locking mechanisms: There were two sections of the threadpool that needed to be synchronised across threads: thread_count, was responsible for managing the number of alive and active threads in the threadpool and taskqueue_lock, which governed which thread had control over the taskqueues read-write privilege. Fortunately, synchronisation is mutually exclusive which means a small number of mutexes can be used to govern threadpool state effectively.

T1

T2

Alive threads:3 Working threads:2

Alive threads:3 Working threads:1
Threadpool shutdowns: Finally, it is worth considering how a threadpool should behave on shutdown. I found this quite difficult to implement and I went with a graceful approach overall, even if an immediate shutdown is instructed. This was done to avoid ending-up in a state where data permanent corruption of a system could occur. A great example given by nachtimwald^[3] would be terminating a thread in the middle of a database transaction which may leave us in a corrupted state. The behaviour of thool ended up as follows, if an immediate shutdown is requested, threads are given a chance to finish their current task. For a graceful shutdown, no more tasks can be added to the queue and threads can attempt to empty the taskqueue entirely. I'm not entirely sure if this is the best approach, but I'm content with how it behaves for now.

These were some of the main considerations when putting thool together. What I found nice about discovering the above is that it distilled the fundamental concepts of a threadpool quite succinctly and gave me enough guidance to go out and implement a very simple version of one!

Further thoughts

While I'm happy with the finished product, I think there are some interesting areas that I would like to explore with thool. I would like to consider adding in a force_add option when queueing tasks onto the threadpool. While this would require a malloc it would hopefully be logarithmic. Finally, I think it would be nice to expand the thool library such that it could be used on Windows machines. This would require writing some kind of wrapper around the pthread module as that is not available on Windows runtimes. Anyways, I hope someone found this useful in some way, I really enjoyed the process 🤙.

[1] Orderbook rewrite

[2] Thool: A lightweight threadpool library

[3] John's blog