This post is inspired by Guido van Rossum’s post on Python’s Global Interpreter Lock (GIL). There has been some recent buzz on efforts to remove the GIL in Python 3000 to enable real threading of applications in an effort take advantage of the trend towards multi-core systems.
This is a hard problem (as evidenced by Greg Stein and Mark Hammond?’s efforts in 1999) mainly from a performance point of view. GvR himself has stated that although he doesn’t have the time to do it himself, he would be willing to accept any patches to CPython which removes the GIL (as a compile time option) which does not negatively affect performance of single threaded applications and significantly improves I/O bound multi-threaded apps.
As I learnt at PyCon 2007, the problems in general have to do with Python’s garbage collection mechanism which uses reference counting and hence requires synchronized access to objects to ensure that the increfs and decrefs are done properly. Additionally, the GIL protects various caches (notably the integer (and some other types?) free lists so that creating and destroying those items don’t take too much time because the PyObjects are recycled.
So far, the solutions discussed involve some type of fine grained locking around critical code sections. The problem here is that beyond two or three cores, this doesn’t scale well because of lock contention (since so much stuff needs to be synchronized).
Well, what if we tried a different approach… Locks don’t seem like they’re going to solve the problem in a scalable way and if we’re going to take the effort to do this, might as well do it so that the benefits scale in an environment of more than 100 cores (its possible – see Azul Systems).
What if we dedicated one core for running synchronized code? Naturally this can’t be done without support from the kernel (perhaps some sort of kernel module/VxD).
At process initialization time, the kernel dedicates a single core to be the “synchronization server” and provides an instruction queue for it. This can vary between processes. When the interpreter tries to acquire the GIL, the kernel simply does a context switch and queues the thread to the dedicated sync server. The thread is run normally by the sync server by popping it from the queue. When the GIL is released, another kernel call is made which simply switches the thread out from the sync server to a different core.
Since we are guaranteeing that synchronized code is running on a single core, it is the equivalent of a lock at the cost of a context switch. All other code can run in parallel on the remaining cores without problems. Moreover, this requires no changes to existing Python code or extension modules which acquire and release the GIL properly.
Obviously, the benefits from two cores will be negligible (possibly even negative because of the context switches and zero parallelism). However, every additional core you add will contribute to the parallelism of the process. Optimizing the context switches shouldn’t be too difficult because no memory pages are copied (no shared memory or IPC).
Limitations: The main limitation that I can foresee where scalability is limited by the sync server. The more code which needs to be synchronized, the greater the CPU usage on the sync core. Once that core is maxed out, the queue length may start increasing as the other cores wait around for the sync core. It looks to me that notwithstanding the cost of context switches this is at least as good as running everything on one core.
The core acting as sync server does not need to be special in any way. Neither does it need to be dedicated to the process alone. If the sync queue is empty (or even if its not), the kernel can always use the core for other processes like normal. The only guarantee we need from the kernel module is that the core will never be used for non-synchronized code within a process where it is designated as the sync server.
Does this make sense or am I crazy?
-Prateek Sureka
Hi Prateek,
I think a couple things are worth pointing out about this idea. First, one of the premises seems to be mistaken, or at least misleading. You mentioned improving performance for multi-threaded I/O bound applications. Using more cores isn’t a solution for this problem. If the application is I/O bound, then it isn’t slow because it isn’t getting enough CPU resources; it’s slow because it isn’t getting enough I/O resources. Utilizing more cores on a multi-core system gives you more CPU resources, but typically not more I/O resources. I may have misunderstood your intent here, if so it would be great if you could clarify it, but it sounds like one of the very common mistakes people think about when they think about I/O.
That issue aside, if we consider a multi-threaded CPU bound application instead of one which is I/O bound, then the prospect of using additional cores does actually seem attractive, since those cores will bring additional CPU resources which may cause the program to run faster. So, to move on to the core of your idea with that in mind…
The first response I have is that the “synchronization server”, as you call it, will actually have to run almost all of the code in a Python process, and the other threads, possibly running on other cores, will be idle almost all the time. This is because there’s very little which can be done in CPython without (actually or notionally) holding the GIL. Consider a two threads computing factorials. In order to perform integer division or modulus, you need to examine two integers and create a third. Integers are immutable, so at least the value won’t change, no matter how many threads are using that integer object. However, the reference counting with CPython uses means that in order to look at the integer, you do have to change the refcount on the object. If this can be done without holding the GIL (or, equivalently, without running in the sync server) then you’re cool. If not, then you’ve probably lost any benefit of threading already (but I suspect you could manage it). Next you need to create a new integer. This might require allocating memory: a per-thread object allocator could let you do this without the GIL (or running in the sync server), though CPython doesn’t have one of these yet (and there might be some negative consequences of adding one – you still need to make the integer freelist aware of multiple threads, for example). On the other hand, you might want to re-use an existing, cached integer object to avoid the cost of allocation. In this case, you need to make your integer cache threadsafe, since other threads may be allocating integers at the same time and be hitting the cache as well. Even if you manage to do this with a tiny lock around only the integer cache, you still lose a lot of the benefit of multiple cores, since both threads in this example are doing heavy integer operations. Maybe parallelizing the actual division by itself is worthwhile though, so I’ll press on. Finally, the two inputs to the division operation are probably dropped, which means their reference counts need to be decremented. It may be possible to do the decrement without holding any locks, but on the chance that this is the last reference, you’ll need to release the integer (either free its memory or return it to a freelist, eg). This again needs a multi-thread friendly allocator or a tiny lock around the integer freelist (or cache or whatever).
So integer division probably still requires acquiring and then releasing two locks. If you are actually exploiting multiple cores in parallel to do the actual underlying division operation… well, you’re still probably losing compared to the GIL. Of course, one would have to measure this to be sure, and it will vary from platform to platform, etc, but in general it is far from a pure win.
If you have to do integer division in the sync server, then you’re much worse off, since you don’t get to parallelize on hardware and you pay the cost of a context switch for every integer division. Context switches are expensive, by the way. Often more expensive than mutex acquisition (certainly the good case for futex acquisition is much cheaper than a context switch – that’s why futexes exist in the first place).
So at a first look, this approach would only add complexity, not improve performance.
Some other problems that come to mind is that getting the scheduling behavior this requires out of the kernel requires a bit of trickery (you can do it quite easily with mutexes, actually – but the point here is to avoid locking and unlocking things, so doing it that way would be self-defeating). You also seem to de-emphasize the cost of context switches, which seems to be a mistake to me. If you could elaborate on this point, or perhaps provide some benchmarks for various architectures/platforms, that would go a long way toward making a convincing argument.
Still, reference counting in CPython is going to make this rather difficult. You should probably also discuss how this plan will avoid being dragged down by the costs associated with that.