<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Suggestion for removing the Python Global Interpreter Lock</title>
	<atom:link href="http://www.brainwavelive.com/blog/technology/suggestion-for-removing-the-python-global-interpreter-lock/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.brainwavelive.com/blog/technology/suggestion-for-removing-the-python-global-interpreter-lock/</link>
	<description></description>
	<lastBuildDate>Fri, 14 Sep 2007 19:23:37 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Jean-Paul Calderone</title>
		<link>http://www.brainwavelive.com/blog/technology/suggestion-for-removing-the-python-global-interpreter-lock/comment-page-1/#comment-4</link>
		<dc:creator>Jean-Paul Calderone</dc:creator>
		<pubDate>Fri, 14 Sep 2007 19:23:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.brainwavelive.com/wordpress/?p=32#comment-4</guid>
		<description>Hi Prateek,

I think a couple things are worth pointing out about this idea. First, one of the premises seems to be mistaken, or at least misleading. You mentioned improving performance for multi-threaded I/O bound applications. Using more cores isn&#039;t a solution for this problem. If the application is I/O bound, then it isn&#039;t slow because it isn&#039;t getting enough CPU resources; it&#039;s slow because it isn&#039;t getting enough I/O resources. Utilizing more cores on a multi-core system gives you more CPU resources, but typically not more I/O resources. I may have misunderstood your intent here, if so it would be great if you could clarify it, but it sounds like one of the very common mistakes people think about when they think about I/O.

That issue aside, if we consider a multi-threaded CPU bound application instead of one which is I/O bound, then the prospect of using additional cores does actually seem attractive, since those cores will bring additional CPU resources which may cause the program to run faster. So, to move on to the core of your idea with that in mind...

The first response I have is that the &quot;synchronization server&quot;, as you call it, will actually have to run almost all of the code in a Python process, and the other threads, possibly running on other cores, will be idle almost all the time. This is because there&#039;s very little which can be done in CPython without (actually or notionally) holding the GIL. Consider a two threads computing factorials. In order to perform integer division or modulus, you need to examine two integers and create a third. Integers are immutable, so at least the value won&#039;t change, no matter how many threads are using that integer object. However, the reference counting with CPython uses means that in order to look at the integer, you do have to change the refcount on the object. If this can be done without holding the GIL (or, equivalently, without running in the sync server) then you&#039;re cool. If not, then you&#039;ve probably lost any benefit of threading already (but I suspect you could manage it). Next you need to create a new integer. This might require allocating memory: a per-thread object allocator could let you do this without the GIL (or running in the sync server), though CPython doesn&#039;t have one of these yet (and there might be some negative consequences of adding one - you still need to make the integer freelist aware of multiple threads, for example). On the other hand, you might want to re-use an existing, cached integer object to avoid the cost of allocation. In this case, you need to make your integer cache threadsafe, since other threads may be allocating integers at the same time and be hitting the cache as well. Even if you manage to do this with a tiny lock around only the integer cache, you still lose a lot of the benefit of multiple cores, since both threads in this example are doing heavy integer operations. Maybe parallelizing the actual division by itself is worthwhile though, so I&#039;ll press on. Finally, the two inputs to the division operation are probably dropped, which means their reference counts need to be decremented. It may be possible to do the decrement without holding any locks, but on the chance that this is the last reference, you&#039;ll need to release the integer (either free its memory or return it to a freelist, eg). This again needs a multi-thread friendly allocator or a tiny lock around the integer freelist (or cache or whatever).

So integer division probably still requires acquiring and then releasing two locks. If you are actually exploiting multiple cores in parallel to do the actual underlying division operation... well, you&#039;re still probably losing compared to the GIL. Of course, one would have to measure this to be sure, and it will vary from platform to platform, etc, but in general it is far from a pure win.

If you have to do integer division in the sync server, then you&#039;re much worse off, since you don&#039;t get to parallelize on hardware and you pay the cost of a context switch for every integer division. Context switches are expensive, by the way. Often more expensive than mutex acquisition (certainly the good case for futex acquisition is much cheaper than a context switch - that&#039;s why futexes exist in the first place).

So at a first look, this approach would only add complexity, not improve performance.

Some other problems that come to mind is that getting the scheduling behavior this requires out of the kernel requires a bit of trickery (you can do it quite easily with mutexes, actually - but the point here is to avoid locking and unlocking things, so doing it that way would be self-defeating). You also seem to de-emphasize the cost of context switches, which seems to be a mistake to me. If you could elaborate on this point, or perhaps provide some benchmarks for various architectures/platforms, that would go a long way toward making a convincing argument.

Still, reference counting in CPython is going to make this rather difficult. You should probably also discuss how this plan will avoid being dragged down by the costs associated with that.</description>
		<content:encoded><![CDATA[<p>Hi Prateek,</p>
<p>I think a couple things are worth pointing out about this idea. First, one of the premises seems to be mistaken, or at least misleading. You mentioned improving performance for multi-threaded I/O bound applications. Using more cores isn&#8217;t a solution for this problem. If the application is I/O bound, then it isn&#8217;t slow because it isn&#8217;t getting enough CPU resources; it&#8217;s slow because it isn&#8217;t getting enough I/O resources. Utilizing more cores on a multi-core system gives you more CPU resources, but typically not more I/O resources. I may have misunderstood your intent here, if so it would be great if you could clarify it, but it sounds like one of the very common mistakes people think about when they think about I/O.</p>
<p>That issue aside, if we consider a multi-threaded CPU bound application instead of one which is I/O bound, then the prospect of using additional cores does actually seem attractive, since those cores will bring additional CPU resources which may cause the program to run faster. So, to move on to the core of your idea with that in mind&#8230;</p>
<p>The first response I have is that the &#8220;synchronization server&#8221;, as you call it, will actually have to run almost all of the code in a Python process, and the other threads, possibly running on other cores, will be idle almost all the time. This is because there&#8217;s very little which can be done in CPython without (actually or notionally) holding the GIL. Consider a two threads computing factorials. In order to perform integer division or modulus, you need to examine two integers and create a third. Integers are immutable, so at least the value won&#8217;t change, no matter how many threads are using that integer object. However, the reference counting with CPython uses means that in order to look at the integer, you do have to change the refcount on the object. If this can be done without holding the GIL (or, equivalently, without running in the sync server) then you&#8217;re cool. If not, then you&#8217;ve probably lost any benefit of threading already (but I suspect you could manage it). Next you need to create a new integer. This might require allocating memory: a per-thread object allocator could let you do this without the GIL (or running in the sync server), though CPython doesn&#8217;t have one of these yet (and there might be some negative consequences of adding one &#8211; you still need to make the integer freelist aware of multiple threads, for example). On the other hand, you might want to re-use an existing, cached integer object to avoid the cost of allocation. In this case, you need to make your integer cache threadsafe, since other threads may be allocating integers at the same time and be hitting the cache as well. Even if you manage to do this with a tiny lock around only the integer cache, you still lose a lot of the benefit of multiple cores, since both threads in this example are doing heavy integer operations. Maybe parallelizing the actual division by itself is worthwhile though, so I&#8217;ll press on. Finally, the two inputs to the division operation are probably dropped, which means their reference counts need to be decremented. It may be possible to do the decrement without holding any locks, but on the chance that this is the last reference, you&#8217;ll need to release the integer (either free its memory or return it to a freelist, eg). This again needs a multi-thread friendly allocator or a tiny lock around the integer freelist (or cache or whatever).</p>
<p>So integer division probably still requires acquiring and then releasing two locks. If you are actually exploiting multiple cores in parallel to do the actual underlying division operation&#8230; well, you&#8217;re still probably losing compared to the GIL. Of course, one would have to measure this to be sure, and it will vary from platform to platform, etc, but in general it is far from a pure win.</p>
<p>If you have to do integer division in the sync server, then you&#8217;re much worse off, since you don&#8217;t get to parallelize on hardware and you pay the cost of a context switch for every integer division. Context switches are expensive, by the way. Often more expensive than mutex acquisition (certainly the good case for futex acquisition is much cheaper than a context switch &#8211; that&#8217;s why futexes exist in the first place).</p>
<p>So at a first look, this approach would only add complexity, not improve performance.</p>
<p>Some other problems that come to mind is that getting the scheduling behavior this requires out of the kernel requires a bit of trickery (you can do it quite easily with mutexes, actually &#8211; but the point here is to avoid locking and unlocking things, so doing it that way would be self-defeating). You also seem to de-emphasize the cost of context switches, which seems to be a mistake to me. If you could elaborate on this point, or perhaps provide some benchmarks for various architectures/platforms, that would go a long way toward making a convincing argument.</p>
<p>Still, reference counting in CPython is going to make this rather difficult. You should probably also discuss how this plan will avoid being dragged down by the costs associated with that.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

