Pycon 2007 was a great experience. I got to meet Guido and heard him speak on Python 3000 which I thought was quite illuminating.
One key point I took away is that there is a big difference between the language and its implementation. Currently, the implementation of choice is CPython but there are many promising new projects like IronPython, PyPy and Jython (to which I am now a contributor) which hope to bridge the gap between Python and various other languages and fix some of the commonly raised complaints with the CPython implementation.
This is particularly relevant to Brainwave. Although we don’t expect to move to a different implementation anytime soon, there are some key areas which generated a lot of buzz in terms of lectures, open space talks and BoF sessions. The ones I’m specifically interested in today are 1) Security and 2) Parallelism
Python is an extremely introspective language. The dir() and eval() functions give you a lot of power to work with objects in ways the original creators never intended. This is a double edged sword. The problem is implementing security at the object level. The deep introspective nature of Python seems to be better suited to a capabilities based approach as opposed to a pure permissions model. Brett Cannon and I discussed a lot of options – running multiple instances of the interpreter as separate sandboxes in the same process space (not possible because of the way the CPython implementation caches object references) to adding keywords to the language (doesn’t seem like a good idea to change the language to serve the implementation). Brett is working on an interesting project to run a separate interpreter process in secure mode (very much like the old rexec solution, which, according to Brett, was dropped because of the number of bugs it caused in the implementation).
At any rate, you can always drop into C to get around all these obstacles (obviously not very pythonic!). It does mean that newcomers and enterprise developers will have a harder time getting to grips with some of these nuances. (Note: The IronPython and Jython teams get around this issue by leaving the security to the underlying VM – just like CPython).
Parallelism was another hot topic of discussion. .. and the guys at IPython are doing some really interesting things. The fundamental problem is with CPython’s global interpreter lock. Any thread in CPython needs access to the interpreter to run which requires acquiring a global lock. Hence, true parallelism cannot be achieved. It isn’t possible to eliminate the GIL because of the way CPython does garbage collection (reference counting). If
multiple threads had simultaneous access to the same objects, it would be pretty hard to track references across them. Stackless python (which is the first thing most people mention when this topic arises) solves the problem with concurrency but does nothing for true parallelism.
IronPython and Jython do not have a GIL because they once again use the underlying VM’s threading and GC models to implement true parallelism.
One good solution is once again to use multiple processes and either share objects between them (a la the POSH project) or do some form of RMI (Pyro, MPI4Py, PyMPI etc.)
Brainwave is currently using some of these workarounds. We have a capabilities based security model at the meme level. Each database query also carries with it information regarding the user context (i.e. who is logged in – effectively the users key). This is used at the end of the query to filter the result set of any protected data. Only accessible data is returned to the client. Since this data transfer is happening over the process boundary, it is harder (impossible?) for the client to subvert the server’s memory space.
The parallelism is currently implemented using Pyro. The engine is smart enough to use a host-process namespace so as to be able to identify all processes across all hosts (multiple processes can run on the same host to take advantage of multiple cores/processors). The code is still under heavy development so I don’t have all the answers yet regarding exactly what algorithms are being used for the multiple master transaction based data synchronization. We expect to prototype a few options (let me know if you have suggestions) and go with whatever has the best benchmarks.
I’m going to keep an eye on what the CPython guys come up with to deal with these issues.
-Prateek Sureka