Showing posts with label programming. Show all posts
Showing posts with label programming. Show all posts

Monday, July 09, 2012

Multi-Master Replication in MongoDB

Since starting my own consulting business, I've had the opportunity to work with lots of interesting technologies. Today I wanted to tell you about some interesting technology I got to develop for MongoDB: multi-master replication.

Tuesday, June 19, 2012

Declarative Schemas for MongoDB in Python using Ming

Continuing on in my series on MongoDB and Python, this article continues to explore the Python MongoDB toolkit Ming, this time looking at an alternative, declarative syntax for schema definition. If you're just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you're all caught up, let's take a look at that declarative syntax....

Friday, June 08, 2012

Schema Maintenance with Ming and MongoDB

Continuing on in my series on MongoDB and Python, this article introduces the Python MongoDB toolkit Ming and what it can do to simplify your MongoDB code and ease maintenance. If you're just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you're all caught up, let's jump right in with Ming....

Saturday, December 31, 2011

MongoDB's Write Lock

MongoDB, as some of you may know, has a process-wide write lock. This has caused some degree of ridicule from database purists when they discover such a primitive locking model. Now per-database and per-collection locking is on the roadmap for MongoDB, but it's not here yet. What was announced in MongoDB version 2.0 was locking-with-yield. I was curious about the performance impact of the write lock and the improvement of lock-with-yield, so I decided to do a little benchmark, MongoDB 1.8 versus MongoDB 2.0.

Thursday, March 31, 2011

Allura Sprint-orials

So with our initial announcement at PyCon and the sprints afterwards, the newly open source SourceForge project hosting platform Allura has gotten a bit of attention. One thing we want to do at SourceForge is to be responsive and open to the community, to make Allura much more than a "throw it over the wall" open source project. To that end, we've been looking at holding a sprint-orial series.

Wednesday, March 09, 2011

Allura, the Open Source Forge

So in kind of a soft launch, SourceForge released the project I've been working on (Allura) under the Apache license last month. The project happens to be the actual software underlying the SourceForge.net 'beta' tools, and we're hoping to get lots of community involvement, starting at PyCon.

Monday, December 07, 2009

Ming 0.1 Released - Python Library for MongoDB

One of the things that's been nice about working with SourceForge for the last few months is the chance I get to work with new open source technology and contribute something back. Well, the first (in a long series, I hope) of projects we're putting out there as the result of recent work is a little library we wrote called Ming. It's been available over git for a while, but I just finished publishing our first "official" release, with a tutorial, over at http://merciless.sourceforge.net and http://pypi.python.org/pypi/Ming/0.1. Ming gives you a way of enforcing schemas on a MongoDB database in a Python application. We also throw in automatic, lazy schema migration for good measure.

Thursday, March 19, 2009

Announcing MetaPython - Macros for Python

As I mentioned in my last post, I have been considering writing some version of macros for Python and was looking for use cases. Well, having gotten the use cases I so desired from my wonderful commenters, I went ahead and put together an import hook and Google Code project that I'm calling MetaPython — all just in time for PyCon! (I have no talks, but I will be there, and would love to have a MetaPython Open Space if anyone's interested.)

So what's all the excitement about? MetaPython introduces some hooks to allow you to modify module code just before it is seen by the Python interpreter (at what I'm calling "import time"). The import-time syntax is pretty simple, and is (almost) all denoted by a question mark ? prefix (question marks are currently syntax errors in regular Python). Here is a trivial example that defines an import-time function (which as we will see can be used as a macro) that will conditionally remove a function call (and the evaluation of its associated arguments). Suppose the following text is saved in a file "cremove.mpy":

def cremove(debug, expr):
debug = eval(str(debug))
defcode empty_result:
pass
if debug:
result = expr
else:
result = empty_result
return result

The idea here is that cremove will be called with two metapython.Code values, debug and expr. cremove will convert debug to its Python code representation by calling str() and then evaluate the result. If debug is true, then expr will be returned. Otherwise a pass statement will be returned (defined using the MetaPython import-time construct defcode which defines a code template). To actually call cremove as a macro, we will need to define another MetaPython module, say "test_cremove.mpy":


import logging
logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger(__name__)

?from cremove import cremove

def do_test():
?cremove(True, log.debug('This statement will be logged'))
?cremove(False, log.debug('This statement will not be logged'))


Here, we do an import-time import (the ?from... import business). This makes the module we're importing available at import-time (a regular import would be seen just as another line of Python code at import-time). To actually call cremove as a macro, we just need to prefix it with the "?" as shown.

Now, to actually test this, we'll need to install MetaPython and fire up an interpreter. MetaPython is available from the CheeseShop, so to get it just run easy_install MetaPython. Once it's installed, we can test our MetaPython code as follows:


>>> import metapython; metapython.install_import_hook()
>>> import test_cremove
>>> test_cremove.do_test()
DEBUG:test_cremove:This statement will be logged


Since macro expansion can get pretty complex and it's always tricky debugging code you've never seen, the fully-expanded module is available as the __expanded__ attribute of the MetaPython module:


>>> print test_cremove.__expanded__
import logging
logging .basicConfig (level =logging .DEBUG )
log =logging .getLogger (__name__ )


def do_test ():
log .debug ('This statement will be logged')
pass


There's a lot more to MetaPython, but hopefully this has whetted your appetite. There's a short tutorial available that shows how you can implement the collections.namedtuple class factory using a macro. It also shows how you can use Jinja2 syntax along with the defcode construct to dynamically produce Python code. Have fun, and let me know what you think!

Thursday, March 12, 2009

Python Macros?

I've been thinking a bit about macros and what use they might be in Python. Basically, I was contemplating writing an import hook that would allow you to use code quoting and unquoting and stuff for your Python modules. My motive was just that Lisp people seem to rave about how awesome macros are all the time, so I figured they must be cool.

As I sat down to actually start figuring out what macro definitions and uses should look like in Python, I thought, hey, I'll just throw together a use case. But I haven't been able to come up with one (yet).

Most of the examples I found on the web focused on "hey, you can implement a 'while' loop with macros in Lisp!" or "hey, look at all the cool stuff the 'setf' macro can do!" So I started to wonder whether maybe Lisp people love macros because it allows them to extend Lisp's minimalist syntax with new constructs (like object-oriented programming with CLOS, while loops, etc.) Python, OTOH, has pretty rich syntax. It has a nice OOP system with syntactic support, while and for loops, generators, iterators, context managers, primitive coroutines, comprehensions, destructuring bind,.... -- What would I use macros for? (OK, depending on the syntax, I could add a "switch" statement, but that hardly seems worth the trouble.)

I should mention that I also saw some examples of people using macros for performance; you basically get rid of a function call and you can potentially make the inner loop of some critical function run really fast. But if that's all it buys me in Python-land (well, that and a switch statement), my motivation is pretty low. Because let's face it -- if your critical inner loop is written in pure Python, you can pretty easily throw it at Cython and get better performance than Python macros could ever provide.

So here's the question: does anyone out there have an idea of what macros would add to Python's power or expressiveness? Or maybe some Lisp, Meta OCAML, or Template Haskell hackers who can enlighten me as to what macros can add to a language with already rich syntax?

Update 2008-03-19


I have implemented MetaPython 0.1, a macro and code quoting system for Python, covered in the next blog post.

Friday, August 22, 2008

Lazy Descriptors

Today I had a need to create a property on an object "lazily." The Python builtin property does a great job of this, but it calls the getter function every time you access the property. Here is how I ended up solving the problem:

First of all, I had (almost) the behavior I wanted by using the following pattern:

class Foo(object):
def __init__(self):
self._bar = None
@property
def bar(self):
if self._bar is None:
print 'Calculating self._bar'
self._bar = 42
return self._bar

There are a couple of problems with this, however. First of all, I'm polluting my object's namespace with a _bar attribute that I don't want. Secondly, I'm using this pattern all over my codebase, and it's quite an eyesore.

Both problems can be fixed by using a descriptor. Basically, a descriptor is an object with a __get__ method which is called when the descriptor is accessed as a property of a class. The descriptor I created is below:

class LazyProperty(object):

def __init__(self, func):
self._func = func
self.__name__ = func.__name__
self.__doc__ = func.__doc__

def __get__(self, obj, klass=None):
if obj is None: return None
result = obj.__dict__[self.__name__] = self._func(obj)
return result

The descriptor is designed to be used as a decorator, and will save the decorated function and its name. When the descriptor is accessed, it will calculate the value by calling the function and save the calculated value back to the object's dict. Saving back to the object's dict has the additional benefit of preventing the descriptor from being called the next time the property is accessed. So I can now use it in the class above:

class Foo(object):
@LazyProperty
def bar(self):
print 'Calculating self._bar'
return 42

So I get a nice lazily calculated property that doesn't recalculate bar every time it's accessed and doesn't bother with any memoization itself. What do you think about it? Is this a patten you use in your code?

Tuesday, August 19, 2008

A Little Command Line Love

One of the things I do in my "spare" time is work on building web applications that will (hopefully) earn some spare money on the side without too much maintenance on my part. Those who have read The Four Hour Work Week will recognize this as my "muse" business.

In working on these web apps, I needed a place to host them, so I went with WebFaction due to their excellent support for TurboGears. I'm using a shared hosting environment with WebFaction, so my usual method of getting stuff up can't involve any scripts in /etc/init.d like I'd use at work, so I used to do the old nohup python start-appname.py >> output.log 2>&1 & to "daemonize" the process and then ps -furick446 to figure out which processes I needed to kill/restart when updating code. This was irritatingly verbose, so I figured I'd build a "userspace daemonizer" which I'll describe below.

The requirements of my daemonizer were pretty straightforward:

  • I should be able to add/remove services from the command line with a minimum amount of typing

  • I should be able to start/stop/restart any service just by listing it by name (no more ps -furick446)

  • It should do "real" daemonization (fork, fork, dup stuff if you're familiar with it) rather than the nohup garbage I'd been using

  • It should perform some sort of verification or status check on processes to make sure that they're successfully started/stopped/running/etc.


The first (and most questionable) decision I made was to go with SQLAlchemy as my service database. I say this is questionable because I ended up with only one table and one client that uses the database, so a fully relational model is really overkill here. Anyway, here's the SQLAlchemy setup code I used. It's pretty simple, and represents kind of the "lowest common denominator" of SQLAlchemy usage. (I've also included all the imports required for the whole shebang at the top of the file.)

#!/usr/bin/env python2.5
import os, sys, signal, shlex, time
from optparse import OptionParser

from sqlalchemy import *
from sqlalchemy.orm import *

HOME=os.environ.get('HOME')
DBFILE=os.path.abspath(os.environ.get(
'SERVER_FILE',
os.path.join(HOME, 'server.sqlite')))

DBURI='sqlite:///' + DBFILE

engine = create_engine(DBURI)
metadata = MetaData(engine)
session = scoped_session(sessionmaker(bind=engine))

process = Table(
'process', metadata,
Column('name', String(255), primary_key=True),
Column('pid', Integer),
Column('command_line', String(255)),
Column('working_directory', String(255)),
Column('stdin', String(255), default='/dev/null'),
Column('stdout', String(255), default='/dev/null'),
Column('stderr', String(255), default='/dev/null'))

class Process(object):
def __repr__(self):
return '%s(PID %s, WD %s): %s < %s >> %s 2>> %s' % (
self.name, self.pid, self.working_directory,
self.command_line, self.stdin, self.stdout, self.stderr)
session.mapper(Process, process)

So, for all of you out there who wonder how to use SQLAlchemy outside of a web framework, there you go. The idea here is that every service has a row in the process table, and every service can optionally redirect its stdin/stdout/stderr to/from files. If a process is running, it will have a pid. You can also specify the startup directory of each process. That pretty much covers the data model.

My next task was to figure out how to invoke this from the command line. I decided to make all the commands of the form python server.py command [options] [service]. Actually, the only command that takes any options is the add command, so I definitely over-engineered here, but I wanted to be able to specify a global list of options shared by all commands to be used with an optparse option parser. So I built the following dictionary:

OPTPARSE_OPTIONS = {
'working-directory':(['-w', '--working-directory'],
dict(dest='ws', default=HOME,
help='Set working directory to DIR',
metavar='DIR')),
'stdin':(['-i', '--stdin'],
dict(dest='stdin', default='/dev/null',
help='Use FILE as stdin', metavar='FILE')),
'stdout':(['-o', '--stdout'],
dict(dest='stdout', default='/dev/null',
help='Use FILE as stdout', metavar='FILE')),
'stderr':(['-e', '--stderr'],
dict(dest='stderr', default='/dev/null',
help='Use FILE as stderr', metavar='FILE')),
}

The next task was to specify the commands. I'm lazy and I like decorators, so I decided that new commands should be as easy as possible to write. My add command, for instance, is the most complex, and it looks like this:

@Command
def add(service, command, working_directory,
stdin, stdout, stderr):
p = Process(name=service,
command_line=command,
working_directory=working_directory,
stdin=stdin,
stdout=stdout,
stderr=stderr)
session.commit()

That's pretty simple. Of course, as you might have guessed, the Command decorator isn't all that simple. Its responsibilities are as follows:

  • Add the function name to a command list so I can get a list of commands from the command line

  • Build an optparse parser based on the OPTPARSE_OPTIONS dict and the named arguments to the function

  • Wrap the function in a new function with a signature like function(args) so it can be called with sys.argv[2:]

  • In the wrapper function, use the optparse parser to initialize the argument list to the function and then call with a named argument dict


So with all that explanation, here's the code:

class Command(object):
commands = {}
def __init__(self, func):
Command.commands[func.__name__] = self
options = self.get_options(func)
self.func = func
self.optparse_options = []
self.optparse_option_names = []
self.positional_options = []
for o in options:
if o in OPTPARSE_OPTIONS:
self.optparse_option_names.append(o)
self.optparse_options.append(
OPTPARSE_OPTIONS[o])
else:
self.positional_options.append(o)
positional_string = ' '.join(
'<%s>' % o for o in self.positional_options)
self.parser = OptionParser('%%prog [options] %s' % positional_string,
prog=func.__name__)
for args, kwargs in self.optparse_options:
self.parser.add_option(*args, **kwargs)

def get_options(self, func):
code = func.func_code
return [ vn for vn in code.co_varnames[:code.co_argcount] ]

@classmethod
def run(klass, args):
if args and args[0] in klass.commands:
return klass.commands[args[0]](args[1:])
else:
print 'Unrecognized command'
print 'Acceptable commands:'
for name in klass.commands:
print ' -', name

def __call__(self, args):
(opts,a) = self.parser.parse_args(args)
if len(a) != len(self.positional_options):
self.parser.error('Wrong number of arguments')
o = {}
for name in self.optparse_option_names:
o[name] = getattr(opts, name, None)
for name, value in zip(self.positional_options, a):
o[name] = value
return self.func(**o)

Once I've built the decorator, the commands are fairly straightforward (as the add command shows). Here are the "short" commands:

@Command
def initialize():
try:
metadata.drop_all()
except:
pass
metadata.create_all()
print 'Service database initialized'

@Command
def list():
q = Process.query()
if q.count():
for p in Process.query():
print p
else:
print 'No processes'

@Command
def add(service, command, working_directory,
stdin, stdout, stderr):
p = Process(name=service,
command_line=command,
working_directory=working_directory,
stdin=stdin,
stdout=stdout,
stderr=stderr)
session.commit()

@Command
def remove(service):
p = Process.query.get(service)
session.delete(p)
print 'Removed %s from the service list' % service
session.commit()

@Command
def status(service):
p = Process.query.get(service)
print p

Starting and stopping processes is a little more complex, but not too bad. First off, I needed a way to determine if a process was running. I didn't want to parse the ps -furick446 results, so I send a unix signal 0 to the PID (which doesn't do anything to the receiving process). If there's an exception, the process is either not running or not owned by me. So here's the is_running code:

def is_running(pid):
try:
os.kill(pid, 0)
return True
except Exception, ex:
return False

I also need a daemonizer function that will start, daemonize, and set the PID of a process object:

def daemonize(p):
pid = os.fork()
if pid: return # exit first parent
os.chdir(p.working_directory)
os.umask(0)
os.setsid()
pid = os.fork()
if pid:
Process.query.get(p.name).pid = pid
session.commit()
sys.exit(1) # exit second parent
si = open(p.stdin, 'r')
so = open(p.stdout, 'a+')
se = open(p.stderr, 'a+', 0)
os.dup2(si.fileno(), sys.stdin.fileno())
os.dup2(so.fileno(), sys.stdout.fileno())
os.dup2(se.fileno(), sys.stderr.fileno())
args = shlex.split(p.command_line.encode('utf-8'))
os.execvp(args[0], args)

Once these are defined, I can start, stop, and restart processes fairly simply:

@Command
def start(service):
print 'Starting %s ...' % service
p = Process.query.get(service)
if p.pid is not None:
if is_running(p.pid):
print '... %s already running with PID %s' % (
service, p.pid)
return
p.pid = 0
session.commit()
daemonize(p)
print '... started %s' % service

@Command
def stop(service):
print 'Stopping %s ...' % service
p = Process.query.get(service)
if not p.pid or not is_running(p.pid):
print '... service %s is already stopped' % service
p.pid = None
session.commit()
return
for retry in range(5):
print '... sending SIGTERM to %s' % p.pid
os.kill(p.pid, signal.SIGTERM)
time.sleep(0.5)
if not is_running(p.pid):
break
else:
print '... sending SIGKILL to %s' % p.pid
os.kill(p.pid, signal.SIGKILL)
time.sleep(0.5)
if is_running(p.pid):
print '... process %s could not be killed' % p.pid
return
p.pid = None
session.commit()
print '... %s is stopped' % service

@Command
def restart(service):
stop([service])
start([service])

Finally, I need to hook the script up and make sure it runs from the command line:

def main():
Command.run(sys.argv[1:])

if __name__ == '__main__':
main()

And there you have it! A userspace daemonizer that lets you manage an arbitrary number of services. It is definitely overkill in many ways, but hopefully the sharing the process of building it will be as educational to you as it was to me.

Wednesday, August 13, 2008

Miruku - Migrations for SQLALchemy

One of the painful things about working with any database-oriented project in production is that you can't just drop the database and re-create every time you have a schema change. (Of course, you could do that, but your users might get a little miffed when their data disappears.) Rails and Django have for this reason had support for migrating from one schema version to another. Well, now SQLAlchemy has a automatic migrations tool by the name of Miruku. I haven't tried it, and it's an extremely early version (0.1a7), but it looks promising. Have a look here, and let me know what you think.

Friday, July 18, 2008

RESTfulness in TurboGears

Mark Ramm and I were talking about how to differentiate GET, POST, PUT, and DELETE in TurboGears 2 and we came up with a syntax that's pretty cool. That's not the reason for this post, though. This morning, I noticed that our syntax is completely compatible with TurboGears 1.x -- so here's how you do it.

What we wanted was to expose a RESTful method like this:

class Root(controllers.RootController):

class index(RestMethod):
@expose('json')
def get(self, **kw):
return dict(method='GET', args=kw)
@expose('json')
def post(self, **kw):
return dict(method='POST', args=kw)
@expose('json')
def put(self, **kw):
return dict(method='PUT', args=kw)
# NOT exposed, for some reason
def delete(self, **kw):
return dict(method='DELETE', args=kw)

The TurboGears 1 implementation relies on the way that CherryPy determines whether a given property is a valid URL controller. It basically looks at the property and checks to make sure that:

  • it's callable, and

  • it has a property exposed which is true (or
    "truthy")


The "a-ha!" moment came when realizing that:

  • Classes are callable in Python (calling a class == instantiating an object)

  • Classes can have exposed attributes


So with appropriate trickery behind the scenes, the above syntax should work as-is. So here's the "appropriate trickery behind the scenes":

class RestMeta(type):
def __new__(meta,name,bases,dct):
cls = type.__new__(meta, name, bases, dct)
allowed_methods = cls.allowed_methods = {}
for name, value in dct.items():
if callable(value) and getattr(value, 'exposed', False):
allowed_methods[name] = value
return cls

The first thing I wanted to do was create a metaclass to use for RestMethod so that I could save the allowed HTTP methods. Nothing too complicated here.

import cherrypy as cp
class ExposedDescriptor(object):
def __get__(self, obj, cls=None):
if cls is None: cls = obj
allowed_methods = cls.allowed_methods
cp_methodname = cp.request.method
methodname = cp_methodname.lower()
if methodname not in allowed_methods:
raise cp.HTTPError(405, '%s not allowed on %s' % (
cp_methodname, cp.request.browser_url))
return True

This next thing is tricky. If you don't understand what a "descriptor" is, I suggest the very nice description here. The basic thing I get here is the ability to intercept a reference to a class attribute the same way the property() builtin intercepts references to object attributes.

The idea here is to use this descriptor as the exposed attribute on the RestMethod class. When CherryPy tries to figure out if the method is exposed, it calls ExposedDescriptor.__get__ and uses the result as the value of exposed. If the HTTP method in question is not exposed, then the code raises a nice HTTP 405 error, which is the correct response to sending, say, a POST to a method expecting only GETs.

The final part of the solution, the actual RestMethod, is actually pretty simple:

class RestMethod(object):
__metaclass__ = RestMeta

exposed = ExposedDescriptor()

def __init__(self, *l, **kw):
methodname = cp.request.method.lower()
method = self.allowed_methods[methodname]
self.result = method(self, *l, **kw)

def __iter__(self):
return iter(self.result)

The sequence of things is now this:

  • CherryPy traverses along to the root.index class and looks up root.index.exposed

  • root.index.exposed is intercepted by the descriptor which checks the CherryPy request method to see if it's valid for this controller, and if it is, returns True

  • CherryPy says, "Great! root.index is exposed. So now I'll call root.index(...)." This calls index's constructor, which in turn calls the appropriate method and saves the result in self.result

  • CherryPy says "Cool! root.index returned me an iterable object. I'll iterate over it to get the text to send back to the browser." This calls root.index(...).__iter__, which is just delegated to the result that the real controller gave.


At then end, we get a fully REST compliant controller with a nice (I think) syntax.

Update 2008-07-21: After some more thinking, I realized that the metaclass isn't necessary. The descriptor is more important, but also not strictly necessary. My original design did everything in the constructor of RestMethod, but this runs after all of TurboGears' validation and identity checking happens, so it's pretty inefficient. If you want the implementation with the descriptor but without the metaclass, you can do this:

import cherrypy as cp
class ExposedDescriptor(object):
def __get__(self, obj, cls=None):
if cls is None: cls = obj
cp_methodname = cp.request.method
methodname = cp_methodname.lower()
method = getattr(cls, methodname, None)
if callable(method) and getattr(method, 'exposed', False):
return True
raise cp.HTTPError(405, '%s not allowed on %s' % (
cp_methodname, cp.request.browser_url))

The benefit to using a metaclass is that the check for the existence, "callability", and exposed-ness of the method happens up front rather than on each request. Also, I don't like the fact that the metaclass pollutes the class namespace with the "allowed_methods" attribute. There's probably a way to clean that up and put the descriptor in a closure, but I haven't had time to look at it. Maybe that will be a future post....

Friday, July 11, 2008

Cascade Rules in SQLAlchemy

Last night at the PyAtl meeting, there was a question about how you define your cascade rules in SQLAlchemy mappers. I'll confess that it confused me at first, too, but here's all you need to know:

What's "cascading" in the mapper is session-based operations. This includes putting an object into the session (saving it), deleting an object from the session, etc. Generally, you don't care about all that stuff, because it Just Works most of the time, as long as you specify cascade="all" on your relation() properties in your mappers. What this means is "whatever session operation you do to the mapped class, do it to the related class as well".

One little confusing thing is that there's another thing you'll often want to specify in your cascade rules, and that's the "delete-orphan". In fact, most of my 1:N relation()s look like:

mapper(ParentClass, parent, properties=dict(
children=relation(ChildClass, backref='parent',
cascade='all,delete-orphan')
)
)

The "delete-orphan" specifies that if you ever have a ChildClass instance that is "orphaned", that is, not connected to some ParentClass, go ahead and delete that ChildClass. You want to specify this whenever you don't want ChildClass instances hanging out with null ParentClass references. Note that even if you don't specify "delete-orphan", deletes on the ParentClass instance will still cascade to related ChildClass instances. An example is probably best. Say you have the following schema and mapper setup:

photo = Table(
'photo', metadata,
Column('id', Integer, primary_key=True))
tag = Table(
'tag', metadata,
Column('id', Integer, primary_key=True),
Column('photo_id', None, ForeignKey('photo.id')),
Column('tag', String(80)))

class Photo(object): pass

class Tag(object): pass

session.mapper(Photo, photo, properties=dict(
tags=relation(Tag, backref='photo', cascade="all"),
session.mapper(Tag, tag)

I'll go ahead and create some photos and tags:

p1 = Photo(tags=[
Tag(tag='foo'),
Tag(tag='bar'),
Tag(tag='baz') ])
p2 = Photo(tags=[
Tag(tag='foo'),
Tag(tag='bar'),
Tag(tag='baz') ])
session.flush()
session.clear()

Now if I delete one of the photos, I'll delete the tags associated
with it, as well:

>>> for t in Tag.query():
... print t.id, t.photo_id, t.tag
...
1 1 foo
2 1 bar
3 1 baz
4 2 foo
5 2 bar
6 2 baz
>>> session.delete(Photo.query.get(1))
>>> session.flush()
>>> for t in Tag.query():
... print t.id, t.photo_id, t.tag
...
4 2 foo
5 2 bar
6 2 baz

At this point, everything is the same whether I specify
"delete-orphan" or not. The difference is in what happens when I
just remove an item from a photo's "tags" collection:

>>> p2 = Photo.query.get(2)
>>> del p2.tags[0]
>>> session.flush()
>>> for t in Tag.query():
... print t.id, t.photo_id, t.tag
...
4 None foo
5 2 bar
6 2 baz

See how the "foo" tag is just hanging out there with no photo?
That's what "delete-orphan" is designed to prevent. If we'd
specified "delete-orphan", we'd have the following result:

>>> p2 = Photo.query.get(2)
>>> del p2.tags[0]
>>> session.flush()
>>> for t in Tag.query():
... print t.id, t.photo_id, t.tag
...
5 2 bar
6 2 baz

So there you go. If you don't mind orphans, then use
cascade="all" and leave off the
"delete-orphan". If you'd rather have them disappear when
disconnected from their parent, use
cascade="all,delete-orphan".

PyAtl: SQLAlchemy Theme Night

Well, last night was the Python Atlanta user group meeting (PyAtl). It had been a while since I've been, and I'd forgotten how fun it can be. The theme was SQLAlchemy, and the speaker lineup was me, Brandon Craig Rhodes, and James Fowler.

The meeting started off with "shooting the breeze" as usual, and then moved into my presentation "Essential SQLAlchemy", which gives a 30 minute overview of the basics of SQLAlchemy. Here are the usual links to slides and the video:



After my talk, Brandon Craig Rhodes (who is, by the way, an incredibly lively presenter, using nothing but emacs!) gave a talk "SQLAlchemy Advanced Mappings" that focused on using the ORM layer in SQLAlchemy. It really was more of a mini-tutorial that took you through basic mappings all the way through relations, backrefs, and more. SQLAlchemy is an amazingly rich library, and it's hard to squeeze a talk into half an hour. Here's the video:



After Brandon, James Fowler did a "now for something completely different" kind of talk on wxPython, "WxPython Quick Bite", focusing on how you can make wxPython (designed to be event-driven and single-threaded) play nicely in a multi-threaded environment. Unfortunately the start of the video was cut off as I feverishly tried to download the other two videos to make room for James's talk. I'll post the video as soon as it gets uploaded.

I'd be remiss if I didn't thank O'Reilly for "sponsoring" the meetup with a giveaway of a number of books (including 9 copies of Essential SQLAlchemy, which I stuck around afterwards to sign). We also had a couple of copies of Beautiful Code, Beginning Development with Python Gaming, and Hackerteen to give away. A great time was had by all!

Monday, January 07, 2008

Cascading DROP TABLE with SQLAlchemy

A little quirk that can get you if you're using SQLAlchemy to create and drop your database is that PostgreSQL doesn't allow you to drop a table that has other tables referring to it via FOREIGN KEY constraints. PostgreSQL has a DROP TABLE ... CASCADE command that supports this (and drops all the dependent tables) but there's no easy way to use the DROP TABLE ... CASCADE statement. It's not too hard to make it available, though.

It turns out, however, that SQLAlchemy has a nice, pluggable database dialect system that is fairly simple to update. One part of this dialect system is a "SchemaDropper". So to cascade the DROP TABLE statements, I just created the following SchemaDropper (derived from the existing PostgreSQL PGSchemaDropper) and installed it as the default PostgreSQL dialect schemadropper. (Most of the code is copied from the base SchemaDropper class in sqlalchemy.sql.compiler)


from sqlalchemy.databases import postgres

class PGCascadeSchemaDropper(postgres.PGSchemaDropper):
def visit_table(self, table):
for column in table.columns:
if column.default is not None:
self.traverse_single(column.default)
self.append("\nDROP TABLE " +
self.preparer.format_table(table) +
" CASCADE")
self.execute()

postgres.dialect.schemadropper = PGCascadeSchemaDropper


And that's it!

Sunday, March 25, 2007

Dynamic Language Weenies?

I saw the linked article and couldn't help but get irritated. I know I shouldn't get this worked up about programming languages, but I did anyway. (Perhaps it was the unnecessarily abrasive tone of the author....) Rather than making you wade through the rather misinformed article, let me summarize and refute some of the author's main points here.


[Edit 3/31/07 @11:58pm - It looks like some of the points I complain about in this article have been changed in the original post. I haven't gone over the article line-by-line, but the article I am responding to was posted on 3/25/07, while the one currently posted is dated 3/26/07, so please take that into account when reading this (3/25/07) article. Some of the main points that were changed in the original post had to do with the use of the confusing terms "weak" and "strong" typing. There may be others.]

[Edit 4/5/07 @5:55pm - I should make it clear that when I refer to "static languages" in the article below, I do so in the same sense that the original author refers to "static languages" -- languages in the style of C, C++, Java, C#, Pascal, etc. I am aware of statically languages such as Haskell and ML which convey many of the productivity benefits of "dynamic" languages in a statically typed environment. ]

Brief aside #1: Non-weenie credentials

I have worked as a commercial hardware and software developer in Real Jobs now for about 15 years. I have used, in production scenarios, C, C++, C#, Dynamic C (embedded programming), VHDL, Verilog, SQL, Javascript, and Python. I have written embedded microcontroller programs, C compilers targeting FPGAs and exotic dataflow (MONARCH) architectures, multiprocessor simulators, enterprise workflow applications, and high-volume web sites. I have implemented digital logic for the StarCore SC140s DSP core, as well as designed various IP digital logic cores, including a Viterbi decoder and an SHA-1 hashing engine. I am not a weenie.

Somewhat less brief aside #2: Muddled thinking about typing

Next, and this gets me every time, the author confuses the (at least) three axes of typing. The first axis is the strong/weak axis, and generally puts languages such as C, perl and rexx on the weak end and most everything else on the strong end. The deciding question here is whether the language implicitly converts unrelated types without warning, allowing you to add the integer 0 to the string "10" and arrive at the result "010" (or is it 11? I forget.). Strongly typed languages will cry foul, where weakly typed will simply do what they think you mean and continue.

The second axis is static versus dynamic typing [Edit 4/3/07], also known as early versus late binding. This has entirely to do with how names (variables) are resolved in the language. In statically typed languages, a name (variable) is associated with a type, and that name (variable) can never reference a value of any other type. In dynamically typed languages, a name may be associated with values of different types over its lifetime. Languages such as C/C++, Java, Pascal, Haskell, OCAML, etc. fall into the static category (with some dynamic capabilities in C++ and Java through runtime dynamic type casting), while languages such as Ruby, Python, etc. fall into the dynamic category. Many languages have support for both, including Lisp and the aforementioned Java and C++.

The third axis is manifest versus implicit typing, and it is a fascinating axis. (Note that this axis is really only applicable to statically typed languages, so it might not really even be an axis in its own right, but I think it's worth looking at here.) Implicitly typed languages such as OCAML, although they are most definitely statically typed and compiled, actually perform quite advanced type inference on the code to determine what types you intended your variables to be, generally by analyzing which operations they participate in. Remarkably, an OCAML compiler is able to produce extremely optimized, statically and strongly typed code, even in the absence of explicit type declarations. RPython (part of the PyPy project) is example of an implicitly typed subset of Python whose compiler is able to produce highly optimized, statically typed code.

The author of the "weenies" article conflates all three axes into strong versus weak, and puts C/C++ and Java on the "strong" side, with Ruby, Python, etc. on the "weak" side, while ignoring other languages such as Haskell, LISP, OCAML, etc. Which hopefully you can see is a gross oversimplification. If you're interested, my current language of choice, Python, is a strongly, dynamically typed language.

Aside #3: Ignorance of other strong advantages of dynamic languages

The author left out what I consider to be two of the most important features, productivity-wise, of my current chosen language, Python: built-in polymorphic containers and high-order functions. (These are also present in most dynamic languages, but I'll discuss them in the context of Python, because that's what I'm familiar with.)

Built-in polymorphic containers

Python has, built into the language, the following containers: lists, tuples, dictionaries ('dicts'), and strings. And when I say that they are built-in, I mean not merely that the language includes facilities to manipulate the structures, but that it also includes a convenient literal syntax to define them. A list of the integers from 1 to 10, for instance, is represented as [1,2,3,4,5,6,7,8,9,10]. Lists have methods that mimic the union of arrays and linked lists in other languages, and even support recursive definition (not that I've used it):

>>> lst = [1,2,3]

>>> lst.append(lst)

>>> lst

[1,2,3,[...]]

>>> lst[3]

[1,2,3,[...]]

>>> lst[3][3]

[1,2,3,[...]]

Tuples are similar to "immutable lists", and are often used to return multiple values from a function:

a,b,c = foo(d,e,f)

This also illustrates a corrolary property of tuples and lists, the "destructuring assignment", that allows you to "unpack" structured data in a very succinct way.

Dictionaries are nice syntactic sugar for hashtables, and allow you to use any "hashable" object as a key to index any Python object: {1:'foo', 2:'bar', 'the third item':'baz', (1,2):'baff'} This also illustrates that all these containers are polymorphic (though in practice the types in a container are usually restricted). Strings need little explanation, except to say that they are built in, unlike C/C++ strings.

Why do all these types make things easier? Mainly because they're already included. To use a list in C++, you have to #include the right header and declare the type of things that go in the list. To use a vector in C++, you also have to #include the right header (a different header, if I remember correctly) and declare the type of things that go in the vector. And good luck if you want to use a hash table. For that, you not only have to #include the header and declare the types of key and object, but you also have to provide a hashing function. Wouldn't it be nice if the language took care of all that for you? (Yes, it is very nice.) If you're using C++, you're also stuck with no literal syntax for specifying anything but the simplest structures, and the enormous pitfall of confusing STL string<>s with char[]s. Dynamic languages (like Python) so significantly lower the bar on creating data structures that you'll often see lists of dicts or tuples where C++ or Java would be littered with utility classes, type declarations (and in pre-generics Java, typecasts). I mean, come on -- do I really need to declare a pair class if I want a list of points? And a separate RGB class if I want color values? Give me a break.

High-order functions

Simply put, this is the ability to treat a function as an object in your code, and the utility of this feature is difficult to overstate. C++ really tries to do this with templates, and Boost::Lambda gets 95% of the way there, but wouldn't it be nice if you didn't have to jump through so many hoops? Python includes features such as map (apply a function to every element in a container), reduce (apply a 2-argument function to every pair of elements in a container until it's reduced to one element), filter (find all elements in a container for which a predicate function returns true), and lambda (define an anonymous function inline). C++ has support for these, but you have to create a functor object (which Boost::Lambda makes mercifully simpler). Actual C++ functions are thus second-class citizens. If you are a C++ or Java language programmer, it may never have occurred to you to write a function that takes a function as a parameter and returns a function as its result. If you are a dynamic language programmer, you probably wrote three of these last week. It's higher-order thinking, and it's simply not supported as well in most static languages.

I should probably pause for a moment and make the point that the C++ templating system is a Turing-complete, dynamically typed, functional programming language that happens to be interpreted at compile time. (I have a compile-time factorial program that I can show you if you don't believe me.) Its syntax leaves much to be desired, but it's semantically much closer to the dynamic language camp than the language onto which it was bolted on, C++.

OK, on to the author's main points, which he presents in a claim/reality format:


Claim: Weak Typing, Interpretation and Reduced Code Volume Increase Development Speed

Reality: No they don't, either individually or together. ....

My reality check: Dynamic typing, interpretation, and reduced code volume do indeed increase development speed.

Dynamic Typing

Have you ever tried to write a really static C++ program? You know, where you actually declare all the methods that don't modify your class as "const" methods? I tried it. Once. Might have tried it again if I didn't have to work with other people. Dynamically typed languages do increase development speed, although their impact is somewhat mitigated in larger projects where enforcement of interfaces becomes more important. Where they really shine, however, is in their "genericity." C++ tried to do generics with templates, and it succeeded to some extent. I'm sure there are other examples in other languages. Dynamic languages give you what are essentially C++ templated functions for every function you write.

Interpretation

Interpretation helps, not so much because compile time is prohibitive in static projects, but because the REPL (read-eval-print loop) is so freaking easy. Want to try out something quickly? Paste it into your interactive shell. Static languages are beginning to understand this, with some IDEs providing a little interactive shell. But how long did it take to "invent" this feature (which was present in Lisp in the 1960s)? Interpretation also facilitates the exploration of new language features in a way that statically compiled languages have a really hard time keeping up with. Take it from someone who has written both interpreters and compilers: it is easier to add a feature to an interpreter than it is to a compiler. OCAML does some amazing things in their compiler. You're going to have a hard time convincing me they can extend the language easier than the PyPy team, however.

Reduced Code Volume

Reduced code volume certainly does reduce development time trivially -- less typing. More importantly, however, it allows you to fit larger concepts onto one screenful of code. The units at which you are programming are larger. Also important to note is the correlation of bug count with source lines of code, independent of language used. That means that, roughly, 1000 lines of assembly has the same bug count as 1000 lines of Lisp. Which one do you think accomplishes more? Reduced code volume is easier and faster to code, debug, and maintain. I can't understand how the author could even imagine this not to be true.


Claim: Support From Major Companies Legitimizes DLs

Reality: No it doesn't. Companies know that fan boys like you are easy marks - an enthusiastic and indiscriminate market segment ripe for exploitation. They also know that you might spread your naive enthusiasms into your workplaces, opening up a corporate market for supporting tools.

My reality check: OK, fine. Companies are driven by profit, so I can accept that corporate profit chasing has little to do with the quality of a language. But this cuts both ways. Java has been pushed by Sun, and C# by Microsoft. Neither would have anywhere near the market share they currently have without their corporate backers.

But let's leave aside corporations "supporting" the languages. Let's look at those who actually get things done. Yahoo! stores was originally written in Lisp. BitTorrent in Python. Google and NASA use Python extensively. The OLPC project is using Python as their core O/S language. 37signals uses (and invented) Ruby on Rails. Reddit is Python (was Lisp). YouTube runs on Python. And tell me, how many thin-client applications use Java applets (static language) versus Javascript (dynamic language)? And that's even with the hellish problem of browser inconsistency.

Claim: As the Problems Change, People Use New Languages

Reality: As languages change, people remain the same. Software development is now, and always has been, driven by an obsession with novelty, and that is what drives language adoption. If there is a new problem to solve, that will simply make for a convenient excuse. Your misplaced enthusiasm simply perpetuates a cycle of self-defeating behaviour that prevents software development maturing into a true profession.
My reality check: Yes, people remain the same. However, the resources we use do not. CPU cycles and memory are relatively cheap today. That's why no one (except some embedded developers) can get away with saying they need to program in assembly language. Runtime performance is objectively less constraining now than it was 10 years ago for the same problems. Which means that all the things we wish we could have done in 1997 are available now. Like dynamic, interpreted languages.

Language is a tool for expressing ideas. Some languages express different ideas more easily, or with greater difficulty, than others. Try saying ninety-nine in French, if you don't believe me (quatre-vingt-dix-neuf, literally four twenty ten nine). Programming languages are no different. Things have been learned about better ways to express yourself since C++, Java, and C# were invented. C++, Java , and C# also chose to ignore certain things that were well-known in programming language research when they were invented due to design decisions that were made in a technological context dissimilar to today.

And for one further reality check, no, language adoption is not driven by novelty. Java introduced nothing whatsoever that was new. It started with known features of C++, removed a bunch of stuff, added a garbage collector that had been understood since the days of the VAX, and threw an ungodly amount of marketing behind it. Java is no longer new, but it is still widespread. C is certainly not new, and its popularity remains astonishingly high. Language adoption is driven by a wide range of factors, including but by no means dominated by novelty.
Claim: You Can Assess Productivity By Feel

Reality: No you can't. You're just trying to justify personal preference by hiding it behind a legitimate but definitionally complex term. If you've never taken measurements, you have approximately no idea what your productivity is like either with or without your favorite dynamic language. You certainly can't assess the difference between the two.
My reality check: The article's author hasn't taken measurements, either. But Lutz Prechelt at least has some data, where the linked article presents none. In fact, without exception, all studies of which I am aware [Ed: 4/5/07] which have compared productivity in languages between compiled, manifestly, statically typed languages and interpreted, dynamically typed languages have the dynamic languages easily winning out.

But what the author ignores is that the "feel" of a language, while not providing objective evidence of its productivity, is an influencing factor in its productivity due to the increased motivation to work in a dynamic language. If I like how a language "feels", I will use it more. I will be a more motivated employee, producing more. If I do open source work, I will work more on it, producing more libraries and facilitating code reuse, which even the most jaded non-weenie must admit is a Good Thing.

Claim: Syntax Can Be Natural

Reality: All programming languages are arcane and cryptic, in different ways and to varying degrees. What is perceived as "natural" varies tremendously between individuals, depending upon their experience and background. Your mischaracterisation of a syntax as "natural" is just an attempt to retro-fit a philosophy to your personal preferences.

My reality check: No syntax is completely natural, but some have more in common with non-programming languages than others. For instance, Haskell invented a wonderful syntax for specifying lists: the "list comprehension":

[x + 2*x + x/2 | x <- [1,2,3,4]]


OK, that looks weird if you've never seen it before. But does it have an analogue outside of programming? How about

{ x + 2*x + x/2 | x in {1,2,3,4} }

That's just about pure mathematical notation for a set. Python took a compromise approach and writes the more "verbal":

[ x + 2*x + x/2 for x in [1,2,3,4] ]

How do I say this in C++?

#include<list>
std::list<int> mklist() {
std::list<int> l;
for(int x = 1; x<= 4; x++)
l.push_back(x + 2*x + x/2);
return l;
}

Which feels more "natural" to you? I see in programming language design two big threads, of equal power but radically different approach, which I will name by certain scientists who inspired the respective approaches. One is the "Chuch" thread, where languages express mathematics. The other is the "Turing" thread, where languages command machines to accomplish tasks. Roughly, this puts languages into "declarative" and "imperative" camps. Dynamic languages pull ideas from both camps. Static languages (at least C/C++, Java, and C#) are heavily imperative, and have little support for declarative concepts. Sure, neither is particularly "natural," but dynamic languages have more expressive capabilities and can approach a "natural" syntax more easily than manifestly static languages.

Claim: A Strength Of My Language Is Its Community

Reality: If it it[sic], then you are in deep trouble, for your community appears to be dominated by juveniles who only take time out from self-gratification long enough to wipe the byproducts off their keyboard, then mindlessly flame anyone who does not share their adolescent enthusiasms.
My Reality Check: The community is a strength, no doubt about it. But communities are made up of all types. For every blithering idiot, there may be five or ten solid programmers pounding out production-quality code. I had to make a choice tonight -- write this article or work on my application. Maybe I made the wrong choice. Many of the juveniles of which you write don't have that choice, being without the requisite skills to create an application. Of course, that raises the question of what you were doing writing the article....
Claim: No Harm, No Foul

Reality: No Brain, No Pain.

And the dynamic languages people are juvenile?....