Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Friday, November 12, 2010

Python: Enumerating IP Addresses on FreeBSD

As promised in my earlier post on enumerating local interfaces and their IP addresses on MacOS X, this time I'll cover how to do the same on FreeBSD and other operating systems that implement the getifaddrs API. Basically, this is just a python wrapper around the getifaddrs interface using ctypes.

The code is a bit longer than I typically like to include in a blog post, but here it goes:
"""
Wrapper for getifaddrs(3).
"""

import socket
import sys

from collections import namedtuple
from ctypes import *

class sockaddr_in(Structure):
_fields_ = [
('sin_len', c_uint8),
('sin_family', c_uint8),
('sin_port', c_uint16),
('sin_addr', c_uint8 * 4),
('sin_zero', c_uint8 * 8)
]

def __str__(self):
assert self.sin_len >= sizeof(sockaddr_in)
data = ''.join(map(chr, self.sin_addr))
return socket.inet_ntop(socket.AF_INET, data)

class sockaddr_in6(Structure):
_fields_ = [
('sin6_len', c_uint8),
('sin6_family', c_uint8),
('sin6_port', c_uint16),
('sin6_flowinfo', c_uint32),
('sin6_addr', c_uint8 * 16),
('sin6_scope_id', c_uint32)
]

def __str__(self):
assert self.sin6_len >= sizeof(sockaddr_in6)
data = ''.join(map(chr, self.sin6_addr))
return socket.inet_ntop(socket.AF_INET6, data)

class sockaddr_dl(Structure):
_fields_ = [
('sdl_len', c_uint8),
('sdl_family', c_uint8),
('sdl_index', c_short),
('sdl_type', c_uint8),
('sdl_nlen', c_uint8),
('sdl_alen', c_uint8),
('sdl_slen', c_uint8),
('sdl_data', c_uint8 * 12)
]

def __str__(self):
assert self.sdl_len >= sizeof(sockaddr_dl)
addrdata = self.sdl_data[self.sdl_nlen:self.sdl_nlen+self.sdl_alen]
return ':'.join('%02x' % x for x in addrdata)

class sockaddr_storage(Structure):
_fields_ = [
('sa_len', c_uint8),
('sa_family', c_uint8),
('sa_data', c_uint8 * 254)
]

class sockaddr(Union):
_anonymous_ = ('sa_storage', )
_fields_ = [
('sa_storage', sockaddr_storage),
('sa_sin', sockaddr_in),
('sa_sin6', sockaddr_in6),
('sa_sdl', sockaddr_dl),
]

def family(self):
return self.sa_storage.sa_family

def __str__(self):
family = self.family()
if family == socket.AF_INET:
return str(self.sa_sin)
elif family == socket.AF_INET6:
return str(self.sa_sin6)
elif family == 18: # AF_LINK
return str(self.sa_sdl)
else:
print family
raise NotImplementedError, "address family %d not supported" % family


class ifaddrs(Structure):
pass
ifaddrs._fields_ = [
('ifa_next', POINTER(ifaddrs)),
('ifa_name', c_char_p),
('ifa_flags', c_uint),
('ifa_addr', POINTER(sockaddr)),
('ifa_netmask', POINTER(sockaddr)),
('ifa_dstaddr', POINTER(sockaddr)),
('ifa_data', c_void_p)
]

# Define constants for the most useful interface flags (from if.h).
IFF_UP = 0x0001
IFF_BROADCAST = 0x0002
IFF_LOOPBACK = 0x0008
IFF_POINTTOPOINT = 0x0010
IFF_RUNNING = 0x0040
if sys.platform == 'darwin' or 'bsd' in sys.platform:
IFF_MULTICAST = 0x8000
elif sys.platform == 'linux':
IFF_MULTICAST = 0x1000

# Load library implementing getifaddrs and freeifaddrs.
if sys.platform == 'darwin':
libc = cdll.LoadLibrary('libc.dylib')
else:
libc = cdll.LoadLibrary('libc.so')

# Tell ctypes the argument and return types for the getifaddrs and
# freeifaddrs functions so it can do marshalling for us.
libc.getifaddrs.argtypes = [POINTER(POINTER(ifaddrs))]
libc.getifaddrs.restype = c_int
libc.freeifaddrs.argtypes = [POINTER(ifaddrs)]


def getifaddrs():
"""
Get local interface addresses.

Returns generator of tuples consisting of interface name, interface flags,
address family (e.g. socket.AF_INET, socket.AF_INET6), address, and netmask.
The tuple members can also be accessed via the names 'name', 'flags',
'family', 'address', and 'netmask', respectively.
"""
# Get address information for each interface.
addrlist = POINTER(ifaddrs)()
if libc.getifaddrs(pointer(addrlist)) < 0:
raise OSError

X = namedtuple('ifaddrs', 'name flags family address netmask')

# Iterate through the address information.
ifaddr = addrlist
while ifaddr and ifaddr.contents:
# The following is a hack to workaround a bug in FreeBSD
# (PR kern/152036) and MacOSX wherein the netmask's sockaddr may be
# truncated. Specifically, AF_INET netmasks may have their sin_addr
# member truncated to the minimum number of bytes necessary to
# represent the netmask. For example, a sockaddr_in with the netmask
# 255.255.254.0 may be truncated to 7 bytes (rather than the normal
# 16) such that the sin_addr field only contains 0xff, 0xff, 0xfe.
# All bytes beyond sa_len bytes are assumed to be zero. Here we work
# around this truncation by copying the netmask's sockaddr into a
# zero-filled buffer.
if ifaddr.contents.ifa_netmask:
netmask = sockaddr()
memmove(byref(netmask), ifaddr.contents.ifa_netmask,
ifaddr.contents.ifa_netmask.contents.sa_len)
if netmask.sa_family == socket.AF_INET and netmask.sa_len < sizeof(sockaddr_in):
netmask.sa_len = sizeof(sockaddr_in)
else:
netmask = None

try:
yield X(ifaddr.contents.ifa_name,
ifaddr.contents.ifa_flags,
ifaddr.contents.ifa_addr.contents.family(),
str(ifaddr.contents.ifa_addr.contents),
str(netmask) if netmask else None)
except NotImplementedError:
# Unsupported address family.
yield X(ifaddr.contents.ifa_name,
ifaddr.contents.ifa_flags,
None,
None,
None)
ifaddr = ifaddr.contents.ifa_next

# When we are done with the address list, ask libc to free whatever memory
# it allocated for the list.
libc.freeifaddrs(addrlist)

__all__ = ['getifaddrs'] + [n for n in dir() if n.startswith('IFF_')]
As always, this code is released under a BSD-style license.

Wednesday, October 20, 2010

Python: Enumerating IP Addresses on MacOS X

How do you enumerate the host's local IP addresses from python? This turns out to be a surprisingly common question. Unfortunately, there is no pretty answer; it depends on the host operating system. On Windows, you can wrap the IP Helper GetIpAddrTable using ctypes. On modern Linux, *BSD, or MacOS X systems, you can wrap getifaddrs(). Neither is trivial, though, so I'll save those for a future post.

Luckily, MacOS X provides a simpler way to get the local IP addresses: the system configuration dynamic store. Using pyObjC, which comes pre-installed on every Mac, we can write a straight port of Apple's example in Technical Note TN1145 for retrieving a list of all IPv4 addresses assigned to local interfaces:

from SystemConfiguration import * # from pyObjC
import socket

def GetIPv4Addresses():
"""
Get all IPv4 addresses assigned to local interfaces.
Returns a generator object that produces information
about each IPv4 address present at the time that the
function was called.

For each IPv4 address, the returned generator yields
a tuple consisting of the interface name, address
family (always socket.AF_INET), the IP address, and
the netmask. The tuple elements may also be accessed
by the names: "ifname", "family", "address", and
"netmask".
"""
ds = SCDynamicStoreCreate(None, 'GetIPv4Addresses', None, None)
# Get all keys matching pattern State:/Network/Service/[^/]+/IPv4
pattern = SCDynamicStoreKeyCreateNetworkServiceEntity(None,
kSCDynamicStoreDomainState,
kSCCompAnyRegex,
kSCEntNetIPv4)
patterns = CFArrayCreate(None, (pattern, ), 1, kCFTypeArrayCallBacks)
valueDict = SCDynamicStoreCopyMultiple(ds, None, patterns)

ipv4info = namedtuple('ipv4info', 'ifname family address netmask')

for serviceDict in valueDict.values():
ifname = serviceDict[u'InterfaceName']
for address, netmask in zip(serviceDict[u'Addresses'], serviceDict[u'SubnetMasks']):
yield ipv4info(ifname, socket.AF_INET, address, netmask)

One interesting point regarding this code is that it doesn't actually inspect interface information in the system configuration dynamic store. The interface-related keys are stored under State:/Network/Interface/, but this code (and Apple's example on which it is based) inspect keys under State:/Network/Service/ instead. However, if you want to get IPv6 addresses then you do have to inspect the system configuration's interface information:

from SystemConfiguration import * # from pyObjC
import socket
import re
ifnameExtractor = re.compile(r'/Interface/([^/]+)/')

def GetIPv6Addresses():
"""
Get all IPv6 addresses assigned to local interfaces.
Returns a generator object that produces information
about each IPv6 address present at the time that the
function was called.

For each IPv6 address, the returned generator yields
a tuple consisting of the interface name, address
family (always socket.AF_INET6), the IP address, and
the prefix length. The tuple elements may also be
accessed by the names: "ifname", "family", "address",
and "prefixlen".
"""
ds = SCDynamicStoreCreate(None, 'GetIPv6Addresses', None, None)
# Get all keys matching pattern State:/Network/Interface/[^/]+/IPv6
pattern = SCDynamicStoreKeyCreateNetworkInterfaceEntity(None,
kSCDynamicStoreDomainState,
kSCCompAnyRegex,
kSCEntNetIPv6)
patterns = CFArrayCreate(None, (pattern, ), 1, kCFTypeArrayCallBacks)
valueDict = SCDynamicStoreCopyMultiple(ds, None, patterns)

ipv6info = namedtuple('ipv6info', 'ifname family address prefixlen')

for key, ifDict in valueDict.items():
ifname = ifnameExtractor.search(key).group(1)
for address, prefixlen in zip(ifDict[u'Addresses'], ifDict[u'PrefixLength']):
yield ipv6info(ifname, socket.AF_INET6, address, prefixlen)

In fact, you could easily adapt the above function to be able to fetch IPv4 addresses from the interface configuration.

Monday, September 20, 2010

Using pyapns with django

There is a handy daemon for sending push notifications to iOS-based mobile clients via Apple's Push Notification Service; it is called pyapns. It is implemented in python but, since it runs as a standalone XML-RPC server process, that fact is largely irrelevant. The important facts are that:
  • It properly and fully implements the client interface to APNs, including the requirement for maintaining a persistent connection with Apple's servers rather than repeatedly setting-up and tearing-down SSL connections.

  • It includes client libraries for communicating with the pyapns daemon from python and ruby, although any language that can speak XML-RPC (including C) will work too.
The way it works is: you first start the pyapns daemon process. This process acts as a XML-RPC server handling requests from your application(s), packing them into Apple's binary APNS protocol, and sending them to Apple to deliver to the iPhone, iPad, or iPod.

In order for your applications to send a push notification request, though, they must first tell pyapns which client certificate it should use to authenticate with the APNS servers. Here is a decent guide on how to obtain a client certificate. Once you have a certificate, you have to use the pyapns client library's configure and provision APIs to tell the pyapns daemon process to use your certificate.

If you are implementing your application in django, you can accomplish the configuration and provisioning directly from your django settings.py file like so:
# Configuration for connecting to the local pyapns daemon,
# including our certificate for pushing notifications to
# mobile terminals via APNS.
PYAPNS_CONFIG = {
'HOST': 'http://localhost:7077/',
'TIMEOUT': 15,
'INITIAL': [
('MyAppName', 'path/to/cert/apns_sandbox.pem', 'sandbox'),
]
}

The pyapns python client library will automatically configure and provision itself from these settings. So, assuming you know the APNS device token of the mobile device you want to send a notification to, all you need to do to send a push notification is to call the pyapns.client.notify() function.

If only it were so easy. One complication arises in that the pyapns provisioning and configuration state is split between the client library and the pyapns daemon process. As a result, there are two scenarios to be wary of:

  1. The django application is restarted. In this case, the client library, which is part of your django application, loses its state and tries to re-configure and re-provision itself from your django settings. Luckily, since the client library will re-read the configuration and provisioning settings from settings.py and seamlessly resume communication with the pyapns daemon.

    However, as noted in the pyapns documentation, "attempts to provision the same application id multiple times are ignored." As a result, if you change pyapns configuration in the settings.py file and restart django, you need to restart the pyapns daemon too for the new settings to take effect. Otherwise, if the settings are unchanged, the client library will seamlessly resume communication with the pyapns daemon.

  2. The pyapns daemon is restarted. In this case, the client library thinks it has already configured and provisioned the daemon, but the daemon has lost this configuration due to restart. As a result, any attempt to send a push notification will fail as the daemon does not know how to establish the connection with Apple's Push Notification service.

As I mentioned above, the first scenario isn't a big deal. If you have to restart your web application or the web server for some reason, the connection between the pyapns client library and the daemon process will automatically resume right where it left off. In the rare case that you changed the pyapns settings in your django settings.py file, you need to restart both the django application and the pyapns daemon process for the new settings to take effect.

The latter scenario, though, is a bigger problem because it is impossible to detect until it is too late: that is, it doesn't manifest itself until you try to send a push notification and fail. Luckily, however, we can catch the failure condition and resolve the problem automatically. Specifically, if the pyapns client library fails to send a push notification to the daemon process due to the daemon process not being configured or provisioned, we can force the client library to re-configure and re-provision and retry.

So here you go, a wrapper around the pyapns client library to automatically recover when the backend pyapns daemon has been restarted:
"""
Wrappers for the pyapns client to simplify sending APNS
notifications, including support for re-configuring the
pyapns daemon after a restart.
"""

import pyapns.client
import time
import logging

log = logging.getLogger('APNS')

def notify(apns_token, message, badge=None, sound=None):
"""Push notification to device with the given message

@param apns_token - The device's APNS-issued unique token
@param message - The message to display in the
notification window
"""
notification = {'aps': {'alert': message}}
if badge is not None:
notification['aps']['badge'] = int(badge)
if sound is not None:
notification['aps']['sound'] = str(sound)
for attempt in range(4):
try:
pyapns.client.notify('MyAppId', apns_token,
notification)
break
except (pyapns.client.UnknownAppID,
pyapns.client.APNSNotConfigured):
# This can happen if the pyapns server has been
# restarted since django started running. In
# that case, we need to clear the client's
# configured flag so we can reconfigure it from
# our settings.py PYAPNS_CONFIG settings.
if attempt == 3:
log.exception()
pyapns.client.OPTIONS['CONFIGURED'] = False
pyapns.client.configure({})
time.sleep(0.5)

Since I glossed over it in this post, I'll cover how to get the APNS device token for a mobile device in my next post. The device token acts as an address, telling Apple's Push Notification service which mobile device it should deliver your notification message to.

Thursday, April 22, 2010

Invent with Python

Al Sweigart gave a presentation of his book, Invent Your Own Computer Games with Python, to Baypiggies tonight. His book is aimed at kids that are interested in learning how to program. He said he didn't have a particular age range in mind, but I would say from experience it would probably be fine for anyone age 8 to 15 with an interest in computers.

I was impressed with the overall tone and layout of his book. And his choice of teaching programming via writing simple computer games is right on the money. He mentioned that what got him hooked on programming was tinkering around creating simple games when he was kid; I'd venture that was what got a great many of the best engineers I've met started. In addition, he choose to teach programming concepts via Python, which I also agree is a great language for learning because it is expressive, easy to understand, and yet powerful to build professional applications.

In his presentation, Al accurately pointed out that there is a trend to try and simplify programming for kids until it resembles building with duplo blocks and that really isn't helpful for kids nor interesting for them. I concur enthusiastically. Projects like Scratch are neat, but seem patronizing to me. I learned on BASIC and Pascal and I don't doubt kids today are just as capable. That said, BASIC and Pascal are dated now; python is just as easy to learn yet more powerful and modern so it is a great choice.

Another difference between Al's book and many intro books is that his programs are short, fun, and mostly text. Of course, every kid dreams of writing graphical games like the video games they play but that is, frankly, not realistic. Again, Al doesn't lie to his audience; he presents fun text-based games that kids can tinker with. He starts with a simple guessing game, then hangman, tic-tac-toe, and othello. Towards the end of the book, he does introduce pygame and shows how to use it to make simple graphical games, but the vast majority of the book focuses on teaching fundamental concepts via text-based games.

I actually taught introductory computer programming to high school students, age 14 and 15, a number of years ago. Back then we used C, but I was surprised to find that Al presents software concepts in the same order I did and even uses the same games to drive those concepts home. If I could teach the same class today in python, Al Sweigart's Invent Your Own Computer Games with Python is the book I'd want to use.

Overall I was quite impressed with the great job Al did with the book and appreciate him taking the time to talk about it, and the process of writing it, to us at Baypiggies tonight.

P.S. I should mention that he has published the book under the Creative Commons license so it is free to read; you can even download the latest edition off his website. Amazon sells it in dead-tree format too, but you should hold off because the second edition will be going to print soon.

Monday, December 14, 2009

Debugging Python Windows Services

Mark Hammond has a simple example of writing Windows Services in Python using the win32serviceutil.ServiceFramework class in his book Python Programming on Win32.

The problem is, that if you use a breakpoint in your service code and try to debug it, you'll find that it never stops at your breakpoint. The reason is that the win32serviceutil module will actually run PythonService.exe and have that program run your python script. Since PythonService.exe is its own process that has no knowledge of your IDE's breakpoints, it just runs without breaking.

Here is a simple trick for the small intersection of people who:
  1. Write Windows Services
  2. in Python
  3. using an IDE (I use pyDev)
  4. and need to step through their code to debug.

It turns out that win32serviceutil has a reimplementation of the logic in PythonService.exe that it uses to emulate PythonService.exe's debug mode when you convert your python script you an executable (e.g. via py2exe). The implementation is in win32serviceutil's DebugService() function. The trick to being able to set breakpoints and debug your python service is to convince win32serviceutil into calling DebugService rather than spawning PythonService.exe.

Luckily, this is trivially easy: just add the line
sys.frozen = 'windows_exe' # Fake py2exe so we can debug
before you call
win32serviceutil.HandleCommandLine(...).

Then, you just need to pass the '-debug' command-line argument when you run your service to force it into debugging mode. Your debugger should then control the process so you can debug it.

Tuesday, November 18, 2008

RFC868 UDP Time Protocol Client

Here is an RFC 868 UDP Time Protocol Client implementation in less than 30 lines of python.
I know...just what you always wanted. It is a long story, but I needed a time protocol client (that would run on Windows) for testing a service we're developing at work. There are tons of implementations of the TCP version of the protocol, but I couldn't find a UDP implementation for Windows to save my life. Python to the rescue.
from socket import *
from struct import unpack
from time import ctime, sleep
from sys import argv

argv = argv[1:]
if len(argv) == 0:
argv = [ 'time-nw.nist.gov' ]

s = socket(AF_INET, SOCK_DGRAM)
s.settimeout(5.0)

for server in argv:
print server, ":",
try:
s.sendto('', 0, (server, 37))
t = long(unpack('!L', s.recv(16)[:4])[0])
# Convert from 1900/01/01 epoch to 1970/01/01 epoch
t -= 2208988800
print ctime(t)
except timeout:
print "TIMEOUT"
except:
print "ERROR"

s.close()
sleep(2)
In case you are curious, the sleep(2) at the end is there so our testers can simply click the icon on their desktop and have time to actually see the results before Windows closes the console window.

By default, it queries the time-nw.nist.gov server, but you can specify any number of servers to query on the command-line.

Thursday, December 20, 2007

Less code

I was just reading Steve Yegge's rant against code size and realized that he managed to put into words exactly the feelings that have been drawing me to python in recent years. In particular, I managed to mostly skip the Java step in my journey from Pascal, through assembler, up to C, and then the leap to high-level languages including perl, and more recently python. I don't really know why, but Java never felt "right" -- for anything. To this day, I can't think of too many applications that I would say Java was the best tool for the job. For which, I think Steve hit the nail on the head when he writes:
Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.

Hallelujah, brother.

Anyway, I strongly agree with Steve's general points about the merits of small code bases, but I won't go so far to say that smaller is necessarily always better. Python hits a sweet spot for me (at least for now) between compactness and comprehensiveness. Certainly a good number of problems could be expressed more succinctly in a functional language such as Erlang or Haskell, but you lose readability. In fact, as elegantly as many problems can be expressed in a functional language, they quickly start to look like line noise when the problems exceed textbook examples.

Programming language preferences aside, what I agree with most from Steve's blog post was not so much that more succinct languages are better, but that less code is better. His post is written so as to suggest that Java itself is a problem -- which may certainly be true -- but he doesn't clarify whether he thinks it is Java the language, or Java the set of libraries.

Python, for example, combines a great set of standard libraries with a language syntax that makes it easy to use those libraries. All the lines of code hidden away in libraries are effectively "free" code. You don't have to manage their complexity. Give me a language that makes leveraging as many libraries as possible painless, then I can glue them together to make great programs with low apparent complexity. In reality, the lines of code might be astronomical, but I don't have to manage the complexity of all of it -- just the part I wrote -- so it doesn't matter.
Python does a great job here, whereas Java (and C++'s STL) largely get it wrong.

In particular, I would argue that, in addition to python's straightforward syntax, the fact that so many of python's libraries are written in C is a large factor in why they are so easy to use. There may be a huge amount of complexity, and a huge number of lines of code, in the C implementation of a library. However, the API boundary between python and C acts a sort of line of demarcation -- no complexity inherent in the implementation of the library can leak out into the python API without the programmer explicitly allowing it. That is, the complexity of libraries written in C and callable from python is necessarily encapsulated.

As a personal anecdote, in one project I work on, we use ctypes to make foreign function calls to a number of Windows APIs. One thing that really bothers me about this technique is that I find myself re-implementing a number of data structures in ctypes that are already defined in C header files. If I make a mistake, then I introduce a bug. Ironically, since I could leverage more existing code, often times there would be fewer lines of code and less complexity had I just used C to call the APIs. Of course, other parts of the program would become hugely unwieldy, but the point of this anecdote is that libraries (more specifically, being able to leverage known-good code) can be much more effective in reducing code than the implementation language.

So long as the implementation language isn't Java. Java just sucks. :)

Monday, December 3, 2007

Python: asserting code runs in specific thread

My buddy JJ's post to the BayPiggies mailing list reminded me of a little snippet I wrote a while back that others might find useful as well. Personally, I avoid threads like the plague, but if you are forced to use them it is generally handy to keep accurate tabs on how you use them. In particular, as JJ suggested in his post, it is a good idea to assert that code is called in the thread context you expect it to be called in. This can go a long way toward avoiding one of many classes of hard-to-find logic bugs multi-threading invites. Anyway, on to the code...
def assertThread(*threads):
"""Decorator that asserts the wrapped function is only
run in a given thread
"""
# If no thread list was supplied, assume the wrapped
# function should be run in the current thread.
if not threads:
threads = (threading.currentThread(), )

def decorator(func):
def wrapper(*args, **kw):
currentThread = threading.currentThread()
name = currentThread.getName()
assert currentThread in threads or name in threads, \
"%s erroniously called in %s thread " \
context" % (func.__name__, name)
return func(*args, **kw)

if __debug__:
return wrapper
else:
return func
return decorator

You can restrict execution to one or more threads, each specified by either the thread object or thread name.

Note the trick at the end to make the decorator effectively a no-op in production. Using this decorator around your functions and methods helps you spot logic errors during development without impacting the performance of your production code. Of course, if you are of the school that assertions should never be disabled, feel free to replace the final if __debug__:/else block with an unconditional return of wrapper.

Tuesday, October 9, 2007

Python: A better wx.ListCtrl

I'm not going to repeat the entire post here, but I would like to direct your attention to my friend and former coworker Zach's recent blog post regarding implementing a better ListCtrl via wxWidgets' virtual ListCtrl.

For those familiar with wxPython, what Zach has done is combine wx.ListCtrl and wx.lib.mixins.listctrl.ColumnSorterMixin into a single easy-to-use class. Except, rather than implement it as an entirely new class, he implements a function that transmutes a generic ListCtrl into his new & improved ListCtrl. That advantage here, as he points out, is that you don't need to modify any XRC files to gain the new functionality.

The post is on NTT MCL's recently-introduced company blog which, unfortunately, doesn't appear to accept comments (and says "Japan Window" for some strange reason). As such, I'll point out that Zach also occasionally posts to his own personal blog, which is also worth checking out.

Sunday, September 2, 2007

Python: Reconstructing datetimes from strings

Previously, I posted a small snippet for converting the str() representation of a python timedelta object back to its equivalent object. This time I'm going to address doing the same for datetime objects:

def parseDateTime(s):
"""Create datetime object representing date/time
expressed in a string

Takes a string in the format produced by calling str()
on a python datetime object and returns a datetime
instance that would produce that string.

Acceptable formats are: "YYYY-MM-DD HH:MM:SS.ssssss+HH:MM",
"YYYY-MM-DD HH:MM:SS.ssssss",
"YYYY-MM-DD HH:MM:SS+HH:MM",
"YYYY-MM-DD HH:MM:SS"
Where ssssss represents fractional seconds. The timezone
is optional and may be either positive or negative
hours/minutes east of UTC.
"""
if s is None:
return None
# Split string in the form 2007-06-18 19:39:25.3300-07:00
# into its constituent date/time, microseconds, and
# timezone fields where microseconds and timezone are
# optional.
m = re.match(r'(.*?)(?:\.(\d+))?(([-+]\d{1,2}):(\d{2}))?$',
str(s))
datestr, fractional, tzname, tzhour, tzmin = m.groups()

# Create tzinfo object representing the timezone
# expressed in the input string. The names we give
# for the timezones are lame: they are just the offset
# from UTC (as it appeared in the input string). We
# handle UTC specially since it is a very common case
# and we know its name.
if tzname is None:
tz = None
else:
tzhour, tzmin = int(tzhour), int(tzmin)
if tzhour == tzmin == 0:
tzname = 'UTC'
tz = FixedOffset(timedelta(hours=tzhour,
minutes=tzmin), tzname)

# Convert the date/time field into a python datetime
# object.
x = datetime.strptime(datestr, "%Y-%m-%d %H:%M:%S")

# Convert the fractional second portion into a count
# of microseconds.
if fractional is None:
fractional = '0'
fracpower = 6 - len(fractional)
fractional = float(fractional) * (10 ** fracpower)

# Return updated datetime object with microseconds and
# timezone information.
return x.replace(microsecond=int(fractional), tzinfo=tz)

Last time Lawrence Oluyede was kind enough to point out that the dateutil module can likely do this and a lot more. However, I'm trying to stick to modules in the base library. Of course, it wouldn't be a bad thing if dateutil were to make it into the base library....

Speaking of which, the snipped above relies on the FixedOffset tzinfo object described in the documentation for the datetime.tzinfo module. Being that the documentation is part of the standard python distribution, I guess you could call that code part of the base library, even if you can't import it. :|

Update 2007/09/03 1:39pm (JST):
Fix example format examples in doc-string per comment from David Goodger.

Wednesday, August 15, 2007

Python: Reconstructing timedeltas from strings

For some reason, the date/time objects implemented by the datetime module have no methods to construct them from their own string representations (as returned by calling str() or unicode() on the objects). It so happens that reconstructing a timedelta from its string representation can be implemented using a relatively simple regular expression:
import re
from datetime import timedelta

def parseTimeDelta(s):
"""Create timedelta object representing time delta
expressed in a string

Takes a string in the format produced by calling str() on
a python timedelta object and returns a timedelta instance
that would produce that string.

Acceptable formats are: "X days, HH:MM:SS" or "HH:MM:SS".
"""
if s is None:
return None
d = re.match(
r'((?P<days>\d+) days, )?(?P<hours>\d+):'
r'(?P<minutes>\d+):(?P<seconds>\d+)',
str(s)).groupdict(0)
return timedelta(**dict(( (key, int(value))
for key, value in d.items() )))

But the other types are not quite so easy. Next time, I'll post my implementation for reconstructing datetime objects from their string representation.

Thursday, August 2, 2007

Python: Typed attributes using descriptors

A few weeks a saw a post by David Stanek on Planet Python regarding using python descriptors to implement typed attributes. At the time, I needed something very similar and I've been trying to find something descriptors were good for (besides re-implementing the property builtin) so I decided to give his trick a try. The only problem was, I was coding on CalTrain at the time so I couldn't access his blog to reference the code in his post. The worst part was that I struggled with problems that, as it turns out, were pointed out in the comments to his posting (specifically, attributes implemented via descriptors would erroneously share state between instances of the classes the attributes were assigned to). By the time I got to the office and could consult David's blog post, this is the implementation I had working:
from collections import defaultdict

class TypedAttr(object):
"""Descriptor implementing typed attributes, converting
assigned values to the given type if necessary

Constructed with three parameters: the type of the attribute,
an initial value for the attribute, and an (optional)
function for converting values of other types to the desired
type.

If the converter function is not specified, then the type
factory is called to perform the conversion.
"""
__slots__ = ('__type', '__converter', '__value')
def __init__(self, type, initvalue=None, converter=None):
if converter is None:
converter = type
self.__type = type
self.__converter = converter
initvalue = self.convert(initvalue)
self.__value = defaultdict(lambda: initvalue)

def convert(self, value):
if not isinstance(value, self.__type) \
and value is not None:
value = self.__converter(value)
assert isinstance(value, self.__type) \
or value is None
return value

def __get__(self, instance, owner):
if instance is None:
return self
return self.__value[instance]

def __set__(self, instance, value):
self.__value[instance] = self.convert(value)

With this, I could write my classes like so:
class Example(object):
Name = TypedAttr(str)
Type = TypedAttr(str, "cheezy example")

Mainly I'm using the typed attributes for data validation in objects populated from values supplied by an untrusted source. It would be really nice if I could compose descriptors to build more complex managed attributes (sort of like you can with validators in Ian Bicking's FormEncode package). Then I could make a descriptor, for example, Unsigned or NotNone and compose them with TypedAttr like so:
class CompositionExample(object):
Name = TypedAttr(NotNone(str))
Age = Unsigned(TypedAttr(int))

I'll admit I haven't put a whole lot of thought into it yet, but at first glance it appears that it would be impossible to compose descriptors in python. I would love to be proven wrong.

Monday, July 23, 2007

Python: Serializer benchmarks

I am working on a project in which clients will be submitting more data than my current server implementation knows what to do with. The reason the current implementation doesn't use all of the submitted data is that I don't yet know what the quality of the data will be until the client is deployed in the wild. I want to record all of the submitted data, though, in the expectation that a future implementation will be able to use it. So I was considering formats for logging the submitted data such that it would be easy to parse in the future.

Since I'm already storing a subset of the submitted data in a database, the most obvious solution is to make a table of submissions which has a column for each submitted data element. However, it turns out that this is quite slow and given that I'm not sure how much of the extra data I'll ever need or when I may update the server implementation to use it, I hate to pay a hefty price to store it now. For now, I can consider the data write-only. If and when I need to use that data, then I can write an import script that updates the server database using the saved data.

So I've been considering simply logging the submissions to a file. It is considerably faster to append to a flat file than it is to write to a database -- which makes sense since the database supports read/write access, whereas I only need write-only access for now.

The next question is what format to write the data to the log file. I have a python dictionary of the submitted data; at first I considered writing the dictionary to the log file in JSON format. The JSON format is relatively easy to convert to/from python data structures and python has quality implementations to do it. Furthermore, unlike the pickle text format, it is trivial to visually interpret the serialized data. This latter point is also important to me since I need to be able to judge the quality of the data in order to discern what portions I can use in the future.

However, to my chagrin, it turns out that the JSON module I have been using, simplejson, is slower than I had imagined. Profiling of my server implementation found that, after the database update logic, serializing the submitted data into JSON format was my second largest consumer of CPU cyles. I hate the thought of wasting so much time logging the data when it is an operation that is essentially pure overhead.

Hence I started considering other serialization formats, benchmarking them as I went. Here are the results of my benchmarks:


SerializerRun 1 (secs)Run 2 (secs)Mean (secs)
pyYAML 3.0521953.1825482.6123717.89
pySyck 0.61.23107.062805.382956.22
pprint2364.912368.422366.67
pickle1509.311665.161587.23
pickle/protocol=21359.401330.711345.05
simplejson 1.7.1710.78604.13657.46
cPickle159.27172.26165.77
repr73.5077.2475.37
cjson 1.0.363.9474.2869.11
cPickle/protocol=250.9757.7254.34
marshal12.5213.3212.92

All numbers were obtained using the timeit module to serialize the dictionary created by the expression "dict([ (str(n), n) for n in range(100) ])".
The tests were run under Python 2.5 (r25:51908, Mar 3 2007, 15:40:46) built using [GCC 3.4.6 [FreeBSD] 20060305] on freebsd6. The simplejson, cjson, pyYAML, and pySyck modules were installed from their respective FreeBSD ports (I had to update the FreeBSD pySyck port to install 0.61.2 since it otherwise installs 0.55).

I guess I should not have been surprised, but it turns out that simply calling repr() on the dictionary is almost 9 times faster than calling simplejson.dumps(). In fact, taking repr() as a baseline (100%), I calculated how long each of the other serializers took relative to repr():

SerializerMean (secs)Relative to Baseline
pyYAML 3.0523717.8931469%
pySyck 0.61.22956.223922%
pprint2366.673140%
pickle1587.232106%
pickle/protocol=21345.051785%
simplejson 1.7.1657.46872%
cPickle165.77220%
repr75.37100%
cjson 1.0.369.1191.7%
cPickle/protocol=254.3472.1%
marshal12.9217.1%

The numbers in the last column are how much longer it took to serialize the test dictionary using the given serializer than it was using repr().

So now I'm thinking of sticking with JSON as my log format, but using the cjson module rather than simplejson. cPickle's latest binary format (protocol=2) is even faster, but I would lose the ability to visually scan the log file to get a feel for the quality of the data I'm not currently using.

Now, before I get a horde of comments I should point out that I am aware that simplejson has an optional C speedups module. Unfortunately, it does not appear to be installed by default on either FreeBSD (my server) or on Windows (my current client). I wouldn't be the least bit surprised if the C version of simplejson is just as fast as the cjson module, but it doesn't matter if it isn't installed. As such, it looks like I'll be switching to cjson for my JSON serialization needs from now on.

Update 2007/07/25 07:07pm:
In response to paddy3118's comment, I added benchmarks for the python pprint module to the tables above.

Update 2007/07/27 12:26pm:
In response to David Niergarth's comment, I added benchmarks for pyYAML 3.05 and pySyck 0.61.2.

Sunday, July 22, 2007

Python: Mapping arguments to their default values

Using inspect.getargspec or the getargspec implemention in my previous post, we can build a dictionary mapping a callable's argument names to their default values:
def getargmap(obj, default=None):
"""Get dictionary mapping callable's arguments to their
default values

Arguments without default values in the callable's argument
specification are mapped to the value given by the default
argument.
"""
spec = getargspec(obj)
argmap = dict.fromkeys(spec[0], default)
argmap.update(zip(spec[0][-len(spec[3]):], spec[3]))
return argmap

Python: A (more) generic getargspec()

In my last post, I presented a generic way to get the arguments passed to the current function such that you can iterate through them. This time, I present a way to get the arguments that a callable accepts/expects. Actually, the standard inspect module already has a getargspec() function that returns the argument specification of a function. However, it only works on functions and methods, not any other python callable. It turns out that there is no way to get the argument specification for built-in callables, but we can implement a version of getargspec() that can get the specification for classes and callable objects:
import inspect

def getargspec(obj):
"""Get the names and default values of a callable's
arguments

A tuple of four things is returned: (args, varargs,
varkw, defaults).
- args is a list of the argument names (it may
contain nested lists).
- varargs and varkw are the names of the * and
** arguments or None.
- defaults is a tuple of default argument values
or None if there are no default arguments; if
this tuple has n elements, they correspond to
the last n elements listed in args.

Unlike inspect.getargspec(), can return argument
specification for functions, methods, callable
objects, and classes. Does not support builtin
functions or methods.
"""
if not callable(obj):
raise TypeError, "%s is not callable" % type(obj)
try:
if inspect.isfunction(obj):
return inspect.getargspec(obj)
elif hasattr(obj, 'im_func'):
# For methods or classmethods drop the first
# argument from the returned list because
# python supplies that automatically for us.
# Note that this differs from what
# inspect.getargspec() returns for methods.
# NB: We use im_func so we work with
# instancemethod objects also.
spec = list(inspect.getargspec(obj.im_func))
spec[0] = spec[0][1:]
return spec
elif inspect.isclass(obj):
return getargspec(obj.__init__)
elif isinstance(obj, object) and \
not isinstance(obj, type(arglist.__get__)):
# We already know the instance is callable,
# so it must have a __call__ method defined.
# Return the arguments it expects.
return getargspec(obj.__call__)
except NotImplementedError:
# If a nested call to our own getargspec()
# raises NotImplementedError, re-raise the
# exception with the real object type to make
# the error message more meaningful (the caller
# only knows what they passed us; they shouldn't
# care what aspect(s) of that object we actually
# examined).
pass
raise NotImplementedError, \
"do not know how to get argument list for %s" % \
type(obj)
This version returns exactly the same argument specification tuple as inspect's getargspec() does with one notable exception: if called on a method, the argument list returned in the first tuple element will not include the implicit 'self' argument. The reason is that python implicitly supplies that argument so the caller does not pass it explicitly. I find it more useful to only return the argument specification as seen by callers. If you need a drop-in replacement for inspect.getargspec(), then you will need to slightly modify the method/classmethod case to not remove the first element in the argument list.

Monday, July 16, 2007

Python: Aggregating function arguments

Python has three ways to pass arguments to functions: enumerated named arguments, unenumerated named arguments, and unnamed positional arguments. Enumerated named arguments are familiar to most programmers since most modern languages use this style of naming arguments (perl being a notable exception). For example, the following function specifies that it accepts 3 arguments, and assigns those arguments the names larry, moe, and curly for the scope of the function:
    def Stooge(larry, moe, curly):
...

If you call this function as Stooge(1, 2, 3) then the variable named larry equals 1, moe equals 2, and curly equals 3 when the function Stooge starts. Python, like C++ and Java, also allows you to specify the arguments explicitly; calling the function as Stooge(moe=2, curly=3, larry=1) or Stooge(1, 2, curly=3) again causes larry to equal 1, moe to equal 2, and curly to equal 3 when the function starts. I call this form of argument passing enumerated named arguments since names are assigned to each argument and all acceptable arguments are enumerated in the function declaration.

Python also supports unenumerated named arguments by specifying a "catch all" argument, prefixed with two asterisks. For example:
    def Stooge2(larry, moe, **kw):
...

In this case, Stooge2 accepts two arguments, larry and moe that may be specified either positionally or by name just as in the previous example. However, it also accepts any number of additional named arguments. For example, we could call the function as Stooge2(1, moe=2, shemp=3) or Stooge2(1, 2, shemp=3, curly=4). In both cases, as before, larry would start equal to 1 and moewould start equal to 2. However, now the kw argument would be populated with a dictionary mapping all other named parameters with their argument values. For example, it might contain {'shemp': 3, 'curly': 4}.

Before we move on to unnamed positional arguments, let me interrupt to touch on the point of this posting: how do you iterate over all named arguments whether they be enumerated or not?

If your function enumerates all accepted named arguments, then you can trivially get a dictionary mapping the argument names to their values if you call the builtin function locals() at the beginning of your function. For example:
    def WhichStooges(larry, moe, curly):
stooges = locals()
...

This would populate stooges with a dictionary with keys, "larry", "moe", and "curly". You could then iterate through the arguments and their values with a standard loop over stooges.items().

Now, if you add unenumerated named arguments into the picture, it gets a bit trickier. The most straightforward way is to use the fact that "catch all" argument is a standard dictionary and update it from locals() at the beginning of the function:
    def WhichStooges2(larry, moe, **stooges):
stooges.update(locals())
...

The only problem with this approach is that stooges still appears in the argument list, which is probably not what you want. This can be remedied like so:
    def WhichStooges2(larry, moe, **stooges):
stooges.update(locals())
del stooges['stooges']
...

Which only leaves the minor issue of the requirement for locals() to be called at the top of the function, before any other variables are defined in the function's scope. Wouldn't it be nice if we could enumerate the function arguments anywhere in the function? And wouldn't it be even better if we could encapsulate the logic for aggregating the function arguments into a utility function?

Before I get to the solution to those problems, for the sake of completeness I should cover unnamed positional arguments too. Unnamed positional arguments are additional positional arguments that are captured in a single list argument by prefixing the argument named with a single asterisk (*) in Python. For example:
    def WhichStooges3(larry, moe, *args):
...

In this case, larry and moe may still be passed values either by name or position as in previous examples. In addition, additional values may be specified but they cannot be named. Calling this function as WhichStooges3(1, 2, 3, 4) causes larryto start with the value 1, moe to start with the value 2, and args to start as a list containing (3, 4). The rules for mixing unnamed positional arguments and named arguments are non-trivial and covered in the Python documentation so I won't rehash them here.

Finally, he can construct one utility function that returns a dictionary of all named parameters (enumerated or not) as well as a list of all unnamed positional parameters. By using Python's inspect module we can encapsulate the logic into a single common routine that can be called anywhere within a function's scope.
    def arguments():
"""Returns tuple containing dictionary of calling function's
named arguments and a list of calling function's unnamed
positional arguments.
"""
from inspect import getargvalues, stack
posname, kwname, args = getargvalues(stack()[1][0])[-3:]
posargs = args.pop(posname, [])
args.update(args.pop(kwname, []))
return args, posargs

This routine removes the 'catch all' arguments (i.e. the positional catch all argument prefixed with a single asterisk and/or the keyword catch all argument prefixed with two asterisks) from the returned dictionary of named arguments for you.


Update 2009/09/29:
I updated the arguments() function to fix a bug that was brought to my attention by drewm1980's comment.

Friday, July 13, 2007

Parsing Japanese addresses

Last night Steven Bird, Ewan Klein, and Edward Loper gave a presentation about their Natural Language Toolkit at the monthly baypiggies meeting. The gist of the presentation seemed to be that their toolkit is just that: a set of basic tools commonly needed in implementing more complicated natural language processing algorithms and a set of corpora for training and benchmarking those algorithms. Given their background as academics, this makes sense as it allows them to quickly prototype and explore new algorithms as part of their research. However, I got the impression that a number of the attendees were hoping for more of a plug-and-play complete natural language processing solution they could integrate into other programs without needing to be versed in the latest research themselves.

When I get some time, I would like to try using NLTK to solve a recurring problem I encounter at work: parsing Japanese addresses. There is a commercial tool that claims to do a good job parsing Japanese postal addresses, but I've found the following python snippet does a pretty good job on the datasets I've been presented so far:
  # Beware of greedy matching in the following regex lest it
# fail to split 宮城県仙台市泉区市名坂字東裏97-1 properly
# as (宮城県, None, 仙台市, 泉区, 市名坂字東裏97-1)
# In addition, we have to handle 京都府 specially since its
# name contains 都 even though it is a 府.
_address_re = re.compile(
ur'(京都府|.+?[都道府県])(.+郡)?(.+?[市町村])?(.+?区)?(.*)',
re.UNICODE)
def splitJapaneseAddress(addrstr):
"""Splits a string containing a Japanese address into
a tuple containing the prefecture (a.k.a. province),
county, city, ward, and everything else.
"""
m = _address_re.match(addrstr.strip())
(province, county, city, ward, address) = m.groups()
address = address.strip()
# 東京都 is both a city and a prefecture.
if province == u'東京都' and city is None:
city = province
return (province, country, city, ward, address)

I should add that, unlike English, it does not make sense to separate and store the Japanese street address as its own value since the full address string is commonly what is displayed. So even though the routine above returns the street address as the final tuple item, I never actually use the returned value for anything.

Anyway, as you can see this regular expression is pretty naive. During last night's meeting I kept thinking that I should put together a corpus of Japanese addresses and their proper parses so that I can experiment with writing a better parser. The Natural Language Toolkit seems to be designed for doing just this kind of experimentation. I'm hoping that next time I'm given a large dataset for import into our database at work I can justify the time to spend applying NLTK to the task.

Thursday, July 5, 2007

Python and MochiKit for the win

Last Friday we concluded the first ever NTT MCL prototyping contest. The rules were simple: we could form teams of up to 3 employees and had one month to prototype any idea we wanted. We had to submit an entry form with our idea and the team members at the beginning of the contest. The judging of submissions would not only consider the technical aspects of the idea but also the feasibility of developing it into a full-scale product to be sold/marketed by NTT Communications or one of our sister companies. Basically, cheap market research. :)

Obviously, I cannot go into the details of the submissions, except to say that one team (of three) implemented theirs in C++, one team (of three) used Java, another team (of two) used C and perl, and my team (of two) used Python and JavaScript. Of course, we all implemented our own project ideas so the amount of work per project varied greatly.

The verdict is in: my team won. And I think it was all thanks to the rapid prototyping made possible by modern scripting languages. The C/perl team dropped out at the last minute due to a content provider their project depended on going off-line since the day before the deadline and presentation. The other two teams (using C++ and Java) had interesting ideas and working prototypes, but in both cases the prototypes were just barely functional. It was a prototyping contest, so that is to be expected.

However, we demonstrated a fully-working dynamic web-based application with real-time updates (graphs and charts would literally change while you were looking at them in response to external data). Not to sound like I'm bragging, but it was polished.

I have to say, I haven't done full-on web development in years, and I was refreshed at how much easier it has gotten. In particular, I found that I could apply more-or-less traditional client-server design with the client implemented in JavaScript and running in the browser, using AJAX to make requests to a JSON server implemented in python. MochiKit, as promised, made JavaScript suck less. Come to think of it, since I used Bob Ippolito's MochiKit on the client and simplejson python module on the server, I guess you could say Bob won the competition for us.

Anyway, the one thing that really disappointed me was that no one asked how we did it. I didn't actually care whether we won or not, but I am really proud of the technology under the hood. I expected that, presenting to 20+ engineers at a research and development company someone would say "cool, how did you do that?" To my chagrin, not one person asked (although, my coworker Zach came by my office later to ask). I know it is cheezy, but I was really hoping someone would ask if I used Ruby on Rails so I could respond, "no, it was Kelly on Caffeine." :)

In case anyone out there reading this is curious: I didn't use TurboGears, Pylons, or Django either. I'll be first to admit it was just a prototype rather than a full-blown production web application, but I found the python cgi and wsgiref modules, flup's FastCGI WSGI server, and Bob Ippolito's simplejson module more than adequate to implement a fast JSON server that interfaced with a PostgreSQL database backend. No proprietary templating languages, no object-relational mappers trying (futilely) to apply python syntax to SQL, no cryptic stack traces through 18 layers of unfamiliar framework code. Just SQL queries, JSON, and good old fashioned client-server request handling (where requests came via CGI). All of the user interface compontents were implemented by logic that executed on the client-side. I can't imagine any web application framework being faster either in terms of developer or CPU time.

Given the choice, however, I would have preferred to not have had to use JavaScript to implement the client. Suck "less" indeed.

Sunday, June 24, 2007

Useless geolocation information?

I just ran across the Windows' geographical location information API hidden amongst their national language support functions. One thing that caught my eye was that you can get latitude and longitude coordinates via the API. For example, in python:

>>> from ctypes import *
>>> nation = windll.kernel32.GetUserGeoID(16)
>>> buf = create_string_buffer(20)
>>> GEO_LATITUDE=0x0002
>>> GEO_LONGITUDE=0x0003
>>> windll.kernel32.GetGeoInfoA(nation, GEO_LATITUDE,
... byref(buf), sizeof(buf), 0)
6
>>> buf.value
'39.45'
>>> windll.kernel32.GetGeoInfoA(nation, GEO_LONGITUDE,
... byref(buf), sizeof(buf), 0)
8
>>> buf.value
'-98.908'

At first glance, this looks pretty cool. You can get geospacial coordinates on any Windows XP or Vista box. Except that the API only returns the latitude and longitude of the country selected in the "Regional and Language Options" control panel. That is: the coordinates returned are for the user's selected location (i.e. country).

Sure enough, plugging the returned latitude and longitude into Yahoo Maps, I find that the coordinates fall squarely in the middle of the country. A good thousand miles from where I sit in Palo Alto, California. Of course, being that I don't have a GPS receiver attached to my laptop, I didn't really expect the coordinates to be accurate.

For the life of me, I can't think of a single application of this information. But there it is: someone at Microsoft had to populate the database mapping countries to geospacial coordinates and some engineer had to code the API to retrieve them. For what purpose? What can you do with the latitude and longitude of a country (whatever that is supposed to mean)? Can anyone think of a use for this data?