Programming pages.

RSS Atom

Lex Nederbragt posted a question about version control and provenance on the Software Carpentry discussion list. I responded with my Portage-based workflow, but C. Titus Brown pointed out a number of reasons why this approach isn't more widely used, which seem to boil down to “that sounds like more trouble than it's worth”. Because recording the state of a system is important for reproducible research, it is worth doing something to clean up the current seat-of-the-pants approach.

Figuring out what software you have intalled on your system is actually a (mostly) solved problem. There is a long history in the Linux ecosystem for package management systems that track installed packages and install new software (and any dependencies) automatically. Unfortunately, there is not a consensus package manager across distributions, with Debian-based distributions using apt, Fedora-based distributions using yum, …. If you are not the system administrator for your computer, you can either talk your sysadmin into installing the packages you need, or use one of a number of guest package managers (Gentoo Prefix, homebrew, …). The guest package managers also work if you're committed to an OS that doesn't have an existing native package manager.

Despite the existence of many high quality package managers, I know many people who continue to install significant amounts of software by hand. While this is sustainable for a handful of packages, I see no reason to struggle through manual installations (subsequent upgrades, dependencies, …) when existing tools can automate the procedure. A stopgap solution is to use language specific package managers (pip for Python, gem for Ruby, …). This works fairly well, but once you reach a certain level of complexity (e.g. integrating Fortran and C extensions with Python in SciPy), things get difficult. While language-specific packaging standards ease automation, they are not a substitute for a language-agnostic package manager.

Many distributions distribute pre-compiled, binary packages, which give fast, stable installs without the need to have a full build system on your local machine. When the package you need is in the official repository (or a third-party repository), this approach works quite well. There's no need to go through the time or effort of compiling Firefox, LaTeX, LibreOffice, or other software that I interact with as a general a user. However, my own packages (or actively developed libraries that use from my own software) are rarely available as pre-compiled binaries. If you find yourself in this situation, it is useful to use a package manager that makes it easy to write source-based packages (Gentoo's Portage, Exherbo's Paludis, Arch's packman, …).

With source-based packaging systems, packaging an existing Python package is usually a matter of listing a bit of metadata. With layman, integrating your local packages into your Portage tree is extremely simple. Does your package depend on some other package in another oddball language? Some wonky build tool? No problem! Just list the new dependency in your ebuild (it probably already exists). Source-based package managers also make it easy to stay up to date with ongoing development. Portage supports live ebuilds that build fresh checkouts from a project's version control repository (use Git!). There is no need to dig out your old installation notes or reread the projects installation instructions.

Getting back to the goals of reproducible research, I think that existing package managers are an excellent solution for tracking the software used to perform experiments or run simulations and analysis. The main stumbling block is the lack of market penetration ;). Building a lightweight package manager that can work easily at both the system-wide and per-user levels across a range of host OSes is hard work. With the current fractured packaging ecosystem, I doubt that rolling a new package manager from scratch would be an effective approach. Existing package managers have mostly satisfied their users, and the fundamental properties haven't changed much in over a decade. Writing a system appealing enough to drag these satisfied users over to your new system is probably not going to happen.

Portage (and Gentoo Prefix) get you most of the way there, with the help of well written specifications and documentation. However, compatibility and testing in the prefix configuration still need some polishing, as does robust binary packaging support. These issues are less interesting to most Portage developers, as they usually run Portage as the native package manager and avoid binary packages. If the broader scientific community is interested in sustainable software, I think effort channeled into polishing these use-cases would be time well spent.

For those less interested in adopting a full-fledged package manager, you should at least make some effort to package your software. I have used software that didn't even have a README with build instructions, but compiling it was awful. If you're publishing your software in the hopes that others will find it, use it, and cite you in their subsequent paper, it behooves you to make the installation as easy as possible. Until your community coalesces around a single package management framework, picking a standard build system (Autotools, Distutils, …) will at least make it easier for folks to install your software by hand.

Available in a git repository.
Repository: mutt-ldap
Browsable repository: mutt-ldap
Author: W. Trevor King

I wrote this Python script to query an LDAP server for addresses from Mutt. In December 2012, I got some patches from Wade Berrier and Niels de Vos. Anything interesting enough for others to hack on deserves it's own repository, so I pulled it out of my blog repository (linked above, and mirrored on GitHub).

The README is posted on the PyPI page.

I've been wanting to get into microcontroller programming for a while now, and last week I broke down and ordered components for a breadboard Arduino from Mouser. There's a fair amount of buzz about the Arduino platform, but I find the whole sketch infrastucture confusing. I'm a big fan of command line tools in general, so the whole IDE thing was a bit of a turn off.

Because the ATMega328 doesn't have a USB controller, I also bought a Teensy 2.0 from PJRC. The Teensy is just an ATMega32u4 on a board with supporting hardware (clock, reset switch, LED, etc). I've packed the Teensy programmer and HID listener in my Gentoo overlay, to make it easier to install them and stay up to date.

Arduinos (and a number of similar projects) are based on AVR microcontrollers like the ATMegas. Writing code for an AVR processor is the similar to writing code for any other processor. GCC will cross-compile your code once you've setup a cross-compiling toolchain. There's a good intro to the whole embedded approach in the Gentoo Embedded Handbook.

For all the AVR-specific features you can use AVR-libc, an open source C library for AVR processors. It's hard to imagine doing anything interesting without using this library, so you should at least skim through the manual. They also have a few interesting demos to get you going.

AVR-libc sorts chip-support code into AVR architecture subdirectories. For example, object code specific to my ATMega32u4 is installed at /usr/avr/lib/avr5/crtm32u4.o. avr5 is the AVR architecture version of this chip.

Crossdev

Since you will probably not want to build a version of GCC that runs on your AVR chip, you'll be building a cross comiling toolchain. The toolchain will allow you to use your development box to compile programs for your AVR chip. On Gentoo, the recommended approach is to use crossdev to build the toolchain (although crossdev's AVR support can be flaky). They suggest you install it in a stage3 chroot to protect your native toolchain, but I think it's easier to just make btrfs snapshots of my hard drive before doing something crazy. I didn't have any trouble skipping the chroot on my sytem, but your mileage may vary.

# emerge -av crossdev

Because it has per-arch libraries (like avr5), AVR-libc needs to be built with multilib support. If you (like me) have avoided multilib like the plague so far, you'll need to patch crossdev to turn on multilib for the AVR tools. Do this by applying Jess' patch from bug 377039.

# wget -O crossdev-avr-multilib.patch 'https://bugs.gentoo.org/attachment.cgi?id=304037'
# patch /usr/bin/crossdev < crossdev-avr-multilib.patch

If you're using a profile where multilib is masked (e.g. default/linux/x86/10.0/desktop) you should use Niklas' extended version of the patch from the duplicate bug 378387.

Despite claiming to use the last overlay in PORTDIR_OVERLAY, crossdev currently uses the first, so if you use layman to manage your overlays (like mine), you'll want to tweak your make.conf to look like:

source /var/lib/layman/make.conf
PORTDIR_OVERLAY="/usr/local/portage ${PORTDIR_OVERLAY}"

Now you can install your toolchain following the Crossdev wiki. First install a minimal GCC (stage 1) using

# USE="-cxx -openmp" crossdev --binutils 9999 -s1 --without-headers --target avr

Then install a full featured GCC (stage 4) using

# USE="cxx -nocxx" crossdev --binutils 9999 -s4 --target avr

I use binutils-9999 to install live from the git mirror, which avoids a segfault bug in binutils 2.22.

After the install, I was getting bit by bug 147155:

cannot open linker script file ldscripts/avr5.x

Which I work around with:

# ln -s /usr/x86_64-pc-linux-gnu/avr/lib/ldscripts /usr/avr/lib/ldscripts

Now you're ready. Go forth and build!

Cross compiler construction

Why do several stages of GCC need to be built anyway? From crossdev --help, here are the stages:

Build just binutils
Also build a bare C compiler (no C library/C++/shared GCC libs/C++ exceptions/etc…)
Also build kernel headers
Also build the C library
Also build a full compiler

Available in a git repository.
Repository: igor
Browsable repository: igor
Author: W. Trevor King

This is the home page for the igor package, Python modules for reading files written by WaveMetrics IGOR Pro. Note that if you're designing a system, HDF5 is almost certainly a better choice for your data file format than IBW or PXP. This package exists for those of you who's data is already stuck in an IGOR format.

History

When I joined Prof. Yang's lab, there was a good deal of data analysis code written in IGOR, and a bunch of old data saved in IGOR binary wave (IBW) and packed experiment (PXP) files. I don't use MS Windows, so I don't run IGOR, but I still needed a way to get at the data. Luckily, the WaveMetrics folks publish some useful notes which explain the fundamentals of these two file formats (TN003 for IBW and PTN003 for PXP). The file formats are in a goofy format, but strings pulls out enough meat to figure out what's going on.

For a while I used a IBW → ASCII reader that I coded up in C, but when I joined the Hooke project during the winter of 2009–2010, I translated the reader into Python to support the drivers for data from Asylum Research's MFP-* and related microscopes. This scratched my itch for a few years.

Fast forward to 2012, and for the first time I needed to extract data from a PXP file. Since my Python code only supported IBW's, I searched around and found igor.py by Paul Kienzle Merlijn van Deen. They had a PXP reader, but no reader for stand-alone IBW files. I decided to merge the two projects, so I split my reader out of the Hooke repository and hacked up the Git repository referenced above. Now it's easy to get a hold of all that useful metadata in a hurry. No writing ability yet, but I don't know why you'd want to move data that direction anyway ;).

Parsing dynamic structures with Python

The IGOR file formats rely on lots of shenanigans with C structs. To meld all the structures together in a natural way, I've extended Python's standard struct library to support arbitrary nesting and dynamic fields. Take a look at igor.struct for some examples. This framework makes it easy to load data from structures like:

struct vector {
  unsigned int length;
  short data[length];
};

With the standard struct module, you'd read this using the functional approach:

>>> import struct
>>> buffer = b'\x00\x00\x00\x02\x01\x02\x03\x04'
>>> length_struct = struct.Struct('>I')
>>> length = length_struct.unpack_from(buffer)[0]
>>> data = struct.unpack_from('>' + 'h'*length, buffer, length_struct.size)
>>> print(data)
(258, 772)

This obviously works, but keeping track of the offsets, byte ordering, etc. can be tedious. My igor.struct package allows you to use a more object oriented approach:

>>> from pprint import pprint
>>> from igor.struct import Field, DynamicField, DynamicStructure
>>> class DynamicLengthField (DynamicField):
...     def pre_pack(self, parents, data):
...         "Set the 'length' value to match the data before packing"
...         vector_structure = parents[-1]
...         vector_data = self._get_structure_data(
...             parents, data, vector_structure)
...         length = len(vector_data['data'])
...         vector_data['length'] = length
...         data_field = vector_structure.get_field('data')
...         data_field.count = length
...         data_field.setup()
...     def post_unpack(self, parents, data):
...         "Adjust the expected data count to match the 'length' value"
...         vector_structure = parents[-1]
...         vector_data = self._get_structure_data(
...             parents, data, vector_structure)
...         length = vector_data['length']
...         data_field = vector_structure.get_field('data')
...         data_field.count = length
...         data_field.setup()
>>> dynamic_length_vector = DynamicStructure('vector',
...     fields=[
...         DynamicLengthField('I', 'length'),
...         Field('h', 'data', count=0, array=True),
...         ],
...     byte_order='>')
>>> vector = dynamic_length_vector.unpack(buffer)
>>> pprint(vector)
{'data': array([258, 772]), 'length': 2}

While this is overkill for such a simple example, it scales much more cleanly than an approach using the standard struct module. The main benefit is that you can use Structure instances as format specifiers for Field instances. This means that you could specify a C structure like:

struct vectors {
  unsigned int length;
  struct vector data[length];
};

With:

>>> dynamic_length_vectors = DynamicStructure('vectors',
...     fields=[
...         DynamicLengthField('I', 'length'),
...         Field(dynamic_length_vector, 'data', count=0, array=True),
...         ],
...     byte_order='>')

The C code your mimicking probably only uses a handful of dynamic approaches. Once you've written classes to handle each of them, it is easy to translate arbitrarily complex nested C structures into Python representations.

The pre-pack and post-unpack hooks also give you a convenient place to translate between some C struct's funky format and Python's native types. You take care off all that when you define the structure, and then any part of your software that uses the structure gets the native version automatically.

Available in a git repository.
Repository: curses-check-for-keypress
Browsable repository: curses-check-for-keypress
Author: W. Trevor King

There are some points in my experiment control code where the program does something for an arbitrary length of time (e.g, waits while the operator manually adjusts a laser's alignment). For these situations, I wanted to be able to loop until the user pressed a key. This is a simple enough idea, but the implementation turned out to be complicated enough for me to spin it out as a stand-alone module.

Portage is Gentoo's default package manager. This post isn't supposed to be a tutorial, the handbook does a pretty good job of that already. I'm just recording a few tricks so I don't forget them.

User patches

While playing around with LDAP, I was trying to troubleshoot the SASL_NOCANON handling. “Gee,” I thought, “wouldn't it be nice to be able to add debugging printfs to figure out what was happening?” Unfortunately, I had trouble getting ldapwhoami working when I compiled it by hand. “Grrr,” I though, “I just want to add a simple patch and do whatever the ebuild already does.” This is actually pretty easy to do, once you're looking in the right places.

Write your patch

I'm not going to cover that here.

Place your patch where `epatch_user` will find it

This would be under

/etc/portage/patches/<CATEGORY>/<PF|P|PN>/

If your ebuild already calls epatch_user, or it uses an eclass like base that calls epatch_user internally, you're done. If not, read on…

Forcing `epatch_user`

While you could always write an overlay with an improved ebuild, a quicker fix for this kind of hack is /etc/portage/bashrc. I used:

if [ "${EBUILD_PHASE}" == "prepare" ]; then
    echo ":: Calling epatch_user";
    pushd "${S}"
    epatch_user
    popd
fi

to insert my patches at the beginning of the prepare phase.

Cleaning up

It's safe to call epatch_user multiple times, so you can leave this setup in place if you like. However, you might run into problems if you touch autoconf files, so you may want to move your bashrc somewhere else until you need it again!

Available in a git repository.
Repository: pyassuan
Browsable repository: pyassuan
Author: W. Trevor King

I've been trying to come up with a clean way to verify detached PGP signatures from Python. There are a number of existing approaches to this problem. Many of them call gpg using Python's multiprocessing or subprocess modules, but to verify detached signatures, you need to send the signature in on a separate file descriptor, and handling that in a way safe from deadlocks is difficult. The other approach, taken by PyMe is to wrap GPGME using SWIG, which is great as far as it goes, but development seems to have stalled, and I find the raw GPGME interface excessively complicated.

The GnuPG tools themselves often communicate over sockets using the Assuan protocol, and I'd already written an Assuan server to handle pinentry (originally for my gpg-agent post, not part of pyassuan). I though it would be natural if there was a gpgme-agent which would handle cryptographic tasks over this protocol, which would make the pgp-mime implementation easier. It turns out that there already is such an agent (gpgme-tool), so I turned my pinentry script into the more general pyassuan package. Now using Assuan from Python should be as easy (or easier?) than using it from C via libassuan.

The README is posted on the PyPI page.

Available in a git repository.
Repository: pygrader
Browsable repository: pygrader
Author: W. Trevor King

The last two courses I've TAd at Drexel have been scientific computing courses where the students are writing code to solve homework problems. When they're done, they email the homework to me, and I grade it and email them back their grade and comments. I've played around with developing a few grading frameworks over the years (a few years back, one of the big intro courses kept the grades in an Excel file on a Samba share, and I wrote a script to automatically sync local comma-separated-variable data with that spreadsheet. Yuck :p), so I figured this was my change to polish up some old scripts into a sensible system to help me stay organized. This system is pygrader.

During the polishing phase, I was searching around looking for prior art ;), and found that Alex Heitzmann had already created pygrade, which is the name I under which I had originally developed my own project. While they are both grade databases written in Python, Alex's project focuses on providing a more integrated grading environment.

Pygrader accepts assignment submissions from students through its mailpipe command, which you can run on your email inbox (or from procmail). Students submit assignments with an email subject like

[submit] <assignment name>

mailpipe automatically drops the submissions into a student/assignment/mail mailbox, extracts any MIME attachments into the student/assignment/ directory (without clobbers, with proper timestamps), and leaves you to get to work.

Pygrader also supports multiple graders through the mailpipe command. The other graders can request a student's submission(s) with an email subject like

[get] <student name>, <assignment name>

Then they can grade the submission and mail the grade back with an email subject like

[grade] <student name>, <assignment name>

The grade-altering messages are also stored in the student/assignment/mail mailbox, so you can peruse them later.

Pygrader doesn't spawn editors or GUIs to help you browse through submissions or assigning grades. As far as I am concerned, this is a good thing.

When you're done grading, pygrader can email (email) your grades and comments back to the students, signing or encrypting with pgp-mime if either party has configured a PGP key. It can also email a tab-delimited table of grades to the professors to keep them up to speed. If you're running mailpipe via procmail, responses to grade request are sent automatically.

While you're grading, pygrader can search for ungraded assignments, or for grades that have not yet been sent to students (todo). It can also check for resubmissions, where new submissions come in response to earlier grades.

The README is posted on the PyPI page.

Available in a git repository.
Repository: update-copyright
Browsable repository: update-copyright
Author: W. Trevor King

A few years ago I was getting tired of having missing or out-of-date copyright blurbs in packages that I was involved with (old license text, missing authors, etc.). This is important stuff, but not the kind of thing that is fun to maintain by hand. I wrote a script for bugs everywhere that automated the process, using the version control system to extract lists of authors and dates for each file. The script was great, so I ported it into a few other projects I was involved in.

This month I realized that it would be much easier to just break the script out into its own package, and only maintain a config file in each of the projects that use it. I don't know why this didn't occur to me years ago :p. Anyhow, here it is! Enjoy.

The README, with usage details, is posted on the PyPI page.

Despite some Apache comments to the contrary, it is possible to use Apache to host several SSL/TLS hosts on the same IP/port combination. The key is Server Name Indication (SNI), in which the client indicates the host name with which it wants to connect explicitly.

All you really need to use SNI is an up-to-date version of GnuTLS or OpenSSL. Your clients be fine will any major browser written in the last few years.

For details on SNI-support, see the Apache Wiki and the Gentoo wiki.

Crossdev

Cross compiler construction

History

Parsing dynamic structures with Python

User patches

Write your patch

Place your patch where epatch_user will find it

Forcing epatch_user

Cleaning up

Place your patch where `epatch_user` will find it

Forcing `epatch_user`