Software versioned in Git or Git-related posts.

Relative submodules

I like Git submodules quite a bit, but they often get a bad rap. Most of the problems involve bad git hygiene (e.g. not developing in feature branches) or limitations in the current submodule implementation (e.g. it's hard to move submodules). Other problems involve not being able to fetch submodules with git:// URLs (due to restrictive firewalls).

This last case is easily solved by using relative submodule URLs in .gitmodules. I've been through the relative-vs.-absolute URL argument a few times now, so I thought I'd write up my position for future reference. I prefer the relative URL in:

[submodule "some-name"]
  path = some/path
  url = ../submod-repo.git

to the absolute URL in:

[submodule "some-name"]
  path = some/path
  url = git://example.net/submod-repo.git

Arguments in favor of relative URLs:

  • Users get submodules over their preferred transport (ssh://, git://, https://, …). Whatever transport you used to clone the superproject will be recycled when you use submodule init to set submodule URLs in your .git/config.
  • No need to tweak .gitmodules if you mirror (or move) your superproject Git hosting somewhere else (e.g. from example.net to elsewhere.com).
  • As a special case of the mirror/move situation, there's no need to tweak .gitmodules in long-term forks. If I setup a local version of the project and host it on my local box, my lab-mates can clone my local superproject and use my local submodules without my having to alter .gitmodules. Reducing trivial differences between forks makes collaboration on substantive changes more likely.

The only argument I've heard in favor of absolute URLs is Brian Granger's GitHub workflow:

  • If a user forks upstream/repo to username/repo and then clones their fork for local work, relative submodule URLs will not work until they also fork the submodules into username/.

This workflow needs absolute URLs:

But relative URLs are fine if you also fork the submodule(s):

Personally, I only create a public repository (username/repo) after cloning the central repository (upstream/repo). Several projects I contribute too (such as Git itself) prefer changes via send-email, in which case there is no need for contributors to create public repositories at all. Relative URLs are also fine here:

Once you understand the trade-offs, picking absolute/relative is just a political/engineering decision. I don't see any benefit to the absolute-URL-only repo relationship, so I favor relative URLs. The IPython folks felt that too many devs already used the absolute-URL-only relationship, and that the relative-URL benefits were not worth the cost of retraining those developers. `

Posted
rss2email

Available in a git repository.
Repository: rss2email
Browsable repository: rss2email
Author: W. Trevor King

Since November 2012 I've been maintaining rss2email, a package that converts RSS or Atom feeds to email so you can follow them with your mail user agent. Rss2email was created by the late Aaron Swartz and maintained for several years by Lindsey Smith. I've added a mailing list (hosted with mlmmj) and PyPI package and made the GitHub location the homepage.

Overall, setting up the standard project infrastructure has been fun, and it's nice to see interest in the newly streamlined code picking up. The timing also works out well, since the demise of Google Reader may push some talented folks in our direction. I'm not sure how visible rss2email is, especially the fresh development locations, hence this post ;). If you know anyone who might be interested in using (or contributing to!) rss2email, please pass the word.

catalyst

Available in a git repository.
Repository: catalyst-swc
Browsable repository: catalyst-swc
Author: W. Trevor King

Catalyst is a release-building tool for Gentoo. If you use Gentoo and want to roll your own live CD or bootable USB drive, this is the way to go. As I've been wrapping my head around catalyst, I've been pushing my notes upstream. This post builds on those notes to discuss the construction of a bootable ISO for Software Carpentry boot camps.

Getting a patched up catalyst

Catalyst has been around for a while, but the user base has been fairly small. If you try to do something that Gentoo's Release Engineering team doesn't do on a regular basis, built in catalyst support can be spotty. There's been a fair amount of patch submissions an gentoo-catalyst@ recently, but patch acceptance can be slow. For the SWC ISO, I applied versions of the following patches (or patch series) to 37540ff:

Configuring catalyst

The easiest way to run catalyst from a Git checkout is to setup a local config file. I didn't have enough hard drive space on my local system (~16 GB) for this build, so I set things up in a temporary directory on an external hard drive:

$ cat catalyst.conf | grep -v '^#\|^$'
digests="md5 sha1 sha512 whirlpool"
contents="auto"
distdir="/usr/portage/distfiles"
envscript="/etc/catalyst/catalystrc"
hash_function="crc32"
options="autoresume kerncache pkgcache seedcache snapcache"
portdir="/usr/portage"
sharedir="/home/wking/src/catalyst"
snapshot_cache="/mnt/d/tmp/catalyst/snapshot_cache"
storedir="/mnt/d/tmp/catalyst"

I used the default values for everything except sharedir, snapshot_cache, and storedir. Then I cloned the catalyst-swc repository into /mnt/d/tmp/catalyst.

Portage snapshot and a seed stage

Take a snapshot of the current Portage tree:

# catalyst -c catalyst.conf --snapshot 20130208

Download a seed stage3 from a Gentoo mirror:

# wget -O /mnt/d/tmp/catalyst/builds/default/stage3-i686-20121213.tar.bz2 \
>   http://distfiles.gentoo.org/releases/x86/current-stage3/stage3-i686-20121213.tar.bz2

Building the live CD

# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage2-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage3-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage2-i686-2013.1.spec

isohybrid

To make the ISO bootable from a USB drive, I used isohybrid:

# cp swc-x86.iso swc-x86-isohybrid.iso
# isohybrid iso-x86-isohybrid.iso

You can install the resulting ISO on a USB drive with:

# dd if=iso-x86-isohybrid.iso of=/dev/sdX

replacing replacing X with the appropriate drive letter for your USB drive.

With versions of catalyst after d1c2ba9, the isohybrid call is built into catalysts ISO construction.

Mutt-LDAP

Available in a git repository.
Repository: mutt-ldap
Browsable repository: mutt-ldap
Author: W. Trevor King

I wrote this Python script to query an LDAP server for addresses from Mutt. In December 2012, I got some patches from Wade Berrier and Niels de Vos. Anything interesting enough for others to hack on deserves it's own repository, so I pulled it out of my blog repository (linked above, and mirrored on GitHub).

The README is posted on the PyPI page.

Reading IGOR files from Python

Available in a git repository.
Repository: igor
Browsable repository: igor
Author: W. Trevor King

This is the home page for the igor package, Python modules for reading files written by WaveMetrics IGOR Pro. Note that if you're designing a system, HDF5 is almost certainly a better choice for your data file format than IBW or PXP. This package exists for those of you who's data is already stuck in an IGOR format.

History

When I joined Prof. Yang's lab, there was a good deal of data analysis code written in IGOR, and a bunch of old data saved in IGOR binary wave (IBW) and packed experiment (PXP) files. I don't use MS Windows, so I don't run IGOR, but I still needed a way to get at the data. Luckily, the WaveMetrics folks publish some useful notes which explain the fundamentals of these two file formats (TN003 for IBW and PTN003 for PXP). The file formats are in a goofy format, but strings pulls out enough meat to figure out what's going on.

For a while I used a IBW → ASCII reader that I coded up in C, but when I joined the Hooke project during the winter of 2009–2010, I translated the reader into Python to support the drivers for data from Asylum Research's MFP-* and related microscopes. This scratched my itch for a few years.

Fast forward to 2012, and for the first time I needed to extract data from a PXP file. Since my Python code only supported IBW's, I searched around and found igor.py by Paul Kienzle Merlijn van Deen. They had a PXP reader, but no reader for stand-alone IBW files. I decided to merge the two projects, so I split my reader out of the Hooke repository and hacked up the Git repository referenced above. Now it's easy to get a hold of all that useful metadata in a hurry. No writing ability yet, but I don't know why you'd want to move data that direction anyway ;).

Parsing dynamic structures with Python

The IGOR file formats rely on lots of shenanigans with C structs. To meld all the structures together in a natural way, I've extended Python's standard struct library to support arbitrary nesting and dynamic fields. Take a look at igor.struct for some examples. This framework makes it easy to load data from structures like:

struct vector {
  unsigned int length;
  short data[length];
};

With the standard struct module, you'd read this using the functional approach:

>>> import struct
>>> buffer = b'\x00\x00\x00\x02\x01\x02\x03\x04'
>>> length_struct = struct.Struct('>I')
>>> length = length_struct.unpack_from(buffer)[0]
>>> data = struct.unpack_from('>' + 'h'*length, buffer, length_struct.size)
>>> print(data)
(258, 772)

This obviously works, but keeping track of the offsets, byte ordering, etc. can be tedious. My igor.struct package allows you to use a more object oriented approach:

>>> from pprint import pprint
>>> from igor.struct import Field, DynamicField, DynamicStructure
>>> class DynamicLengthField (DynamicField):
...     def pre_pack(self, parents, data):
...         "Set the 'length' value to match the data before packing"
...         vector_structure = parents[-1]
...         vector_data = self._get_structure_data(
...             parents, data, vector_structure)
...         length = len(vector_data['data'])
...         vector_data['length'] = length
...         data_field = vector_structure.get_field('data')
...         data_field.count = length
...         data_field.setup()
...     def post_unpack(self, parents, data):
...         "Adjust the expected data count to match the 'length' value"
...         vector_structure = parents[-1]
...         vector_data = self._get_structure_data(
...             parents, data, vector_structure)
...         length = vector_data['length']
...         data_field = vector_structure.get_field('data')
...         data_field.count = length
...         data_field.setup()
>>> dynamic_length_vector = DynamicStructure('vector',
...     fields=[
...         DynamicLengthField('I', 'length'),
...         Field('h', 'data', count=0, array=True),
...         ],
...     byte_order='>')
>>> vector = dynamic_length_vector.unpack(buffer)
>>> pprint(vector)
{'data': array([258, 772]), 'length': 2}

While this is overkill for such a simple example, it scales much more cleanly than an approach using the standard struct module. The main benefit is that you can use Structure instances as format specifiers for Field instances. This means that you could specify a C structure like:

struct vectors {
  unsigned int length;
  struct vector data[length];
};

With:

>>> dynamic_length_vectors = DynamicStructure('vectors',
...     fields=[
...         DynamicLengthField('I', 'length'),
...         Field(dynamic_length_vector, 'data', count=0, array=True),
...         ],
...     byte_order='>')

The C code your mimicking probably only uses a handful of dynamic approaches. Once you've written classes to handle each of them, it is easy to translate arbitrarily complex nested C structures into Python representations.

The pre-pack and post-unpack hooks also give you a convenient place to translate between some C struct's funky format and Python's native types. You take care off all that when you define the structure, and then any part of your software that uses the structure gets the native version automatically.

curses_check_for_keypress

Available in a git repository.
Repository: curses-check-for-keypress
Browsable repository: curses-check-for-keypress
Author: W. Trevor King

There are some points in my experiment control code where the program does something for an arbitrary length of time (e.g, waits while the operator manually adjusts a laser's alignment). For these situations, I wanted to be able to loop until the user pressed a key. This is a simple enough idea, but the implementation turned out to be complicated enough for me to spin it out as a stand-alone module.

pyassuan

Available in a git repository.
Repository: pyassuan
Browsable repository: pyassuan
Author: W. Trevor King

I've been trying to come up with a clean way to verify detached PGP signatures from Python. There are a number of existing approaches to this problem. Many of them call gpg using Python's multiprocessing or subprocess modules, but to verify detached signatures, you need to send the signature in on a separate file descriptor, and handling that in a way safe from deadlocks is difficult. The other approach, taken by PyMe is to wrap GPGME using SWIG, which is great as far as it goes, but development seems to have stalled, and I find the raw GPGME interface excessively complicated.

The GnuPG tools themselves often communicate over sockets using the Assuan protocol, and I'd already written an Assuan server to handle pinentry (originally for my gpg-agent post, not part of pyassuan). I though it would be natural if there was a gpgme-agent which would handle cryptographic tasks over this protocol, which would make the pgp-mime implementation easier. It turns out that there already is such an agent (gpgme-tool), so I turned my pinentry script into the more general pyassuan package. Now using Assuan from Python should be as easy (or easier?) than using it from C via libassuan.

The README is posted on the PyPI page.

pygrader

Available in a git repository.
Repository: pygrader
Browsable repository: pygrader
Author: W. Trevor King

The last two courses I've TAd at Drexel have been scientific computing courses where the students are writing code to solve homework problems. When they're done, they email the homework to me, and I grade it and email them back their grade and comments. I've played around with developing a few grading frameworks over the years (a few years back, one of the big intro courses kept the grades in an Excel file on a Samba share, and I wrote a script to automatically sync local comma-separated-variable data with that spreadsheet. Yuck :p), so I figured this was my change to polish up some old scripts into a sensible system to help me stay organized. This system is pygrader.

During the polishing phase, I was searching around looking for prior art ;), and found that Alex Heitzmann had already created pygrade, which is the name I under which I had originally developed my own project. While they are both grade databases written in Python, Alex's project focuses on providing a more integrated grading environment.

Pygrader accepts assignment submissions from students through its mailpipe command, which you can run on your email inbox (or from procmail). Students submit assignments with an email subject like

[submit] <assignment name>

mailpipe automatically drops the submissions into a student/assignment/mail mailbox, extracts any MIME attachments into the student/assignment/ directory (without clobbers, with proper timestamps), and leaves you to get to work.

Pygrader also supports multiple graders through the mailpipe command. The other graders can request a student's submission(s) with an email subject like

[get] <student name>, <assignment name>

Then they can grade the submission and mail the grade back with an email subject like

[grade] <student name>, <assignment name>

The grade-altering messages are also stored in the student/assignment/mail mailbox, so you can peruse them later.

Pygrader doesn't spawn editors or GUIs to help you browse through submissions or assigning grades. As far as I am concerned, this is a good thing.

When you're done grading, pygrader can email (email) your grades and comments back to the students, signing or encrypting with pgp-mime if either party has configured a PGP key. It can also email a tab-delimited table of grades to the professors to keep them up to speed. If you're running mailpipe via procmail, responses to grade request are sent automatically.

While you're grading, pygrader can search for ungraded assignments, or for grades that have not yet been sent to students (todo). It can also check for resubmissions, where new submissions come in response to earlier grades.

The README is posted on the PyPI page.

update-copyright

Available in a git repository.
Repository: update-copyright
Browsable repository: update-copyright
Author: W. Trevor King

A few years ago I was getting tired of having missing or out-of-date copyright blurbs in packages that I was involved with (old license text, missing authors, etc.). This is important stuff, but not the kind of thing that is fun to maintain by hand. I wrote a script for bugs everywhere that automated the process, using the version control system to extract lists of authors and dates for each file. The script was great, so I ported it into a few other projects I was involved in.

This month I realized that it would be much easier to just break the script out into its own package, and only maintain a config file in each of the projects that use it. I don't know why this didn't occur to me years ago :p. Anyhow, here it is! Enjoy.

The README, with usage details, is posted on the PyPI page.

Serving Git on Gentoo

Today I decided to host all my public Git repositories on my Gentoo server. Here's a quick summary of what I did.

Gitweb

Re-emerge git with the cgi USE flag enabled.

# echo "dev-util/git cgi" >> /etc/portage/package.use/webserver
# emerge -av git

Create a virtual host for running gitweb:

# cat > /etc/apache2/vhosts.d/20_git.example.net_vhost.conf << EOF
<VirtualHost *:80>
    ServerName git.example.net
    DocumentRoot /usr/share/gitweb
    <Directory /usr/share/gitweb>
        Allow from all
        AllowOverride all
        Order allow,deny
        Options ExecCGI
        <Files gitweb.cgi>
            SetHandler cgi-script
        </Files>
    </Directory>
    DirectoryIndex gitweb.cgi
    SetEnv  GITWEB_CONFIG  /etc/gitweb.conf
</VirtualHost>
EOF

Tell gitweb where you keep your repos:

# echo "\$projectroot = '/var/git';" > /etc/gitweb.conf

Tell gitweb where people can pull your repos from:

# echo "@git_base_url_list = ( 'git://example.net', ); >> /etc/gitweb.conf

Restart Apache:

# /etc/init.d/apache2 restart

Add the virtual host to your DNS server.

# emacs /etc/bind/pri/example.net.internal
...
git  A  192.168.0.2
...

Restart the DNS server.

# /etc/init.d/named restart

If names aren't showing up in the Owner column, you can add them to the user's /etc/passwd comment with

# usermod -c 'John Doe' jdoe

Thanks to Phil Sergi for his own summary, which I've borrowed from heavily.

Git daemon

Gitweb allows browsing repositories via HTTP, but if you will be pulling from your repositories using the git:// protocol, you'll also want to run git-daemon. On Gentoo, this is really easy, just edit /etc/conf.d/git-daemon as you see fit. I added --verbose, --base-path=/var/git and --export-all to GITDAEMON_OPTS. Start the daemon with

# /etc/init.d/git-daemon start

Add it to your default runlevel with

# rc-update add git-daemon default

If you're logging to syslog and running syslog-ng, you can configure the log location using the usual syslog tricks. See my syslog-ng for details.