Software versioned in Git or Git-related posts.
I like Git submodules quite a bit, but they often get a bad
rap. Most of the problems involve bad git hygiene (e.g. not
developing in feature branches) or limitations in the current
submodule implementation (e.g. it's hard to move submodules). Other
problems involve not being able to fetch submodules with git://
URLs (due to restrictive firewalls).
This last case is easily solved by using relative submodule URLs in
.gitmodules
. I've been through the relative-vs.-absolute URL
argument a few times now, so I
thought I'd write up my position for future reference. I prefer the
relative URL in:
[submodule "some-name"]
path = some/path
url = ../submod-repo.git
to the absolute URL in:
[submodule "some-name"]
path = some/path
url = git://example.net/submod-repo.git
Arguments in favor of relative URLs:
- Users get submodules over their preferred transport (
ssh://
,git://
,https://
, …). Whatever transport you used to clone the superproject will be recycled when you usesubmodule init
to set submodule URLs in your.git/config
. - No need to tweak
.gitmodules
if you mirror (or move) your superproject Git hosting somewhere else (e.g. fromexample.net
toelsewhere.com
). - As a special case of the mirror/move situation, there's no need to
tweak
.gitmodules
in long-term forks. If I setup a local version of the project and host it on my local box, my lab-mates can clone my local superproject and use my local submodules without my having to alter.gitmodules
. Reducing trivial differences between forks makes collaboration on substantive changes more likely.
The only argument I've heard in favor of absolute URLs is Brian Granger's GitHub workflow:
- If a user forks
upstream/repo
tousername/repo
and then clones their fork for local work, relative submodule URLs will not work until they also fork the submodules intousername/
.
This workflow needs absolute URLs:
But relative URLs are fine if you also fork the submodule(s):
Personally, I only create a public repository (username/repo
) after
cloning the central repository (upstream/repo
). Several projects I
contribute too (such as Git itself) prefer changes via
send-email, in which case there is no need for contributors to
create public repositories at all. Relative URLs are also fine here:
Once you understand the trade-offs, picking absolute/relative is just a political/engineering decision. I don't see any benefit to the absolute-URL-only repo relationship, so I favor relative URLs. The IPython folks felt that too many devs already used the absolute-URL-only relationship, and that the relative-URL benefits were not worth the cost of retraining those developers. `
Available in a git repository.
Repository: rss2email
Browsable repository: rss2email
Author: W. Trevor King
Since November 2012 I've been maintaining rss2email, a package that converts RSS or Atom feeds to email so you can follow them with your mail user agent. Rss2email was created by the late Aaron Swartz and maintained for several years by Lindsey Smith. I've added a mailing list (hosted with mlmmj) and PyPI package and made the GitHub location the homepage.
Overall, setting up the standard project infrastructure has been fun, and it's nice to see interest in the newly streamlined code picking up. The timing also works out well, since the demise of Google Reader may push some talented folks in our direction. I'm not sure how visible rss2email is, especially the fresh development locations, hence this post ;). If you know anyone who might be interested in using (or contributing to!) rss2email, please pass the word.
Available in a git repository.
Repository: catalyst-swc
Browsable repository: catalyst-swc
Author: W. Trevor King
Catalyst is a release-building tool for Gentoo. If you use Gentoo and want to roll your own live CD or bootable USB drive, this is the way to go. As I've been wrapping my head around catalyst, I've been pushing my notes upstream. This post builds on those notes to discuss the construction of a bootable ISO for Software Carpentry boot camps.
Getting a patched up catalyst
Catalyst has been around for a while, but the user base has been fairly small. If you try to do something that Gentoo's Release Engineering team doesn't do on a regular basis, built in catalyst support can be spotty. There's been a fair amount of patch submissions an gentoo-catalyst@ recently, but patch acceptance can be slow. For the SWC ISO, I applied versions of the following patches (or patch series) to 37540ff:
- chmod +x all sh scripts so they can run from the git checkout
- livecdfs-update.sh: Set XSESSION in /etc/env.d/90xsession
- Fix livecdfs-update.sh startx handling
Configuring catalyst
The easiest way to run catalyst from a Git checkout is to setup a local config file. I didn't have enough hard drive space on my local system (~16 GB) for this build, so I set things up in a temporary directory on an external hard drive:
$ cat catalyst.conf | grep -v '^#\|^$'
digests="md5 sha1 sha512 whirlpool"
contents="auto"
distdir="/usr/portage/distfiles"
envscript="/etc/catalyst/catalystrc"
hash_function="crc32"
options="autoresume kerncache pkgcache seedcache snapcache"
portdir="/usr/portage"
sharedir="/home/wking/src/catalyst"
snapshot_cache="/mnt/d/tmp/catalyst/snapshot_cache"
storedir="/mnt/d/tmp/catalyst"
I used the default values for everything except sharedir
,
snapshot_cache
, and storedir
. Then I cloned the catalyst-swc
repository into /mnt/d/tmp/catalyst
.
Portage snapshot and a seed stage
Take a snapshot of the current Portage tree:
# catalyst -c catalyst.conf --snapshot 20130208
Download a seed stage3 from a Gentoo mirror:
# wget -O /mnt/d/tmp/catalyst/builds/default/stage3-i686-20121213.tar.bz2 \
> http://distfiles.gentoo.org/releases/x86/current-stage3/stage3-i686-20121213.tar.bz2
Building the live CD
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage2-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage3-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage2-i686-2013.1.spec
isohybrid
To make the ISO bootable from a USB drive, I used isohybrid:
# cp swc-x86.iso swc-x86-isohybrid.iso
# isohybrid iso-x86-isohybrid.iso
You can install the resulting ISO on a USB drive with:
# dd if=iso-x86-isohybrid.iso of=/dev/sdX
replacing replacing X
with the appropriate drive letter for your USB
drive.
With versions of catalyst after d1c2ba9, the isohybrid
call is
built into catalysts ISO construction.
Available in a git repository.
Repository: mutt-ldap
Browsable repository: mutt-ldap
Author: W. Trevor King
I wrote this Python script to query an LDAP server for addresses from Mutt. In December 2012, I got some patches from Wade Berrier and Niels de Vos. Anything interesting enough for others to hack on deserves it's own repository, so I pulled it out of my blog repository (linked above, and mirrored on GitHub).
The README
is posted on the PyPI page.
Available in a git repository.
Repository: igor
Browsable repository: igor
Author: W. Trevor King
This is the home page for the igor
package, Python modules for
reading files written by WaveMetrics IGOR Pro. Note that if
you're designing a system, HDF5 is almost certainly a better
choice for your data file format than IBW or PXP. This package exists
for those of you who's data is already stuck in an IGOR format.
History
When I joined Prof. Yang's lab, there was a good deal of data analysis code written in IGOR, and a bunch of old data saved in IGOR binary wave (IBW) and packed experiment (PXP) files. I don't use MS Windows, so I don't run IGOR, but I still needed a way to get at the data. Luckily, the WaveMetrics folks publish some useful notes which explain the fundamentals of these two file formats (TN003 for IBW and PTN003 for PXP). The file formats are in a goofy format, but strings pulls out enough meat to figure out what's going on.
For a while I used a IBW → ASCII reader that I coded up in C, but when I joined the Hooke project during the winter of 2009–2010, I translated the reader into Python to support the drivers for data from Asylum Research's MFP-* and related microscopes. This scratched my itch for a few years.
Fast forward to 2012, and for the first time I needed to extract data from a PXP file. Since my Python code only supported IBW's, I searched around and found igor.py by Paul Kienzle Merlijn van Deen. They had a PXP reader, but no reader for stand-alone IBW files. I decided to merge the two projects, so I split my reader out of the Hooke repository and hacked up the Git repository referenced above. Now it's easy to get a hold of all that useful metadata in a hurry. No writing ability yet, but I don't know why you'd want to move data that direction anyway ;).
Parsing dynamic structures with Python
The IGOR file formats rely on lots of shenanigans with C struct
s.
To meld all the structures together in a natural way, I've extended
Python's standard struct library to support arbitrary nesting and
dynamic fields. Take a look at igor.struct for some
examples. This framework makes it easy to load data from structures
like:
struct vector {
unsigned int length;
short data[length];
};
With the standard struct
module, you'd read this using the
functional approach:
>>> import struct
>>> buffer = b'\x00\x00\x00\x02\x01\x02\x03\x04'
>>> length_struct = struct.Struct('>I')
>>> length = length_struct.unpack_from(buffer)[0]
>>> data = struct.unpack_from('>' + 'h'*length, buffer, length_struct.size)
>>> print(data)
(258, 772)
This obviously works, but keeping track of the offsets, byte ordering,
etc. can be tedious. My igor.struct
package allows you to use a
more object oriented approach:
>>> from pprint import pprint
>>> from igor.struct import Field, DynamicField, DynamicStructure
>>> class DynamicLengthField (DynamicField):
... def pre_pack(self, parents, data):
... "Set the 'length' value to match the data before packing"
... vector_structure = parents[-1]
... vector_data = self._get_structure_data(
... parents, data, vector_structure)
... length = len(vector_data['data'])
... vector_data['length'] = length
... data_field = vector_structure.get_field('data')
... data_field.count = length
... data_field.setup()
... def post_unpack(self, parents, data):
... "Adjust the expected data count to match the 'length' value"
... vector_structure = parents[-1]
... vector_data = self._get_structure_data(
... parents, data, vector_structure)
... length = vector_data['length']
... data_field = vector_structure.get_field('data')
... data_field.count = length
... data_field.setup()
>>> dynamic_length_vector = DynamicStructure('vector',
... fields=[
... DynamicLengthField('I', 'length'),
... Field('h', 'data', count=0, array=True),
... ],
... byte_order='>')
>>> vector = dynamic_length_vector.unpack(buffer)
>>> pprint(vector)
{'data': array([258, 772]), 'length': 2}
While this is overkill for such a simple example, it scales much more
cleanly than an approach using the standard struct
module. The main
benefit is that you can use Structure
instances as format specifiers
for Field
instances. This means that you could specify a C
structure like:
struct vectors {
unsigned int length;
struct vector data[length];
};
With:
>>> dynamic_length_vectors = DynamicStructure('vectors',
... fields=[
... DynamicLengthField('I', 'length'),
... Field(dynamic_length_vector, 'data', count=0, array=True),
... ],
... byte_order='>')
The C code your mimicking probably only uses a handful of dynamic approaches. Once you've written classes to handle each of them, it is easy to translate arbitrarily complex nested C structures into Python representations.
The pre-pack and post-unpack hooks also give you a convenient place to translate between some C struct's funky format and Python's native types. You take care off all that when you define the structure, and then any part of your software that uses the structure gets the native version automatically.
Available in a git repository.
Repository: curses-check-for-keypress
Browsable repository: curses-check-for-keypress
Author: W. Trevor King
There are some points in my experiment control code where the program does something for an arbitrary length of time (e.g, waits while the operator manually adjusts a laser's alignment). For these situations, I wanted to be able to loop until the user pressed a key. This is a simple enough idea, but the implementation turned out to be complicated enough for me to spin it out as a stand-alone module.
Available in a git repository.
Repository: pyassuan
Browsable repository: pyassuan
Author: W. Trevor King
I've been trying to come up with a clean way to verify detached PGP signatures from Python. There are a number of existing approaches to this problem. Many of them call gpg using Python's multiprocessing or subprocess modules, but to verify detached signatures, you need to send the signature in on a separate file descriptor, and handling that in a way safe from deadlocks is difficult. The other approach, taken by PyMe is to wrap GPGME using SWIG, which is great as far as it goes, but development seems to have stalled, and I find the raw GPGME interface excessively complicated.
The GnuPG tools themselves often communicate over sockets using the
Assuan protocol, and I'd already written an Assuan server to
handle pinentry (originally for my gpg-agent post, not part of
pyassuan). I though it would be natural if there was a gpgme-agent
which would handle cryptographic tasks over this protocol, which would
make the pgp-mime implementation easier. It turns out that there
already is such an agent (gpgme-tool), so I turned my pinentry
script into the more general pyassuan package. Now using Assuan from
Python should be as easy (or easier?) than using it from C via
libassuan.
The README
is posted on the PyPI page.
Available in a git repository.
Repository: pygrader
Browsable repository: pygrader
Author: W. Trevor King
The last two courses I've TAd at Drexel have been scientific computing courses where the students are writing code to solve homework problems. When they're done, they email the homework to me, and I grade it and email them back their grade and comments. I've played around with developing a few grading frameworks over the years (a few years back, one of the big intro courses kept the grades in an Excel file on a Samba share, and I wrote a script to automatically sync local comma-separated-variable data with that spreadsheet. Yuck :p), so I figured this was my change to polish up some old scripts into a sensible system to help me stay organized. This system is pygrader.
During the polishing phase, I was searching around looking for prior art ;), and found that Alex Heitzmann had already created pygrade, which is the name I under which I had originally developed my own project. While they are both grade databases written in Python, Alex's project focuses on providing a more integrated grading environment.
Pygrader accepts assignment submissions from students through its
mailpipe
command, which you can run on your email inbox (or from
procmail). Students submit assignments with an email subject like
[submit] <assignment name>
mailpipe
automatically drops the submissions into a
student/assignment/mail
mailbox, extracts any MIME attachments
into the student/assignment/
directory (without clobbers, with
proper timestamps), and leaves you to get to work.
Pygrader also supports multiple graders through the mailpipe
command. The other graders can request a student's submission(s) with
an email subject like
[get] <student name>, <assignment name>
Then they can grade the submission and mail the grade back with an email subject like
[grade] <student name>, <assignment name>
The grade-altering messages are also stored in the
student/assignment/mail
mailbox, so you can peruse them later.
Pygrader doesn't spawn editors or GUIs to help you browse through submissions or assigning grades. As far as I am concerned, this is a good thing.
When you're done grading, pygrader can email (email
) your grades and
comments back to the students, signing or encrypting with pgp-mime
if either party has configured a PGP key. It can also email a
tab-delimited table of grades to the professors to keep them up to
speed. If you're running mailpipe
via procmail, responses to grade
request are sent automatically.
While you're grading, pygrader can search for ungraded assignments, or
for grades that have not yet been sent to students (todo
). It can
also check for resubmissions, where new submissions come in response
to earlier grades.
The README
is posted on the PyPI page.
Available in a git repository.
Repository: update-copyright
Browsable repository: update-copyright
Author: W. Trevor King
A few years ago I was getting tired of having missing or out-of-date copyright blurbs in packages that I was involved with (old license text, missing authors, etc.). This is important stuff, but not the kind of thing that is fun to maintain by hand. I wrote a script for bugs everywhere that automated the process, using the version control system to extract lists of authors and dates for each file. The script was great, so I ported it into a few other projects I was involved in.
This month I realized that it would be much easier to just break the script out into its own package, and only maintain a config file in each of the projects that use it. I don't know why this didn't occur to me years ago :p. Anyhow, here it is! Enjoy.
The README
, with usage details, is posted on the PyPI page.
Today I decided to host all my public Git repositories on my Gentoo server. Here's a quick summary of what I did.
Gitweb
Re-emerge git
with the cgi
USE flag enabled.
# echo "dev-util/git cgi" >> /etc/portage/package.use/webserver
# emerge -av git
Create a virtual host for running gitweb
:
# cat > /etc/apache2/vhosts.d/20_git.example.net_vhost.conf << EOF
<VirtualHost *:80>
ServerName git.example.net
DocumentRoot /usr/share/gitweb
<Directory /usr/share/gitweb>
Allow from all
AllowOverride all
Order allow,deny
Options ExecCGI
<Files gitweb.cgi>
SetHandler cgi-script
</Files>
</Directory>
DirectoryIndex gitweb.cgi
SetEnv GITWEB_CONFIG /etc/gitweb.conf
</VirtualHost>
EOF
Tell gitweb
where you keep your repos:
# echo "\$projectroot = '/var/git';" > /etc/gitweb.conf
Tell gitweb
where people can pull your repos from:
# echo "@git_base_url_list = ( 'git://example.net', ); >> /etc/gitweb.conf
Restart Apache:
# /etc/init.d/apache2 restart
Add the virtual host to your DNS server.
# emacs /etc/bind/pri/example.net.internal
...
git A 192.168.0.2
...
Restart the DNS server.
# /etc/init.d/named restart
If names aren't showing up in the Owner
column, you can add them to
the user's /etc/passwd
comment with
# usermod -c 'John Doe' jdoe
Thanks to Phil Sergi for his own summary, which I've borrowed from heavily.
Git daemon
Gitweb allows browsing repositories via HTTP, but if you will be
pulling from your repositories using the git://
protocol, you'll
also want to run git-daemon
. On Gentoo, this is really easy, just
edit /etc/conf.d/git-daemon
as you see fit. I added --verbose
,
--base-path=/var/git
and --export-all
to GITDAEMON_OPTS
. Start
the daemon with
# /etc/init.d/git-daemon start
Add it to your default runlevel with
# rc-update add git-daemon default
If you're logging to syslog and running syslog-ng, you can configure the log location using the usual syslog tricks. See my syslog-ng for details.