Some software I've found useful in my research, occasionally with personal forks, but no major time investments. When I've put in some more serious work, the appriopriate tag is code.

Factor analysis

I've been trying to wrap my head around factor analysis as a theory for designing and understanding test and survey results. This has turned out to be another one of those fields where the going has been a bit rough. I think the key factors in making these older topics difficult are:

• “Everybody knows this, so we don't need to write up the details.”
• “Hey, I can do better than Bob if I just tweak this knob…”

The resulting discussion ends up being overly complicated, and it's hard for newcomers to decide if people using similar terminology are in fact talking about the same thing.

Some of the better open sources for background has been Tucker and MacCallum's “Exploratory Factor Analysis” manuscript and Max Welling's notes. I'll use Welling's terminology for this discussion.

The basic idea of factor analsys is to model $d$ measurable attributes as generated by $k common factors and $d$ unique factors. With $n=4$ and $k=2$, you get something like:

Corresponding to the equation (Welling's eq. 1):

(1)$x=Ay+\mu +\nu$

The independent random variables $y$ are distributed according to a Gaussian with zero mean and unit variance ${𝒢}_{y}\left[0,I\right]$ (zero mean because constant offsets are handled by $\mu$; unit variance becase scaling is handled by $A$). The independent random variables $\nu$ are distributed according to ${𝒢}_{\nu }\left[0,\Sigma \right]$, with (Welling's eq. 2):

(2)$\Sigma \equiv \text{diag}\left[{\sigma }_{1}^{2},\dots ,{\sigma }_{d}^{2}\right]$

Because the only source of constant offset is $\mu$, we can calculate it by averaging out the random noise (Welling's eq. 6):

(3)$\mu =\frac{1}{N}\sum _{n=1}^{N}{x}_{n}$

where $N$ is the number of measurements (survey responders) and ${x}_{n}$ is the response vector for the ${n}^{\text{th}}$ responder.

How do we find $A$ and $\Sigma$? This is the tricky bit, and there are a number of possible approaches. Welling suggests using expectation maximization (EM), and there's an excellent example of the procedure with a colorblind experimenter drawing colored balls in his EM notes (to test my understanding, I wrote color-ball.py).

To simplify calculations, Welling defines (before eq. 15):

(4)$\begin{array}{rl}A\prime & \equiv \left[A,\mu \right]\\ y\prime & \equiv \left[{y}^{T},1{\right]}^{T}\end{array}$

which reduce the model to

(5)$x=A\prime y\prime +\nu$

After some manipulation Welling works out the maximizing updates (eq'ns 16 and 17):

(6)$\begin{array}{rl}A{\prime }^{\text{new}}& =\left(\sum _{n=1}^{N}{x}_{n}E\left[y\prime \mid {x}_{n}{\right]}^{T}\right){\left(\sum _{n=1}^{N}{x}_{n}E\left[y\prime y{\prime }^{T}\mid {x}_{n}\right]\right)}^{-1}\\ {\Sigma }^{\text{new}}& =\frac{1}{N}\sum _{n=1}^{N}\text{diag}\left[{x}_{n}{x}_{n}^{T}-A{\prime }^{\text{new}}E\left[y\prime \mid {x}_{n}\right]{x}_{n}^{T}\right]\end{array}$

The expectation values used in these updates are given by (Welling's eq'ns 12 and 13):

(7)$\begin{array}{rl}E\left[y\mid {x}_{n}\right]& ={A}^{T}\left(A{A}^{T}+\Sigma {\right)}^{-1}\left({x}_{n}-\mu \right)\\ E\left[y{y}^{T}\mid {x}_{n}\right]& =I-{A}^{T}\left(A{A}^{T}+\Sigma {\right)}^{-1}A+E\left[y\mid {x}_{n}\right]E\left[y\mid {x}_{n}{\right]}^{T}\end{array}$

# Survey analysis

Enough abstraction! Let's look at an example: survey results:

``````>>> import numpy
>>> scores = numpy.genfromtxt('Factor_analysis/survey.data', delimiter='\t')
>>> scores
array([[ 1.,  3.,  4.,  6.,  7.,  2.,  4.,  5.],
[ 2.,  3.,  4.,  3.,  4.,  6.,  7.,  6.],
[ 4.,  5.,  6.,  7.,  7.,  2.,  3.,  4.],
[ 3.,  4.,  5.,  6.,  7.,  3.,  5.,  4.],
[ 2.,  5.,  5.,  5.,  6.,  2.,  4.,  5.],
[ 3.,  4.,  6.,  7.,  7.,  4.,  3.,  5.],
[ 2.,  3.,  6.,  4.,  5.,  4.,  4.,  4.],
[ 1.,  3.,  4.,  5.,  6.,  3.,  3.,  4.],
[ 3.,  3.,  5.,  6.,  6.,  4.,  4.,  3.],
[ 4.,  4.,  5.,  6.,  7.,  4.,  3.,  4.],
[ 2.,  3.,  6.,  7.,  5.,  4.,  4.,  4.],
[ 2.,  3.,  5.,  7.,  6.,  3.,  3.,  3.]])
``````

`scores[i,j]` is the answer the `i`th respondent gave for the `j`th question. We're looking for underlying factors that can explain covariance between the different questions. Do the question answers ($x$) represent some underlying factors ($y$)? Let's start off by calculating $\mu$:

``````>>> def print_row(row):
...     print('  '.join('{: 0.2f}'.format(x) for x in row))
>>> mu = scores.mean(axis=0)
>>> print_row(mu)
2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
``````

Next we need priors for $A$ and $\Sigma$. MDP has an implementation for Python, and their FANode uses a Gaussian random matrix for $A$ and the diagonal of the score covariance for $\Sigma$. They also use the score covariance to avoid repeated summations over $n$.

``````>>> import mdp
>>> def print_matrix(matrix):
...     for row in matrix:
...         print_row(row)
>>> fa = mdp.nodes.FANode(output_dim=3)
>>> numpy.random.seed(1)  # for consistend doctest results
>>> responder_scores = fa(scores)   # hidden factors for each responder
>>> print_matrix(responder_scores)
-1.92  -0.45   0.00
0.67   1.97   1.96
0.70   0.03  -2.00
0.29   0.03  -0.60
-1.02   1.79  -1.43
0.82   0.27  -0.23
-0.07  -0.08   0.82
-1.38  -0.27   0.48
0.79  -1.17   0.50
1.59  -0.30  -0.41
0.01  -0.48   0.73
-0.46  -1.34   0.18
>>> print_row(fa.mu.flat)
2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
>>> fa.mu.flat == mu  # MDP agrees with our earlier calculation
array([ True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)
>>> print_matrix(fa.A)  # factor weights for each question
0.80  -0.06  -0.45
0.17   0.30  -0.65
0.34  -0.13  -0.25
0.13  -0.73  -0.64
0.02  -0.32  -0.70
0.61   0.23   0.86
0.08   0.63   0.59
-0.09   0.67   0.13
>>> print_row(fa.sigma)  # unique noise for each question
0.04   0.02   0.38   0.55   0.30   0.05   0.48   0.21
``````

Because the covariance is unaffected by the rotation $A\to AR$, the estimated weights $A$ and responder scores $y$ can be quite sensitive to the seed priors. The width $\Sigma$ of the unique noise $\nu$ is more robust, because $\Sigma$ is unaffected by rotations on $A$.

# Nomenclature

${A}_{\mathrm{ij}}$
The element from the ${i}^{\text{th}}$ row and ${j}^{\text{th}}$ column of a matrix $A$. For example here is a 2-by-3 matrix terms of components:
(8)$A=\left(\begin{array}{ccc}{A}_{11}& {A}_{12}& {A}_{13}\\ {A}_{21}& {A}_{22}& {A}_{23}\end{array}\right)$
${A}^{T}$
The transpose of a matrix (or vector) $A$. ${A}_{\mathrm{ij}}^{T}={A}_{\mathrm{ji}}$
${A}^{-1}$
The inverse of a matrix $A$. ${A}^{-1}\stackrel{˙}{A}=1$
$\text{diag}\left[A\right]$
A matrix containing only the diagonal elements of $A$, with the off-diagonal values set to zero.
$E\left[f\left(x\right)\right]$
Expectation value for a function $f$ of a random variable $x$. If the probability density of $x$ is $p\left(x\right)$, then $E\left[f\left(x\right)\right]=\int dxp\left(x\right)f\left(x\right)$. For example, $E\left[p\left(x\right)\right]=1$.
$\mu$
The mean of a random variable $x$ is given by $\mu =E\left[x\right]$.
$\Sigma$
The covariance of a random variable $x$ is given by $\Sigma =E\left[\left(x-\mu \right)\left(x-\mu {\right)}^{T}\right]$. In the factor analysis model discussed above, $\Sigma$ is restricted to a diagonal matrix.
${𝒢}_{x}\left[\mu ,\Sigma \right]$
A Gaussian probability density for the random variables $x$ with a mean $\mu$ and a covariance $\Sigma$.
(9)${𝒢}_{x}\left[\mu ,\Sigma \right]=\frac{1}{\left(2\pi {\right)}^{\frac{D}{2}}\sqrt{\mathrm{det}\left[\Sigma \right]}}{e}^{-\frac{1}{2}\left(x-\mu {\right)}^{T}{\Sigma }^{-1}\left(x-\mu \right)}$
$p\left(y\mid x\right)$
Probability of $y$ occurring given that $x$ occured. This is commonly used in Bayesian statistics.
$p\left(x,y\right)$
Probability of $y$ and $x$ occuring simultaneously (the joint density). $p\left(x,y\right)=p\left(x\mid y\right)p\left(y\right)$

Note: if you have trouble viewing some of the more obscure Unicode used in this post, you might want to install the STIX fonts.

Posted
catalyst

Available in a git repository.
Repository: catalyst-swc
Browsable repository: catalyst-swc
Author: W. Trevor King

Catalyst is a release-building tool for Gentoo. If you use Gentoo and want to roll your own live CD or bootable USB drive, this is the way to go. As I've been wrapping my head around catalyst, I've been pushing my notes upstream. This post builds on those notes to discuss the construction of a bootable ISO for Software Carpentry boot camps.

# Getting a patched up catalyst

Catalyst has been around for a while, but the user base has been fairly small. If you try to do something that Gentoo's Release Engineering team doesn't do on a regular basis, built in catalyst support can be spotty. There's been a fair amount of patch submissions an gentoo-catalyst@ recently, but patch acceptance can be slow. For the SWC ISO, I applied versions of the following patches (or patch series) to 37540ff:

# Configuring catalyst

The easiest way to run catalyst from a Git checkout is to setup a local config file. I didn't have enough hard drive space on my local system (~16 GB) for this build, so I set things up in a temporary directory on an external hard drive:

``````\$ cat catalyst.conf | grep -v '^#\|^\$'
digests="md5 sha1 sha512 whirlpool"
contents="auto"
distdir="/usr/portage/distfiles"
envscript="/etc/catalyst/catalystrc"
hash_function="crc32"
options="autoresume kerncache pkgcache seedcache snapcache"
portdir="/usr/portage"
sharedir="/home/wking/src/catalyst"
snapshot_cache="/mnt/d/tmp/catalyst/snapshot_cache"
storedir="/mnt/d/tmp/catalyst"
``````

I used the default values for everything except `sharedir`, `snapshot_cache`, and `storedir`. Then I cloned the `catalyst-swc` repository into `/mnt/d/tmp/catalyst`.

# Portage snapshot and a seed stage

Take a snapshot of the current Portage tree:

``````# catalyst -c catalyst.conf --snapshot 20130208
``````

``````# wget -O /mnt/d/tmp/catalyst/builds/default/stage3-i686-20121213.tar.bz2 \
>   http://distfiles.gentoo.org/releases/x86/current-stage3/stage3-i686-20121213.tar.bz2
``````

# Building the live CD

``````# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage2-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage3-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage2-i686-2013.1.spec
``````

# isohybrid

To make the ISO bootable from a USB drive, I used isohybrid:

``````# cp swc-x86.iso swc-x86-isohybrid.iso
# isohybrid iso-x86-isohybrid.iso
``````

You can install the resulting ISO on a USB drive with:

``````# dd if=iso-x86-isohybrid.iso of=/dev/sdX
``````

replacing replacing `X` with the appropriate drive letter for your USB drive.

With versions of catalyst after d1c2ba9, the `isohybrid` call is built into catalysts ISO construction.

Posted
SymPy

SymPy is a Python library for symbolic mathematics. To give you a feel for how it works, lets extrapolate the extremum location for $f\left(x\right)$ given a quadratic model:

(1)$f\left(x\right)=A{x}^{2}+Bx+C$

and three known values:

(2)$\begin{array}{rl}f\left(a\right)& =A{a}^{2}+Ba+C\\ f\left(b\right)& =A{b}^{2}+Bb+C\\ f\left(c\right)& =A{c}^{2}+Bc+C\end{array}$

Rephrase as a matrix equation:

(3)$\left(\begin{array}{c}f\left(a\right)\\ f\left(b\right)\\ f\left(c\right)\end{array}\right)=\left(\begin{array}{ccc}{a}^{2}& a& 1\\ {b}^{2}& b& 1\\ {c}^{2}& c& 1\end{array}\right)\cdot \left(\begin{array}{c}A\\ B\\ C\end{array}\right)$

So the solutions for $A$, $B$, and $C$ are:

(4)$\left(\begin{array}{c}A\\ B\\ C\end{array}\right)={\left(\begin{array}{ccc}{a}^{2}& a& 1\\ {b}^{2}& b& 1\\ {c}^{2}& c& 1\end{array}\right)}^{-1}\cdot \left(\begin{array}{c}f\left(a\right)\\ f\left(b\right)\\ f\left(c\right)\end{array}\right)=\left(\begin{array}{c}\text{long}\\ \text{complicated}\\ \text{stuff}\end{array}\right)$

Now that we've found the model parameters, we need to find the $x$ coordinate of the extremum.

(5)$\frac{\mathrm{d}f}{\mathrm{d}x}=2Ax+B\phantom{\rule{thickmathspace}{0ex}},$

which is zero when

(6)$\begin{array}{rl}2Ax& =-B\\ x& =\frac{-B}{2A}\end{array}$

Here's the solution in SymPy:

``````>>> from sympy import Symbol, Matrix, factor, expand, pprint, preview
>>> a = Symbol('a')
>>> b = Symbol('b')
>>> c = Symbol('c')
>>> fa = Symbol('fa')
>>> fb = Symbol('fb')
>>> fc = Symbol('fc')
>>> M = Matrix([[a**2, a, 1], [b**2, b, 1], [c**2, c, 1]])
>>> F = Matrix([[fa],[fb],[fc]])
>>> ABC = M.inv() * F
>>> A = ABC[0,0]
>>> B = ABC[1,0]
>>> x = -B/(2*A)
>>> x = factor(expand(x))
>>> pprint(x)
2       2       2       2       2       2
a *fb - a *fc - b *fa + b *fc + c *fa - c *fb
---------------------------------------------
2*(a*fb - a*fc - b*fa + b*fc + c*fa - c*fb)
>>> preview(x, viewer='pqiv')
``````

Where `pqiv` is the executable for pqiv, my preferred image viewer. With a bit of additional factoring, that is:

(7)$x=\frac{{a}^{2}\left[f\left(b\right)-f\left(c\right)\right]+{b}^{2}\left[f\left(c\right)-f\left(a\right)\right]+{c}^{2}\left[f\left(a\right)-f\left(b\right)\right]}{2\cdot \left\{a\left[f\left(b\right)-f\left(c\right)\right]+b\left[f\left(c\right)-f\left(a\right)\right]+c\left[f\left(a\right)-f\left(b\right)\right]\right\}}$
Posted
One-off Git daemon

In my gitweb post, I explain how to setup `git daemon` to serve `git://` requests under Nginx on Gentoo. This post talks about a different situation, where you want to toss up a Git daemon for collaboration on your LAN. This is useful when you're teaching Git to a room full of LAN-sharing students, and you don't want to bother setting up public repositories more permanently.

# Serving a few repositories

Say you have a repository that you want to serve:

``````\$ mkdir -p ~/src/my-project
\$ cd ~/src/my-project
\$ git init
\$ …hack hack hack…
``````

Fire up the daemon (probably in another terminal so you can keep hacking in your original terminal) with:

``````\$ cd ~/src
\$ git daemon --export-all --base-path=. --verbose ./my-project
``````

Then you can clone with:

``````\$ git clone git://192.168.1.2/my-project
``````

replacing `192.168.1.2` with your public IP address (e.g. from ```ip addr show scope global```). Add additional repository paths to the ```git daemon``` call to serve additional repositories.

# Serving a single repository

If you don't want to bother listing `my-project` in your URLs, you can base the daemon in the project itself (instead of in the parent directory):

``````\$ cd
\$ git daemon --export-all --base-path=src/my-project --verbose
``````

Then you can clone with:

``````\$ git clone git://192.168.1.2/
``````

This may be more convenient if you're only sharing a single repository.

# Enabling pushes

If you want your students to be able to push to your repository during class, you can run:

``````\$ git daemon --enable=receive-pack …
``````

Only do this on a trusted LAN with a junk test repository, because it will allow anybody to push anything or remove references.

Posted
PDF forms

You can use pdftk to fill out PDF forms (thanks for the inspiration, Joe Rothweiler). The syntax is simple:

``````\$ pdftk input.pdf fill_form data.fdf output output.pdf
``````

where `input.pdf` is the input PDF containing the form, `data.fdf` is an FDF or XFDF file containing your data, and `output.pdf` is the name of the PDF you're creating. The tricky part is figuring out what to put in `data.fdf`. There's a useful comparison of the Forms Data Format (FDF) and it's XML version (XFDF) in the XFDF specification. XFDF only covers a subset of FDF, so I won't worry about it here. FDF is defined in section 12.7.7 of ISO 32000-1:2008, the PDF 1.7 specification, and it has been in PDF specifications since version 1.2.

# Forms Data Format (FDF)

FDF files are basically stripped down PDFs (§12.7.7.1). A simple FDF file will look something like:

``````%FDF-1.2
1 0 obj<</FDF<</Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
trailer
<</Root 1 0 R>>
%%EOF
``````

Broken down into the lingo of ISO 32000, we have a header (§12.7.7.2.2):

``````%FDF-1.2
``````

followed by a body with a single object (§12.7.7.2.3):

``````1 0 obj<</FDF<</Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
``````

followed by a trailer (§12.7.7.2.4):

``````trailer
<</Root 1 0 R>>
%%EOF
``````

Despite the claims in §12.7.7.2.1 that the trailer is optional, pdftk choked on files without it:

``````\$ cat no-trailer.fdf
%FDF-1.2
1 0 obj<</FDF<</Fields[
<</T(Name)/V(Trevor)>>
<</T(Date)/V(2012-09-20)>>
] >> >>
endobj
\$ pdftk input.pdf fill_form no-trailer.fdf output output.pdf
Error: Failed to open form data file:
data.fdf
No output created.
``````

Trailers are easy to add, since all they reqire is a reference to the root of the FDF catalog dictionary. If you only have one dictionary, you can always use the simple trailer I gave above.

## FDF Catalog

The meat of the FDF file is the catalog (§12.7.7.3). Lets take a closer look at the catalog structure:

``````1 0 obj<</FDF<</Fields[
…
] >> >>
``````

This defines a new object (the FDF catalog) which contains one key (the `/FDF` dictionary). The FDF dictionary contains one key (`/Fields`) and its associated array of fields. Then we close the `/Fields` array (`]`), close the FDF dictionary (`>>`) and close the FDF catalog (`>>`).

There are a number of interesting entries that you can add to the FDF dictionary (§12.7.7.3.1, table 243), some of which require a more advanced FDF version. You can use the `/Version` key to the FDF catalog (§12.7.7.3.1, table 242) to specify the of data in the dictionary:

``````1 0 obj<</Version/1.3/FDF<</Fields[…
``````

Now you can extend the dictionary using table 244. Lets set things up to use UTF-8 for the field values (`/V`) or options (`/Opt`):

``````1 0 obj<</Version/1.3/FDF<</Encoding/utf_8/Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
``````

pdftk understands raw text in the specified encoding (`(…)`), raw UTF-16 strings starting with a BOM (`(\xFE\xFF…)`), or UTF-16BE strings encoded as ASCII hex (`<FEFF…>`). You can use pdf-merge.py and its `--unicode` option to find the latter. Support for the `/utf_8` encoding in pdftk is new. I mailed a patch to pdftk's Sid Steward and posted a patch request to the underlying iText library. Until those get accepted, you're stuck with the less convenient encodings.

## Fonts

Say you fill in some Unicode values, but your PDF reader is having trouble rendering some funky glyphs. Maybe it doesn't have access to the right font? You can see which fonts are embedded in a given PDF using pdffonts.

``````\$ pdffonts input.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
MMXQDQ+UniversalStd-NewswithCommPi   CID Type 0C       yes yes yes   1738  0
MMXQDQ+ZapfDingbatsStd               CID Type 0C       yes yes yes   1749  0
MMXQDQ+HelveticaNeueLTStd-Roman      Type 1C           yes yes no    1737  0
CPZITK+HelveticaNeueLTStd-BlkCn      Type 1C           yes yes no    1739  0
…
``````

If you don't have the right font for your new data, you can add it using current versions of iText. However, pdftk uses an older version, so I'm not sure how to translate this idea for pdftk.

## FDF templates and field names

You can use pdftk itself to create an FDF template, which it does with embedded UTF-16BE (you can see the FE FF BOMS at the start of each string value).

``````\$ pdftk input.pdf generate_fdf output template.fdf
\$ hexdump -C template.fdf  | head
00000000  25 46 44 46 2d 31 2e 32  0a 25 e2 e3 cf d3 0a 31  |%FDF-1.2.%.....1|
00000010  20 30 20 6f 62 6a 20 0a  3c 3c 0a 2f 46 44 46 20  | 0 obj .<<./FDF |
00000020  0a 3c 3c 0a 2f 46 69 65  6c 64 73 20 5b 0a 3c 3c  |.<<./Fields [.<<|
00000030  0a 2f 56 20 28 fe ff 29  0a 2f 54 20 28 fe ff 00  |./V (..)./T (...|
00000040  50 00 6f 00 73 00 74 00  65 00 72 00 4f 00 72 00  |P.o.s.t.e.r.O.r.|
…
``````

You can also dump a more human friendly version of the PDF's fields (without any default data):

``````\$ pdftk input.pdf dump_data_fields_utf8 output data.txt
\$ cat data.txt
---
FieldType: Text
FieldName: Name
FieldNameAlt: Name:
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: Date
FieldNameAlt: Date:
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldFlags: 0
FieldJustification: Left
---
…
``````

If the fields are poorly named, you may have to fill the entire form with unique values and then see which values appeared where in the output PDF (for and example, see codehero's identify_pdf_fields.js).

# Conclusions

This would be so much easier if people just used YAML or JSON instead of bothering with PDFs ;).

Posted
Portage

Portage is Gentoo's default package manager. This post isn't supposed to be a tutorial, the handbook does a pretty good job of that already. I'm just recording a few tricks so I don't forget them.

# User patches

While playing around with LDAP, I was trying to troubleshoot the `SASL_NOCANON` handling. “Gee,” I thought, “wouldn't it be nice to be able to add debugging printfs to figure out what was happening?” Unfortunately, I had trouble getting `ldapwhoami` working when I compiled it by hand. “Grrr,” I though, “I just want to add a simple patch and do whatever the ebuild already does.” This is actually pretty easy to do, once you're looking in the right places.

I'm not going to cover that here.

## Place your patch where `epatch_user` will find it

This would be under

``````/etc/portage/patches/<CATEGORY>/<PF|P|PN>/
``````

If your ebuild already calls `epatch_user`, or it uses an eclass like `base` that calls `epatch_user` internally, you're done. If not, read on…

## Forcing `epatch_user`

While you could always write an overlay with an improved ebuild, a quicker fix for this kind of hack is /etc/portage/bashrc. I used:

``````if [ "\${EBUILD_PHASE}" == "prepare" ]; then
echo ":: Calling epatch_user";
pushd "\${S}"
epatch_user
popd
fi
``````

to insert my patches at the beginning of the `prepare` phase.

## Cleaning up

It's safe to call `epatch_user` multiple times, so you can leave this setup in place if you like. However, you might run into problems if you touch autoconf files, so you may want to move your `bashrc` somewhere else until you need it again!

Posted
DVD Backup

I've been using abcde to rip our audio CD collection onto our fileserver for a few years now. Then I can play songs from across the collection using MPD without having to dig the original CDs out of the closet. I just picked up a large external hard drive and thought it might be time to take a look at ripping our DVD collection as well.

There is an excellent Quick-n-Dirty Guide that goes into more detail on all of this, but here's an executive summary.

Make sure you're kernel understands the UDF file system:

``````\$ grep CONFIG_UDF_FS /usr/src/linux/.congfig
``````

If your kernel was compiled with `CONFIG_IKCONFIG_PROC` enabled, you could use

``````\$ zcat /proc/config.gz | grep CONFIG_UDF_FS
``````

instead, to make sure you're checking the configuration of the currently running kernel. If the `udf` driver was compiled as a module, make sure it's loaded.

``````\$ sudo modprobe udf
``````

``````\$ sudo mount /dev/dvd /mnt/dvd
``````

Now you're ready to rip. You've got two options: you can copy the VOBs over directly, or rip the DVD into an alternative container format such as Matroska.

## Vobcopy

Mirror the disc with vobcopy (`media-video/vobcopy` on Gentoo):

``````\$ vobcopy -m -t "Awesome_Movie" -v -i /mnt/dvd -o ~/movies/
``````

Play with Mplayer (`media-video/mplayer` on Gentoo):

``````\$ mplayer -nosub -fs -dvd-device ~/movies/Awesome_Movie dvd://1
``````

where `-nosub` and `-fs` are optional.

## Matroska

Remux the disc (without reencoding) with `mkvmerge` (from MKVToolNix, `media-video/mkvtoolnix` on Gentoo):

``````\$ mkvmerge -o ~/movies/Awesome_Movie.mkv /mnt/dvd/VIDEO_TS/VTS_01_1.VOB
(Processing the following files as well: "VTS_01_2.VOB", "VTS_01_3.VOB", "VTS_01_4.VOB", "VTS_01_5.VOB")
``````

Then you can do all the usual tricks. Here's an example of extracting a slice of the Matroska file as silent video in an AVI container with `mencoder` (from Mplayer, `media-video/mplayer` on Gentoo):

``````\$ mencoder -ss 00:29:20.3 -endpos 00:00:21.6 Awesome_Movie.mkv -nosound -of avi -ovc copy -o silent-clip.avi
``````

Here's an example of extracting a slice of the Matroska file as audio in an AC3 container:

``````\$ mencoder -ss 51.1 -endpos 160.9 Awesome_Movie.mkv -of rawaudio -ovc copy -oac copy -o audio-clip.ac3
``````

You can also take a look through the Gentoo wiki and this Ubuntu thread for more ideas.

Posted
Screen

Screen is a ncurses-based terminal multiplexer. There are tons of useful things you can do with it, and innumerable blog posts describing them. I have two common use cases:

• On my local host when I don't start X Windows, I login to a virtual terminal and run `screen`. Then I can easily open several windows (e.g. for Emacs, Mutt, irssi, …) without having to log in on another virtual terminal.
• On remote hosts when I'm doing anything serious, I start `screen` immediately aftering SSH-ing into the remote host. Then if my connection is dropped (or I need to disconnect while I take the train in to work), my remote work is waiting for me to pick up where I left off.

# Treehouse X

Those are useful things, but they are well covered by others. A few days ago I though of a cute trick, for increasing security on my local host, which lead me to finally write up a `screen` post. I call it “treehouse X”. Here's the problem:

You don't like waiting for X to start up when a virtual terminal is sufficient for your task at hand, so you've set your box up without a graphical login manager. However, sometimes you do need a graphical interface (e.g. to use fancy characters via Xmodmap or the Compose key), so you fire up X with `startx`, and get on with your life. But wait! You have to leave the terminal to do something else (e.g. teach a class, eat dinner, sleep?). Being a security-concious bloke, you lock your screen with xlockmore (using your Fluxbox hotkeys). You leave to complete your task. While you're gone Mallory sneaks into your lab. You've locked your X server, so you think you're safe, but Mallory jumps to the virtual terminal from which you started X (using `Ctrl-Alt-F1`, or similar), and kills your `startx` process with `Ctrl-c`. Now Mallory can do evil things in your name, like adding `export EDITOR=vim` to your `.bashrc`.

So how do you protect yourself against this attack? Enter `screen` and treehouse X. If you run `startx` from within a `screen` session, you can jump back to the virtual terminal yourself, detach from the sesion, and log out of the virtual terminal. This is equivalent to climing into your treehouse (X) and pulling up your rope ladder (`startx`) behind you, so that you are no longer vulnerable from the ground (the virtual terminal). For kicks, you can reattach to the screen session from an `xterm`, which leads to a fun chicken-and-egg picture:

Of course the whole situation makes sense when you realize that it's really:

``````\$ pstree 14542
screen───bash───startx───xinit─┬─X
└─fluxbox───xterm───bash───screen
``````

where the first `screen` is the server and the second `screen` is the client.

Posted
Cython

Cython is a Python-like language that makes it easy to write C-based extensions for Python. This is a Good Thing™, because people who will write good Python wrappers will be fluent in Python, but not necessarily in C. Alternatives like SWIG allow you to specify wrappers in a C-like language, which makes thin wrappers easy, but can lead to a less idomatic wrapper API. I should also point out ctypes, which has the advantage of avoiding compiled wrappers altogether, at the expense of dealing with linking explicitly in the Python code.

The Cython docs are fairly extensive, and I found them to be sufficient for writing my pycomedi wrapper around the Comedi library. One annoying thing was that Cython does not support `__all__` (cython-users). I took a stab at fixing this, but got sidetracked cleaning up the Cython parser (cython-devel, later in cython-devel). I must have bit off more than I should have, since I eventually ran out of time to work on merging my code, and the Cython trunk moved off without me ;).

Posted
SWIG

SWIG is a Simplified Wrapper and Interface Generator. It makes it very easy to provide a quick-and-dirty wrapper so you can call code written in C or C++ from code written in another (e.g. Python). I don't do much with SWIG, because while building an object oriented wrapper in SWIG is possible, I could never get it to feel natural (I like Cython better). Here are my notes from when I do have to interact with SWIG.

# `%array_class` and memory management

`%array_class` (defined in carrays.i) lets you wrap a C array in a class-based interface. The example from the docs is nice and concise, but I was running into problems.

``````>>> import example
>>> n = 3
>>> data = example.sample_array(n)
>>> for i in range(n):
...     data[i] = 2*i + 3
>>> example.print_sample_pointer(n, data)
Traceback (most recent call last):
...
TypeError: in method 'print_sample_pointer', argument 2 of type 'sample_t *'
``````

I just bumped into these errors again while trying to add an `insn_array` class to Comedi's wrapper:

``````%array_class(comedi_insn, insn_array);
``````

so I decided it was time to buckle down and figure out what was going on. All of the non-Comedi examples here are based on my example test code.

The basic problem is that while you and I realize that an `array_class`-based instance is interchangable with the underlying pointer, SWIG does not. For example, I've defined a `sample_vector_t` `struct`:

``````typedef double sample_t;
typedef struct sample_vector_struct {
size_t n;
sample_t *data;
} sample_vector_t;
``````

and a `sample_array` class:

``````%array_class(sample_t, sample_array);
``````

A bare instance of the double array class has fancy SWIG additions for getting and setting attributes. The class that adds the extra goodies is SWIG's proxy class:

``````>>> print(data)  # doctest: +ELLIPSIS
<example.sample_array; proxy of <Swig Object of type 'sample_array *' at 0x...> >
``````

However, C functions and structs interact with the bare pointer (i.e. without the proxy goodies). You can use the `.cast()` method to remove the goodies:

``````>>> data.cast()  # doctest: +ELLIPSIS
<Swig Object of type 'double *' at 0x...>
>>> example.print_sample_pointer(n, data.cast())
>>> vector = example.sample_vector_t()
>>> vector.n = n
>>> vector.data = data
Traceback (most recent call last):
...
TypeError: in method 'sample_vector_t_data_set', argument 2 of type 'sample_t *'
>>> vector.data = data.cast()
>>> vector.data  # doctest: +ELLIPSIS
<Swig Object of type 'double *' at 0x...>
``````

So `.cast()` gets you from `proxy of <Swig Object ...>` to ```<Swig Object ...>```. How you go the other way? You'll need this if you want to do something extra fancy, like accessing the array members ;).

``````>>> vector.data[0]
Traceback (most recent call last):
...
TypeError: 'SwigPyObject' object is not subscriptable
``````

The answer here is the `.frompointer()` method, which can function as a class method:

``````>>> reconst_data = example.sample_array.frompointer(vector.data)
>>> reconst_data[n-1]
7.0
``````

Or as a single line:

``````>>> example.sample_array.frompointer(vector.data)[n-1]
7.0
``````

I chose the somewhat awkward name of `reconst_data` for the reconstitued data, because if you use `data`, you clobber the earlier `example.sample_array(n)` definition. After the clobber, Python garbage collects the old `data`, and becase the old data claims it owns the underlying memory, Python frees the memory. This leaves `vector.data` and `reconst_data` pointing to unallocated memory, which is probably not what you want. If keeping references to the original objects (like I did above with `data`) is too annoying, you have to manually tweak the ownership flag:

``````>>> data.thisown
True
>>> data.thisown = False
>>> data = example.sample_array.frompointer(vector.data)
>>> data[n-1]
7.0
``````

This way, when `data` is clobbered, SWIG doesn't release the underlying array (because `data` no longer claims to own the array). However, `vector` doesn't own the array either, so you'll have to remember to reattach the array to somthing that will clean it up before vector goes out of scope to avoid leaking memory:

``````>>> data.thisown = True
>>> del vector, data
``````

For deeply nested structures, this can be annoying, but it will work.

Posted