Available in a git repository.
Author: W. Trevor King

Since November 2012 I've been maintaining rss2email, a package that converts RSS or Atom feeds to email so you can follow them with your mail user agent. Rss2email was created by the late Aaron Swartz and maintained for several years by Lindsey Smith. I've added a mailing list (hosted with mlmmj) and PyPI package and made the GitHub location the homepage.

Overall, setting up the standard project infrastructure has been fun, and it's nice to see interest in the newly streamlined code picking up. The timing also works out well, since the demise of Google Reader may push some talented folks in our direction. I'm not sure how visible rss2email is, especially the fresh development locations, hence this post ;). If you know anyone who might be interested in using (or contributing to!) rss2email, please pass the word.

Posted
Factor analysis

I've been trying to wrap my head around factor analysis as a theory for designing and understanding test and survey results. This has turned out to be another one of those fields where the going has been a bit rough. I think the key factors in making these older topics difficult are:

• “Everybody knows this, so we don't need to write up the details.”
• “Hey, I can do better than Bob if I just tweak this knob…”

The resulting discussion ends up being overly complicated, and it's hard for newcomers to decide if people using similar terminology are in fact talking about the same thing.

Some of the better open sources for background has been Tucker and MacCallum's “Exploratory Factor Analysis” manuscript and Max Welling's notes. I'll use Welling's terminology for this discussion.

The basic idea of factor analsys is to model $d$ measurable attributes as generated by $k common factors and $d$ unique factors. With $n=4$ and $k=2$, you get something like:

Corresponding to the equation (Welling's eq. 1):

(1)$x=Ay+\mu +\nu$

The independent random variables $y$ are distributed according to a Gaussian with zero mean and unit variance ${𝒢}_{y}\left[0,I\right]$ (zero mean because constant offsets are handled by $\mu$; unit variance becase scaling is handled by $A$). The independent random variables $\nu$ are distributed according to ${𝒢}_{\nu }\left[0,\Sigma \right]$, with (Welling's eq. 2):

(2)$\Sigma \equiv \text{diag}\left[{\sigma }_{1}^{2},\dots ,{\sigma }_{d}^{2}\right]$

Because the only source of constant offset is $\mu$, we can calculate it by averaging out the random noise (Welling's eq. 6):

(3)$\mu =\frac{1}{N}\sum _{n=1}^{N}{x}_{n}$

where $N$ is the number of measurements (survey responders) and ${x}_{n}$ is the response vector for the ${n}^{\text{th}}$ responder.

How do we find $A$ and $\Sigma$? This is the tricky bit, and there are a number of possible approaches. Welling suggests using expectation maximization (EM), and there's an excellent example of the procedure with a colorblind experimenter drawing colored balls in his EM notes (to test my understanding, I wrote color-ball.py).

To simplify calculations, Welling defines (before eq. 15):

(4)$\begin{array}{rl}A\prime & \equiv \left[A,\mu \right]\\ y\prime & \equiv \left[{y}^{T},1{\right]}^{T}\end{array}$

which reduce the model to

(5)$x=A\prime y\prime +\nu$

After some manipulation Welling works out the maximizing updates (eq'ns 16 and 17):

(6)$\begin{array}{rl}A{\prime }^{\text{new}}& =\left(\sum _{n=1}^{N}{x}_{n}E\left[y\prime \mid {x}_{n}{\right]}^{T}\right){\left(\sum _{n=1}^{N}{x}_{n}E\left[y\prime y{\prime }^{T}\mid {x}_{n}\right]\right)}^{-1}\\ {\Sigma }^{\text{new}}& =\frac{1}{N}\sum _{n=1}^{N}\text{diag}\left[{x}_{n}{x}_{n}^{T}-A{\prime }^{\text{new}}E\left[y\prime \mid {x}_{n}\right]{x}_{n}^{T}\right]\end{array}$

The expectation values used in these updates are given by (Welling's eq'ns 12 and 13):

(7)$\begin{array}{rl}E\left[y\mid {x}_{n}\right]& ={A}^{T}\left(A{A}^{T}+\Sigma {\right)}^{-1}\left({x}_{n}-\mu \right)\\ E\left[y{y}^{T}\mid {x}_{n}\right]& =I-{A}^{T}\left(A{A}^{T}+\Sigma {\right)}^{-1}A+E\left[y\mid {x}_{n}\right]E\left[y\mid {x}_{n}{\right]}^{T}\end{array}$

# Survey analysis

Enough abstraction! Let's look at an example: survey results:

``````>>> import numpy
>>> scores = numpy.genfromtxt('Factor_analysis/survey.data', delimiter='\t')
>>> scores
array([[ 1.,  3.,  4.,  6.,  7.,  2.,  4.,  5.],
[ 2.,  3.,  4.,  3.,  4.,  6.,  7.,  6.],
[ 4.,  5.,  6.,  7.,  7.,  2.,  3.,  4.],
[ 3.,  4.,  5.,  6.,  7.,  3.,  5.,  4.],
[ 2.,  5.,  5.,  5.,  6.,  2.,  4.,  5.],
[ 3.,  4.,  6.,  7.,  7.,  4.,  3.,  5.],
[ 2.,  3.,  6.,  4.,  5.,  4.,  4.,  4.],
[ 1.,  3.,  4.,  5.,  6.,  3.,  3.,  4.],
[ 3.,  3.,  5.,  6.,  6.,  4.,  4.,  3.],
[ 4.,  4.,  5.,  6.,  7.,  4.,  3.,  4.],
[ 2.,  3.,  6.,  7.,  5.,  4.,  4.,  4.],
[ 2.,  3.,  5.,  7.,  6.,  3.,  3.,  3.]])
``````

`scores[i,j]` is the answer the `i`th respondent gave for the `j`th question. We're looking for underlying factors that can explain covariance between the different questions. Do the question answers ($x$) represent some underlying factors ($y$)? Let's start off by calculating $\mu$:

``````>>> def print_row(row):
...     print('  '.join('{: 0.2f}'.format(x) for x in row))
>>> mu = scores.mean(axis=0)
>>> print_row(mu)
2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
``````

Next we need priors for $A$ and $\Sigma$. MDP has an implementation for Python, and their FANode uses a Gaussian random matrix for $A$ and the diagonal of the score covariance for $\Sigma$. They also use the score covariance to avoid repeated summations over $n$.

``````>>> import mdp
>>> def print_matrix(matrix):
...     for row in matrix:
...         print_row(row)
>>> fa = mdp.nodes.FANode(output_dim=3)
>>> numpy.random.seed(1)  # for consistend doctest results
>>> responder_scores = fa(scores)   # hidden factors for each responder
>>> print_matrix(responder_scores)
-1.92  -0.45   0.00
0.67   1.97   1.96
0.70   0.03  -2.00
0.29   0.03  -0.60
-1.02   1.79  -1.43
0.82   0.27  -0.23
-0.07  -0.08   0.82
-1.38  -0.27   0.48
0.79  -1.17   0.50
1.59  -0.30  -0.41
0.01  -0.48   0.73
-0.46  -1.34   0.18
>>> print_row(fa.mu.flat)
2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
>>> fa.mu.flat == mu  # MDP agrees with our earlier calculation
array([ True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)
>>> print_matrix(fa.A)  # factor weights for each question
0.80  -0.06  -0.45
0.17   0.30  -0.65
0.34  -0.13  -0.25
0.13  -0.73  -0.64
0.02  -0.32  -0.70
0.61   0.23   0.86
0.08   0.63   0.59
-0.09   0.67   0.13
>>> print_row(fa.sigma)  # unique noise for each question
0.04   0.02   0.38   0.55   0.30   0.05   0.48   0.21
``````

Because the covariance is unaffected by the rotation $A\to AR$, the estimated weights $A$ and responder scores $y$ can be quite sensitive to the seed priors. The width $\Sigma$ of the unique noise $\nu$ is more robust, because $\Sigma$ is unaffected by rotations on $A$.

# Nomenclature

${A}_{\mathrm{ij}}$
The element from the ${i}^{\text{th}}$ row and ${j}^{\text{th}}$ column of a matrix $A$. For example here is a 2-by-3 matrix terms of components:
(8)$A=\left(\begin{array}{ccc}{A}_{11}& {A}_{12}& {A}_{13}\\ {A}_{21}& {A}_{22}& {A}_{23}\end{array}\right)$
${A}^{T}$
The transpose of a matrix (or vector) $A$. ${A}_{\mathrm{ij}}^{T}={A}_{\mathrm{ji}}$
${A}^{-1}$
The inverse of a matrix $A$. ${A}^{-1}\stackrel{˙}{A}=1$
$\text{diag}\left[A\right]$
A matrix containing only the diagonal elements of $A$, with the off-diagonal values set to zero.
$E\left[f\left(x\right)\right]$
Expectation value for a function $f$ of a random variable $x$. If the probability density of $x$ is $p\left(x\right)$, then $E\left[f\left(x\right)\right]=\int dxp\left(x\right)f\left(x\right)$. For example, $E\left[p\left(x\right)\right]=1$.
$\mu$
The mean of a random variable $x$ is given by $\mu =E\left[x\right]$.
$\Sigma$
The covariance of a random variable $x$ is given by $\Sigma =E\left[\left(x-\mu \right)\left(x-\mu {\right)}^{T}\right]$. In the factor analysis model discussed above, $\Sigma$ is restricted to a diagonal matrix.
${𝒢}_{x}\left[\mu ,\Sigma \right]$
A Gaussian probability density for the random variables $x$ with a mean $\mu$ and a covariance $\Sigma$.
(9)${𝒢}_{x}\left[\mu ,\Sigma \right]=\frac{1}{\left(2\pi {\right)}^{\frac{D}{2}}\sqrt{\mathrm{det}\left[\Sigma \right]}}{e}^{-\frac{1}{2}\left(x-\mu {\right)}^{T}{\Sigma }^{-1}\left(x-\mu \right)}$
$p\left(y\mid x\right)$
Probability of $y$ occurring given that $x$ occured. This is commonly used in Bayesian statistics.
$p\left(x,y\right)$
Probability of $y$ and $x$ occuring simultaneously (the joint density). $p\left(x,y\right)=p\left(x\mid y\right)p\left(y\right)$

Note: if you have trouble viewing some of the more obscure Unicode used in this post, you might want to install the STIX fonts.

Posted
catalyst

Available in a git repository.
Repository: catalyst-swc
Browsable repository: catalyst-swc
Author: W. Trevor King

Catalyst is a release-building tool for Gentoo. If you use Gentoo and want to roll your own live CD or bootable USB drive, this is the way to go. As I've been wrapping my head around catalyst, I've been pushing my notes upstream. This post builds on those notes to discuss the construction of a bootable ISO for Software Carpentry boot camps.

# Getting a patched up catalyst

Catalyst has been around for a while, but the user base has been fairly small. If you try to do something that Gentoo's Release Engineering team doesn't do on a regular basis, built in catalyst support can be spotty. There's been a fair amount of patch submissions an gentoo-catalyst@ recently, but patch acceptance can be slow. For the SWC ISO, I applied versions of the following patches (or patch series) to 37540ff:

# Configuring catalyst

The easiest way to run catalyst from a Git checkout is to setup a local config file. I didn't have enough hard drive space on my local system (~16 GB) for this build, so I set things up in a temporary directory on an external hard drive:

``````\$ cat catalyst.conf | grep -v '^#\|^\$'
digests="md5 sha1 sha512 whirlpool"
contents="auto"
distdir="/usr/portage/distfiles"
envscript="/etc/catalyst/catalystrc"
hash_function="crc32"
options="autoresume kerncache pkgcache seedcache snapcache"
portdir="/usr/portage"
sharedir="/home/wking/src/catalyst"
snapshot_cache="/mnt/d/tmp/catalyst/snapshot_cache"
storedir="/mnt/d/tmp/catalyst"
``````

I used the default values for everything except `sharedir`, `snapshot_cache`, and `storedir`. Then I cloned the `catalyst-swc` repository into `/mnt/d/tmp/catalyst`.

# Portage snapshot and a seed stage

Take a snapshot of the current Portage tree:

``````# catalyst -c catalyst.conf --snapshot 20130208
``````

``````# wget -O /mnt/d/tmp/catalyst/builds/default/stage3-i686-20121213.tar.bz2 \
>   http://distfiles.gentoo.org/releases/x86/current-stage3/stage3-i686-20121213.tar.bz2
``````

# Building the live CD

``````# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage2-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-stage3-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage1-i686-2013.1.spec
# catalyst -c catalyst.conf -f /mnt/d/tmp/catalyst/spec/default-livecd-stage2-i686-2013.1.spec
``````

# isohybrid

To make the ISO bootable from a USB drive, I used isohybrid:

``````# cp swc-x86.iso swc-x86-isohybrid.iso
# isohybrid iso-x86-isohybrid.iso
``````

You can install the resulting ISO on a USB drive with:

``````# dd if=iso-x86-isohybrid.iso of=/dev/sdX
``````

replacing replacing `X` with the appropriate drive letter for your USB drive.

With versions of catalyst after d1c2ba9, the `isohybrid` call is built into catalysts ISO construction.

Posted
SymPy

SymPy is a Python library for symbolic mathematics. To give you a feel for how it works, lets extrapolate the extremum location for $f\left(x\right)$ given a quadratic model:

(1)$f\left(x\right)=A{x}^{2}+Bx+C$

and three known values:

(2)$\begin{array}{rl}f\left(a\right)& =A{a}^{2}+Ba+C\\ f\left(b\right)& =A{b}^{2}+Bb+C\\ f\left(c\right)& =A{c}^{2}+Bc+C\end{array}$

Rephrase as a matrix equation:

(3)$\left(\begin{array}{c}f\left(a\right)\\ f\left(b\right)\\ f\left(c\right)\end{array}\right)=\left(\begin{array}{ccc}{a}^{2}& a& 1\\ {b}^{2}& b& 1\\ {c}^{2}& c& 1\end{array}\right)\cdot \left(\begin{array}{c}A\\ B\\ C\end{array}\right)$

So the solutions for $A$, $B$, and $C$ are:

(4)$\left(\begin{array}{c}A\\ B\\ C\end{array}\right)={\left(\begin{array}{ccc}{a}^{2}& a& 1\\ {b}^{2}& b& 1\\ {c}^{2}& c& 1\end{array}\right)}^{-1}\cdot \left(\begin{array}{c}f\left(a\right)\\ f\left(b\right)\\ f\left(c\right)\end{array}\right)=\left(\begin{array}{c}\text{long}\\ \text{complicated}\\ \text{stuff}\end{array}\right)$

Now that we've found the model parameters, we need to find the $x$ coordinate of the extremum.

(5)$\frac{\mathrm{d}f}{\mathrm{d}x}=2Ax+B\phantom{\rule{thickmathspace}{0ex}},$

which is zero when

(6)$\begin{array}{rl}2Ax& =-B\\ x& =\frac{-B}{2A}\end{array}$

Here's the solution in SymPy:

``````>>> from sympy import Symbol, Matrix, factor, expand, pprint, preview
>>> a = Symbol('a')
>>> b = Symbol('b')
>>> c = Symbol('c')
>>> fa = Symbol('fa')
>>> fb = Symbol('fb')
>>> fc = Symbol('fc')
>>> M = Matrix([[a**2, a, 1], [b**2, b, 1], [c**2, c, 1]])
>>> F = Matrix([[fa],[fb],[fc]])
>>> ABC = M.inv() * F
>>> A = ABC[0,0]
>>> B = ABC[1,0]
>>> x = -B/(2*A)
>>> x = factor(expand(x))
>>> pprint(x)
2       2       2       2       2       2
a *fb - a *fc - b *fa + b *fc + c *fa - c *fb
---------------------------------------------
2*(a*fb - a*fc - b*fa + b*fc + c*fa - c*fb)
>>> preview(x, viewer='pqiv')
``````

Where `pqiv` is the executable for pqiv, my preferred image viewer. With a bit of additional factoring, that is:

(7)$x=\frac{{a}^{2}\left[f\left(b\right)-f\left(c\right)\right]+{b}^{2}\left[f\left(c\right)-f\left(a\right)\right]+{c}^{2}\left[f\left(a\right)-f\left(b\right)\right]}{2\cdot \left\{a\left[f\left(b\right)-f\left(c\right)\right]+b\left[f\left(c\right)-f\left(a\right)\right]+c\left[f\left(a\right)-f\left(b\right)\right]\right\}}$
Posted
One-off Git daemon

In my gitweb post, I explain how to setup `git daemon` to serve `git://` requests under Nginx on Gentoo. This post talks about a different situation, where you want to toss up a Git daemon for collaboration on your LAN. This is useful when you're teaching Git to a room full of LAN-sharing students, and you don't want to bother setting up public repositories more permanently.

# Serving a few repositories

Say you have a repository that you want to serve:

``````\$ mkdir -p ~/src/my-project
\$ cd ~/src/my-project
\$ git init
\$ …hack hack hack…
``````

Fire up the daemon (probably in another terminal so you can keep hacking in your original terminal) with:

``````\$ cd ~/src
\$ git daemon --export-all --base-path=. --verbose ./my-project
``````

Then you can clone with:

``````\$ git clone git://192.168.1.2/my-project
``````

replacing `192.168.1.2` with your public IP address (e.g. from ```ip addr show scope global```). Add additional repository paths to the ```git daemon``` call to serve additional repositories.

# Serving a single repository

If you don't want to bother listing `my-project` in your URLs, you can base the daemon in the project itself (instead of in the parent directory):

``````\$ cd
\$ git daemon --export-all --base-path=src/my-project --verbose
``````

Then you can clone with:

``````\$ git clone git://192.168.1.2/
``````

This may be more convenient if you're only sharing a single repository.

# Enabling pushes

If you want your students to be able to push to your repository during class, you can run:

``````\$ git daemon --enable=receive-pack …
``````

Only do this on a trusted LAN with a junk test repository, because it will allow anybody to push anything or remove references.

Posted
Mutt-LDAP

Available in a git repository.
Repository: mutt-ldap
Browsable repository: mutt-ldap
Author: W. Trevor King

I wrote this Python script to query an LDAP server for addresses from Mutt. In December 2012, I got some patches from Wade Berrier and Niels de Vos. Anything interesting enough for others to hack on deserves it's own repository, so I pulled it out of my blog repository (linked above, and mirrored on GitHub).

The `README` is posted on the PyPI page.

Posted
script

`script` is a nice utility for recording terminal sessions. It spawns a new shell (by default), and records any activity in that shell:

``````\$ script /tmp/script
Script started, file is /tmp/script
\$ ps -u wking | grep script
12102 pts/7    00:00:00 script
12103 pts/7    00:00:00 script
\$ pstree 12102
script───script───bash───pstree
\$ exit
exit
Script done, file is /tmp/script
``````

The recorded file stores all the characters sent to the terminal:

``````\$ cat -v /tmp/script
Script started on Mon 10 Dec 2012 08:07:17 AM EST
^[]0;wking@mjolnir:~^G^[[?1034h^[[01;32mwking@mjolnir^[[01;34m ~ \$^[[00m ps -u wking | grep script^M
12102 pts/7    00:00:00 ^[[01;31m^[[Kscript^[[m^[[K^M
12103 pts/7    00:00:00 ^[[01;31m^[[Kscript^[[m^[[K^M
^[]0;wking@mjolnir:~^G^[[01;32mwking@mjolnir^[[01;34m ~ \$^[[00m pstr^G^M
pstree   pstruct  ^M
^[[01;32mwking@mjolnir^[[01;34m ~ \$^[[00m pstr^M
pstree   pstruct  ^M
^[[01;32mwking@mjolnir^[[01;34m ~ \$^[[00m pstree 12102^M
scriptM-bM-^TM-^@M-bM-^TM-^@M-bM-^TM-^@scriptM-bM-^TM-^@M-bM-^TM-^@M-bM-^TM-^@bashM-bM-^TM-^@M-bM-^TM-^@M-bM-^TM-^@pstree^M
^[]0;wking@mjolnir:~^G^[[01;32mwking@mjolnir^[[01;34m ~ \$^[[00m exit^M
exit^M

Script done on Mon 10 Dec 2012 08:07:51 AM EST
``````

You can also record timing information in a separate file (this approach was developed by Joey Hess and posted in Debian bug 68556):

``````\$ script --timing=/tmp/timing /tmp/script
Script started, file is /tmp/script
\$ echo hello
hello
\$ exit
exit
Script done, file is /tmp/script
``````

The timing file has a “delay in seconds” column and a “number of characters to send column”:

``````\$ cat /tmp/timing
0.671478 67
0.159100 1
0.941919 1
0.149764 1
``````

You can play back a script with timing information:

``````\$ scriptreplay -t /tmp/timing -s /tmp/script
…playback…
``````

You can also play it back with altered timing (e.g. in slow motion):

``````\$ scriptreplay -t /tmp/timing -s /tmp/script -d 0.5
…half-speed playback playback…
``````

This even works reasonably well with curses applications (`emacs`, `vi`, …), but information such as window size is not recorded or replayed. From util-linux's `scriptreplay(1)`:

Since the same information is simply being displayed, scriptreplay is only guaranteed to work properly if run on the same type of terminal the typescript was recorded on. Otherwise, any escape characters in the typescript may be interpreted differently by the terminal to which scriptreplay is sending its output.

This means that if you want interoperable replay, you'll want to do something like

``````\$ reset
\$ echo "\$TERM \$LINES \$COLUMNS"
xterm 24 80
``````

early on so folks know how to setup their replay environment. If you're using some wonky `\$TERM`, you may even want to post your terminfo:

``````\$ infocmp
xterm|xterm terminal emulator (X Window System),
am, bce, km, mc5i, mir, msgr, npc, xenl,
colors#8, cols#80, it#8, lines#24, pairs#64,
…
``````

The user can compare with their current terminal:

``````\$ infocmp xterm linux
comparing xterm to linux.
comparing booleans.
ccc: F:T.
eo: F:T.
…
comparing numbers.
cols: 80, NULL.
lines: 24, NULL.
ncv: NULL, 18.
…
``````

It would be nice if there was an `iconv`-style converter to translate between terminal operation encodings, but I haven't found one yet. Screen does something like this internally, and their list of control sequences is a useful reference. I've started work on an escape-sequence-to-HTML converter, in case you want to play around with these conversions in Python.

Posted
PDF forms

You can use pdftk to fill out PDF forms (thanks for the inspiration, Joe Rothweiler). The syntax is simple:

``````\$ pdftk input.pdf fill_form data.fdf output output.pdf
``````

where `input.pdf` is the input PDF containing the form, `data.fdf` is an FDF or XFDF file containing your data, and `output.pdf` is the name of the PDF you're creating. The tricky part is figuring out what to put in `data.fdf`. There's a useful comparison of the Forms Data Format (FDF) and it's XML version (XFDF) in the XFDF specification. XFDF only covers a subset of FDF, so I won't worry about it here. FDF is defined in section 12.7.7 of ISO 32000-1:2008, the PDF 1.7 specification, and it has been in PDF specifications since version 1.2.

# Forms Data Format (FDF)

FDF files are basically stripped down PDFs (§12.7.7.1). A simple FDF file will look something like:

``````%FDF-1.2
1 0 obj<</FDF<</Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
trailer
<</Root 1 0 R>>
%%EOF
``````

Broken down into the lingo of ISO 32000, we have a header (§12.7.7.2.2):

``````%FDF-1.2
``````

followed by a body with a single object (§12.7.7.2.3):

``````1 0 obj<</FDF<</Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
``````

followed by a trailer (§12.7.7.2.4):

``````trailer
<</Root 1 0 R>>
%%EOF
``````

Despite the claims in §12.7.7.2.1 that the trailer is optional, pdftk choked on files without it:

``````\$ cat no-trailer.fdf
%FDF-1.2
1 0 obj<</FDF<</Fields[
<</T(Name)/V(Trevor)>>
<</T(Date)/V(2012-09-20)>>
] >> >>
endobj
\$ pdftk input.pdf fill_form no-trailer.fdf output output.pdf
Error: Failed to open form data file:
data.fdf
No output created.
``````

Trailers are easy to add, since all they reqire is a reference to the root of the FDF catalog dictionary. If you only have one dictionary, you can always use the simple trailer I gave above.

## FDF Catalog

The meat of the FDF file is the catalog (§12.7.7.3). Lets take a closer look at the catalog structure:

``````1 0 obj<</FDF<</Fields[
…
] >> >>
``````

This defines a new object (the FDF catalog) which contains one key (the `/FDF` dictionary). The FDF dictionary contains one key (`/Fields`) and its associated array of fields. Then we close the `/Fields` array (`]`), close the FDF dictionary (`>>`) and close the FDF catalog (`>>`).

There are a number of interesting entries that you can add to the FDF dictionary (§12.7.7.3.1, table 243), some of which require a more advanced FDF version. You can use the `/Version` key to the FDF catalog (§12.7.7.3.1, table 242) to specify the of data in the dictionary:

``````1 0 obj<</Version/1.3/FDF<</Fields[…
``````

Now you can extend the dictionary using table 244. Lets set things up to use UTF-8 for the field values (`/V`) or options (`/Opt`):

``````1 0 obj<</Version/1.3/FDF<</Encoding/utf_8/Fields[
<</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
<</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
…
] >> >>
endobj
``````

pdftk understands raw text in the specified encoding (`(…)`), raw UTF-16 strings starting with a BOM (`(\xFE\xFF…)`), or UTF-16BE strings encoded as ASCII hex (`<FEFF…>`). You can use pdf-merge.py and its `--unicode` option to find the latter. Support for the `/utf_8` encoding in pdftk is new. I mailed a patch to pdftk's Sid Steward and posted a patch request to the underlying iText library. Until those get accepted, you're stuck with the less convenient encodings.

## Fonts

Say you fill in some Unicode values, but your PDF reader is having trouble rendering some funky glyphs. Maybe it doesn't have access to the right font? You can see which fonts are embedded in a given PDF using pdffonts.

``````\$ pdffonts input.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
MMXQDQ+UniversalStd-NewswithCommPi   CID Type 0C       yes yes yes   1738  0
MMXQDQ+ZapfDingbatsStd               CID Type 0C       yes yes yes   1749  0
MMXQDQ+HelveticaNeueLTStd-Roman      Type 1C           yes yes no    1737  0
CPZITK+HelveticaNeueLTStd-BlkCn      Type 1C           yes yes no    1739  0
…
``````

If you don't have the right font for your new data, you can add it using current versions of iText. However, pdftk uses an older version, so I'm not sure how to translate this idea for pdftk.

## FDF templates and field names

You can use pdftk itself to create an FDF template, which it does with embedded UTF-16BE (you can see the FE FF BOMS at the start of each string value).

``````\$ pdftk input.pdf generate_fdf output template.fdf
\$ hexdump -C template.fdf  | head
00000000  25 46 44 46 2d 31 2e 32  0a 25 e2 e3 cf d3 0a 31  |%FDF-1.2.%.....1|
00000010  20 30 20 6f 62 6a 20 0a  3c 3c 0a 2f 46 44 46 20  | 0 obj .<<./FDF |
00000020  0a 3c 3c 0a 2f 46 69 65  6c 64 73 20 5b 0a 3c 3c  |.<<./Fields [.<<|
00000030  0a 2f 56 20 28 fe ff 29  0a 2f 54 20 28 fe ff 00  |./V (..)./T (...|
00000040  50 00 6f 00 73 00 74 00  65 00 72 00 4f 00 72 00  |P.o.s.t.e.r.O.r.|
…
``````

You can also dump a more human friendly version of the PDF's fields (without any default data):

``````\$ pdftk input.pdf dump_data_fields_utf8 output data.txt
\$ cat data.txt
---
FieldType: Text
FieldName: Name
FieldNameAlt: Name:
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldName: Date
FieldNameAlt: Date:
FieldFlags: 0
FieldJustification: Left
---
FieldType: Text
FieldFlags: 0
FieldJustification: Left
---
…
``````

If the fields are poorly named, you may have to fill the entire form with unique values and then see which values appeared where in the output PDF (for and example, see codehero's identify_pdf_fields.js).

# Conclusions

This would be so much easier if people just used YAML or JSON instead of bothering with PDFs ;).

Posted
Comparing velocity clamp experiments

I've been spending some time comparing my force spectroscopy data with Marisa's, and the main problem has been normalizing the data collected using different systems. Marisa runs experiments using some home-grown LabVIEW software, which saves the data in IGOR binary wave (IBW) files. From my point of view, this is not the best approach, but it has been stable over the time we've been running experiments.

I run my own experiments using my own code based on pyafm, saving the data in HDF5 files. I think this approach is much stronger, but it has been a bit of a moving target, and I've spent a good deal of my time working on the framework instead of running experiments.

Both approaches save the digitized voltages that we read/write during an experiment, but other constants and calibration terms are not always recorded. My pyafm-based software has been gradually getting better in this regard, especially since I moved to h5config-based initialization. Anyhow, here's a quick runthough of all the important terms.

For the TL;DR crowd, crunch.py is my Python module that crunches your input and prints out all the intermediate constants discussed below. To use it on your own data, you'll probably have to tweak the bits that read in the data to use your own format.

# Calibrating the piezo

## Calibration grid

We control the surface-tip distance by driving a piezo tube. We want to know the correspondence between the driving voltage and the piezo position, so we calibrate the piezo by imaging a sample with square pits of a known depth.

In this trace, I swept the $x$ position across most of the range of my 16-bit DAC. For each $x$ position, I adjusted the $z$ position until the measured cantilever deflection $d$ crossed a setpoint. This gives the $z$ voltage required to reach the surface as a function of $x$. You can see the 200 nm deep square pits, as well as a reasonable amount of $x$/$z$ crosstalk.

Sometimes grid calibrations are even less convincing than the example shown above. For example, I have traces with odd shoulders on the sides of the pits:

This is one of the better piezo calibration images I aquired for the piezo used in the pull below, so I'll use numbers from the funky imaging for the remainder of this analysis. This piezo has less $x$/$y$ range than the one used to measure the first calibration trace, which is why I was unable to aquire a full pit (or pits).

The calibration constant ${\alpha }_{z}$ is the number of $z$ bits per depth meter:

(1)${\alpha }_{z}=\frac{\text{grid piezo bits}}{\text{grid depth}}=\frac{2779±87\text{bits}}{200\text{nm}}=\left(1.39±0.04\right)\cdot {10}^{10}\text{bits/m}\phantom{\rule{thickmathspace}{0ex}}.$

Related conversion factors are

(2)$\begin{array}{rl}{\gamma }_{z}& =\frac{\text{piezo volts}}{\text{piezo bits}}=\frac{20\text{piezo volts}}{\text{output volts}}\cdot \frac{10\text{output volts}}{{2}^{15}\text{piezo bits}}=6.10\cdot {10}^{-3}\text{V/bit}\\ {\sigma }_{z}& =\frac{\text{grid piezo volts}}{\text{grid depth}}={\gamma }_{z}{\alpha }_{z}=8.48\cdot {10}^{7}\text{V/m}\phantom{\rule{thickmathspace}{0ex}}.\end{array}$

This is roughly in the ballpark of our piezo (serial number 2253E) which is spec'd at 8.96 nm/V along the $z$ axis, which comes out to $1.12\cdot {10}^{8}$ V/m.

## Laser interference

Another way to ballpark a piezo calibration that is much closer to a force spectroscopy pull is to use the laser interference pattern as a standard length. The laser for our microscope has a wavelength of 670 nm. We'll assume a geometric gain of

(3)${g}_{\lambda }=\frac{\text{increased laser path}}{\text{piezo displacement}}\approx 2\phantom{\rule{thickmathspace}{0ex}}.$

Measuring the length of an interference wave in bits then gives a standard equivalent to the known-depth pits of the calibration grid.

(4)$\begin{array}{rl}{\alpha }_{z}& =\frac{\text{interference piezo bits}}{\text{interference depth}}=\frac{{g}_{\lambda }}{\lambda }\cdot \text{interference piezo bits}\\ & =\frac{2}{670\cdot {10}^{-9}\text{m}}\cdot \left(28935-23721\right)\text{bits}=1.56\cdot {10}^{10}\text{bits/m}\phantom{\rule{thickmathspace}{0ex}}.\end{array}$

The other piezo calibration parameters are found exactly as in the calibration grid case.

(5)${\sigma }_{z}={\gamma }_{z}{\alpha }_{z}=9.52\cdot {10}^{7}\text{V/m}\phantom{\rule{thickmathspace}{0ex}}.$

which is fairly close to both the spec'd value and grid calibration values.

# Calibrating the photodiode

During experiments, we measure cantilever deflection via the top and bottom segments of a photodiode. We need to convert this deflection voltage into a deflection distance, so we'll use the already-calibrated piezo. When the tip is in contact with the surface, we can move the surface a known distance (using the piezo) and record the change in deflection.

The calibration constant ${\alpha }_{d}$ is the number of diode bits per piezo bit.

(6)${\alpha }_{d}=\frac{\text{bump diode bits}}{\text{bump piezo bits}}=2.24\text{diode bits/piezo bits}\phantom{\rule{thickmathspace}{0ex}}.$

Related conversion factors are

(7)$\begin{array}{rl}{\gamma }_{d}& =\frac{\text{diode volts}}{\text{diode bits}}=\frac{10\text{input volts}}{{2}^{15}\text{diode bits}}\cdot \frac{\text{diode volts}}{1\text{input volts}}=3.05\cdot {10}^{-4}\text{V/bit}\\ {\sigma }_{d}& =\frac{\text{bump diode volts}}{\text{bump tip position}}={\gamma }_{d}{\alpha }_{d}{\alpha }_{z}=9.52\cdot {10}^{6}\text{V/m}\phantom{\rule{thickmathspace}{0ex}}.\end{array}$

# Calibrating the cantilever spring constant

To convert cantilever tip deflection to force, we need to know the spring constant of the cantilever. After bumping the surface, we move away from the surface and measure the cantilever's thermal vibration. We use the vibration data to calculate the spring constant using the equipartition theorem.

The deflection variance $⟨{d}^{2}⟩$ is measured in frequency space, where the power spectral density (PSD) is fitted to the expected PSD of a damped harmonic oscillator.

(8)$\begin{array}{rl}{\mathrm{PSD}}_{f}& =\frac{{G}_{1f}}{\left({f}_{0}^{2}-{f}^{2}{\right)}^{2}+{\beta }_{f}^{2}{f}^{2}}\\ ⟨{d}^{2}⟩& =\frac{\pi {G}_{1f}}{2{\beta }_{f}{f}_{0}^{2}}\\ \kappa & =\frac{2{\beta }_{f}{f}_{0}^{2}}{\pi {G}_{1f}}\left({\alpha }_{d}{\alpha }_{z}{\right)}^{2}{k}_{B}T=\frac{2\cdot \left(4.17\cdot {10}^{3}\text{Hz}\right)\cdot \left(8.14\cdot {10}^{3}\text{Hz}{\right)}^{2}}{\pi \cdot 3.10\cdot {10}^{13}{\text{bit}}^{2}\cdot {\text{Hz}}^{3}}\cdot \left(2.24\cdot 1.39\cdot {10}^{10}\text{bit/m}{\right)}^{2}\cdot \left(1.38\cdot {10}^{-23}\text{J/K}\right)\cdot \left(294\text{K}\right)=22.4\cdot {10}^{-3}\text{N/m}\end{array}$

# Analyzing a velocity-clamp pull

The raw data from a velocity-clamp pull is an array of output voltages used to sweep the piezo (moving the surface away from the cantilever tip) and an array of input voltages from the photodiode (measuring cantilever deflection). There are a number of interesting questions you can ask about such data, so I'll break this up into a few steps. Lets call the raw piezo data (in bits) ${\alpha }_{v}$, with the contact-kink located at ${\alpha }_{v0}$. Call the raw deflection data (also in bits) ${\alpha }_{F}$, with the contact-kink located at ${\alpha }_{F0}$.

## Piezo displacement

Using the piezo calibration parameters, we can calculate the raw piezo position using

(9)${z}_{\text{piezo}}=\frac{{\alpha }_{v}-{\alpha }_{v0}}{{\alpha }_{z}}\phantom{\rule{thickmathspace}{0ex}},$

measured in meters.

## Surface contact region

During the initial portion of a velocity clamp pull, the cantilever tip is still in contact with the surface. This allows you to repeat the photodiode calibration, avoiding problems due to drift in laser alignment or other geometric issues. This gives a new set of diode calibration parameters ${\alpha }_{d}\prime$ and ${\sigma }_{d}\prime$ (it is unlikely that ${\gamma }_{d}$ has changed, but it's easy to rework the following arguments to include ${\gamma }_{d}\prime$ if you feel that it might have changed).

## Tension

We can use the new photodiode calibration and the cantilever's spring constant to calculate the force from the Hookean cantilever:

(10)$F=\frac{{\alpha }_{F}-{\alpha }_{F0}}{{\alpha }_{d}\prime {\alpha }_{z}}\cdot \kappa =\left({\alpha }_{F}-{\alpha }_{F0}\right)\cdot \frac{2{\beta }_{f}{f}_{0}^{2}}{\pi {G}_{1f}}\cdot \frac{{\alpha }_{d}^{2}{\alpha }_{z}}{{\alpha }_{d}\prime }\cdot {k}_{B}T\phantom{\rule{thickmathspace}{0ex}}.$

## Protein extension

As the piezo pulls on the cantilever/protein system, some of the increased extension is due to protein extension and the rest is due to cantilever extension. We can use Hooke's law to remove the cantilever extension, leaving only protein extension:

(11)${z}_{\text{protein}}={z}_{\text{piezo}}-\frac{F}{\kappa }=\frac{1}{{\alpha }_{z}}\left({\alpha }_{v}-{\alpha }_{v0}-\frac{{\alpha }_{F}-{\alpha }_{F0}}{{\alpha }_{d}\prime }\right)$

## Contour-space

In order to confirm the piezo calibration, we look at changes in the unfolded contour length due to domain unfolding ($\Delta L$). There have been a number of studies of titin I27, starting with Carrion-Vazquez et al., that show an contour length increase of 28.1 ± 0.17 nm. Rather than fitting each loading region with a worm-like chain (WLC) or other polymer model, it is easier to calculate $\Delta L$ by converting the abscissa to contour-length space (following Puchner et al.). While the WLC is commonly used, Puchner gets better fits using the freely rotating chain (FRC) model.

In crunch.py, I use either Bustamante's formula (WLC) or Livadaru's equation 49 (FRC) to calculate the contour length of a particular force/distance pair.

As you can see from the figure, my curves are spaced a bit too far appart. Because the contour length depends $F$ as well as ${z}_{\text{protein}}$, it depends on the cantilever calibration (via ${\beta }_{f}$, ${f}_{0}$, ${G}_{1f}$ and ${\alpha }_{d}$) as well as the piezo calibration (via ${\alpha }_{z}$). This makes adjusting calibration parameters in an attempt to match your $\Delta L$ with previously determined values a bit awkward. However, it is likely that my cantilever calibration is too blame. If we use the value of ${\alpha }_{z}$ from the interference measurement we get

Which is still a bit too wide, but is much closer.

Posted
Freely rotating chains

Velocity clamp force spectroscopy pulls are often fit to polymer models such as the worm-like chain (WLC). However, Puchner et al. had the bright idea that, rather than fitting each loading region with a polymer model, it is easier to calculate the change in contour length by converting the abscissa to contour-length space. While the WLC is commonly used, Puchner gets better fits using the freely rotating chain (FRC) model.

Computing force-extension curves for either the WLC or FJC is complicated, and it is common to use interpolation formulas to estimate the curves. For the WLC, we use Bustamante's formula:

(1)${F}_{\mathrm{WLC}}\left(x\right)=\frac{{k}_{B}T}{p}\left[\frac{1}{4}\left(\frac{1}{{\left(1-\frac{x}{L}\right)}^{2}}-1\right)+\frac{x}{L}\right]$

For the FRC, Puchner uses Livadaru's equation 46.

(2)$\frac{{R}_{z}}{L}\approx \left\{\begin{array}{ll}\frac{\mathrm{fa}}{3{k}_{B}T}& \text{for}\frac{\mathrm{fb}}{{k}_{B}T}<\frac{b}{l}\\ 1-{\left(\frac{\mathrm{fl}}{4{k}_{B}T}\right)}^{-\frac{1}{2}}& \text{for}\frac{b}{l}<\frac{\mathrm{fb}}{{k}_{B}T}<\frac{l}{b}\\ 1-{\left(\frac{\mathrm{fb}}{{\mathrm{ck}}_{B}T}\right)}^{-1}& \text{for}\frac{l}{b}<\frac{\mathrm{fb}}{{k}_{B}T}\end{array}\phantom{\rule{thickmathspace}{0ex}}.$

Unfortunately, there are two typos in Livadaru's equation 46. It should read (confirmed by private communication with Roland Netz).

(3)$\frac{{R}_{z}}{L}\approx \left\{\begin{array}{ll}\frac{\mathrm{fa}}{3{k}_{B}T}& \text{for}\frac{\mathrm{fb}}{{k}_{B}T}<\frac{b}{l}\\ 1-{\left(\frac{4\mathrm{fl}}{{k}_{B}T}\right)}^{-\frac{1}{2}}& \text{for}\frac{b}{l}<\frac{\mathrm{fb}}{{k}_{B}T}<\frac{l}{b}\\ 1-{\left(\frac{\mathrm{cfb}}{{k}_{B}T}\right)}^{-1}& \text{for}\frac{l}{b}<\frac{\mathrm{fb}}{{k}_{B}T}\end{array}\phantom{\rule{thickmathspace}{0ex}}.$

Regardless of the form of Livadaru's equation 46, the suggested FRC interpolation formula is Livadaru's equation 49, which has continuous cross-overs between the various regimes and adds the possibility of elastic backbone extension.

(4)$\frac{{R}_{z}}{L}=1-{\left\{{\left({F}_{\text{WLC}}^{-1}\left[\frac{\mathrm{fl}}{{k}_{\mathrm{BT}}}\right]\right)}^{\beta }+{\left(\frac{\mathrm{cfb}}{{k}_{\mathrm{BT}}}\right)}^{\beta }\right\}}^{\frac{-1}{\beta }}+\frac{f}{\stackrel{˜}{\gamma }}\phantom{\rule{thickmathspace}{0ex}},$

where $l=b\frac{\mathrm{cos}\left(\gamma /2\right)}{\mid \mathrm{ln}\left(\mathrm{cos}\gamma \right)\mid }$ (Livadaru's equation 22) is the effective persistence length, $\beta$ determines the crossover sharpness, $\stackrel{˜}{\gamma }$ is the backbone stretching modulus, and ${F}_{\text{WLC}}^{-1}\left[x\right]$ is related to the inverse of Bustamante's interpolation formula,

(5)${F}_{\text{WLC}}\left[x\right]=\frac{3}{4}-\frac{1}{x}+\frac{{x}^{2}}{4}\phantom{\rule{thickmathspace}{0ex}}.$

By matching their interpolation formula with simlated FRCs, Livadaru suggests using $\beta =2$, $\stackrel{˜}{\gamma }=\infty$, and $c=2$. In his paper, Puchner suggests using $b=0.4$ nm and $\gamma ={22}^{\circ }$. However, when I contacted him and pointed out the typos in Livadaru's equation 46, he reran his analysis and got similar results using the corrected formula with $b=0.11$ nm and $\gamma ={41}^{\circ }$. This makes more sense because it gives a WLC persistence length similar to the one he used when fitting the WLC model:

(6)$l=b\frac{\mathrm{cos}\left(\gamma /2\right)}{\mid \mathrm{ln}\left(\mathrm{cos}\gamma \right)\mid }=0.366\text{nm}$

(vs. his WLC persistence length of $p=0.4$ nm).

In any event, the two models (WLC and FRC) give similar results for low to moderate forces, with the differences kicking in as $\mathrm{fb}/{k}_{B}T$ moves above $l/b$. For Puchner's revised numbers, this corresponds to

(7)$f>\frac{l}{b}\cdot \frac{{k}_{B}T}{b}=\frac{\mathrm{cos}\left(\gamma /2\right)}{\mid \mathrm{ln}\left(\mathrm{cos}\gamma \right)\mid }\cdot \frac{{k}_{B}T}{b}\approx 122\text{pN}\phantom{\rule{thickmathspace}{0ex}},$

assuming a temperature in the range of 300 K.

I've written an `inverse_frc` implementation in crunch.py for comparing velocity clamp experiments. I test the implementation with frc.py by regenerating Livadaru et al.'s figure 14.

Posted