Theory pages.

I've been trying to wrap my head around factor analysis as a theory for designing and understanding test and survey results. This has turned out to be another one of those fields where the going has been a bit rough. I think the key factors in making these older topics difficult are:

- “Everybody knows this, so we don't need to write up the details.”
- “Hey, I can do better than Bob if I just tweak this knob…”
- “I'll just publish this seminal paper behind a paywall…”

The resulting discussion ends up being overly complicated, and it's hard for newcomers to decide if people using similar terminology are in fact talking about the same thing.

Some of the better open sources for background has been Tucker and MacCallum's “Exploratory Factor Analysis” manuscript and Max Welling's notes. I'll use Welling's terminology for this discussion.

The basic idea of factor analsys is to model $d$ measurable attributes as generated by $k<d$ common factors and $d$ unique factors. With $n=4$ and $k=2$, you get something like:

Corresponding to the equation (Welling's eq. 1):

The independent random variables $y$ are distributed according to a Gaussian with zero mean and unit variance ${\mathcal{G}}_{y}[0,I]$ (zero mean because constant offsets are handled by $\mu $; unit variance because scaling is handled by $A$). The independent random variables $\nu $ are distributed according to ${\mathcal{G}}_{\nu}[0,\Sigma ]$, with (Welling's eq. 2):

The matrix $A$ (linking common factors with measured
attributes $x)\mathrm{is}\mathrm{refered}\mathrm{to}\mathrm{as}\mathrm{the}\mathrm{factor}\mathrm{weights}\mathrm{or}\mathrm{factor}\mathrm{loadings}.\mathrm{Because}\mathrm{the}\mathrm{only}\mathrm{source}\mathrm{of}\mathrm{constant}\mathrm{offset}\mathrm{is}$\mathbf{\mu}$,\mathrm{we}\mathrm{can}\mathrm{calculate}\mathrm{it}\mathrm{by}\mathrm{averaging}\mathrm{out}\mathrm{the}\mathrm{random}\mathrm{noise}(\mathrm{Welling}\prime s\mathrm{eq}.6):\text{Unknown character}\mathrm{div}\mathrm{class}=\text{Unknown character}\mathrm{numberedEq}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{span}\text{Unknown character}(3)\text{Unknown character}/\mathrm{span}\text{Unknown character}$$\mu =\frac{1}{N}{\sum}_{n=1}^{N}{x}_{n}$$\text{Unknown character}/\mathrm{div}\text{Unknown character}\mathrm{where}$N$\mathrm{is}\mathrm{the}\mathrm{number}\mathrm{of}\mathrm{measurements}(\mathrm{survey}\mathrm{responders})\mathrm{and}$\mathbf{x}*n$\mathrm{is}\mathrm{the}\mathrm{response}\mathrm{vector}\mathrm{for}\mathrm{the}$n^\text{th}$\mathrm{responder}.\mathrm{How}\mathrm{do}\mathrm{we}\mathrm{find}$\mathbf{A}$\mathrm{and}$\mathbf{\Sigma}$?\mathrm{This}\mathrm{is}\mathrm{the}\mathrm{tricky}\mathrm{bit},\mathrm{and}\mathrm{there}\mathrm{are}a\mathrm{number}\mathrm{of}\mathrm{possible}\mathrm{approaches}.\mathrm{Welling}\mathrm{suggests}\mathrm{using}\mathrm{expectation}\mathrm{maximization}(\mathrm{EM}),\mathrm{and}\mathrm{there}\prime s\mathrm{an}\mathrm{excellent}\mathrm{example}\mathrm{of}\mathrm{the}\mathrm{procedure}\mathrm{with}a\mathrm{colorblind}\mathrm{experimenter}\mathrm{drawing}\mathrm{colored}\mathrm{balls}\mathrm{in}\mathrm{his}[\mathrm{EM}\mathrm{notes}][\mathrm{EM}](\mathrm{to}\mathrm{test}\mathrm{my}\mathrm{understanding},I\mathrm{wrote}\text{Unknown character}a\mathrm{href}=\text{Unknown character}../../\mathrm{posts}/{\mathrm{Factor}}_{\mathrm{analysis}}/\mathrm{color}-\mathrm{ball}.\mathrm{py}\text{Unknown character}\text{Unknown character}\mathrm{color}-\mathrm{ball}.\mathrm{py}\text{Unknown character}/a\text{Unknown character}).\mathrm{To}\mathrm{simplify}\mathrm{calculations},\mathrm{Welling}\mathrm{defines}(\mathrm{before}\mathrm{eq}.15):\text{Unknown character}\mathrm{div}\mathrm{class}=\text{Unknown character}\mathrm{numberedEq}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{span}\text{Unknown character}(4)\text{Unknown character}/\mathrm{span}\text{Unknown character}$$\begin{array}{rl}A\prime & \equiv [A,\mu ]\\ y\prime & \equiv [{y}^{T},1{]}^{T}\end{array}$$\text{Unknown character}/\mathrm{div}\text{Unknown character}\mathrm{which}\mathrm{reduce}\mathrm{the}\mathrm{model}\mathrm{to}\text{Unknown character}\mathrm{div}\mathrm{class}=\text{Unknown character}\mathrm{numberedEq}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{span}\text{Unknown character}(5)\text{Unknown character}/\mathrm{span}\text{Unknown character}$$x=A\prime y\prime +\nu $$\text{Unknown character}/\mathrm{div}\text{Unknown character}\mathrm{After}\mathrm{some}\mathrm{manipulation}\mathrm{Welling}\mathrm{works}\mathrm{out}\mathrm{the}\mathrm{maximizing}\mathrm{updates}(\mathrm{eq}\prime \mathrm{ns}16\mathrm{and}17):\text{Unknown character}\mathrm{div}\mathrm{class}=\text{Unknown character}\mathrm{numberedEq}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{span}\text{Unknown character}(6)\text{Unknown character}/\mathrm{span}\text{Unknown character}$$\begin{array}{rl}A{\prime}^{\text{new}}& =({\sum}_{n=1}^{N}{x}_{n}E[y\prime \mid {x}_{n}{]}^{T}){({\sum}_{n=1}^{N}{x}_{n}E[y\prime y{\prime}^{T}\mid {x}_{n}])}^{-1}\\ {\Sigma}^{\text{new}}& =\frac{1}{N}{\sum}_{n=1}^{N}\text{diag}[{x}_{n}{x}_{n}^{T}-A{\prime}^{\text{new}}E[y\prime \mid {x}_{n}]{x}_{n}^{T}]\end{array}$$\text{Unknown character}/\mathrm{div}\text{Unknown character}\mathrm{The}\mathrm{expectation}\mathrm{values}\mathrm{used}\mathrm{in}\mathrm{these}\mathrm{updates}\mathrm{are}\mathrm{given}\mathrm{by}(\mathrm{Welling}\prime s\mathrm{eq}\prime \mathrm{ns}12\mathrm{and}13):\text{Unknown character}\mathrm{div}\mathrm{class}=\text{Unknown character}\mathrm{numberedEq}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{span}\text{Unknown character}(7)\text{Unknown character}/\mathrm{span}\text{Unknown character}$$\begin{array}{rl}E[y\mid {x}_{n}]& ={A}^{T}(A{A}^{T}+\Sigma {)}^{-1}({x}_{n}-\mu )\\ E[y{y}^{T}\mid {x}_{n}]& =I-{A}^{T}(A{A}^{T}+\Sigma {)}^{-1}A+E[y\mid {x}_{n}]E[y\mid {x}_{n}{]}^{T}\end{array}$$\text{Unknown character}/\mathrm{div}\text{Unknown character}\mathrm{Survey}\mathrm{analysis}===============\mathrm{Enough}\mathrm{abstraction}!\mathrm{Let}\prime s\mathrm{look}\mathrm{at}\mathrm{an}\mathrm{example}:[\mathrm{survey}\mathrm{results}][\mathrm{survey}]:\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{import}\mathrm{numpy}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{scores}=\mathrm{numpy}.\mathrm{genfromtxt}(\prime {\mathrm{Factor}}_{\mathrm{analysis}}/\mathrm{survey}.\mathrm{data}\prime ,\mathrm{delimiter}=\prime t\prime )\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{scores}\mathrm{array}([[1.,3.,4.,6.,7.,2.,4.,5.],[2.,3.,4.,3.,4.,6.,7.,6.],[4.,5.,6.,7.,7.,2.,3.,4.],[3.,4.,5.,6.,7.,3.,5.,4.],[2.,5.,5.,5.,6.,2.,4.,5.],[3.,4.,6.,7.,7.,4.,3.,5.],[2.,3.,6.,4.,5.,4.,4.,4.],[1.,3.,4.,5.,6.,3.,3.,4.],[3.,3.,5.,6.,6.,4.,4.,3.],[4.,4.,5.,6.,7.,4.,3.,4.],[2.,3.,6.,7.,5.,4.,4.,4.],[2.,3.,5.,7.,6.,3.,3.,3.]])</mo><mi>scores</mi><mo\; stretchy="false">[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo\; stretchy="false">]</mo><mo>\mathrm{is}\mathrm{the}\mathrm{answer}\mathrm{the}</mo><mi>i</mi><mo>\mathrm{th}\mathrm{respondent}\mathrm{gave}\mathrm{for}\mathrm{the}</mo><mi>j</mi><mo>\mathrm{th}\mathrm{question}.\mathrm{We}\prime \mathrm{re}\mathrm{looking}\mathrm{for}\mathrm{underlying}\mathrm{factors}\mathrm{that}\mathrm{can}\mathrm{explain}\mathrm{covariance}\mathrm{between}\mathrm{the}\mathrm{different}\mathrm{questions}.\mathrm{Do}\mathrm{the}\mathrm{question}\mathrm{answers}($\mathbf{x}$)\mathrm{represent}\mathrm{some}\mathrm{underlying}\mathrm{factors}($\mathbf{y}$)?\mathrm{Let}\prime s\mathrm{start}\mathrm{off}\mathrm{by}\mathrm{calculating}$\mathbf{\mu}$:\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{def}{\mathrm{print}}_{\mathrm{row}}(\mathrm{row}):...\mathrm{print}(\prime \prime .\mathrm{join}(\prime :0.2f\prime .\mathrm{format}(x)\mathrm{for}x\mathrm{in}\mathrm{row}))\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{mu}=\mathrm{scores}.\mathrm{mean}(\mathrm{axis}=0)\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{print}}_{\mathrm{row}}(\mathrm{mu})2.423.585.085.756.083.423.924.25\mathrm{Next}\mathrm{we}\mathrm{need}\mathrm{priors}\mathrm{for}$\mathbf{A}$\mathrm{and}$\mathbf{\Sigma}$.\text{Unknown character}\mathrm{span}\mathrm{class}=\text{Unknown character}\mathrm{createlink}\text{Unknown character}\text{Unknown character}\mathrm{MDP}\text{Unknown character}/\mathrm{span}\text{Unknown character}\mathrm{has}\mathrm{an}\mathrm{implementation}\mathrm{for}\text{Unknown character}a\mathrm{href}=\text{Unknown character}../../\mathrm{posts}/\mathrm{Python}/\text{Unknown character}\text{Unknown character}\mathrm{Python}\text{Unknown character}/a\text{Unknown character},\mathrm{and}\mathrm{their}[\mathrm{FANode}][]\mathrm{uses}a\mathrm{Gaussian}\mathrm{random}\mathrm{matrix}\mathrm{for}$\mathbf{A}$\mathrm{and}\mathrm{the}\mathrm{diagonal}\mathrm{of}\mathrm{the}\mathrm{score}\mathrm{covariance}\mathrm{for}$\mathbf{\Sigma}$.\mathrm{They}\mathrm{also}\mathrm{use}\mathrm{the}\mathrm{score}\mathrm{covariance}\mathrm{to}\mathrm{avoid}\mathrm{repeated}\mathrm{summations}\mathrm{over}$n$.\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{import}\mathrm{mdp}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{def}{\mathrm{print}}_{\mathrm{matrix}}(\mathrm{matrix}):...\mathrm{for}\mathrm{row}\mathrm{in}\mathrm{matrix}:...{\mathrm{print}}_{\mathrm{row}}(\mathrm{row})\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{fa}=\mathrm{mdp}.\mathrm{nodes}.\mathrm{FANode}({\mathrm{output}}_{\mathrm{dim}}=3)\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{numpy}.\mathrm{random}.\mathrm{seed}(1)\#\mathrm{for}\mathrm{consistend}\mathrm{doctest}\mathrm{results}\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{responder}}_{\mathrm{scores}}=\mathrm{fa}(\mathrm{scores})\#\mathrm{common}\mathrm{factors}\mathrm{for}\mathrm{each}\mathrm{responder}\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{print}}_{\mathrm{matrix}}({\mathrm{responder}}_{\mathrm{scores}})-1.92-0.450.000.671.971.960.700.03-2.000.290.03-0.60-1.021.79-1.430.820.27-0.23-0.07-0.080.82-1.38-0.270.480.79-1.170.501.59-0.30-0.410.01-0.480.73-0.46-1.340.18\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{print}}_{\mathrm{row}}(\mathrm{fa}.\mathrm{mu}.\mathrm{flat})2.423.585.085.756.083.423.924.25\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{fa}.\mathrm{mu}.\mathrm{flat}==\mathrm{mu}\#\mathrm{MDP}\mathrm{agrees}\mathrm{with}\mathrm{our}\mathrm{earlier}\mathrm{calculation}\mathrm{array}([\mathrm{True},\mathrm{True},\mathrm{True},\mathrm{True},\mathrm{True},\mathrm{True},\mathrm{True},\mathrm{True}],\mathrm{dtype}=\mathrm{bool})\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{print}}_{\mathrm{matrix}}(\mathrm{fa}.A)\#\mathrm{factor}\mathrm{weights}\mathrm{for}\mathrm{each}\mathrm{question}0.80-0.06-0.450.170.30-0.650.34-0.13-0.250.13-0.73-0.640.02-0.32-0.700.610.230.860.080.630.59-0.090.670.13\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{print}}_{\mathrm{row}}(\mathrm{fa}.\mathrm{sigma})\#\mathrm{unique}\mathrm{noise}\mathrm{for}\mathrm{each}\mathrm{question}0.040.020.380.550.300.050.480.21\mathrm{Because}\mathrm{the}\mathrm{covariance}\mathrm{is}\mathrm{unaffected}\mathrm{by}\mathrm{the}\mathrm{rotation}$\mathbf{A}\rightarrow\mathbf{A}\mathbf{R}$,\mathrm{the}\mathrm{estimated}\mathrm{weights}$\mathbf{A}$\mathrm{and}\mathrm{responder}\mathrm{scores}$\mathbf{y}$\mathrm{can}\mathrm{be}\mathrm{quite}\mathrm{sensitive}\mathrm{to}\mathrm{the}\mathrm{seed}\mathrm{priors}.\mathrm{The}\mathrm{width}$\mathbf{\Sigma}$\mathrm{of}\mathrm{the}\mathrm{unique}\mathrm{noise}$\mathbf{\nu}$\mathrm{is}\mathrm{more}\mathrm{robust},\mathrm{because}$\mathbf{\Sigma}$\mathrm{is}\mathrm{unaffected}\mathrm{by}\mathrm{rotations}\mathrm{on}$\mathbf{A}$.\mathrm{Related}\mathrm{tidbits}===============\mathrm{Communality}-----------\mathrm{The}[\mathrm{communality}][]$h*i^2$\mathrm{of}\mathrm{the}$i^\text{th}$\mathrm{measured}\mathrm{attribute}$x_i$\mathrm{is}\mathrm{the}\mathrm{fraction}\mathrm{of}\mathrm{variance}\mathrm{in}\mathrm{the}\mathrm{measured}\mathrm{attribute}\mathrm{which}\mathrm{is}\mathrm{explained}\mathrm{by}\mathrm{the}\mathrm{set}\mathrm{of}\mathrm{common}\mathrm{factors}.\mathrm{Because}\mathrm{the}\mathrm{common}\mathrm{factors}$\mathbf{y}$\mathrm{have}\mathrm{unit}\mathrm{variance},\mathrm{the}\mathrm{communality}\mathrm{is}\mathrm{given}\mathrm{by}:\text{Unknown character}\mathrm{div}\mathrm{class}=\text{Unknown character}\mathrm{numberedEq}\text{Unknown character}\text{Unknown character}\text{Unknown character}\mathrm{span}\text{Unknown character}(8)\text{Unknown character}/\mathrm{span}\text{Unknown character}$${h}_{i}=\frac{{\sum}_{j=1}^{k}{A}_{\mathrm{ij}}^{2}}{{\sum}_{j=1}^{k}{A}_{\mathrm{ij}}^{2}+{\sigma}_{1}^{2}}$$\text{Unknown character}/\mathrm{div}\text{Unknown character}\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{factor}}_{\mathrm{variance}}=\mathrm{numpy}.\mathrm{array}([\mathrm{sum}(\mathrm{row}2)\mathrm{for}\mathrm{row}\mathrm{in}\mathrm{fa}.A])\text{Unknown character}\text{Unknown character}\text{Unknown character}h=\mathrm{numpy}.\mathrm{array}(...[\mathrm{var}/(\mathrm{var}+\mathrm{sig})\mathrm{for}\mathrm{var},\mathrm{sig}\mathrm{in}\mathrm{zip}({\mathrm{factor}}_{\mathrm{variance}},\mathrm{fa}.\mathrm{sigma})])\text{Unknown character}\text{Unknown character}\text{Unknown character}{\mathrm{print}}_{\mathrm{row}}(h)0.950.970.340.640.660.960.610.69\mathrm{There}\mathrm{may}\mathrm{be}\mathrm{some}\mathrm{scaling}\mathrm{issues}\mathrm{in}\mathrm{the}\mathrm{communality}\mathrm{due}\mathrm{to}\mathrm{deviations}\mathrm{between}\mathrm{the}\mathrm{estimated}$\mathbf{A}$\mathrm{and}\Sigma $ and the
variations contained in the measured scores (why?):

```
>>> print_row(factor_variance + fa.sigma)
0.89 0.56 0.57 1.51 0.89 1.21 1.23 0.69
>>> print_row(scores.var(axis=0, ddof=1)) # total variance for each question
0.99 0.63 0.63 1.66 0.99 1.36 1.36 0.75
```

The proportion of total variation explained by the common factors is given by:

## Varimax rotation

As mentioned earlier, factor analysis generated loadings $A$
that are unique up to an arbitrary rotation $R$ (as you'd
expect for a $k$-dimensional Gaussian ball of factors $y$).
A number of of schemes have been proposed to simplify the initial
loadings by rotating $A$ to reduce off-diagonal terms. One
of the more popular approaches is Henry Kaiser's *varimax rotation*
(unfortunately, I don't have access to either his thesis or
the subsequent paper). I did find (via
Wikipedia) Trevor Park's notes which have been
very useful.

The idea is to iterate rotations to maximize the raw varimax criterion (Park's eq. 1):

Rather than computing a $k$-dimensional rotation in one sweep, we'll iterate through 2-dimensional rotations (on successive column pairs) until convergence. For a particular column pair $(p,q)$, the rotation matrix ${R}^{*}$ is the usual rotation matrix:

where the optimum rotation angle ${\varphi}^{*}$ is (Park's eq. 3):

where $i\equiv \sqrt{-1}$.

# Nomenclature

- ${A}_{\mathrm{ij}}$
- The element from the ${i}^{\text{th}}$ row and ${j}^{\text{th}}$
column of a matrix $A$. For example here is a 2-by-3
matrix terms of components:
(13)$$A=\left(\begin{array}{ccc}{A}_{11}& {A}_{12}& {A}_{13}\\ {A}_{21}& {A}_{22}& {A}_{23}\end{array}\right)$$
- ${A}^{T}$
- The transpose of a matrix (or vector) $A$. ${A}_{\mathrm{ij}}^{T}={A}_{\mathrm{ji}}$
- ${A}^{-1}$
- The inverse of a matrix $A$. ${A}^{-1}\dot{A}=1$
- $\text{diag}[A]$
- A matrix containing only the diagonal elements of $A$, with the off-diagonal values set to zero.
- $E[f(x)]$
- Expectation value for a function $f$ of a random variable $x$. If the probability density of $x$ is $p(x)$, then $E[f(x)]=\int dxp(x)f(x)$. For example, $E[p(x)]=1$.
- $\mu $
- The mean of a random variable $x$ is given by $\mu =E[x]$.
- $\Sigma $
- The covariance of a random variable $x$ is given by $\Sigma =E[(x-\mu )(x-\mu {)}^{T}]$. In the factor analysis model discussed above, $\Sigma $ is restricted to a diagonal matrix.
- ${\mathcal{G}}_{x}[\mu ,\Sigma ]$
- A Gaussian probability density for the random variables
$x$ with a mean $\mu $ and a covariance
$\Sigma $.
(14)$${\mathcal{G}}_{x}[\mu ,\Sigma ]=\frac{1}{(2\pi {)}^{\frac{D}{2}}\sqrt{\mathrm{det}[\Sigma ]}}{e}^{-\frac{1}{2}(x-\mu {)}^{T}{\Sigma}^{-1}(x-\mu )}$$
- $p(y\mid x)$
- Probability of $y$ occurring given that $x$ occured. This is commonly used in Bayesian statistics.
- $p(x,y)$
- Probability of $y$ and $x$ occuring simultaneously (the joint density). $p(x,y)=p(x\mid y)p(y)$
- $\angle (z)$
- The angle of $z$ in the complex plane. $\angle ({\mathrm{re}}^{i\theta})=\theta $.

Note: if you have trouble viewing some of the more obscure Unicode used in this post, you might want to install the STIX fonts.

I've been spending some time comparing my force spectroscopy data with Marisa's, and the main problem has been normalizing the data collected using different systems. Marisa runs experiments using some home-grown LabVIEW software, which saves the data in IGOR binary wave (IBW) files. From my point of view, this is not the best approach, but it has been stable over the time we've been running experiments.

I run my own experiments using my own code based on pyafm, saving the data in HDF5 files. I think this approach is much stronger, but it has been a bit of a moving target, and I've spent a good deal of my time working on the framework instead of running experiments.

Both approaches save the digitized voltages that we read/write during an experiment, but other constants and calibration terms are not always recorded. My pyafm-based software has been gradually getting better in this regard, especially since I moved to h5config-based initialization. Anyhow, here's a quick runthough of all the important terms.

For the TL;DR crowd, crunch.py is my Python module that crunches your input and prints out all the intermediate constants discussed below. To use it on your own data, you'll probably have to tweak the bits that read in the data to use your own format.

# Calibrating the piezo

## Calibration grid

We control the surface-tip distance by driving a piezo tube. We want to know the correspondence between the driving voltage and the piezo position, so we calibrate the piezo by imaging a sample with square pits of a known depth.

In this trace, I swept the $x$ position across most of the range of my 16-bit DAC. For each $x$ position, I adjusted the $z$ position until the measured cantilever deflection $d$ crossed a setpoint. This gives the $z$ voltage required to reach the surface as a function of $x$. You can see the 200 nm deep square pits, as well as a reasonable amount of $x$/$z$ crosstalk.

Sometimes grid calibrations are even less convincing than the example shown above. For example, I have traces with odd shoulders on the sides of the pits:

This is one of the better piezo calibration images I aquired for the piezo used in the pull below, so I'll use numbers from the funky imaging for the remainder of this analysis. This piezo has less $x$/$y$ range than the one used to measure the first calibration trace, which is why I was unable to aquire a full pit (or pits).

The calibration constant ${\alpha}_{z}$ is the number of $z$ bits per depth meter:

Related conversion factors are

This is roughly in the ballpark of our piezo (serial number 2253E) which is spec'd at 8.96 nm/V along the $z$ axis, which comes out to $1.12\cdot {10}^{8}$ V/m.

## Laser interference

Another way to ballpark a piezo calibration that is much closer to a force spectroscopy pull is to use the laser interference pattern as a standard length. The laser for our microscope has a wavelength of 670 nm. We'll assume a geometric gain of

Measuring the length of an interference wave in bits then gives a standard equivalent to the known-depth pits of the calibration grid.

The other piezo calibration parameters are found exactly as in the calibration grid case.

which is fairly close to both the spec'd value and grid calibration values.

# Calibrating the photodiode

During experiments, we measure cantilever deflection via the top and bottom segments of a photodiode. We need to convert this deflection voltage into a deflection distance, so we'll use the already-calibrated piezo. When the tip is in contact with the surface, we can move the surface a known distance (using the piezo) and record the change in deflection.

The calibration constant ${\alpha}_{d}$ is the number of diode bits per piezo bit.

Related conversion factors are

# Calibrating the cantilever spring constant

To convert cantilever tip deflection to force, we need to know the spring constant of the cantilever. After bumping the surface, we move away from the surface and measure the cantilever's thermal vibration. We use the vibration data to calculate the spring constant using the equipartition theorem.

The deflection variance $\u27e8{d}^{2}\u27e9$ is measured in frequency space, where the power spectral density (PSD) is fitted to the expected PSD of a damped harmonic oscillator.

# Analyzing a velocity-clamp pull

The raw data from a velocity-clamp pull is an array of output voltages used to sweep the piezo (moving the surface away from the cantilever tip) and an array of input voltages from the photodiode (measuring cantilever deflection). There are a number of interesting questions you can ask about such data, so I'll break this up into a few steps. Lets call the raw piezo data (in bits) ${\alpha}_{v}$, with the contact-kink located at ${\alpha}_{v0}$. Call the raw deflection data (also in bits) ${\alpha}_{F}$, with the contact-kink located at ${\alpha}_{F0}$.

## Piezo displacement

Using the piezo calibration parameters, we can calculate the raw piezo position using

measured in meters.

## Surface contact region

During the initial portion of a velocity clamp pull, the cantilever tip is still in contact with the surface. This allows you to repeat the photodiode calibration, avoiding problems due to drift in laser alignment or other geometric issues. This gives a new set of diode calibration parameters ${\alpha}_{d}\prime $ and ${\sigma}_{d}\prime $ (it is unlikely that ${\gamma}_{d}$ has changed, but it's easy to rework the following arguments to include ${\gamma}_{d}\prime $ if you feel that it might have changed).

## Tension

We can use the new photodiode calibration and the cantilever's spring constant to calculate the force from the Hookean cantilever:

## Protein extension

As the piezo pulls on the cantilever/protein system, some of the increased extension is due to protein extension and the rest is due to cantilever extension. We can use Hooke's law to remove the cantilever extension, leaving only protein extension:

## Contour-space

In order to confirm the piezo calibration, we look at changes in the unfolded contour length due to domain unfolding ($\Delta L$). There have been a number of studies of titin I27, starting with Carrion-Vazquez et al., that show an contour length increase of 28.1 ± 0.17 nm. Rather than fitting each loading region with a worm-like chain (WLC) or other polymer model, it is easier to calculate $\Delta L$ by converting the abscissa to contour-length space (following Puchner et al.). While the WLC is commonly used, Puchner gets better fits using the freely rotating chain (FRC) model.

In crunch.py, I use either Bustamante's formula (WLC) or Livadaru's equation 49 (FRC) to calculate the contour length of a particular force/distance pair.

As you can see from the figure, my curves are spaced a bit too far appart. Because the contour length depends $F$ as well as ${z}_{\text{protein}}$, it depends on the cantilever calibration (via ${\beta}_{f}$, ${f}_{0}$, ${G}_{1f}$ and ${\alpha}_{d}$) as well as the piezo calibration (via ${\alpha}_{z}$). This makes adjusting calibration parameters in an attempt to match your $\Delta L$ with previously determined values a bit awkward. However, it is likely that my cantilever calibration is too blame. If we use the value of ${\alpha}_{z}$ from the interference measurement we get

Which is still a bit too wide, but is much closer.

Velocity clamp force spectroscopy pulls are often fit to polymer models such as the worm-like chain (WLC). However, Puchner et al. had the bright idea that, rather than fitting each loading region with a polymer model, it is easier to calculate the change in contour length by converting the abscissa to contour-length space. While the WLC is commonly used, Puchner gets better fits using the freely rotating chain (FRC) model.

Computing force-extension curves for either the WLC or FJC is complicated, and it is common to use interpolation formulas to estimate the curves. For the WLC, we use Bustamante's formula:

For the FRC, Puchner uses Livadaru's equation 46.

Unfortunately, there are two typos in Livadaru's equation 46. It should read (confirmed by private communication with Roland Netz).

Regardless of the form of Livadaru's equation 46, the suggested FRC interpolation formula is Livadaru's equation 49, which has continuous cross-overs between the various regimes and adds the possibility of elastic backbone extension.

where $l=b\frac{\mathrm{cos}(\gamma /2)}{\mid \mathrm{ln}(\mathrm{cos}\gamma )\mid}$ (Livadaru's equation 22) is the effective persistence length, $\beta $ determines the crossover sharpness, $\tilde{\gamma}$ is the backbone stretching modulus, and ${F}_{\text{WLC}}^{-1}[x]$ is related to the inverse of Bustamante's interpolation formula,

By matching their interpolation formula with simlated FRCs, Livadaru suggests using $\beta =2$, $\tilde{\gamma}=\mathrm{\infty}$, and $c=2$. In his paper, Puchner suggests using $b=0.4$ nm and $\gamma ={22}^{\circ}$. However, when I contacted him and pointed out the typos in Livadaru's equation 46, he reran his analysis and got similar results using the corrected formula with $b=0.11$ nm and $\gamma ={41}^{\circ}$. This makes more sense because it gives a WLC persistence length similar to the one he used when fitting the WLC model:

(vs. his WLC persistence length of $p=0.4$ nm).

In any event, the two models (WLC and FRC) give similar results for low to moderate forces, with the differences kicking in as $\mathrm{fb}/{k}_{B}T$ moves above $l/b$. For Puchner's revised numbers, this corresponds to

assuming a temperature in the range of 300 K.

I've written an `inverse_frc`

implementation in
crunch.py for
comparing velocity clamp experiments. I test the implementation
with frc.py by
regenerating Livadaru et al.'s figure 14.

My wife was recently reviewing some pulse oxymeter notes while working a round of anesthesia. It took us a while to trace the logic through all the symbol changes and notation shifts, and my wife's math-discomfort and my bio-discomfort made us more cautious than it turns out we needed to be. There are a number of nice review articles out there that I turned up while looking for explicit derivations, but by the time I'd found them, my working notes had gotten fairly well polished themselves. So here's my contribution to the pulse-ox noise ;). Highlights include:

- Short and sweet (with pictures)
- Symbols table
- Baby steps during math manipulation

# Oxygen content

The circulatory system distributes oxygen (O₂) throughout the body. The amount of O₂ at any given point is measured by the O₂ content ([O₂]), usually given in $\frac{\text{mL O\u2082 at BTP}}{\text{dL blood}}$ (BTP is my acronym for body temperature and pressure). Most transported O₂ is bound to hemoglobin (Hb), but there is also free O₂ disolved directly in the plasma and cytoplasm of suspended cells.

where ${\text{S}}_{\text{O\u2082}}$ is the Hb's O₂ saturation and ${\text{P}}_{\text{O\u2082}}$ is the O₂ partial pressure. Don't worry about the coefficients $a$ and $b$ yet, we'll get back to them in a second.

The amound of dissolved O₂ is given by its partial pressure (${\text{P}}_{\text{O\u2082}}$). Partial pressures are generally given in mm of mercury (Hg) at standard temperature and pressure (STP). Because the partial pressure changes as blood flows through the body, an additional specifier $i$ may be added ($\text{P}{i}_{\text{O\u2082}}$) to clarify the measurement location.

$i$ | Full symbol | Location descriptor |
---|---|---|

a | ${\text{Pa}}_{\text{O\u2082}}$ | arterial |

p | ${\text{Pp}}_{\text{O\u2082}}$ | peripheral or pulsatile |

t | ${\text{Pt}}_{\text{O\u2082}}$ | tissue |

v | ${\text{Pv}}_{\text{O\u2082}}$ | venous |

O₂ is carried in the blood primarily through binding to hemoglobin monomers (Hb), with each monomer potentially binding a single O₂. Oxygen saturation (${\text{S}}_{\text{O\u2082}}$) is the fraction of hemoglobin monomers (Hb) that have bound an oxygen molecule (O₂).

The ratio of concentrations, ${\text{S}}_{\text{O\u2082}}$, is unitless. It is often expressed as a percentage. [Hb] is often given in g/dL. As with partial pressures, an additional specifier $i$ may be added ($\text{S}{i}_{\text{O\u2082}}$) to clarify the measurement location (${\text{Sa}}_{\text{O\u2082}}$, ${\text{Sp}}_{\text{O\u2082}}$, …).

Now we can take a second look at our O₂ content formula. The coefficient $a$ must convert g/dL to $\frac{\text{mL O\u2082 at BTP}}{\text{dL blood}}$. Using the molecular weight of Hb and the volume of a mole of gas at STP:

where $\chi $ is a pure number (we're just working out the unit conversion here, not converting a particular Hg concentration). Therefore, $a=1.32\frac{\text{mL O\u2082 at BTP}}{\text{g Hb}}$. The powers that be seem to have used a slightly different density, since the more commonly used value is 5% higher at $1.39$. Possibly someone actually measured the density of O₂ at BTP, because BTP is not STP, and O₂ is not an ideal gas.

The coefficient $b$ must convert mm Hg at STP to $\frac{\text{mL O\u2082 at BTP}}{\text{dL blood}}$. Empirical experiments (?) give a value of $b=0.003\frac{\text{mL O\u2082 at BTP}}{\text{dL blood \u22c5 mm Hg at STP}}$. Now we can write out the familiar form

Reasonable levels are

[Hb] | $14\frac{\text{g Hb}}{\text{dL blood}}$ |

${\text{S}}_{\text{O\u2082}}$ | 98% |

${\text{P}}_{\text{O\u2082}}$ | 100 mm Hg at STP |

$1.39\frac{\text{mL O\u2082}}{\text{g Hb}}\text{[Hb]}{\text{S}}_{\text{O\u2082}}$ | $19.1\frac{\text{mL O\u2082 at BTP}}{\text{dL blood}}$ |

$0.003\frac{\text{mL O\u2082 at BTP}}{\text{dL blood}\cdot \text{mm Hg at STP}}{\text{P}}_{\text{O\u2082}}$ | $0.300\frac{\text{mL O\u2082 at BTP}}{\text{dL blood}}$ |

Because the dissolved O₂ has such a tiny contribution (1.5% of the total in my example), it is often measured at STP rather than BTP. Sometimes it is dropped from the calculation entirely. We focus on the more imporant $\text{[Hb]}{\text{S}}_{\text{O\u2082}}$ in the next section.

# Oxygen saturation

The preceding discussion used $\text{[Hb]}{\text{S}}_{\text{O\u2082}}$ to represent the concentration of HbO₂ complexes. This was useful while we were getting our bearings, but now we will replace that term with a more detailed model. Let us sort the Hb monomers into species:

- Hb: all hemoglobin monomers
- HbO₂: monomers complexed with O₂
- HHb: reduced Hb (not complexed with O₂)
- dysHb: dys-hemoglobin (cannot complex with O₂)
- MHb: methemoglobin
- HbCO: carboxyhemoglobin

These species are related as follows

Because modern two-color pulse-oximeters don't
measure ${\text{S}}_{\text{O\u2082}}$ exactly, the related quantity that they
*do* measure has been given a name of its own: the *functional*
saturation (${\text{Sf}}_{\text{O\u2082}}$).

Rephrasing our earlier saturation, we see

To avoid confusion with ${\text{Sf}}_{\text{O\u2082}}$, our
original ${\text{S}}_{\text{O\u2082}}$ is sometimes referred to as the
*fractional* saturation.

# The Beer-Labmert law

So far we've been labeling and defining attributes of the blood. The point of this excercise is to understand how a pulse oximeter measures them. People have known for a while that different hemoglobin complexes (HbO₂, HHb, MHb, HbCO, …) have differnt absorbtion spectra, and they have been using this difference since the 1930's to make pulse-oximeters based on two-color transmittance measurements (see Tremper 1989).

By passing different wavelengths of light through perfused tissue, we can measure the relative quantities of the different Hb species. The basis for this analysis comes from the Beer-Lambert law.

where ${I}_{0}$ is the incident intensity (entering the tissue), $I$ is the tranmitted intensity (leaving the tissue), $c$ is the tissue density (concentration), $\u03f5$ is the extinction coefficient (molar absorbtivity), and $L$ is the tissue thickness. Rephrasing the math as English, this means that the intensity drops off exponentially as you pass through the tissue, and more tissue (higher $c$ or $L$) or more opaque tissue (higher $\u03f5$) mean you'll get less light out the far side. This is a very simple law, and the price of the simplicity is that it brushes all sorts of things under the rug. Still, it will help give us a basic idea of what is going on in a pulse-oximeter.

Rather than treat the the tissue as a single substance, lets use the Beer-Labmert law on a mixture of substances with concentrations ${c}_{1}$, ${c}_{2}$, … and extinction coefficients ${\u03f5}_{1}$, ${\u03f5}_{2}$, ….

We also notice that the intensities and extinction coefficients may all depend on the wavelength of light $\lambda $, so we should really write

Once isolated, a simple spectroscopy experiment can measure the extinction coefficient ${\u03f5}_{i\lambda}$ of a given species across a range of $\lambda $, and this has been done for all of our common Hb flavors. We need to play with the last equation to find a way to extract the unknown concentrations, which we can then use to calculate the ${\text{Sf}}_{\text{O\u2082}}$ and ${\text{S}}_{\text{O\u2082}}$ which we can use in turn to calculate $\text{[O\u2082]}$.

Note that by increasing the number of LEDs (adding new $\lambda $) we increase the number of constraints on the unknown ${c}_{i}$. A traditional pulse-oximeter uses two LEDs, at 660 nm and 940 nm, to measure ${\text{Sf}}_{\text{O\u2082}}$ (related to [HbO₂] and [HHb]). More recent designs called pulse CO-oximeters use more wavelengths to allow measurement of quanties related to additional species (approaching the end goal of measuring ${\text{S}}_{\text{O\u2082}}$).

Let us deal with the fact that there is a lot of stuff absorbing light that is not arterial blood (e.g. venous blood, other tissue, bone, etc). The good thing about this stuff is that it's just sitting there or moving through in a smooth fasion. Arterial blood is the only thing that's pulsing. Here's another figure from Tremper:

During a pulse, the pressure in the finger increases and non-arterial tissue is compressed, changing $L$ and ${c}_{i}$ from their trough values to peak values $L\prime $ and ${c}_{i}\prime $. Since the finger is big, the fractional change in width $\mathrm{d}L/L=(L\prime -L)/L$ is very small. Assuming the change in concentration is even smaller (since most liquids are fairly incompressible), we have

where $X={c}_{1}{\u03f5}_{1\lambda}+{c}_{2}{\u03f5}_{2labmda}+\dots $ is just a placeholder to reduce clutter. $\mathrm{d}{I}_{\lambda}$ is the AC amplitude (height of wiggle top of the detected light intensity due to pulsatile arterial blood), while ${I}_{\lambda}$ is the DC ampltude (height of the static base of the detected light intensity due to everything else). This is actually a fairly sneaky step, because if we can also use it to drop the DC compents. Because we've assumed fixed concentrations (incompressible fluids), and there is no more DC material coming in during a pulse (by definition), the effective $L$ for the DC components does not change. Separating the DC and AC components and running through the derivative again, we have

where $Y={e}^{-({c}_{\text{DC}1}{\u03f5}_{\text{DC}1\lambda}+{c}_{\text{DC}2}{\u03f5}_{\text{DC}2\lambda}+\dots ){L}_{\text{DC}}}$ and $Z={c}_{\text{AC}1}{\u03f5}_{\text{AC}1\lambda}+{c}_{\text{AC}2}{\u03f5}_{\text{AC}2\lambda}+\dots )$ are just placeholders to reduce clutter. Note that the last equation looks just like the previous one with the translation $X\to Z$. This means that if we stick to using the AC-DC intensity ratio ($\frac{\mathrm{d}Il}{Il}$) we can forget about the DC contribution completely.

If the changing-$L$-but-static-${L}_{\text{DC}}$ thing bothers you, you can imagine insteadthat ${L}_{\text{DC}}$ grows with $L$, but ${c}_{\text{DC}i}$ shrinks proportially (to conserve mass). With this proportionate stretching, there is still no change in absorbtion for that component so $\frac{\mathrm{d}}{\mathrm{d}L}\mathrm{exp}(-{c}_{\text{DC}i}{\u03f5}_{\text{DC}i\lambda}L)=0$ and we can still pull the DC terms out of the integral as we just did.

Taking a ratio of these amplitudes at two different wavelengths, we get optical density ratio (R)

because $\mathrm{d}L$ (the amount of finger expansion during a pulse) obviously doesn't depend on the color light you are using. Plugging back in for $Z$,

Assuming, for now, that there are only two species of Hb—HbO₂ and HHb—we can solve for ${c}_{\text{AC}1}/{c}_{\text{AC}2}$.

So now we know [HbO₂]/[HHb] in terms of the measured quantity $R$ and the empirical values $\u03f5$.

Plugging in to our equation for ${\text{Sf}}_{\text{O\u2082}}$ to find the functional saturation:

As a check, we can rephrase this as

which matches Mendelson 1989, Eq. 8 with the translations:

- ${\text{Sf}}_{\text{O\u2082}}\to \mathrm{Sp}\text{O\u2082}$,
- $R\to R/\mathrm{IR}$,
- ${\u03f5}_{\text{AC}\mathrm{2,660}}\to {\u03f5}_{R}(\text{Hb})$,
- ${\u03f5}_{\text{AC}\mathrm{2,940}}\to {\u03f5}_{\mathrm{IR}}(\text{Hb})$,
- ${\u03f5}_{\text{AC}\mathrm{1,660}}\to {\u03f5}_{R}(\text{HbO\u2082})$, and
- ${\u03f5}_{\text{AC}\mathrm{1,940}}\to {\u03f5}_{\mathrm{IR}}(\text{HbO\u2082})$.

And that is the first-order explaination of how a pulse-oximeter measures the functional saturation!

Reading extinction coefficients off the absorbtion figure, I get

which are comfortingly close to those given by Mendelson in Table 1. The corresponding ${\text{Sf}}_{\text{O\u2082}}(R)$ plot (from Tremper) is:

The theoretical plot was calculated using SfO2vR-theory.py and our ${\text{Sf}}_{\text{O\u2082}}$ equation. This is why it’s a good idea to use an empirical calibration curve! The concave theoretical curve is supported by Mendelson's figure 4.

# Nomenclature

- O₂
- Molecular oxygen
- [$x$]
- Concentration of $x$ in the blood
- BTP
- Body temperature and pressure
- STP
- Standard temperature and pressure
- ${\text{P}}_{\text{O\u2082}}$
- O₂ partial pressure
- $\text{P}{i}_{\text{O\u2082}}$
- O₂ partial pressure at location $i$
- ${\text{S}}_{\text{O\u2082}}$
- Fractional O₂ saturation
- $\text{S}{i}_{\text{O\u2082}}$
- O₂ fractional saturation at location $i$
- ${\text{Sf}}_{\text{O\u2082}}$
- Functional O₂ saturation
- Hg
- Mercury
- Hb
- Hemoglobin monomer
- HbO₂
- Hemoglobin monomers complexed with O₂
- HHb
- Reduced hemoglobin (not complexed with O₂)
- dysHb
- Dys-hemoglobin (cannot complex with O₂)
- MHb
- Methemoglobin
- HbCO
- Carboxyhemoglobin
- ${I}_{0\lambda}$
- Intensity of incident light at wavelength $\lambda $
- ${I}_{\lambda}$
- Intensity of transmitted light at wavelength $\lambda $
- ${c}_{i}$
- Concentration of light-absorbing species $i$
- ${c}_{\text{DC}i}$
- Concentration of the $i$th DC species at wavelength $\lambda $
- ${c}_{\text{AC}i}$
- Concentration of the $i$th AC species at wavelength $\lambda $
- ${\u03f5}_{i\lambda}$
- Extinction coefficient of species $i$ at wavelength $\lambda $
- ${\u03f5}_{\text{DC}i\lambda}$
- Extinction coefficient of the $i$th DC species wavelength $\lambda $
- ${\u03f5}_{\text{DC}i\lambda}$
- Extinction coefficient of the $i$th DC species wavelength $\lambda $
- $L$
- Length of tissue through which light must pass
- ${L}_{\text{DC}}$
- Diastolic finger width
- $R$
- Optical density ratio
- LED
- Light emitting diode

Over the last few days I've been trying to teach myself enough genetics to reconstruct Carrion-Vazquez's poly-I27 synthesis procedure. I'm not quite there yet, but I feel like I've made enough progress that it's worth posting my notes somewhere public in case they are useful to others.

# Overview

We buy our poly-I27 from AthenaES, who market it as I27O™.
Perusing their technical brief, makes it clear that I2O7™
corresponds to Carrion-Vazquez's I27^{RS}₈. In
Carrion-Vazquez' original paper they describe the synthesis of
both I27^{RS}₈ and a variant I27^{GLG}₁₂. Their
I27^{RS}₈ procedure is:

- Human cardiac muscle used to generate a cDNA library (Rief 1997)
- cDNA library amplified with PCR
- 5' primer contained a BamHI restriction site that permitted in-frame cloning of the monomer into the expression vector pQE30.
- The 3' primer contained a BglII restriction site, two Cys codons located 3' to the BglII site and in-frame with the I27 domain, and two in-frame stop codons.

- The PCR product was cloned into pUC19 linearized with BamHI and SmaI.
- The 8-domain synthetic gene was constructed by iterative cloning of monomer into monomer, dimer into dimer, and tetramer into tetramer.
- The final construct contained eight direct repeats of the I27 domain, an amino-terminal His tag for purification, and two carboxyl-terminal Cys codons used for covalent attachment to the gold-covered coverslips.

They also give the full-length sequence of I27^{RS}₈:

```
Met-Arg-Gly-Ser-(His)₆-Gly-Ser-(I27-Arg-Ser)₇-I27-...-Cys-Cys
```

They point out the Arg-Ser (RS) amino acid sequence is the BglII/BamHI hybrid site, which makes sense.

Back on the Athena site, they have a page describing their procedure (they reference the Carrion-Vazquez paper). They claim to use the restriction enzyme KpnI in addition to BamHI, BglII, and SmaI.

Carrion-Vazquez points to the following references:

- Kempe et al. 1985 (CV16), the source of the multi-step cloning technique.
- Rief et al. (CV10), for I27 subcloning.

## Rief

In their note 11, Rief et al. explain their synthesis procedure:

- λ cDNA library
- Titin fragments of interest were amplified by PCR
- cloned into pET 9d
- NH₂-terminal domain boundaries were as in Politou 1996.
- The clones were fused with an NH₂-terminal His₆ tag and a COOH-terminal Cys₂ tag for immobilization on solid surfaces.

which doesn't help me very much.

## Kemp

The Kempe article is more informative, focusing entirely on the synthesis procedure (albiet for a different gene). Their figure 2 outlines the general approach, and used the following restriction enzymes: PstI, BamHI, PstI, and BglII. I'll walk through their procedure in detail below.

## Genetic code

Wikipedia has a good page on the genetic code for converting between DNA/mRNA codons and amino acids. I've written up a little Python script, mRNAcode.py, to automate the conversion of various sequences, which helped me while I was writing this post. I'm sure there are tons of similar programs out there, so don't feel pressured to use mine ;).

## Restriction enzymes

We'll use the following restriction enzymes:

```
5' G|GATC C 3'
3' C CTAG|G 5'
```

BglI (N is any nucleotide)

```
5' GCCN NNN|NGGC 3'
3' CGGN|NNN NCCG 5'
```

```
5' A|GATC T 3'
3' T CTAG|A 5'
```

```
5' A|AGCT T 3'
3' T TCGA|A 5'
```

```
5' G GTAC|C 3'
3' C|CATG G 5'
```

```
5' C TGCA|G 3'
3' G|ACGT C 5'
```

```
5' CCC|GGG 3'
3' GGG|CCC 5'
```

# Details

Here's my attempt to reconstruct the details of the polymer-cloning reactions, where they splice several copies of I27 into the expression plasmid.

## Kempe procedure

Inserted their poly-SP into pHK414 (I haven't been able to find any
online sources for pHK414. Kempe cites R.J. Watson et al.
*Expression of Herpes simplex virus type 1 and type 2 glyco-protein D
genes using the Escherichia coli lac promoter.* Y. Becker (Ed.),
*Recombinant DNA Research and Viruses.* Nijhoff, The Hague, 1985,
pp. 327-352.)

### Synthetic SP

```
HindIII. ,BamHI_.
| | Met Arg Pro Lys Pro Gln Gln Phe Phe Gly Leu Met |
5’ GA AGC TTC ATG CGT CCG AAG CCG CAG CAG TTC TTC GGT CTC ATG GAT CCG
CT TCG AAG TAC GCA GGC TTC GGC GTC GTC AAG AAG CCA GAG TAC CTA GGC 5’
```

### pHK414

```
_______Linker_sequence______
/ \
HindIII BamHI
,PstI. BglII.| |,SmaI. |
CTGCAG...AGATCTAAGCTTCCCGGGGATCCAAGATCC
GACGTC...TCTAGATTCGAAGGGCCCCTAGGTTCTAGG
. .
.......................................
```

### Synthesizing pSP4-1

#### pHK414 + HindIII + BamHI

They cut a hole in the plasmid…

```
HindIII BamHI.
(PstI) BglII,| |
CTGCAG...AGATCTA GATCCAAGATCC
GACGTC...TCTAGATTCGA GTTCTAGG
. .
.......................................
```

#### SP + HindIII + BamHI

… and cut matching snips off their SP gene.

```
HindIII. ,BamHI_.
| | Met Arg Pro Lys Pro Gln Gln Phe Phe Gly Leu Met |
AGC TTC ATG CGT CCG AAG CCG CAG CAG TTC TTC GGT CTC ATG
AG TAC GCA GGC TTC GGC GTC GTC AAG AAG CCA GAG TAC CTA G
```

#### pSP4-1

Mixing the snips together gives the plasmid with a single SP.

```
HindIII BamHI.
,PstI. BglII.| | MetArgProLysProGlnGlnPhePheGlyLeuMet |
CTGCAG...AGATCTAAGCTTCATGCGTCCGAAGCCGCAGCAGTTCTTCGGTCTCATGGATCCAAGATCC
GACGTC...TCTAGATTCGAAGTACGCAGGCTTCGGCGTCGTCAAGAAGCCAGAGTACCTAGGTTCTAGG
. .
......................................................................
```

Using `-SP-`

to abbreviate the HindIII→Met→Met portion (less the
terminal G, which is part of the BamHI match sequence).

```
,PstI. BglII. BamHI.
CTGCAG...AGATCT-SP-GGATCC
GACGTC...TCTAGA-SP-CCTAGG
. .
.........................
```

### Synthesizing pSP4-2

The single-SP plasmid, pSP4-1, is split in two parallel reactions.

#### PstI + BamHI

```
G...AGATCT-SP-G
ACGTC...TCTAGA-SP-CCTAG
```

#### PstI + BglII

```
CTGCA GATCT-SP-GGATCC
G A-SP-CCTAGG
. .
.........................
```

#### pSP4-2

Then the SP-containing fragments (shown above) are isolated and mixed together to form pSP4-2.

```
,PstI. BglII. other. BamHI.
CTGCAG...AGATCT-SP-GGATCT-SP-GGATCC
GACGTC...TCTAGA-SP-CCTAGA-SP-CCTAGG
. .
...................................
```

where the "other" sequence is the result of the BamHI/BglII splice.
Expanding the `-SP-`

abbreviation around the SP joint:

```
....SP,other_.HindIII. SP.....
Leu Met Asp Leu Ser Phe Met Arg
CTC ATG GAT CTA AGC TTC ATG CGT
AGA CGT TCG AGC CTA GGA CGT ATG
```

So the resulting poly-SP will have Asp-Leu-Ser-Phe linking amino acids.

By repeating the PstI + BamHI / PstI + BglII split-and-join, you can synthesize plasmids with any number of SP repeats.

## I27^{RS}₈ procedure

Like Kempe, Carrion-Vazquez et al. flank the I27 gene with BglII and BamHI, but they reverse the order. Here's the output of their PCR:

```
BamHI-I27-BglII-Cys-Cys-STOP-STOP
```

From the PDB entry for I27 (1TIT), the amino acid sequence is:

```
,leader_.
MHHHHHHSSLIEVEKPLYGVEVFVGETAHFEIELSEPDVHGQWKLKGQPLTASPDCEIIEDGKKHILI
LHNCQLGMTGEVSFQAANAKSAANLKVKEL
```

To translate this into cDNA, I've scanned thorough the sequence of NM_003319.4, and found a close match from nucleotides 15991 through 16248.

```
15982 CTAATAAAAG TGGAAAAGCC TCTGTACGGA GTAGAGGTGT TTGTTGGTGA
16032 AACAGCCCAC TTTGAAATTG AACTTTCTGA ACCTGATGTT CACGGCCAGT
16082 GGAAGCTGAA AGGACAGCCT TTGACAGCTT CCCCTGACTG TGAAATCATT
16132 GAGGATGGAA AGAAGCATAT TCTGATCCTT CATAACTGTC AGCTGGGTAT
16182 GACAGGAGAG GTTTCCTTCC AGGCTGCTAA TGCCAAATCT GCAGCCAATC
16232 TGAAAGTGAA AGAATTG
```

This cDNA match generates an amino acid starting with LIKVEK instead of the expected LIEVEK, but the LIKVEK version matches amino acids 12677-12765 in Q8WZ42 (canonical titin), and there is a natural variant listed for 12679 K→E.

Interestingly, this sequence contains a PstI site at nucleotides 16220 through 16225. None of our other restriction enzymes have sites in the I27 sequence.

Carrion-Vazquez et al. list two vectors in their procedure, but I'm not sure about their respective roles.

### pQE30

pQE30 (sequence) is listed as the "expression vector", but I'm not sure why they would need a non-expression vector, as they don't reference cross-vector subcloning after inserting their I27 monomer into the plasmid.

From the Qiagen site, the section around the linker nucleotides 115 through 203 is:

```
,RGS-His epitope__________________. ,BamHI.
Met Arg Gly Ser His His His His His His Gly Ser Ala Cys Glu Leu
ATG AGA GGA TCG CAT CAC CAT CAC CAT CAC GGA TCC GCA TGC GAG CTC
CGT CTC TTC GAT ACG ACA ACG ACA ACG ACA TTC GAA TAC GTA TCT AGA
,SmaI__.
,KpnI_. HindIII
Gly Thr Pro Gly Arg Pro Ala Ala Lys Leu Asn STOP
GGT ACC CCG GGT CGA CCT GCA GCC AAG CTT AAT TAG CTG AG
TTG CAA AAT TTG ATC AAG TAC TAA CCT AGG CCG GCT AGT CT
```

However, there is no BglII site in this linker. In fact, there is no BglII site in the entire pQE30 plasmid, so they'd need to use a third restiction enzyme to insert their I27 (which does contain a trailing BglII).

### pUC19

From BCCM/LMBP and GenBank, the section around the linker nucleotides 233 through 289 is:

```
,SmaI_.
HindIII. ,PstI__. ,BamHI_. ,KpnI__.
Met STOP
AA GCT TGC ATG CCT GCA GGT CGA CTC TAG AGG ATC CCC GGG TAC CGA
GCT CGA ATT C
```

However, there is no BglII the entire pUC19 plasmid either, so they'd need to use a third restiction enzyme to insert their I27.

### Questions

- Why do Carrion-Vazquez et al. list two different plasmids?
- What is the 3'-side restiction enzyme that Carrion-Vazquez et al. use to insert their I27 into their plasmid?
- What is the remote restriction enzyme that Carrion-Vazquez et al. use to break their opened plasmids (Kempe PstI equivalent).
- The BamHI and SmaI sites in pUC19 overlap, so it is unclear how you could use both to "linearize" pUC19. It would seem that either one would open the plasmid on its own, although I'm not sure you could "heal" the blunt-ended SmaI cut.
Since the Arg-Ser joint is formed by a BglII/BamHI overlap, why are there no BglII-coded amino acids after the last I27 in the I27

^{RS}₈ sequence? If there is, why do Carrion-Vazquez et al. not acknowledge it when they write [3]:The full-length construct, I27

^{RS}₈, results in the following amino acid additions: (i) the amino-terminal sequence is Met-Arg-Gly-Ser-(His)6-Gly-Ser-I27 codons; (ii) the junction between the domains (BamHI-BglII hybrid site) is Arg-Ser; and (iii) the protein terminates in Cys-Cys.Since they don't acknowledge an I27-Arg-Ser-Cys-Cys ending, might there be more amino acids in the C terminal addition?

### Working backward

Since I'm stuck trying to get I27 into either plasmid, let's try and work backward from

```
Met-Arg-Gly-Ser-(His)₆-Gly-Ser-(I27-Arg-Ser)₇-I27-...-Cys-Cys
```

#### BglII/BamHI joint

The BglII/BamHI overlap would produce the expected Arg-Ser joint.

```
BglII BamHI
A + GATCC = AGATCC = Arg-Ser
TCTAG G TCTAGG
```

#### Final plasmid (pI27-8)

The beginning of this sequence looks like the start of pQE30's linker, so we'll assume the final plasmid was:

```
remote ... ,RGS-His epitope__________________. ,BamHI. I27...
... Met Arg Gly Ser His His His His His His Gly Ser Leu Ile ...
??? ... ATG AGA GGA TCG CAT CAC CAT CAC CAT CAC GGA TCC CTA ATA ...
??? ... CGT CTC TTC GAT ACG ACA ACG ACA ACG ACA TTC GAA GAT TAT ...
........I27 joint_. I27 ... final I27 ,BglII. continuation of pQE30?
... Glu Leu Leu ... Leu Arg Ser Cys Cys STOPSTOP...
... GAA TTG AGA TCC CTA ... TTG AGA TCT TGC TGC TAG TAG ...
... CTT AAC TCT AGG GAT ... GAT CTC GAG GTA GTA GCT GCT ...
```

#### Penultimate plasmid (pI27-4)

```
remote ... ,RGS-His epitope__________________. ,BamHI. I27...
Met Arg Gly Ser His His His His His His Gly Ser Leu Ile ...
??? ... ATG AGA GGA TCG CAT CAC CAT CAC CAT CAC GGA TCC CTA ATA ...
??? ... CGT CTC TTC GAT ACG ACA ACG ACA ACG ACA TTC GAA GAT TAT ...
... I27 joint_. I27 ... fourth I27 ,BglII. continuation of pQE30?
... Glu Leu Leu ... Leu Arg Ser Cys Cys STOPSTOP...
... GAA TTG AGA TCC CTA ... TTG AGA TCT TGC TGC TAG TAG ...
... CTT AAC TCT AGG GAT ... GAT CTC GAG GTA GTA GCT GCT ...
```

##### pI27-4 + BamHI + remote

```
remote ,BamHI. I27...
Leu Ile ...
? GA TCC CTA ATA ...
?? A GAT TAT ...
....... I27 joint_. I27 ... fourth I27 ,BglII. continuation of pQE30?
... Glu Leu Leu ... Leu Arg Ser Cys Cys STOPSTOP...
... GAA TTG AGA TCC CTA ... TTG AGA TCT TGC TGC TAG TAG ...
... CTT AAC TCT AGG GAT ... GAT CTC GAG GTA GTA GCT GCT ...
```

##### pI27-4 + BglII + remote

```
remote ... ,RGS-His epitope__________________. ,BamHI. I27...
Met Arg Gly Ser His His His His His His Gly Ser Leu Ile ...
?? ... ATG AGA GGA TCG CAT CAC CAT CAC CAT CAC GGA TCC CTA ATA ...
? ... CGT CTC TTC GAT ACG ACA ACG ACA ACG ACA TTC GAA GAT TAT ...
....... I27 joint_. I27 ... fourth I27 ,BglII.
... Glu Leu Leu ... Leu
... GAA TTG AGA TCC CTA ... TTG A
... CTT AAC TCT AGG GAT ... GAT CTC GA
```

##### pI27-8

```
remote ... ,RGS-His epitope__________________. ,BamHI. I27...
Met Arg Gly Ser His His His His His His Gly Ser Leu Ile ...
??? ... ATG AGA GGA TCG CAT CAC CAT CAC CAT CAC GGA TCC CTA ATA ...
??? ... CGT CTC TTC GAT ACG ACA ACG ACA ACG ACA TTC GAA GAT TAT ...
....... I27 joint_. I27 ... fourth I27 ,other. I27...
... Glu Leu Leu ... Leu Gly Ser Leu Ile ...
... GAA TTG AGA TCC CTA ... TTG AGA TCC CTA ATA ...
... CTT AAC TCT AGG GAT ... GAT CTC GAA GAT TAT ...
....... I27 joint_. I27 ... fourth I27 ,BglII. continuation of pQE30?
... Glu Leu Leu ... Leu Arg Ser Cys Cys STOPSTOP...
... GAA TTG AGA TCC CTA ... TTG AGA TCT TGC TGC TAG TAG ...
... CTT AAC TCT AGG GAT ... GAT CTC GAG GTA GTA GCT GCT ...
```

#### Continuing to the first plasmid, pI27-1 must have been

```
remote ... ,RGS-His epitope__________________. ,BamHI. I27...
... Met Arg Gly Ser His His His His His His Gly Ser Leu Ile ...
??? ... ATG AGA GGA TCG CAT CAC CAT CAC CAT CAC GGA TCC CTA ATA ...
??? ... CGT CTC TTC GAT ACG ACA ACG ACA ACG ACA TTC GAA GAT TAT ...
........I27 ,BglII. continuation of pQE30?
... Glu Leu Arg Ser Cys Cys STOPSTOP...
... GAA TTG AGA TCT TGC TGC TAG TAG ...
... CTT AAC CTC GAG GTA GTA GCT GCT ...
```

### Potential pQE30 insertion points

- Kpn1 (present after BamHI in both plasmids)

### Potential remote restriction enzymes

- BglI (pQE30 nucleotides 2583-2593 (GCCGGAAGGGC), Amp-resistance 3256-2396; pUC19 has two BglI sites (bad idea))

Available in a git repository.

Repository: pypid

Browsable repository: pypid

Author: W. Trevor King

I've just finished rewriting my PID temperature control package in pure-Python, and it's now clean enough to go up on PyPI. Features:

- Backend-agnostic architecture. I've written a first-order process with dead time (FOPDT) test backend and a pymodbus-based backend for our Melcor MTCA controller, but it should be easy to plug in your own custom backend.
- The general PID controller will automatically tune your backend using any of a variety of tuning rules.

The `README`

is posted on the PyPI page.

There are a number of open source packages dealing with aspects of single-molecule force spectroscopy. Here's a list of everything I've heard about to date (for more details on calibcant, Hooke, and sawsim, see my thesis).

Package | License | Purpose |
---|---|---|

calibcant | GPL v3+ | Cantilever thermal calibration |

fs_kit | GPL v2+ | Force spectra analysis pattern recognition |

Hooke | LGPL v3+ | Force spectra analysis and unfolding force extraction |

sawsim | GPL v3+ | Monte Carlo unfolding/refolding simulation and fitting |

refolding | Apache v2.0 | Double-pulse experiment control and analysis |

# calibcant

Calibcant is my Python module for AFM cantilever calibration via the thermal tune method. It's based on Comedi, so it needs work if you want to use it on a non-Linux system. If you're running a Linux kernel, it should be pretty easy to get it running on your system. Email me if there's any way I can help set it up for your lab.

# fs_kit

fs_kit is a package for force spectra analysis pattern
recognition. It was developed by Michael Kuhn and Maurice Hubain at
Daniel Müller's lab when they were at TU Dresden
(paper). It has an Igor interface, but the bulk
of the project is in C++ with a wxWidgets interface. fs_kit
is versioned in CVS at `bioinformatics.org`

, and you can check out
their code with:

```
$ cvs -d:pserver:anonymous@bioinformatics.org:/cvsroot checkout fskit
```

The last commit was on 2005/05/16, so it's a bit crusty. I patched things up back in 2008 so it would compile again,

but when I emailed Michael with the patches I got this:

```
On Thu, Oct 23, 2008 at 11:21:42PM +0200, Michael Kuhn wrote:
> Hi Trevor,
>
> I'm glad you could fix fs-kit, the project is otherwise pretty dead,
> as was the link. I found an old file which should be the tutorial,
> hopefully in the latest version. The PDF is probably lost.
>
> bw, Michael
```

So, it's a bit of a fixer-upper, but it was the first open source package in this field that I know of. I've put up a PDF version of the tutorial Michael sent me in case you're interested.

# Hooke

Hooke is a force spectroscopy data analysis package written in Python. It was initially developed by Massimo Sandal, Fabrizio Benedetti, Marco Brucale, Alberto Gomez-Casado while at Bruno Samorì's lab at U Bologna (paper; surprisingly, there are commits by all of the authors except Samorì himself). Hooke provides the interface between your raw data and theory. It has a drivers for reading most force spectroscopy file formats, and a large number of commands for manipulating and analyzing the data.

I liked Hooke so much I threw out my already-written package that had been performing a similar role and proceeded to work over Hooke to merge together the diverging command-line and GUI forks. Unfortunately, my fork has not yet been merged back in as the main branch, but I'm optimistic that it will eventually. The homepage for my branch is here.

# sawsim

While programs like Hooke can extract unfolding forces from velocity-clamp experiments, the unfolding force histograms are generally compared to simulated data to estimate the underlying kinetic parameters. Sawsim is my package for performing such simulations and fitting them to the experimental histograms (paper). The single-pull simulator is written in C, and there is a nice Python wrapper that manages the thousands of simulated pulls needed to explore the possible model parameter space. The whole package ends up being pretty fast, flexible, and convenient.

# refolding

Refolding is a suite for performing and analyzing
double-pulse refolding experiments. It was initially developed by
Daniel Aioanei, also at the Samorí lab in Bologna (these guys are
great!). The experiment-driver is mostly written in Java with the
analysis code in Python. The driver is curious; it uses the
NanoScope scripting interface to drive the experiment *through* the
NanoScope software by impersonating a mouse-wielding user (like
Selenium does for web browsers). See the `RobotNanoDriver.java`

code for details. There is also support for automatic velocity clamp
analysis.

The official paper for the project is by Aioanei. The earlier paper by Materassi may be related, but Aioanei doesn't cite it in his paper, and Materassi doesn't give a URL for his code.

# Hardware

Nice software doesn't do you much good if you don't have the hardware to control. There are a number of quasi-open hardware solutions for building your own AFM (and other types of scanning probe microscopes). There's a good list on opencircuits. Interesting projects include:

- Glenn Durden's STM (1992–1998)
- Jim Rice's Homebrew STM (1995)
- The Peddie School's STM Project (1997–2002)
- Jürgen Müller's home-built STMs (1999–2006)
- John D. Alexander's STM Project (2000–2003)
- The Münster Interface Physics Group's SXM Project (free except for commercial use, 2000–2005).
- Joseph Gatt's Amateur STM (2003)
- Maxim Shusteff's AFM for the instructional laboratory (2006)
- Dominik, Ivan, and Sandro's STM-DIY project (2009)

# Other software

The Gnome X Scanning Miscroscopy (GXSM) project provides GPL
software to perform standard SPM imaging. The list of supported
hardware is currently limited to the SignalRanger series by SoftdB,
via GXSM-specific kernel modules like `sranger-mk23-dkms`

. There is
an obsolete Comedi driver for GXSM that Percy Zahl wrote back in
1999, but it has been deprecated since at least 2007.

Force spectroscopy is the process of extracting information about the unfolding (or unbinding) characteristics of a protein (or ligand-receptor pair) by measuring force vs. extension curves while gradually ripping the protein (or pair) apart. Consider this cartoon representation of the procedure

The AFM tip is pulling a protein chain away from the substrate, causing one of the protein domains to uncoil.

The procedure yields 'force curves' like this

To interpret the force curve, let us examine it piece-by-piece as the AFM tip gradually pulls away from the substrate.

- The linear 'contact' region demonstrates the Hooke's law behavior of the AFM cantilever, with force ∝ displacement.
- The high force 'bulge' linking the contact region to the sawtooth comes from the AFM tip pulling free of the surface and associated protein 'mat' (the cartoon being excessively pretty, and our sample having too high a protein concentration :p).
- The characteristic 'sawtooth' comes from several identical domains unfolding one after the other.
- After the last of the protein domains unfolds the protein snaps off of the AFM tip (or the substrate), and the deflection of the now-free cantilever ceases to depend on distance.

The clickloc.tk micro-app just opens an image file and prints out the pixel coordinates of any mouse clicks upon it. I use it to get rough numbers from journal figures. You can use scale click.py to convert the output to units of your choice, if you use your first four clicks to mark out the coordinate system (xmin,anything), (xmax,anything), (anything,ymin), and (anything,ymax).

```
$ pdfimages article.pdf fig
$ clickloc.tk fig-000.ppm > fig-000.pixels
$ scale_click.py 0 10 5 20 fig-000.pixels > fig-000.data
```

Take a look at plotpick for grabbing points from raw datafiles (which is more accurate and easier than reverse engineering images).

Available in a git repository.

Repository: thesis

Browsable repository: thesis

Author: W. Trevor King

The source for my Ph.D. thesis (in progress). A draft is compiled after each commit. I use the drexel-thesis class, which I also maintain.

Exciting features: SCons build; Asymptote/asyfig, PGF, and PyMOL figures.