I've been trying to wrap my head around factor analysis as a theory for designing and understanding test and survey results. This has turned out to be another one of those fields where the going has been a bit rough. I think the key factors in making these older topics difficult are:

  • “Everybody knows this, so we don't need to write up the details.”
  • “Hey, I can do better than Bob if I just tweak this knob…”
  • “I'll just publish this seminal paper behind a paywall…”

The resulting discussion ends up being overly complicated, and it's hard for newcomers to decide if people using similar terminology are in fact talking about the same thing.

Some of the better open sources for background has been Tucker and MacCallum's “Exploratory Factor Analysis” manuscript and Max Welling's notes. I'll use Welling's terminology for this discussion.

The basic idea of factor analsys is to model d measurable attributes as generated by k<d common factors and d unique factors. With n=4 and k=2, you get something like:

Relationships between factors and measured attributes (adapted from Tucker and MacCallum's Figure 1.2)
Relationships between factors and measured attributes

Corresponding to the equation (Welling's eq. 1):

(1)x=Ay+μ+ν

The independent random variables y are distributed according to a Gaussian with zero mean and unit variance 𝒢 y[0,I] (zero mean because constant offsets are handled by μ; unit variance because scaling is handled by A). The independent random variables ν are distributed according to 𝒢 ν[0,Σ], with (Welling's eq. 2):

(2)Σdiag[σ 1 2,,σ d 2]

The matrix A (linking common factors with measured attributes x)isreferedtoasthefactorweightsorfactorloadings.Becausetheonlysourceofconstantoffsetis\mathbf{\mu},wecancalculateitbyaveragingouttherandomnoise(Wellingseq.6):Unknown characterdivclass=Unknown characternumberedEqUnknown characterUnknown characterUnknown characterspanUnknown character(3)Unknown character/spanUnknown characterμ=1N n=1 Nx nUnknown character/divUnknown characterwhereNisthenumberofmeasurements(surveyresponders)and\mathbf{x}nistheresponsevectorforthen^\text{th}responder.Howdowefind\mathbf{A}and\mathbf{\Sigma}?Thisisthetrickybit,andthereareanumberofpossibleapproaches.Wellingsuggestsusingexpectationmaximization(EM),andtheresanexcellentexampleoftheprocedurewithacolorblindexperimenterdrawingcoloredballsinhis[EMnotes][EM](totestmyunderstanding,IwroteUnknown characterahref=Unknown character./colorball.pyUnknown characterUnknown charactercolorball.pyUnknown character/aUnknown character).Tosimplifycalculations,Wellingdefines(beforeeq.15):Unknown characterdivclass=Unknown characternumberedEqUnknown characterUnknown characterUnknown characterspanUnknown character(4)Unknown character/spanUnknown characterA [A,μ] y [y T,1] TUnknown character/divUnknown characterwhichreducethemodeltoUnknown characterdivclass=Unknown characternumberedEqUnknown characterUnknown characterUnknown characterspanUnknown character(5)Unknown character/spanUnknown characterx=Ay+νUnknown character/divUnknown characterAftersomemanipulationWellingworksoutthemaximizingupdates(eqns16and17):Unknown characterdivclass=Unknown characternumberedEqUnknown characterUnknown characterUnknown characterspanUnknown character(6)Unknown character/spanUnknown characterA new =( n=1 Nx nE[yx n] T)( n=1 Nx nE[yy Tx n]) 1 Σ new =1N n=1 Ndiag[x nx n TA newE[yx n]x n T]Unknown character/divUnknown characterTheexpectationvaluesusedintheseupdatesaregivenby(Wellingseqns12and13):Unknown characterdivclass=Unknown characternumberedEqUnknown characterUnknown characterUnknown characterspanUnknown character(7)Unknown character/spanUnknown characterE[yx n] =A T(AA T+Σ) 1(x nμ) E[yy Tx n] =IA T(AA T+Σ) 1A+E[yx n]E[yx n] TUnknown character/divUnknown characterSurveyanalysis===============Enoughabstraction!Letslookatanexample:[surveyresults][survey]:Unknown characterUnknown characterUnknown characterimportnumpyUnknown characterUnknown characterUnknown characterscores=numpy.genfromtxt(Factor analysis/survey.data,delimiter=t)Unknown characterUnknown characterUnknown characterscoresarray([[1.,3.,4.,6.,7.,2.,4.,5.],[2.,3.,4.,3.,4.,6.,7.,6.],[4.,5.,6.,7.,7.,2.,3.,4.],[3.,4.,5.,6.,7.,3.,5.,4.],[2.,5.,5.,5.,6.,2.,4.,5.],[3.,4.,6.,7.,7.,4.,3.,5.],[2.,3.,6.,4.,5.,4.,4.,4.],[1.,3.,4.,5.,6.,3.,3.,4.],[3.,3.,5.,6.,6.,4.,4.,3.],[4.,4.,5.,6.,7.,4.,3.,4.],[2.,3.,6.,7.,5.,4.,4.,4.],[2.,3.,5.,7.,6.,3.,3.,3.]])</mo><mi>scores</mi><mo stretchy="false">[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo stretchy="false">]</mo><mo>istheanswerthe</mo><mi>i</mi><mo>threspondentgaveforthe</mo><mi>j</mi><mo>thquestion.Werelookingforunderlyingfactorsthatcanexplaincovariancebetweenthedifferentquestions.Dothequestionanswers(\mathbf{x})representsomeunderlyingfactors(\mathbf{y})?Letsstartoffbycalculating\mathbf{\mu}:Unknown characterUnknown characterUnknown characterdefprint row(row):...print(.join(:0.2f.format(x)forxinrow))Unknown characterUnknown characterUnknown charactermu=scores.mean(axis=0)Unknown characterUnknown characterUnknown characterprint row(mu)2.423.585.085.756.083.423.924.25Nextweneedpriorsfor\mathbf{A}and\mathbf{\Sigma}.Unknown characterspanclass=Unknown charactercreatelinkUnknown characterUnknown characterMDPUnknown character/spanUnknown characterhasanimplementationforUnknown characterahref=Unknown character../Python/Unknown characterUnknown characterPythonUnknown character/aUnknown character,andtheir[FANode][]usesaGaussianrandommatrixfor\mathbf{A}andthediagonalofthescorecovariancefor\mathbf{\Sigma}.Theyalsousethescorecovariancetoavoidrepeatedsummationsovern.Unknown characterUnknown characterUnknown characterimportmdpUnknown characterUnknown characterUnknown characterdefprint matrix(matrix):...forrowinmatrix:...print row(row)Unknown characterUnknown characterUnknown characterfa=mdp.nodes.FANode(output dim=3)Unknown characterUnknown characterUnknown characternumpy.random.seed(1)#forconsistenddoctestresultsUnknown characterUnknown characterUnknown characterresponder scores=fa(scores)#commonfactorsforeachresponderUnknown characterUnknown characterUnknown characterprint matrix(responder scores)1.920.450.000.671.971.960.700.032.000.290.030.601.021.791.430.820.270.230.070.080.821.380.270.480.791.170.501.590.300.410.010.480.730.461.340.18Unknown characterUnknown characterUnknown characterprint row(fa.mu.flat)2.423.585.085.756.083.423.924.25Unknown characterUnknown characterUnknown characterfa.mu.flat==mu#MDPagreeswithourearliercalculationarray([True,True,True,True,True,True,True,True],dtype=bool)Unknown characterUnknown characterUnknown characterprint matrix(fa.A)#factorweightsforeachquestion0.800.060.450.170.300.650.340.130.250.130.730.640.020.320.700.610.230.860.080.630.590.090.670.13Unknown characterUnknown characterUnknown characterprint row(fa.sigma)#uniquenoiseforeachquestion0.040.020.380.550.300.050.480.21Becausethecovarianceisunaffectedbytherotation\mathbf{A}\rightarrow\mathbf{A}\mathbf{R},theestimatedweights\mathbf{A}andresponderscores\mathbf{y}canbequitesensitivetotheseedpriors.Thewidth\mathbf{\Sigma}oftheuniquenoise\mathbf{\nu}ismorerobust,because\mathbf{\Sigma}isunaffectedbyrotationson\mathbf{A}.Relatedtidbits===============CommunalityThe[communality][]hi^2ofthei^\text{th}measuredattributex_iisthefractionofvarianceinthemeasuredattributewhichisexplainedbythesetofcommonfactors.Becausethecommonfactors\mathbf{y}haveunitvariance,thecommunalityisgivenby:Unknown characterdivclass=Unknown characternumberedEqUnknown characterUnknown characterUnknown characterspanUnknown character(8)Unknown character/spanUnknown characterh i= j=1 kA ij 2 j=1 kA ij 2+σ 1 2Unknown character/divUnknown characterUnknown characterUnknown characterUnknown characterfactor variance=numpy.array([sum(row2)forrowinfa.A])Unknown characterUnknown characterUnknown characterh=numpy.array(...[var/(var+sig)forvar,siginzip(factor variance,fa.sigma)])Unknown characterUnknown characterUnknown characterprint row(h)0.950.970.340.640.660.960.610.69Theremaybesomescalingissuesinthecommunalityduetodeviationsbetweentheestimated\mathbf{A}andΣ and the variations contained in the measured scores (why?):

>>> print_row(factor_variance + fa.sigma)
 0.89   0.56   0.57   1.51   0.89   1.21   1.23   0.69
>>> print_row(scores.var(axis=0, ddof=1))  # total variance for each question
 0.99   0.63   0.63   1.66   0.99   1.36   1.36   0.75

The proportion of total variation explained by the common factors is given by:

(9) i=1 kh i

Varimax rotation

As mentioned earlier, factor analysis generated loadings A that are unique up to an arbitrary rotation R (as you'd expect for a k-dimensional Gaussian ball of factors y). A number of of schemes have been proposed to simplify the initial loadings by rotating A to reduce off-diagonal terms. One of the more popular approaches is Henry Kaiser's varimax rotation (unfortunately, I don't have access to either his thesis or the subsequent paper). I did find (via Wikipedia) Trevor Park's notes which have been very useful.

The idea is to iterate rotations to maximize the raw varimax criterion (Park's eq. 1):

(10)V(A)= j=1 k(1d i=1 dA ij 4(1d i=1 dA ij 4) 2)

Rather than computing a k-dimensional rotation in one sweep, we'll iterate through 2-dimensional rotations (on successive column pairs) until convergence. For a particular column pair (p,q), the rotation matrix R * is the usual rotation matrix:

(11)R *=(cos(ϕ *) sin(ϕ *) sin(ϕ *) cos(ϕ *))

where the optimum rotation angle ϕ * is (Park's eq. 3):

(12)ϕ *=14(1d j=1 d(A jp+iA jq) 4(1d j=1 d(A jp+iA jq) 2) 2)

where i1.

Nomenclature

A ij
The element from the i th row and j th column of a matrix A. For example here is a 2-by-3 matrix terms of components:
(13)A=(A 11 A 12 A 13 A 21 A 22 A 23)
A T
The transpose of a matrix (or vector) A. A ij T=A ji
A 1
The inverse of a matrix A. A 1A˙=1
diag[A]
A matrix containing only the diagonal elements of A, with the off-diagonal values set to zero.
E[f(x)]
Expectation value for a function f of a random variable x. If the probability density of x is p(x), then E[f(x)]=dxp(x)f(x). For example, E[p(x)]=1.
μ
The mean of a random variable x is given by μ=E[x].
Σ
The covariance of a random variable x is given by Σ=E[(xμ)(xμ) T]. In the factor analysis model discussed above, Σ is restricted to a diagonal matrix.
𝒢 x[μ,Σ]
A Gaussian probability density for the random variables x with a mean μ and a covariance Σ.
(14)𝒢 x[μ,Σ]=1(2π) D2det[Σ]e 12(xμ) TΣ 1(xμ)
p(yx)
Probability of y occurring given that x occured. This is commonly used in Bayesian statistics.
p(x,y)
Probability of y and x occuring simultaneously (the joint density). p(x,y)=p(xy)p(y)
(z)
The angle of z in the complex plane. (re iθ)=θ.

Note: if you have trouble viewing some of the more obscure Unicode used in this post, you might want to install the STIX fonts.