the slow art of fast mixing — aniket deshpande

In my studies of quantum information and learning theory, I keep landing on the same computational hinge: sampling. Exact inference on graphical models is \(\mathsf{\#P}\)-hard. Partition functions of quantum systems are intractable for the same reason. What remains, what we do daily in practice, is run a Markov chain and hope. The hope is well-founded; Markov Chain Monte Carlo has been a numerical workhorse for three-quarters of a century. But a sample is only as good as the chain that draws it, and whether a chain is good enough is a surprisingly delicate question.

This post traces that question from its origin in early 20th-century Russia to a breakthrough posted to the arXiv this past April. The arc covers a hundred and nineteen years, two world wars, and half a dozen theoretical frameworks, yet the central idea has hardly changed: given a chain on a large state space, how do we prove it converges fast enough to be useful?

I. A Chain of Events

In 1913, Andrey Markov wanted to settle an argument. His colleague Pavel Nekrasov had been insisting, with nationalist and theological flavor, that the law of large numbers required independent events, and that this dependence-on-independence was somehow proof of human free will. Markov disagreed, and he disagreed in the way mathematicians do: with a counterexample. He took the first 20,000 letters of Pushkin's Eugene Onegin, classified each as a vowel or a consonant, and computed the empirical transition frequencies. The sequence was obviously dependent, yet the frequencies obeyed a law of large numbers just fine. The Markov chain was born as a combinatorial retort to a philosophical dispute about Russian poetry.

Thirty years later, the chain had migrated to Los Alamos. Stanislaw Ulam, convalescing at home in 1946, was playing solitaire and wondered what fraction of deals were winnable. The combinatorics were hopeless, but he realized one could simply simulate: play many random hands and count. Ulam mentioned the idea to John von Neumann, who saw immediately that neutron transport for the hydrogen bomb, the core technical problem facing the lab, was the same kind of thing. Nicholas Metropolis¹ named the method after Ulam's gambling uncle in Monaco. By 1953, Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller had written down the algorithm that now bears his name: a recipe for sampling the Boltzmann distribution of any classical system by running a chain with a carefully chosen acceptance rule.

What neither Markov nor Metropolis had was a theory of how fast these chains converge. If the chain takes exponentially long to forget its starting state, the empirical averages you compute are biased, and the bias does not vanish in the Monte Carlo variance; more samples will not save you. The question, then, is: how short must the run be before the empirical distribution is close enough to the target? That is the mixing-time question, and it is what we'll spend the rest of the post on.

II. The Imaginary-Time Trick

The modern setting I care about most is Path-Integral Monte Carlo (PIMC): the dominant method for simulating thermal states of stoquastic quantum systems. The program is to transform a quantum partition function into a classical Gibbs distribution by introducing an imaginary-time dimension, then sample from that distribution by Markov chain. Consider the one-dimensional transverse-field Ising model (TFIM) with open boundary conditions, \[ H = -J\sum_{i=1}^{n-1} Z_i Z_{i+1} - h\sum_{i=1}^{n} X_i, \qquad J, h > 0, \] and suppose we want to estimate \(Z = \mathrm{tr}(e^{-\beta H})\). Simulating a generic quantum Hamiltonian on a classical computer is hard for the reasons you'd expect, but the TFIM has a redeeming property that makes it tractable: it is stoquastic.

Definition (stoquastic). A Hamiltonian is stoquastic in a given basis if all of its off-diagonal matrix elements are real and non-positive.

In the computational (\(Z\)) basis, \(Z_i Z_{i+1}\) is diagonal, and \(X_i\) has matrix elements \(\langle x \mid X_i \mid y\rangle = \delta_{x_{\neq i}, y_{\neq i}}\, \mathbb{1}[x_i \neq y_i]\), which are strictly non-negative. Because the transverse field carries a minus sign, the off-diagonal entries of \(H\) are \(-h \leq 0\), and the TFIM qualifies. Stoquasticity matters because it kills the sign problem²: decompose \(H = D + V\) with \(D\) diagonal and \(V\) off-diagonal. Since \(-V\) has non-negative entries, every term in the Taylor series \(e^{-\beta V/N} = \sum_{k \geq 0} (-\beta V/N)^k / k!\) is entrywise non-negative, and so is the Trotter product \[ e^{-\beta H} = \lim_{N \to \infty} \left( e^{-\beta D/N}\, e^{-\beta V/N} \right)^N. \] The left-hand side therefore has non-negative entries in every position, which is what lets us interpret \(\langle x \mid e^{-\beta H} \mid y\rangle\) as an unnormalized probability.

Insert resolutions of the identity between the Trotter factors and take the trace: \[ Z = \sum_{x^{(0)}, \ldots, x^{(N-1)}} \prod_{k=0}^{N-1} \langle x^{(k)} \mid e^{-\Delta\tau D} e^{-\Delta\tau V} \mid x^{(k+1)} \rangle, \] with \(x^{(N)} = x^{(0)}\) and \(\Delta\tau = \beta/N\). The diagonal piece contributes \(\exp\left(J\Delta\tau \sum_{i} s_i^{(k)} s_{i+1}^{(k)}\right)\) per Trotter slice, writing \(s_i^{(k)} = \pm 1\) for the spin value. The transverse piece factorizes across sites, and using \(e^{h\Delta\tau X} = \cosh(h\Delta\tau)\, I + \sinh(h\Delta\tau)\, X\), a short calculation yields \[ \langle s \mid e^{h\Delta\tau X} \mid s'\rangle = A \exp(\tilde{J} s s'), \qquad \tilde{J} = \tfrac{1}{2}\log\coth(h\Delta\tau), \] with \(A = \sqrt{\tfrac{1}{2}\sinh(2h\Delta\tau)}\). Putting everything together, \[ Z = A^{nN} \sum_{\{s_i^{(k)}\}} \exp\!\left( J\Delta\tau \sum_{i, k} s_i^{(k)} s_{i+1}^{(k)} \;+\; \tilde{J} \sum_{i, k} s_i^{(k)} s_i^{(k+1)} \right). \] This is the partition function of a classical anisotropic 2D Ising model on an \(n \times N\) torus, with spatial coupling \(J\Delta\tau\) and imaginary-time coupling \(\tilde{J}\). The quantum system has become a classical one in space-time, and the Gibbs distribution of this lattice is what PIMC samples from via Glauber dynamics, heat-bath, or cluster updates.

The Trotter error is \(O(\Delta\tau^2)\) per step, so the total error in \(Z\) is \(O(\beta^2/N)\), controllable by taking \(N\) polynomial in \(\beta\) and \(1/\epsilon\). What remains is exactly the question that has been sitting behind the algorithm since 1953: how long does the Glauber chain on this lattice take to mix?

III. Canonical Paths

To prove a chain mixes rapidly, one needs a geometric handle on the state space. The canonical-path method, introduced by Jerrum and Sinclair in 1989, gives us one. Let \(P\) be a reversible chain with stationary distribution \(\pi\) on a state space \(\Omega\). For each ordered pair \((x, y) \in \Omega \times \Omega\), designate a path \(\gamma_{x, y}\) from \(x\) to \(y\) along edges of the transition graph. Collect these into a canonical-path system \(\Gamma = \{\gamma_{x,y}\}\).

Definition (edge congestion). The edge congestion of a canonical-path system \(\Gamma\) is \[ \rho = \max_{e} \frac{1}{Q(e)} \sum_{(x, y) :\, e \in \gamma_{x, y}} \pi(x)\pi(y), \] where \(Q(e) = \pi(u) P(u, v)\) for \(e = (u \to v)\) is the ergodic flow through \(e\), that is, the stationary probability that the chain crosses \(e\) in a single step.

Think of \(\rho\) as a worst-case load: for each pair of states, we treat \(\pi(x)\pi(y)\) as the demand to be routed from \(x\) to \(y\), and \(\rho\) is the maximum ratio of total demand on any edge to that edge's capacity \(Q(e)\). A small \(\rho\) means no edge is a bottleneck. Jerrum and Sinclair's insight was that this combinatorial object controls the chain's conductance.

Theorem (Jerrum–Sinclair, 1989). For any canonical-path system with congestion \(\rho\), \[ \Phi \geq \frac{1}{2\rho}, \] where \(\Phi\) is the conductance³ of the chain.

Proof. Fix any \(S \subseteq \Omega\) with \(\pi(S) \leq 1/2\). For every pair \((x, y) \in S \times S^c\), the canonical path \(\gamma_{x, y}\) starts in \(S\) and ends in \(S^c\), so it crosses at least one boundary edge \(e \in \partial S := \{(u, v) : u \in S,\, v \in S^c\}\). Sum the pair weights and decompose by which boundary edge they cross: \[ \sum_{x \in S,\, y \in S^c} \pi(x)\pi(y) \;\leq\; \sum_{e \in \partial S}\ \sum_{(x, y) :\, e \in \gamma_{x, y}} \pi(x)\pi(y) \;\leq\; \sum_{e \in \partial S} \rho \cdot Q(e) \;=\; \rho \cdot Q(S, S^c). \] Since \(\pi(S^c) \geq 1/2\), the left side is at least \(\pi(S)/2\), and rearranging gives \(Q(S, S^c)/\pi(S) \geq 1/(2\rho)\). Taking the minimum over \(S\) is the claim. \(\blacksquare\)

The chain-theoretic payoff comes through Cheeger's inequality for Markov chains, \(\lambda_1 \geq \Phi^2/2\), which converts conductance into a spectral gap⁴. The two bounds compose to \[ \lambda_1 \geq \frac{1}{8 \rho^2}, \] which in turn gives a mixing-time bound⁵ (in total variation⁶) of \[ \tau(\epsilon) \leq \frac{2}{\Phi^2}\left( \log \frac{1}{\pi_{\min}} + \log \frac{1}{\epsilon} \right), \] where \(\pi_{\min} = \min_x \pi(x)\) is the least stationary mass. This is the result that let Jerrum and Sinclair produce the first FPRAS⁷ for the permanent of a non-negative matrix, a problem that had stood open for nearly two decades.

To see canonical paths work, take the lazy random walk on the Boolean hypercube \(\Omega = \{0, 1\}^n\): with probability \(1/2\) stay in place, with probability \(1/(2n)\) flip a uniformly chosen coordinate. The stationary distribution is uniform, and each edge \(e = (u, u \oplus e_i)\) has ergodic flow \(Q(e) = 1/(n \cdot 2^{n+1})\). For each pair \((x, y)\), define \(\gamma_{x, y}\) by flipping the differing bits in the order \(i = 1, 2, \ldots, n\). Count the pairs routing through a fixed edge \(e = (u, u \oplus e_i)\): these are exactly the pairs with \(u\) matching \(y\) in positions \(1, \ldots, i-1\) and matching \(x\) in positions \(i+1, \ldots, n\), giving \(2^{n-1}\) pairs, each contributing weight \(2^{-2n}\). The congestion at \(e\) is \[ \rho = \frac{2^{n-1} \cdot 2^{-2n}}{1/(n \cdot 2^{n+1})} = n, \] so \(\lambda_1 \geq 1/(8n^2)\) and the mixing time is \(O(n^4 + n^3 \log(1/\epsilon))\). The true answer is \(O(n \log n)\), so the bound is off by two factors of \(n\). Even so, the machinery is correct, and it is almost embarrassingly general.

IV. Fractional Flows

A single deterministic path routes all demand between \(x\) and \(y\) through one route. If that route happens to touch a bottleneck edge, the pair's entire \(\pi(x)\pi(y)\) weight lands there, and \(\rho\) blows up. The remedy is to hedge: spread each pair's demand over many paths.

Definition (fractional flow). A fractional flow is an assignment \(f_{x, y}\) of non-negative weight to paths from \(x\) to \(y\) satisfying \[ \sum_{\gamma} f_{x, y}(\gamma) = \pi(x)\pi(y). \] The fractional congestion is \[ \rho_f = \max_e \frac{1}{Q(e)} \sum_{x, y}\, \sum_{\gamma \ni e} f_{x, y}(\gamma). \]

Sinclair (1992) showed the Jerrum–Sinclair conductance bound generalizes verbatim: \(\Phi \geq 1/(2\rho_f)\), and \(\rho_f\) may replace \(\rho\) in every downstream bound. The question is where this actually helps. Not on the hypercube. The symmetry is already strong enough that spreading each pair's flow uniformly over permutations of bit-flips gives \(\rho_f = \Theta(n)\), the same order as a single path. The canonical success story is elsewhere: perfect matchings. To route between two near-perfect matchings \(M\) and \(M'\), Jerrum and Sinclair consider the symmetric difference \(M \triangle M'\) (a union of alternating paths and cycles) and "unwind" it edge by edge. A single deterministic unwinding order concentrates flow badly through a few hub configurations; averaging over valid orderings (the fractional flow) brings the congestion down from exponential in \(n\) to polynomial. This is the move that made the FPRAS for the permanent possible, and it is where fractional flows earn their keep.

V. Diaconis–Stroock via Poincaré

Routing a canonical-path bound through Cheeger loses a factor of \(\rho\): we pick up one \(\rho\) to bound conductance and square it at the spectral step, even though the path structure only depends on \(\rho\) linearly. Diaconis and Stroock (1991) showed that going directly through a Poincaré inequality, skipping conductance entirely, recovers the lost factor.

Recall that the Dirichlet form of a reversible chain is \[ \mathcal{E}(f, f) = \tfrac{1}{2}\sum_{x, y} \pi(x) P(x, y) (f(x) - f(y))^2 = \sum_{e = \{u, v\}} Q(e) (f(u) - f(v))^2, \] where the second sum is over undirected edges. The variance under \(\pi\) has the symmetric expression \[ \mathrm{Var}_\pi(f) = \tfrac{1}{2}\sum_{x, y} \pi(x)\pi(y)(f(x) - f(y))^2, \] and the spectral gap is the optimal Poincaré constant, \[ \lambda_1 = \inf_{f \text{ non-constant}} \frac{\mathcal{E}(f, f)}{\mathrm{Var}_\pi(f)}. \]

Theorem (Diaconis–Stroock, 1991). Let \(\ell_{\max} = \max_{x, y} |\gamma_{x, y}|\) be the maximum canonical-path length. Then \[ \lambda_1 \geq \frac{1}{\rho\, \ell_{\max}}. \]

Proof. For any \(x, y\), write \(f(x) - f(y)\) as a telescoping sum along \(\gamma_{x, y} = (x = v_0, v_1, \ldots, v_\ell = y)\): \[ f(x) - f(y) = \sum_{k = 1}^{\ell} (f(v_{k-1}) - f(v_k)) =: \sum_{e \in \gamma_{x, y}} \nabla_e f. \] Cauchy–Schwarz gives \[ (f(x) - f(y))^2 \leq |\gamma_{x, y}| \sum_{e \in \gamma_{x, y}} (\nabla_e f)^2 \leq \ell_{\max} \sum_{e \in \gamma_{x, y}} (\nabla_e f)^2. \] Substituting into the variance identity and swapping the order of summation, \[ \mathrm{Var}_\pi(f) \leq \tfrac{\ell_{\max}}{2} \sum_e (\nabla_e f)^2 \sum_{(x, y) :\, e \in \gamma_{x, y}} \pi(x)\pi(y) \leq \rho\, \ell_{\max} \cdot \mathcal{E}(f, f), \] using \(\sum_{(x,y)} \pi(x)\pi(y) \leq \rho Q(e)\) on the inner sum. Rearranging is the claim. \(\blacksquare\)

Compared with the route \(\rho \Rightarrow \Phi \Rightarrow \lambda_1\), which gives \(\lambda_1 \geq 1/(8\rho^2)\), the Diaconis–Stroock bound is stronger whenever \(\ell_{\max} \ll \rho\), precisely the regime where demand is spread along short paths. The fractional version replaces \(\rho\) with \(\rho_f\), with a weighted accounting of path lengths (see Sinclair 1992 for the definitive statement). Given my prior interest⁸ in Poincaré constants for mixing, this is a satisfying place to land: the canonical-path framework, stripped of the Cheeger detour, is a Poincaré inequality in disguise.

Canonical paths are not the only tool, and it's worth naming two neighbors in the toolbox. Cheeger's inequality, \(\Phi^2/2 \leq \lambda_1 \leq 2\Phi\), is the isoperimetric-to-spectral translation that underlies all of the machinery above; the upper bound is an easy Rayleigh-quotient calculation, while the lower bound (Lawler–Sokal, 1988) proceeds by a level-set decomposition of test functions. Path coupling, due to Bubley and Dyer (1997), is a conceptually different technique that bounds mixing time directly by constructing a contracting coupling of two copies of the chain along adjacent pairs in a well-chosen metric. For Glauber dynamics on the Ising model at high temperature, path coupling gives \(O(n \log n)\) mixing in a few lines, while canonical paths require real work. Both techniques live in the toolbox because they dominate each other in different regimes.

VI. Thirty-Six Years Later

For three and a half decades, the Jerrum–Sinclair mixing bound for the perfect-matching chain stood essentially unchanged. Sinclair's 1992 paper sharpened the constants and introduced fractional flows in their modern form; subsequent work handled restricted regimes (e.g., the monomer-dimer model at high temperature). But the dependence on graph size, \(\widetilde{O}(n^2 m)\) for an \(n\)-vertex, \(m\)-edge graph, did not move.

In April 2025, Chen, Feng, Ju, Miao, Yin, and Zhang posted a new bound. For the Jerrum–Sinclair chain on matchings of any graph with \(n\) vertices, \(m\) edges, maximum degree \(\Delta\), and constant edge weight \(\lambda > 0\), \[ \tau(\epsilon) = \widetilde{O}(\Delta^2 m). \] For bounded-degree graphs, \(\Delta = O(1)\), so this is an \(n^2\) speedup over the 1989 bound, and the first substantive progress in thirty-six years.

The technical move is a hybrid of the classical path machinery with local-to-global spectral methods from the theory of high-dimensional expanders (HDX). The HDX framework, developed by Kaufman–Mass, Alev–Lau, and Anari–Liu–Oveis Gharan over the past decade, views the state space of a combinatorial problem as the top-dimensional faces of a simplicial complex, and bounds the spectral gap of the "global" walk on the complex by spectral gaps of local walks on links (sub-complexes obtained by conditioning on a subset of coordinates). Concretely, if every link has spectral gap at least \(1 - \delta\), the global walk on a \(k\)-dimensional complex has gap at least \((1 - k\delta)/k\) under suitable conditions. For matchings, the complex has partial matchings as faces, and links correspond to conditioning on some edges being included.

Chen et al.'s contribution is to combine the canonical-path flow construction with local-to-global analysis, with refinements on both sides. The intuition is that paths give a global handle on how demand routes through the state space, while local-to-global lifts local mixing into a global gap. Local mixing is easier to bound, since links are small and structured. The mashup extracts the \(n^2/\Delta^2\) of slack the 1989 framework had left on the table, and the machinery it requires (HDX) did not exist when the slack was introduced.

This is the slow in the slow art of fast mixing. The Jerrum–Sinclair framework is from 1989; Diaconis–Stroock's Poincaré refinement from 1991; Sinclair's fractional flows from 1992; Bubley–Dyer path coupling from 1997; Anari–Liu–Oveis Gharan's HDX breakthrough for matroid polytopes from 2018; Chen–Feng–Ju–Miao–Yin–Zhang from last April. Each step tightens one part of the proof, and each tightening takes years to find. Not because anyone is slow, but because the space of proof ideas is as large as the state space of matchings, and good paths through it are rare.

For the physicist running PIMC on the transverse-field Ising chain, the practical upshot is concrete. The classical Gibbs distribution on the imaginary-time lattice mixes in polynomial time, and the quantum partition function is estimable to additive error \(\epsilon\) in time polynomial in \(n\), \(\beta\), and \(1/\epsilon\). The chain we run is not the matching chain, but the method by which we prove it fast (pair demands, canonical paths, congestion, a Poincaré inequality) is the method Jerrum and Sinclair laid down in 1989. Monte Carlo isn't just plausible for these systems; it's provably fast, and the proof is older than most of its users.

Metropolis stayed at Los Alamos for most of his career, and the photo at the top of this post is from his time there. The Metropolis algorithm was, by his own account, mostly the Rosenbluths' and Tellers' work, but the name stuck.
The sign problem is the obstruction to classical Monte Carlo simulation of generic quantum systems: matrix elements \(\langle x \mid e^{-\beta H} \mid y\rangle\) can be negative or complex, and interpreting them as probabilities requires averaging over cancelling signs, which destroys the statistical signal exponentially in \(\beta n\). Stoquasticity is the property that rules this out.
Conductance measures the bottleneck probability flow. Formally, \(\Phi = \min_{S :\, \pi(S) \leq 1/2}\, Q(S, S^c)/\pi(S)\), where \(Q(S, S^c) = \sum_{u \in S,\, v \in S^c} \pi(u) P(u, v)\) is the total ergodic flow across the boundary. Larger \(\Phi\) means fewer bottlenecks and faster mixing.
The spectral gap \(\lambda_1 = 1 - |\lambda_2|\) is the separation between the largest and second-largest eigenvalues of the (symmetrized) transition matrix. Larger gap, exponentially faster convergence.
The mixing time is \(\tau(\epsilon) = \min\{t : \|\mu_t - \pi\|_{\mathrm{TV}} \leq \epsilon\}\), where \(\mu_t\) is the distribution after \(t\) steps. It is the number of steps needed to get \(\epsilon\)-close to stationarity in total variation.
Total variation distance \(\|\mu - \pi\|_{\mathrm{TV}} = \tfrac{1}{2}\sum_x |\mu(x) - \pi(x)|\) is the natural metric for convergence: at most \(1\), and \(0\) iff the distributions coincide.
A Fully Polynomial Randomized Approximation Scheme: a randomized algorithm that returns a \((1 \pm \epsilon)\)-approximation in time polynomial in the input size and \(1/\epsilon\).
A strictly stronger object than \(\lambda_1\) is the log-Sobolev constant \(\alpha_{\mathrm{LS}}\), which controls hypercontractive bounds and gives mixing time \(\alpha_{\mathrm{LS}}^{-1} \log\log(1/\pi_{\min})\) rather than \(\lambda_1^{-1} \log(1/\pi_{\min})\). The \(\log(1/\pi_{\min})\) in our mixing bound is exactly the factor log-Sobolev saves. It is harder to establish, but for Glauber dynamics in many regimes it is the sharper tool.

References

Faster Mixing of the Jerrum–Sinclair Chain
Chen, Z., Feng, W., Ju, Y., Miao, C., Yin, Y. and Zhang, X. (2025).
Approximating the Permanent
Jerrum, M. and Sinclair, A. (1989).
Improved Bounds for Mixing Rates of Markov Chains and Multicommodity Flow
Sinclair, A. (1992).
Geometric Bounds for Eigenvalues of Markov Chains
Diaconis, P. and Stroock, D. (1991).
Equation of State Calculations by Fast Computing Machines
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953).
Markov Chains and Mixing Times (2nd ed.)
Levin, D. A., Peres, Y. and Wilmer, E. L. (2017).
Path Coupling: A Technique for Proving Rapid Mixing in Markov Chains
Bubley, R. and Dyer, M. (1997).
Bounds on the \(L^2\) Spectrum for Markov Chains and Markov Processes
Lawler, G. F. and Sokal, A. D. (1988).
Complexity of Stoquastic Frustration-Free Hamiltonians
Bravyi, S. and Terhal, B. (2009).
Polynomial-Time Classical Simulation of Quantum Ferromagnets
Bravyi, S. and Gosset, D. (2017).
Log-Concave Polynomials II: High-Dimensional Walks and an FPRAS for Counting Bases of a Matroid
Anari, N., Liu, K., Oveis Gharan, S. and Vinzant, C. (2019).
Rapidly Mixing Markov Chains: A Comparison of Techniques
Guruswami, V. (2016).