In my studies of quantum information and learning theory, I find that a particular computational tool has become the center of rigorous mathematical research: sampling. This post walks through the rich theory of MCMC sampling and the results that allow Monte Carlo simulations to be numerical necessities in many fields.
In quantum information, Path-Integral Monte Carlo (PIMC) is a primary method for simulating thermal states of stoquastic quantum systems. PIMC transforms a quantum partition function into a classical Gibbs distribution by introducing an "imaginary-time" dimension, then applies Markov Chain Monte Carlo to sample from that distribution. Recently, these Markov chains were proven to mix rapidly, using the canonical path method. This rigorously justifies the use of PIMC in partition function estimation.
Suppose we hope to estimate the partition function of a \(1\)D stoquastic Hamiltonian \[ H = \sum_{i=1}^{n-1}Z_{i}Z_{i+1} + \sum_{i=1}^{n}h_iX_i \] With a Suzuki-Trotter decomposition, this quantum system becomes a classical one in \(n \times \beta\) space-time, with a Gibbs distribution over paths \(x \in \{0, 1 \}^{n\beta}\). Sampling from this distribution using MCMC requires Glauber dynamics. However, our estimates can be wildly inaccurate unless we prove mixing time1 to be small.
In learning theory, specifically in probabilistic inference on graphical models, sampling-based algorithms (e.g. Gibbs sampling, importance sampling) are essential to approximating joint distributions when exact inference is intractable. Consider a Markov random field over binary variables \(X_1, \ldots, X_n\) with pairwise interactions: \[ P(x) \propto \exp \left( \sum_{(i, j) \in E} \theta_{ij} x_i x_j + \sum_{i}\theta_i x_i \right) \] and we want to compute marginal probabilities \(P(X_i = 1)\). Exact inference is \(\mathsf{\# P}\)-hard, so we use Gibbs sampling. But how long does this take to converge2 to the true distribution? Once again, this depends on how fast the Markov chain mixes. Both of these examples illustrate that sampling is not just a numerical tool, but a computational lifeline in dire need of some theoretical justifications: do these chains converge fast enough for meaningful inference?
Canonical paths are a powerful combinatorial framework for proving rapid mixing in high-dimensional Markov chains. Let us define a set of paths \( \Gamma = \{ \gamma_{x, y} \}_{x, y \in \Omega} \) for each pair of distinct states in the state space \( \Omega \). The edge congestion is defined as \[ \rho = \max_{e \in E} \frac{1}{Q(e)} \sum_{(x, y) : e \in \gamma_{x, y}} \pi(x)\pi(y) \] where \( Q(e) = \pi(u) P(u, v)\) is the ergodic flow through edge \(e_{u \to v}\). In 1989, Jerrum and Sinclair showed that one can lower-bound the conductance3 \(\Phi\) and spectral gap4 \(\lambda_1\) with \[ \Phi \geq \frac{1}{2\rho},\quad \lambda_1 \geq \frac{1}{2\rho^2} \] yielding a mixing-time bound (in total variation) of \[ \tau(\epsilon) \leq \frac{2}{\Phi^2} \left(\ln \frac{1}{\pi_{min}} + \ln \frac{1}{\epsilon} \right). \]
This foundational result was the justification for their FPRAS5 algorithm for approximating the permanent of a matrix, and has since been generalized. However, canonical paths designate a single path to each pair of states potentially causing large congestion. So, we introduce fractional flows, a probability distribution over paths for each pair. Let \( f_{x, y}(\gamma)\) be the amount of flow along path \(\gamma\) from \(x\) to \(y\), subject to the constraint: \[ \sum_{\gamma \in \Gamma} f_{x, y}(\gamma) = \pi(x) \pi(y). \] The total load on an edge \(e\) and fractional congestion are defined as \[ F(e) = \sum_{x, y}\sum_{\gamma \ni e} f_{x, y}(\gamma),\quad \rho_f = \max_{e \in E} \frac{F(e)}{Q(e)}. \]
A vital insight in modern canonical path theory is that rapid mixing is still ensured when \(\rho_f\) is small. Fractional flows even often yield smaller congestion bounds than deterministic path assignments. Note that path length matters. Diaconis and Stroock showed that with a maximum path length \(l_{max} = \max_{x, y} \left| \gamma_{x, y} \right|\), the spectral gap satisfies the bound \[ \lambda_1 \geq \frac{1}{\rho_f l_{max}} \] This bound tightens mixing time estimates by involving both congestion and how long it takes information to propogate through the graph structure. With long but sparse paths, it can be much stronger than the \(\rho^2\) bound.
In 2025, Chen et al. heavily improved on the Jerrum-Sinclair result by redefining the analysis for the perfect matching chain. For a graph with \(n\) vertices, \(m\) edges, and maximum degree \(\Delta\), they showed that the mixing time satisfies \[ \tau(\epsilon) = \widetilde{\mathcal{O}} \left(\Delta^2 m + \Delta^2 \log\left(1/\epsilon\right) \right). \] In summary, the journey from Jerrum & Sinclair's classical framework to the modern refinement by Chen et al. illustrates the mathematical richness of sampling theory. For problems in quantum simulation and learning algorithms, where sampling is vital, this theory is mathematical assurance that Monte Carlo isn't just plausible, it's provably fast.
- The mixing time is defined as \[\tau(\epsilon) = \min\{t : \|\mu_t - \pi\|_{TV} \leq \epsilon\}\] where \(\mu_t\) is the distribution after \(t\) steps and \(\pi\) is the stationary distribution. It measures how many steps are needed to get \(\epsilon\)-close in total variation distance.
- Total variation distance \(\|\mu - \pi\|_{TV} = \frac{1}{2}\sum_x |\mu(x) - \pi(x)|\) measures how far a distribution \(\mu\) is from the stationary distribution \(\pi\). It's the natural metric for convergence: at most 1, and 0 only when distributions are identical.
- Conductance \(\Phi\) measures the bottleneck probability flow: how easily probability mass can move between large sets. Larger \(\Phi\) means fewer bottlenecks and faster mixing.
- The spectral gap \(\lambda_1 = 1 - |\lambda_2|\) measures the separation between the largest and second-largest eigenvalue of the transition matrix. Larger gap means exponentially faster convergence.
- A Fully Polynomial Randomized Approximation Scheme is a randomized algorithm that computes a \((1 \pm \epsilon)\)-approximation in time polynomial in both the input size and \(1/\epsilon\).
References
- Faster Mixing of the Jerrum–Sinclair Chain
Chen, Z., Liu, K. and Vuong, T. V. (2025). - Rapidly Mixing Markov Chains: A Comparison of Techniques
Guruswami, V. (2016). - Geometric Bounds for Eigenvalues of Markov Chains
Diaconis, P. and Stroock, D. (1991). - Approximating the Permanent
Jerrum, M. and Sinclair, A. (1989).