attention seeking — aniket deshpande

I am (finally) taking a course on statistical physics, and have found that the field connects a multitude of my research interests in an oddly¹ elegant way. In this post, I want to share some overlapping ideas between statistical mechanics, (softmax) self-attention, regression, and neuroscientific theories of memory. My hope is that, after reading this, one appreciates the lack of walls between these fields in the same way that I do.

I. Hedging Your Tokens

Consider a physical system of \(N\) discrete states, with the \(j\)-th state carrying an energy \(E_j\). When this system is in thermal equilibrium with a heat bath of temperature \(T\), the probability of finding the system in state \(j\) is given by the Boltzmann distribution: \[ p_j = \frac{e^{-\beta E_j}}{Z},\quad \text{where } Z = \sum_{k=1}^N e^{-\beta E_k} \text{ and } \beta = \frac{1}{k_B T}. \] Here, \(\beta\) is the inverse temperature and \(Z\) is the partition function. \(Z\) serves as both a normalization constant and also a generator function that encodes all thermodynamic properties of the system. For instance, the average energy \(\langle E \rangle\) can be computed as \[ \langle E \rangle = -\frac{\partial \log Z}{\partial \beta}. \] We are interested in a specific quantity, called the Helmholtz free energy, defined as \[ F = -\frac{1}{\beta} \log Z = \langle E \rangle - TS, \] where \(S\) is the system's entropy. The free energy \(F\) captures the tradeoff between energy and entropy, and is a fundamental quantity in thermodynamics that determines the system's equilibrium properties and phase transitions. We are also interested in macroscopic observables; they are computed as a thermal average over microstates: \[ \langle O\rangle = \sum_{j=1}^N p_j O_j = \frac{1}{Z} \sum_{j=1}^N O_j e^{-\beta E_j}. \] With these definitions, we can describe (almost)² everything in equilibrium statistical mechanics. States compete for occupancy based on their energies and \(\beta\) controls how sharply the measure discriminates between them. At high temperatures (low \(\beta\)), all states become equally likely: the system forgets its energy landscape and becomes maximally entropic. At low temperatures (high \(\beta\)), the system becomes more deterministic, concentrating its probability mass on the lowest-energy states, "freezing out" higher-energy states.

Now, consider a different system: a transformer processing a token sequence. A query token arrives and needs to retrieve information from a context of \(N\) tokens, each carrying a value vector. The query has finite representational capacity, it can only output a single vector, which must be some weighted average of the stored values. This becomes an allocation problem: how should the query distribute a finite budget of weight across the context tokens to best retrieve the information most relevant to the query?

This is, clearly, a constrained optimization problem, and it is governed by two competing objectives. The query aims to concentrate on the most compatible tokens, those whose keys align well with the query, because they carry the most relevant information. However, it also wants to hedge against committing too aggressively to any single token, because the key similarities are computed from noisy representations that might not be strictly relevant. Concentration is an energy-minimization problem, while hedging is an entropy-maximization one.

Let us define the energy of the \(j\)-th context token relative to the query as \[ E_j = -q^T k_j, \] where \(q\) is the query vector and \(k_j\) is the key vector of token \(j\). Tokens with high query-key alignment will have lower energy; tokens with low alignment will have higher energy. With this definition, the free energy of a probability distribution \(p\) over context tokens is \[ F[p] = \sum_{j} p_j E_j + \frac 1\beta \sum_j p_j \log\, p_j = -\sum_j p_j q^T k_j + \frac 1\beta \sum_j p_j \log\, p_j. \] The first term rewards the concentration objective on compatible tokens. The second term, a negative Shannon entropy, penalizes overcommitment to any single token, thus encouraging hedging. The distribution that optimally balances these objectives is the one that minimizes the free energy. We derive it here, first by setting up the Lagrangian for this constrained optimization problem: \[ \mathcal{L}(p, \lambda) = F[p] + \lambda\left(\sum_j p_j - 1\right) = -\sum_j p_j q^T k_j + \frac 1\beta \sum_j p_j \log\, p_j + \lambda\left(\sum_j p_j - 1\right). \] We then take the functional derivative with respect to \(p_j\) and set it to zero: \[ \frac{\partial \mathcal{L}}{\partial p_j} = -q^T k_j + \frac{1}{\beta} (1 + \log\, p_j) + \lambda = 0. \] Solving for \(p_j\) recovers a familiar form: \[ p_j = \frac{\exp\left(\beta q^T k_j\right)}{\sum_{\ell} \exp\left(\beta q^T k_\ell\right)}. \] This is the Boltzmann distribution! The Lagrange multiplier calculation is classical, and the exponential form is forced by the variational principle, not a choice. Any system that allocates probability over discrete options by trading off expected compatibility against distributional uncertainty will arrive at the Gibbs measure³. Recall that each token also carries a value vector \(v_j\). The output of the allocation, the query's retrieved information, the thermal expectation of these values: \[ \langle v \rangle = \sum_j p_j v_j = \frac{\sum_j v_j \exp\left(\beta q^T k_j\right)}{\sum_{\ell} \exp\left(\beta q^T k_\ell\right)} = \mathrm{softmax}\left(\beta q^T k\right) v. \] By allowing context tokens to compete for retrieval in a physically-plausible way, we have recovered the softmax attention formula from a free-energy minimization principle. The temperature \(T\) (or inverse temperature \(\beta\)) controls how much the query hedges its bets across the context. At high temperatures, the query retrieves a more uniform average of all values, while at low temperatures, it concentrates on the most compatible token.

Metropolis Monte Carlo

Softmax Attention

step 0

Figure 1. A single particle hops between \(N = 30\) energy levels via the Metropolis rule. The empirical histogram (left, red) converges to the softmax attention weights (right, teal), computed from the same energies \(E_j = -q^\top k_j\) at fixed \(\beta = 2\).

Monte Carlo

Softmax

β = 0.10

Figure 2. Temperature sweep: \(\beta\) increases from 0.1 to 5. Both panels reshape in lockstep — uniform at low \(\beta\), peaked at high \(\beta\) — but through different mechanisms (stochastic sampling vs. deterministic exponentiation).

II. Do You Remember?

So we have established that softmax attention is a Boltzmann distribution: the attention weights emerge from a free-energy minimization principle balancing concentration against hedging. This raises some deeper questions. What kind of computational system is this? What is the attention mechanism actually doing when it retrieves a weighted combination of stored values based on their compatibility with the query?

The answer predates the transformer architecture by decades: associative memory.

An associative memory is a system that stores patterns and retrieves them by content rather than by address. Instead of saying "read memory location 0x7FFF", we present a cue. The cue is possibly noisy, possibly partial, and the system returns the stored pattern most compatible with that cue. The retrieval works in three components: a set of stored items, a similarity (or energy) function that determines compatibility with the cue, and a retrieval rule that moves the cue toward stored patterns (or a weighted combination of them). This is content-addressable memory; the query itself determines what gets retrieved. Self-attention follows precisely the same structure. Keys and values represent stored memories. The query is a noisy cue that aims to retrieve relevant information from the context. The dot product \(q^T k_j\) measures the compatibility between the query and all stored items. The softmax normalizes these into a retrieval distribution, and the output \(\sum_j \alpha_j v_j\)⁴ is the retrieved memory.

The canonical example of an associative memory in physics is the Hopfield network. In its classical form, the system consists of \(N\) binary neurons \(s_i \in \{-1, 1\}\) and \(M\) stored patterns \( \xi^\mu \in \{-1, 1\}^N \) for \(\mu = 1, \ldots, M\). The network stores patterns via a Hebbian⁵ coupling matrix \[ J_{ij} = \frac{1}{N} \sum_{\mu=1}^M \xi_i^\mu \xi_j^\mu, \] and the energy of a configuration \(s\) is given by \[ E(s) = -\frac{1}{2} \sum_{i,j} J_{ij} s_i s_j = -\frac{1}{2}s^T J s. \] Stored patterns are local minima in this energy landscape. We can retrieve a pattern by initializing the network with a noisy cue and letting it evolve according to the update rule, or equivalently, letting the dynamics evolve by flipping spins to reduce energy.

Network State

Target Pattern

Energy

step 0

Figure 3. A Hopfield network of \(N = 1024\) binary spins stores 8 patterns. The network is initialized with a noisy version of one pattern (40% flipped) and evolves via asynchronous updates that minimize \(E(s) = -\frac{1}{2}s^\top J s\). The state (left) converges to the stored memory (right); the energy trace (bottom) shows the descent.

The physics of the Hopfield network is incredibly well-studied. Amit, Gutfreund, and Sompolinsky's seminal 1985 paper analyzed the network's capacity to store and retrieve patterns, showing that it can robustly store up to \(0.138N\) random patterns before retrieval degrades. The network undergoes a phase transition as the number of stored patterns increases, transitioning from a regime of perfect retrieval to one of spurious states and retrieval failure. The free energy landscape of the Hopfield network is rich with local minima corresponding to stored patterns, and the dynamics of retrieval can be understood as a descent in this landscape. The storage capacity scaling as \(M \sim 0.14 N\) poses an issue: one can only store about 14% as many patterns as one has neurons before retrieval fails due to interference between memories. This is due to the quadratic interaction \(s^T J s\) creating shadow energy basins that blur together when too many patterns are stored.

The modern Hopfield line, initiated by Krotov and Hopfield, fixes this by replacing the quadratic interaction with a higher-order one, an exponential interaction. This sharpens the energy basins dramatically, allowing the storage capacity to jump from \(M \sim 0.14 N\) to \(M \sim \exp(\alpha N)\) for some \(\alpha > 0\). Demircigil et al. proved this rigorously in 2017, and Ramsauer et al. took this model to its logical conclusion in 2020. For stored patterns \(X = [x_1, \ldots, x_M]\) and a query state \(\xi\), the continuous Hopfield energy is \[ E(\xi) = -\frac{1}{\beta} \underbrace{\log\sum_{\mu=1}^M \exp\left(\beta \xi^T x_\mu\right)}_{\text{log-sum-exp}}. \] To minimize this energy, we compute the gradient \[ \nabla E(\xi) = -\frac{1}{\beta} \frac{\sum_{\mu=1}^M x_\mu \exp\left(\beta \xi^T x_\mu\right)}{\sum_{\nu=1}^M \exp\left(\beta \xi^T x_\nu\right)} = \mathrm{softmax}\left(\beta \xi^T X\right) X. \] Yet again, we have recovered softmax attention. The query state \(\xi\) is updated to a softmax-weighted combination of the stored patterns \(X\) with inverse temperature \(\beta\). A single attention step is a single step of energy minimization in a modern Hopfield network. The log-sum-exp in the energy is the free energy of the system (the same object we derived in section I), now appearing as the energy function of an associative memory. Ramsauer et al. continued to show that the energy landscape of the Hopfield network has three regimes of interest:

A global fixed point, where \(\xi\) is the average of all stored patterns. This corresponds to uniform attention, the high-temperature (paramagnetic) phase.
A metastable phase, where \(\xi\) is a weighted average over a subset of patterns. This corresponds to distributed attention over a few relevant tokens.
A single-pattern fixed point, where \(\xi\) has converged to one stored pattern. This corresponds to peaked, nearly one-hot attention, the low-temperature (ferromagnetic) phase.

\(\beta\) controls which regime dominates, exactly as temperature controls phases in physical systems. Statistical mechanics provides a powerful lens to understand why attention works well in practice: transformers operate in the metastable regime of an associative memory.

Developed almost simultaneously with Hopfield networks, there is a second associative memory architecture that connects to attention in a completely different mathematical route: Kanerva's sparse distributed memory. SDM was designed to solve the "best match problem"; given a set of stored memories and a query, return the best match to the query quickly. SDM operates in high-dimensional binary space \(\{0, 1\}^N\). There are three main primitives:

Patterns: the memories to be stored, with address \(p_a^\mu\) and pointer \(p_p^\mu\).
Neurons: fixed addresses \(x_a^\tau\) in the space that store superpositions of nearby patterns.
Query: the input pattern \(\xi\).

To find the best match to \(\xi\), we utilize the Hamming distance \(d(x_a^\tau, \xi)\) between binary vectors. The write operation distributes each pattern's pointer to all neurons within Hamming distance \(d\) of the pattern's address. The read operation retrieves from neurons within \(d\) of the query and averages them. This average is effectively weighted because a pattern that is closer to the query has written to more neurons that the query will read from. Geometrically, the weight of each pattern is the circle intersection between the query's read neighborhood and the pattern's write neighborhood: \[ \xi^{\text{new}} = g\left( \frac{\sum_{p \in P} \mathcal{I}\left( d(p_a, \xi), d, n\right)p_p}{\sum_{p \in P} \mathcal{I}\left( d(p_a, \xi), d, n\right)} \right), \] where \(\mathcal I(d_v, d, n) = \left| O_n(p_a, d) \cap O_n(\xi, d) \right|\) counts the number of binary vectors in the intersection of two Hamming balls. It might be difficult to visualize how this retrieval works, so I highly recommend Trenton Bricken's excellent analogy with Mormon missionaries, which he presents in his doctoral defense⁶. This mechanism is a purely geometric, combinatorial object. No energies, partition functions, appear. Just counting lattice points in overlapping neighborhoods. And yet, Bricken and Pehlevan showed that this circle intersection is approximately exponential in the distance between query and pattern: \[ \mathcal I(d(p_a, \xi), d, n) \approx c_1 \exp\left(-c_2 \cdot d(p_a, \xi)\right), \] the approximation holds tightest for the closest, most important patterns (those within \(d(p_a, \xi) \leq 2d\)), which are exactly the ones that dominate the retrieval. Omitting some proof details for brevity, the next step is to map Hamming distance to cosine similarity. For \(L^2\)-normalized continuous vectors \(\hat a\), \(\hat b\), the relationship is \[ d(a, b) = \left\lfloor \frac n2 \left(1 - \hat a^T \hat b\right) \right\rfloor. \] Substituting this into the exponential approximation of the circle intersection, we again recover our familiar formula: \[ \mathcal I \approx c_3 \exp\left(\beta \hat p_a^T \hat \xi\right), \] and the SDM read rule becomes \[ \tilde{\xi}^{\text{new}} = \hat{P}_p \mathrm{softmax}\left(\beta \hat{P}_a^T \hat{\xi}\right), \] which is transformer attention with a fitted \(\beta\).

What makes this particularly remarkable is the chain of implications. We showed that softmax attention is a Boltzmann distribution: a free energy minimization principle. Bricken and Pehlevan showed that SDM's retrieval dynamics approximate softmax attention. Therefore, is SDM implementing an approximation of the same free energy principle, even though its original formulation involves no statistical mechanics? The exponential weighting that a Gibbs measure assigns to states by energy is the same exponential weighting that high-dimensional geometry assigns to stored patterns by distance. The approximate exponentiality of the circle intersection is a consequence of concentration of measure in high dimensions. This is the same mathematics that makes the Boltzmann distribution the universal equilibrium measure.

The relationship between these architectures is clarified by a result due to Keeler (1988): SDM is a generalization of the classical Hopfield network. Hopfield networks are the special case where there is no distributed read/write (\(d=0\)), neurons coincide with patterns (\(r=m\)), and pattern weighting is a rescaled Hamming distance without thresholding.

Hamming Ball Intersection

log ℑ(d_v) · n = 64, d = 5

d = 5 n = 64

Figure 4. After Bricken & Pehlevan (2021). Left: two Hamming balls of radius \(d\) centered on the query \(\xi\) and a pattern \(p\), separated by distance \(d_v\). The shaded overlap is the circle intersection \(\mathcal{I}\). Right: \(\log \mathcal{I}(d_v)\) (teal dots) falls on a straight line — the fitted exponential (red dashed) — confirming that the intersection is approximately \(e^{\beta \cos(q, k)}\). The approximation tightens as \(n\) increases (concentration of measure).

III. Kernels of Truth

We have now arrived at softmax attention from two directions: as a Boltzmann distribution from free energy minimization, and as the retrieval rule of an associative memory. There is a third perspective, arguably the most elementary, that reaches the same formula from classical nonparametric statistics. This derivation, while first realized by Yu et al., follows a recent vignette by Dhruv Pai at Tilde Research, and I present it here to close the triangle. The setting is regression. We have \(n\) input-output pairs \(x_i, y_i\) and aim to predict the output at a new query point \(x_0\). The simplest possible model is a constant predictor: \(\hat f(x_0) = \beta_0\), fit by minimizing \[ \sum_{i=1}^n (y_i - \beta_0)^2, \] giving \(\beta_0 = \frac 1n \sum_{i=1}^n y_i\). This is the global mean, an estimator that ignores where \(x_0\) sits relative to the data. The fix is to make the model locally constant: weigh each observation by its proximity to the query (you can see where this is going). Introduce a kernel \(K_h(x_i, x_0) = K(\|x_i - x_0\| / h)\) that decays with distance, upweighting nearby points and downweighting distant ones with bandwidth \(h\) controlling the scale of locality. The kernel-weighed least squares problem is \[ \hat{\beta}_0 = \arg\min_{\beta}\sum_{i=1}^n K_h(x_i, x_0) (y_i - \beta)^2. \] Taking the derivative and setting it to zero gives the Nadaraya-Watson estimator: \[ \hat{f}(x_0) = \frac{\sum_{i=1}^n K_h(x_i, x_0) y_i}{\sum_{j=1}^n K_h(x_j, x_0)}. \] This is the weighted average of observed outputs, with weights proportional to the kernel similarity between the query point and each stored input. This particular estimator has a large history in econometric and nonparametric statistics. Now, we relabel⁷. Let the inputs \(x_i\) be keys \(k_i\), the outputs \(y_i\) be values \(v_i\), and the query point \(x_0\) be a query \(q\). We choose the Gaussian kernel for its nice properties, giving \[ K_h(k_i, q) = \exp\left(-\frac{\|k_i - q\|^2}{2h^2}\right), \] and assume that queries and keys are \(L^2\)-normalized (the \(QK\)-norm)⁸. Then the squared distance simplifies \[ \|q - k_i\|^2 = \|q\|^2 + \|k_i\|^2 - 2 q^T k_i = 2(1 - q^T k_i). \] Setting the bandwidth \(h = \sqrt{\tau}\), where \(\tau\) is a temperature parameter, the kernel becomes \[ K_\tau(k_i, q) = \exp\left(-\frac{1 - q^T k_i}{\tau}\right) = \exp\left(\frac{q^T k_i}{\tau}\right) \underbrace{\exp\left(-\frac{1}{\tau}\right)}_{\text{constant}}. \] The constant term cancels in the numerator-denominator ratio. The Nadaraya-Watson estimator becomes \[ \hat{f}(q) = \frac{\sum_{i=1}^n \exp\left(\frac{q^T k_i}{\tau}\right) v_i}{\sum_{j=1}^n \exp\left(\frac{q^T k_j}{\tau}\right)} = \mathrm{softmax}\left(\frac{q^T K}{\tau}\right) V. \] Softmax attention, once more, with \(\tau = 1/\beta\) playing the role of kernel bandwidth.

What is striking is how little the three derivations share on the surface. The statistical mechanics route minimizes a free energy functional with an entropy regularizer and recovers the Boltzmann distribution through a Lagrange multiplier argument. The associative memory route defines an energy landscape over stored patterns and derives the softmax as the gradient of a log-sum-exp energy. The regression route solves a kernel-weighted least squares problem and recovers a ratio of exponential similarity weights. The statistical mechanics tells us about energy, entropy, temperature, and phase transitions. The memory picture tells us about storage, retrieval, capacity, and metastability. The regression picture tells us about attention as a learned similarity-weighted estimator over in-context data. What looks, in the transformer, like a neural network primitive is the meeting point of thermodynamics, memory, and estimation. The walls between these fields are thinner than we are usually taught to believe.

Studying for this post has been both deeply rewarding and challenging for my worldview. I came in asking whether the universe computes. I am leaving less certain that this is the right question. Over implementation and a physical substrate, the three derivations share a problem statement — how to distribute belief over possibilities under constraint. The Boltzmann distribution, softmax attention, and the Nadaraya-Watson estimator are the singular solution to the same inferential problem, discovered independently by thermodynamics, machine learning, and statistics. If there is a sense in which the universe computes, it may be this: that the universe is, at every scale, solving the problem of inference — and we keep rediscovering its solution.

"Oddly" is a massive understatement. Some of the correspondences between statistical mechanics, neuroscience, and learning theory have genuinely changed how I think about what the universe is doing. The boundary between a physical process and a computation is much thinner than I was taught to believe — and I'm not sure it exists at all.
Our specific setup is known as the canonical ensemble, where the system can exchange energy with a heat bath but has a fixed number of particles and volume. In situations where the particles themselves can move in and out of the system, we would use the grand canonical ensemble, which introduces a chemical potential \(\mu\) to account for particle exchange. For the purposes of the connections I wish to draw here, the canonical ensemble suffices.
A Gibbs measure is the infinite-volume, rigorous generalization of the Boltzmann distribution. In our finite-volume setting, the terms are used interchangeably.
Here, \(\alpha_j\) represents the attention weight for the \(j\)-th key-value pair.
Hebbian learning is a principle that states that the synaptic strength between two neurons increases if they are activated together. In the context of the Hopfield network, the Hebbian rule encodes the stored patterns into the network's connectivity, allowing for associative retrieval based on partial or noisy cues. Famously, "neurons that fire together, wire together."
Bricken is also a leading researcher in Anthropic's interpretability effort. His work and research taste were instrumental in my decision to pursue a research career in mechanistic interpretability. Thanks, Trenton!
It is understandable to wonder if we are forcing a connection with this relabeling. Note that we are preserving the functional roles and relationships of every variable. NW regression takes no interest in the names of variables; it only requires a similarity space for conditioning with \(x_i\) and \(x_0\) and a target object to be averaged with \(y_i\). Attention follows the same structure, allowing for a relabeling.
\(QK\)-norm, where query and key vectors are \(L^2\)-normalized before computing dot products, was developed independently by Henry et al. (2020). Bricken and Pehlevan's SDM analysis predicted it as a useful inductive bias: SDM requires cosine similarity for its circle intersection to approximate the exponential. That the transformer community arrived at the same normalization from engineering considerations is another instance of epistemic convergence.

References

Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, D., Cho, K. and Bengio, Y. (2014).
Attention is All You Need
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. (2017).
Neural networks and physical systems with emergent collective computational abilities
Hopfield, J. J. (1982).
Spin-glass models of neural networks
Amit, D. J., Gutfreund, H. and Sompolinsky, H. (1985).
Dense Associative Memory for Pattern Recognition
Krotov, D. and Hopfield, J. J. (2016).
On a Model of Associative Memory with Huge Storage Capacity
Demircigil, M., Heusel, J., Löwe, M., Upgang, S. and Vermet, F. (2017).
Hopfield Networks is All You Need
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M. and others (2021).
Sparse Distributed Memory
Kanerva, P. (1988).
Attention Approximates Sparse Distributed Memory
Bricken, T. and Pehlevan, C. (2021).
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T. and others (2023).
Comparison between Kanerva's SDM and Hopfield-type neural networks
Keeler, J. D. (1988).
On Estimating Regression
Nadaraya, E. A. (1964).
Smooth Regression Analysis
Watson, G. S. (1964).
Regression is All You Need
Pai, D. (2025).
Query-Key Normalization for Transformers
Henry, A., Dachapally, P. R., Pawar, S. and Chen, Y. (2020).