Equilibrium Propagation

Consider the prototypical machine learning setting where one wants to minimize an objective function \[J(\theta) = \mathbb{E}_{(x,y)} \left[ C(s(\theta,x),y) \right]\] with respect to some parameter \(\theta\). The variables \(x\) and \(y\) represent an input and its associated label, \(C\) is a cost function, and \(s(\theta,x)\) is a prediction with respect to \(y\), computed from \(\theta\) and \(x\). The expectation \( \mathbb{E}_{(x,y)} \) is taken over the distribution of input-output pairs \( (x,y) \) for the task that we want to solve. In deep learning, the objective function is usually optimized by stochastic gradient descent (SGD). Each step of SGD consists in choosing an input-output pair \( (x,y) \) at random from a training set of examples, and updating the parameters according to \( \Delta \theta = -\eta \, \frac{\partial {\mathcal L}}{\partial \theta}(\theta,x,y) \), where \( \eta>0 \) is the learning rate and \[ {\mathcal L}(\theta,x,y) = C(s(\theta,x),y) \] is the per-example loss function. For clarity, we distinguish the loss (\({\mathcal L}\)) from the objective (\(J\)) and the cost (\(C\)).

In the equilibrium propagation (EP) framework, \( s(\theta,x) \) is the equilibrium state of a physical system, obtained by a process of physical equilibration: \[s(\theta,x) = \underset{s}{\arg\min} \, E(\theta,x,s),\] where \( E \) is the energy function, and \( s \) is the state of the system. The input \(x\) is typically supplied as a boundary condition to the system. The central idea of EP is to view the cost function \( C(s,y) \) as the energy of a new interaction between the system's state (\(s\)) and the desired output (\(y\)) ; this new energy term is modulated by a parameter \( \beta \in \mathbb{R} \) (called `nudging parameter') and added to the energy function of the system to form the `total energy function': \[ F(\theta,\beta,s) = E(\theta,x,s) + \beta C(s,y). \] In its original formulation, EP follows a contrastive scheme, proceeding as follows:

  1. Set \( \beta=0 \) and let the system settle to an energy minimum, called `free state': \( s_\star^0 = \underset{s}{\arg\min} \, F(\theta,0,s) = s(\theta,x). \)
  2. Set \( \beta>0 \) and let the system settle to a new energy minimum, called `nudge state': \( s_\star^\beta = \underset{s}{\arg\min} \, F(\theta,\beta,s). \)
  3. Choose a learning rate \( \eta>0 \) and update the parameters according to the following contrastive rule: \[ \Delta \theta = - \frac{\eta}{\beta} \left( \frac{\partial E}{\partial \theta} \left(\theta,x,s_\star^\beta\right) - \frac{\partial E}{\partial \theta} \left(\theta,x,s_\star^0\right) \right). \]
The central insight of EP is that the above learning rule performs one step of SGD on the loss \( {\mathcal L} \) with learning rate \( \eta \), up to \( O(\beta) \), \[ \Delta \theta = - \eta \frac{\partial {\mathcal L}}{\partial \theta}(\theta, x, y) + O(\beta). \] This result is a consequence of the equilibrium propagation formula.