Equilibrium Propagation

Consider the prototypical machine learning setting where one wants to minimize an objective function \[J(\theta) = \mathbb{E}_{(x,y)} \left[ C(s(\theta,x),y) \right]\] with respect to some parameter \(\theta\). The variables \(x\) and \(y\) represent an input and its associated label, \(C\) is a cost function, and \(s(\theta,x)\) is a prediction with respect to \(y\), computed from \(\theta\) and \(x\). The expectation \( \mathbb{E}_{(x,y)} \) is taken over the distribution of input-output pairs \( (x,y) \) for the task that we want to solve. In deep learning, the objective function is usually optimized by stochastic gradient descent (SGD). Each step of SGD consists in choosing an input-output pair \( (x,y) \) at random from a training set of examples, and updating the parameters according to \( \Delta \theta = -\eta \, \frac{\partial {\mathcal L}}{\partial \theta}(\theta,x,y) \), where \( \eta>0 \) is the learning rate and \[ {\mathcal L}(\theta,x,y) = C(s(\theta,x),y) \] is the per-example loss function. For clarity, we distinguish the loss (\({\mathcal L}\)) from the objective (\(J\)) and the cost (\(C\)).

In the equilibrium propagation (EP) framework, \( s(\theta,x) \) is the equilibrium state of a physical system, obtained by a process of physical equilibration: \[s(\theta,x) = \underset{s}{\arg\min} \, E(\theta,x,s),\] where \( E \) is the energy function defined for any tuple \((\theta,x,s)\). The input \(x\) is typically imposed as a boundary condition to the system. The central idea of EP is to perturb (or `nudge') the system by augmenting its energy function \( E\) by an energy term \( \beta C \), where \( \beta \in \mathbb{R} \) is the `nudging factor'. The system then settles to a new equilibrium state that depends on the value of the nudge \( \beta \), \[ s_\star^\beta = \underset{s}{\arg\min} \, F(\theta,\beta,s). \] In particular, for \( \beta=0 \) we have \( s_\star^0 = s(\theta,x) \). The learning rule for the parameters follows a contrastive scheme: \[ \Delta \theta = - \frac{\eta}{\beta} \left( \frac{\partial E}{\partial \theta} \left(\theta,x,s_\star^\beta\right) - \frac{\partial E}{\partial \theta} \left(\theta,x,s_\star^0\right) \right). \] The main theoretical result is that the EP learning rule performs one step of SGD with learning rate \( \eta \), up to \( O(\beta) \), \[ \Delta \theta = - \eta \frac{\partial {\mathcal L}}{\partial \theta}(\theta, x, y) + O(\beta). \] This result is a consequence of the equilibrium propagation formula.