Optimization

Root finding and extremization in higher dimensions

Our discussion so far has concerned the zeros of a function \(f(x)\) that maps \(\mathbb{R} \to \mathbb{R}\). In other words, the the domain of f is one-dimensional. In this special case, we can be sure to find a root (though perhaps not efficiently) so long as we can bracket it in some interval \([a,b]\) such that \(f(a)\) and \(f(b)\) have opposite sign. This is true because of the intermediate value theorem: a continuous function \(f\) that connects a point \((a,f(a))\) lying on one side of the x-axis in the x-y plane and a point \((b,f(b))\) on the other side must pass through a point \((x_0,0)\) with \(a < x_0 < b\).

In higher dimensions—i.e., we want to solve \(f(\mathbf{x}) = 0\) for \(\mathbf{x} \in \mathbb{R}^N\)—there is no intermediate value theorem and thus no way to bracket the root. As a consequence, there is nothing analogous to the bisection method, and there is no strict guarantee of convergence. Newton’s method, on the other hand, does generalize to higher dimensions, but it’s not unambiguously defined if what we’re doing is root finding. (As we’ll see shortly, the scheme is uniquely defined for extremization.) The issue here is that, for a function \(f:\mathbb{R}^N \to \mathbb{R}\), its zeros do not define a point but an (\(N-1\))-dimensional surface (a line when \(N=2\)).

It is straightforward to perform a Taylor series expansion around a guess \(\mathbf{x}^*\), now an \(N\)-component vector, as follows:

\[0 = f(\mathbf{x}_0) = f(\mathbf{x}^* + \Delta \mathbf{x}) \approx f(\mathbf{x}^*) + \sum_{\alpha=1}^N \frac{\partial f}{\partial x_\alpha} \Biggr|_{\mathbf{x}^*} \Delta x_\alpha = f(\mathbf{x}^*) + (\nabla f) \cdot \Delta \mathbf{x}\]

The summation over \(\alpha\) represents an inner product of the local gradient and the discrepancy vector \(\Delta \mathbf{x} = \mathbf{x}_0-\mathbf{x}^*\). As usual, we neglect the terms at second order and beyond. Unfortunately, this gives us only one equation for the \(N\) unknown components of \(\mathbf{x}\). This means we have a great deal of freedom in how to proceed. We can view \(\Delta \mathbf{x} \cdot \nabla f =|\Delta \mathbf{x}||\nabla f|\cos\theta = f(\mathbf{x}^*)\) as depending on the angle \(\theta\) that \(\Delta \mathbf{x}\) makes with the local gradient. If we direct \(\Delta \mathbf{x}\) collinear with the gradient (i.e., we choose \(\theta=0\)), then

\[\Delta \mathbf{x} = - \hat{\mathbf{g}}, \ \ \nabla f(\mathbf{x}^*) = \hat{\mathbf{g}} |\nabla f|\]

This leads to an iteration rule

\[\mathbf{x}^{(n+1)} := \mathbf{x}^{(n)} - \Biggl( \hat{\mathbf{g}} \frac{f}{|\nabla f|} \Biggr)_{\mathbf{x}^{(n)}}\]

Alternatively, we could solve

\[(\nabla f)\cdot \Delta \mathbf(x) = \sum_{\alpha=1}^N (\nabla f(\mathbf(x)^*))_\alpha \Delta x_\alpha = -f(\mathbf(x)^*)\]

by setting \((\nabla f)_\alpha \Delta x_\alpha = f(\mathbf{x}^*)/N\) for each of \(\alpha = 1, 2, \ldots, N\). In that case, the iteration rule updates each component according to the slope along the corresponding direction:

\[x^{(n+1)}_\alpha := x^{(n)}_\alpha - \frac{f(\mathbf{x}^{(n)})}{N(\nabla \mathbf{x}^{(n)}))_\alpha}\]

In \(N\) dimensions, extremization is an \(\mathbb{R}^N\) to \(\mathbb{R}^N\) problem and no longer exactly equivalent to root-finding. Specifically, if we’re interested in the maximum and minimum values of a function, then we need to solve simultaneously for the \(N\) conditions, \(\nabla F(\mathbf{x}) = \mathbf{0}\). A linear expansion for each gives

\[0 = \frac{\partial F(\mathbf{x}_e)}{\partial x_\alpha} \approx \frac{\partial F(\mathbf{x}^*)}{\partial x_\alpha} + \frac{\partial^2 F(\mathbf{x}^*)}{\partial x_\alpha \partial x_\beta}\Delta x_\beta\]

Hence, the update rule can be written as

\[\mathbf{x}^{(n+1)} := \mathbf{x}^{(n)} - H^{-1} \cdot \nabla F(\mathbf{x}^{(n)})\]

where \(H\) is the Hessian matrix containing the second partial derivatives:

\[H_{ij} = \frac{\partial^2 F}{\partial x_i \partial x_j}\Biggr|_{\mathbf{x} = \mathbf{x}^{(n)}}\]

Dynamical interpretation

Once again, let’s imagine that each successive iteration of Newton’s method corresponds to a discrete step along a fictitious time axis. In the one-dimensional case, we argued that (up to some proportionality constant) the structure of the recursion equation is reminiscent of a first-order ordinary differential equation (ODE),

\[\dot{x} \sim -\sgn(F''(x)) F'(x)\]

where the direction of flow is governed by the convexity of \(F\).

In the multi-dimensional case, a similar interpretation can be made, except that multiplication by the inverse of the Hessian matrix both rescales and rotates (toward the directions of maximum convexity) the gradient \(\nabla F\).

It is often the case that the Hessian is too complicated to compute or that inverting it is too numerically unstable. In that case, we may want to throw away the convexity information and resort to steepest descent.

\[\dot{\mathbf{x}} \sim \begin{cases} -\nabla F(\mathbf{x})\\ +\nabla F(\mathbf{x}) \end{cases}\]

(Remember, this is not familiar second-order mechanics! There is no momentum associated with mathbf{x}.) In this point of view, our search is a walker moving through the phase space, only coming to rest at the stationary points of F. If we are interested in the either the minima (or maxima) only, we can restrict the walks to downhill (or uphill) moves.

Warning: Both steepest descent and Newton’s method proper will return the root in whose0 basis of attraction the initial guess lies. There is no reason to believe that the solution is a global minimumin the sense that \(F(\mathbf{x}_\text{min}) < F(\mathbf{x})\) for all \(\mathbf{x} \neq \mathbf{x}_\text{min}\).

An example of minimization in a multi-dimensional space

Consider a collection of \(N\) rotors. Each rotor is a 2-component vector \(\mathbf{n}_i = (n_{i,1}, n_{i,2})\) restricted to unit length, \(|\mathbf{n}|^2 = 1\). We can describe the overall configuration by a set of variables \(\{ \theta_i \}\) that measures the angle each rotor makes to the x-axis. If each individual rotor behaves as a magnetic dipole, then we would expect it to align preferentially with an applied field; this corresponds to a term \(-\mathbf{B}\cdot\mathbf{n}_i\) in the energy functional. If each rotor responds to the magnetic field established by all other rotors, the energy of the system will have the form

\[E = \sum_{ij} J_{ij} \mathbf{n}_i \cdot \mathbf{n}_j = \sum_{ij} J_{ij} \cos(\theta_i - \theta_j)\]

where the coefficient \(J_{ij} = J_{ji}\) encodes the interaction strength between the rotors \(i\) and \(j\).

Now suppose that the rotors arranged in a regular lattice and that they interact most strongly with the rotors closest to them. In that case, a plausible choice is that \(J_{ij} = 1\) for all sites \(i\) and \(j\) that are neighbours and that \(J_{ij}\) vanishes otherwise. In that case, there is a ferromagnetic ground state (degenerate under global rotation), in which all the spins are aligned. That is, any configuration \(theta_i = \text{constant}\) yields the minimum energy \(E = -2N_\text{sites}\times N_\text{neighbours}\). For arbitrary \(J_{ij}\), however, the energy landscape may be extremely complicated and the search for the lowest-energy configuration nontrivial.

Steepest descent in this context goes as follows. From the current configuration, each rotor angle evolves in the direction of locally decreasing energy, \(-\nabla E\). Hence,

\[\dot{\theta}_k \sim -\frac{\partial E}{\partial \theta_k} &= -\sum_{ij} J_{ij} \Biggl[ -\sin(\theta_i - \theta_j)\delta_{i,k} +\sin(\theta_i - \theta_j) \delta_{j,k}\\ &= -2\sum_i J_{ik} \sin(\theta_i - \theta_k)\]

The right-hand-side of the equation is the local torque felt by the spin at site \(k\).