OptimalControl (2024)

In previous chapters we have seen the use of myopic controllers like PIDor operational space control, as well as some predictive controllerslike trajectory generation. Particularly for complex and nonlinearsystems like robots, predictive control allows the controller to makebetter decisions at the current time to account for futurepossibilities. However, our previous predictive methods were largelyrestricted to one class of systems. Optimal control addresses theseshortcomings in a highly general framework.

Optimal control asks to compute a control function (either open loop orclosed loop) that optimizes some performance metric regarding thecontrol and the predicted state. For example, a driver of a car wouldlike to reach a desired location while achieving several other goals:e.g., avoiding obstacles, not driving erratically, maintaining acomfortable level of accelerations for human passengers. A driving stylecan be composed of some balanced combination of these goals. Optimalcontrol allows a control designer to specify the dynamic model and thedesired outcomes, and the algorithm will compute an optimized control.This relieves some burden by letting the designer reason at the level ofwhat the robot should do, rather than designing how it should do it.

In this chapter, we will discuss how to specify optimal control problemsand how to implement and use optimal control techniques. We begin by discussingthe open loop problem, and then we will describe how an openloop optimizer can be adapted to closed loop control via the use ofmodel predictive control.

Optimal control problem¶

An optimal control problem is defined by the dynamics function $f$ and acost functional over the entire trajectory $x$ and $u$:\begin{equation}J(x,u) = \int_0^\infty L(x(t),u(t),t) dt.\label{eq:CostFunctional}\end{equation}The term functionalindicates that this is a function mapping a function to a real number.The term $L(x,u,t)$ is known as the instantaneous cost (or running cost) which isaccumulated over time, and should be chosen to be nonnegative and topenalize certain undesirable states, velocities, or controls. Its unitsare cost units per second.

The goal of optimal control is to find state and control trajectories$x$ and $u$ such that $J(x,u)$ is minimized:\begin{equation}\begin{gathered}x^\star, u^\star = \arg \min_{x,u} J(x,u) \text{ such that} \\\dot{x}(t) = f(x(t),u(t)) \text{ for all }t \quad \text{(Dynamics constraint)} \\x(0) = x_0 \quad \text{(Initial state)}\end{gathered}\label{eq:OptimalControl}\end{equation}

TODO: illustration (Figure 1)

(For somewhat technical reasons, there are problems for which no optimaltrajectory exists, but rather only a sequence of trajectoriesapproaching an optimal cost. Hence, if we prefer to be pedantic, it isoften necessary to prove existence of an optimal solution first, or torelax the problem to determine only an approximate optimum.)

Cost functionals¶

A variety of behaviors can be specified in this framework by modifyingthe instantaneous cost. For example:

Trajectory tracking for a trajectory $x_D(t)$ can be implemented by penalizing squared error $L(x,u,t) = \|x - x_D(t)\|^2$.
Minimizing effort can be defined in terms of a control penalty $\|u\|^2$.
Minimum time to hit a target $x_{tgt}$ could be implemented as an indicator function $I[x\neq x_{tgt}]$ where $I[z]$ is 1 if $z$ is true, and 0 otherwise.
Obstacle avoidance and other feasibility constraints can be implemented as indicator functions as well, $\infty \cdot I[x \notin \mathcal{F}]$ where $\mathcal{F}$ is the free space.
Smoothed obstacle avoidance can be implemented by a repulsive barrier that decreases to 0 when the distance to the closest obstacle $d$ exceeds some minimum buffer distance $d_{min}$ and incrases to infinity as the distance shrinks to 0. One common form of this barrier is $L(x,u,t) = 1/d^2 - 1/d_{min}^2$ when $d < d_{min}$ and $L(x,u,t) = 0$ otherwise.

TODO: illustration (Figure 2)

It is common to mix and match different types of costs functionals using a weighted cost functional $$J(x,u) = \sum_{i=1}^N w_i J_i(x,u)$$where each $J_i(x,u)$ is some primitive cost functional and $w_i$ scales its contribution to the final cost. By tuning these weights, a designer can encourage the optimized trajectories to emphasize some aspects of the trajectory over others.

Finite horizon optimal control and discounting¶

As stated, this problem is somewhat ill-behaved because itinvolves an infinite integral, which could achieve infinite cost evenfor relatively well-behaved trajectories. For example, if the cost weresimply squared error, a trajectory that achieves 0.01 steady-state errorwould be rated as having the same cost as a trajectory that had an errorof 1: namely, infinitely bad.

There are two general ways to make the cost functional better behaved.The first method to truncate the problem at some maximum time $T$,leading to a finite-horizon optimal control cost functional\begin{equation}J(x,u) = \int_0^T L(x(t),u(t),t) dt + \Phi(x(T))\label{eq:FiniteHorizonOptimalControl}\end{equation}where $\Phi(x)$ isa nonnegative terminal cost that penalizes the state attained at theterminal time.

The second method is to modify the instantaneous cost functional byincluding a discount factor that decays to 0 relatively quickly as$t \rightarrow \infty$. This is usually expressed as the product of atime-independent term and a time-dependent discount factor term:$$L(x,u,t) = L(x,u,0) \gamma(t)$$ with $\gamma(t)$ a decaying function,such as $O(1/t^\alpha)$ or $O(\beta^t)$. It is important to choose adiscount factor that drops relatively rapidly toward 0 to ensure thatthe cost is integrable. Discount factors of the form $O(1/t^\alpha)$must have $\alpha > 1$ to ensure that the cost functional is finite forall bounded trajectories, and those of the form $O(\beta^t)$ must have$\beta < 1$.

State and Control Constraints¶

Usually, optimal control solvers require that the cost functional issmooth, and so non-differentiable constraints like minimum time andobstacle avoidance must be reformulated as hard constraints, external tothe cost functional. As a result the reformulation becomes essentiallyan infinite-dimensional constrained optimization problem. Solvers maydiffer about whether they can handle constraints on state or constraintson control.

Analytical vs Numerical Solvers¶

Minimization over a space of functions is considerably more difficult tosolve than typical optimization problems: the space of functions isuncountably infinite-dimensional! There are two general ways to tacklethese problems: analytical or numerical. Analytical techniques use themathematical conditions of optimality so that the optimal control can bedetermined directly through calculus and algebraic manipulation.Successfully applying analysis typically requires a relatively simpledynamics and cost. Numerical techniques approximate the problem bydiscretizing either the state, time, and/or control space and attempt to castthe problem as a finite dimensional optimization. These are moregeneral-purpose and can be applied to complex problems, but are morecomputationally expensive and usually require some parameter tuning toobtain high-quality solutions.

Continuous-Time vs Discrete-Time Formulations¶

Above we have stated the continuous-time optimal control problem, which isusually the most mathematically sound way to state an optimal controlproblem. However, for computational purposes, we will often resort to a discrete-timeoptimal control problem. Such problems seek an optimum represented as a finite number of states $\mathbf{x}=(x_1,\ldots,x_N)$ and controls $\mathbf{u}=(u_0,\ldots,u_{N-1})$:

\begin{equation}\begin{gathered}\mathbf{x}^\star, \mathbf{u}^\star = \arg \min_{\mathbf{x},\mathbf{u}} J(\mathbf{x},\mathbf{u}) \text{ such that} \\x_{i+1} = g(x_i,u_i) \text{ for }i=0,\ldots,N-1 \end{gathered}\label{eq:DiscreteTimeOptimalControl}\end{equation}

and wish to minimize the discrete-time cost functional $J(\mathbf{x},\mathbf{u})$:\begin{equation}J(\mathbf{x},\mathbf{u}) = \sum_{i=0}^{N-1} L(x_i,u_i,i) + \Phi(x_N).\end{equation}

Any continuous-time formulation can be approximated by a discrete-timeformulation by replacing the derivative term with a simulation function. If we define the time step$\Delta t = T/N$ with $N$ an integer, then wehave the computational grid$0, \Delta t, 2\Delta t, \ldots, N\Delta t$. Let these grid points bedenoted $t_0,t_1,\ldots,t_N$ respectively with $t_0=0$ and $t_N=T$.

Then, the entire control trajectory is specified by a control sequence$u_0,\ldots,u_{N-1}$, with each $u_i$ active on the time range$[t_{i},t_{i+1})$. In other words $u(t) = u_i$ with$i = \lfloor t/\Delta t \rfloor$.

Suppose now we define a simulation function, which is a method forintegrating the state trajectory over time. Specifically, given aninitial state $x_t$, a constant control $u_t$, and a fixed duration $h$,the simulation function $g$ computes an approximation$$x_{t+1} \approx g(x_t,u_t)$$If the timestep is small enough, the Eulerapproximation is a reasonable simulation function:$$x_{t+1} \approx x_t +h f(x_t,u_t).$$If accuracy of this method is toolow, then Euler integration could be performed at a finer time sub-stepup to time $h$, and/or a more accurate integration technique could beused.

Another modification is that the running cost $L$ used in the discrete time formulation, here denoted $L_d$,should approximate the integral of the continuous-time running cost $L$, here denoted $L_c$:$$L_d(x_i,u_i,i) \approx \int_{i\Delta t}^{(i+1)\Delta t} L_c(x(t),u(t),t) dt$$and is often approximated using the Euler approximation:$$L_d(x_i,u_i,i) \approx \Delta t L_c(x(t_i),u(t_i),t_i).$$

LQR control¶

The simplest class of optimal control problems is LTI systems with coststhat are quadratic in $x$ and $u$. Through the calculus of variations,which is beyond the scope of this book, the optimal control for thisproblem class can be determined analytically as a closed-form functionof $x$.

LTI systems with quadratic cost are specified as $\dot{x} = Ax + Bu$ and\begin{equation}L(x,u,t) = x^T Q x + u^T R u\end{equation}where $Q$ and $R$ are symmetricmatrices of size $n\times n$ and $m\times m$, respectively. Themagnitude of entries of $Q$ penalize error from the equilibrium point,and the magnitude of entries of $R$ penalize control effort. The overallcontrol functional is therefore\begin{equation}J(x,u) = \int_0^\infty x(t)^T Q x(t) + u(t)^T R u(t).\end{equation}

Here, the optimal control can be shown to be a linear function of $x$:\begin{equation}u = -Kx\end{equation}for the gain $K = R^{-1}B^T P$ defined as a function of anunknown matrix $P$. $P$ is a symmetric $n \times n$ matrix that solvesthe following Riccati equation :

\begin{equation}A^TP + PA - PBR^{-1}B^TP + Q = 0. \label{eq:RiccatiEquation}\end{equation}

Numerical methods are availablefor solving the Riccati equation for $P$. This method is known as theLinear Quadratic Regulator (LQR) approach.

TODO: illustration (Figure 3)

As we showed in the section on LTI stability,traditional pole placement methods can beused to derive a stable controller for LTI systems. However, thesignificance of the LQR controller compared to traditional polestability analysis is that the performance metric is made explicitrather than implicit. Moreover, it gives a closed form solution for anydynamic model specified by $A,B$, so if more information is gatheredthat yields a better estimate for $A$ and $B$, the LQR method can beapplied directly to obtain the optimal gains.

Pointryagin's Minimum Principle¶

The "magic" of the LQR solution is obtained through a more genericprinciple of optimal controllers called the Pointryagin's minimumprinciple (PMP). It defines a first order, necessary condition for aparticular state/control trajectory to be an optimum. It is derived from($\ref{eq:OptimalControl}$) via a combination of calculus ofvariations and the method of Lagrange multipliers. This will briefly bedescribed here.

In (equality) constrained optimization problems, the method of Lagrangemultipliers defines an auxiliary Lagrange multiplier variable$\lambda_i$ for each equality constraint. However, in optimal controlproblems, there are an infinite number of equality constraints$\dot{x}(t) = f(x(t),u(t))$ defined for each point in time. As a result,the Lagrange multipliers for this problem are not single variables, butrather trajectories defined over time. This trajectory of multipliersis known as a costate trajectory$\lambda(t):[0,\infty)\rightarrow \mathbb{R}^n$.

An auxiliary function, called the Hamiltonian, is defined over thesystem and costate at particular points in time:\begin{equation}H(\lambda,x,u,t)=\lambda^T f(x,u) + L(x,u,t).\label{eq:Hamiltonian}\end{equation}It is also possible tomaintain control constraints $u\in \mathcal{U}\subseteq \mathbb{R}^m$.

Pointryagin's minimum principle is then stated as follows. An optimalstate/ control/ costate trajectory $(x^\star,u^\star,\lambda^\star)$satisfies for all $t \geq 0$:

PMP 1: $H(\lambda^\star(t),x^\star(t),u^\star(t),t) \leq H(\lambda^\star(t),x^\star(t),u,t)$for all $u \in \mathcal{U}$
PMP 2: $\dot{x}^\star(t) = f(x^\star(t),u^\star(t))$
See Also
4.7: Optimization Problems IT Optimization: Challenges, Examples, and Strategy | LeanIX Optimisation In The Real World | Stephen J. Maher
PMP 3: $\dot{\lambda}^\star(t) = -\frac{\partial}{\partial x}H(\lambda^\star(t),x^\star(t),u^\star(t),t)$.

The derivation of these equations is outside the scope of this book. Butthe conditions can be applied in certain cases to obtain optimalcontrols, or at least limit the range of controls that we would need to consider as candidates for optima.

Derivation of LQR from PMP¶

Let's now apply PMP to the LQR setting to see what insightsit can give us into the LQR solution. The Hamiltonian is$$H(\lambda,x,u,t) = \lambda^T(Ax + Bu) + x^T Q x + u^T R u$$ by simple substitution, and $\mathcal{U} = \mathbb{R}^m$.Now, let us apply each of the conditions of Pointryagin's minimum principle:

$H(\lambda^\star(t),x^\star(t),u^\star(t),t) \leq H(\lambda^\star(t),x^\star(t),u,t)$ implies that$\lambda^{\star T}Bu^{\star} + u^{\star T} R u^\star \leq \lambda^{\star T}Bu + u^T R u$for all $u$. Here we have subtracted off terms that do not contain $u$.
$\dot{x}^\star = f(x^\star(t),u^\star(t))$ implies that $\dot{x}^\star = Ax^\star + Bu^\star$.
$\dot{\lambda}^\star(t) = -\frac{\partial}{\partial x}H(\lambda^\star(t),x^\star(t),u^\star(t),t)$ implies that $\dot{\lambda}^\star = - A^T \lambda^{\star} - 2 Qx$.

Expanding condition 1, we see that for $u^\star$ to be a minimizer of theHamiltonian, given $\lambda^\star$, $x^\star$, and $t$ fixed, we musthave that $B^T \lambda^\star + 2 R u^\star = 0$ so that$u^\star = \frac{1}{2}R^{-1} B^T \lambda^\star$.

Now replacing this into conditions 2 and 3, we have a system of ODEs:$$\begin{split}\dot{x}^\star &= A x^\star + \frac{1}{2} B R^{-1} B^T \lambda^\star \\\dot{\lambda}^\star &= - 2 Qx^\star -A^T \lambda^\star\end{split}$$

Hypothesizing that $\lambda^\star = 2 P x^\star$ andmultiplying the first equation by $P$, we obtain the system of equations

$$\begin{split}P\dot{x}^\star &= (PA + P B R^{-1} B^T P )x^\star \\2P\dot{x}^\star &= (-2Q -2A^T P)x^\star.\end{split}$$

After dividing the second equation by 2 and equating theleft hand sides, we have an equation$$(PA + P B R^{-1} B^T P )x^\star = (-Q -A^T P)x^\star.$$Note that this condition must hold for every state $x^\star$ on an optimal trajectory,but we can set our trajectory's initial state to any $x \in \mathbb{R}^n$.Hence the equation must hold for all $x$. Since the equation must hold for all $x$,the matrices must also be equal, which produces the Riccati equations ($\ref{eq:RiccatiEquation}$).

Bang-bang control¶

Another result from PMP condition (1) isthat the Hamiltonian must be minimized by the control, keeping the stateand costate fixed. As a result, there are two possibilities: (1a) thederivative of the Hamiltonian is 0 at $u^\star$, or (1b) the control isat the boundary of the control set $u^\star \in \partial U$.

This leads to many systems having the characteristic of bang-bangcontrol, which means that the optimal control will jump discontinuouslybetween extremes of the control set. As an example, consider a race cardriver attempting to minimize time. The optimal control at all points intime will either maximize acceleration, maximize braking, and/ormaximize/minimize angular acceleration. If a driver chose a trajectory thatused any intermediate controls such as a light acceleration or braking, it wouldnot be optimal: he/she could save time around the track by making the control more extreme.

TODO: illustration (Figure 4)

This condition helps us reduce the number of controls that we need to consider if weare trying to prove optimality. In some cases, it can be determined that there are a finite number of possiblecontrols satisfying condition (1). This is very helpful, because the optimal control problembecomes one of simply finding switching times between optimal controls.

Taking the Dubins car model as an example, we have the state variable$(x,y,\theta)\in SO(2)$ and control variable $(v,\phi)$ denotingvelocity and steering angle: $$\begin{split}\dot{x} &= v \cos \theta \\\dot{y} &= v \sin \theta \\\dot{\theta} &= v/L \tan \phi .\end{split}$$

Here $(x,y)$ are the coordinates of a point in the middleof the rear axis, and $L$ is the length between the rear and front axle.The velocity and steering angle are bounded, $v \in [-1,1]$ and$\phi \in [-\phi_{min},\phi_{max}]$, and the cost only measures time toreach a target state. Hence, the Hamiltonian is$$H(\lambda,x,u,t) = \lambda_1 v \cos \theta + \lambda_2 v \sin \theta + \lambda_3 v/L \tan \phi + I[x \neq x_{tgt}]$$

The latter term does not contribute to the choice of $u$, so we canignore it. For $(v,\phi)$ to be a minimum of the Hamiltonian, with$\lambda$ and $x$ fixed, either $\lambda = 0$ and the control isirrelevant, or $\lambda \neq 0$ and$v = -sign(\lambda_1 \cos \theta + \lambda_2 \sin \theta + \lambda_3 /L \tan \phi)$.Then, since $\tan$ is a monotonic function, we have$\phi = -sign(\lambda_3 v)\phi_{max}$. As a result, the only options arethe minimum, maximum, and 0 controls on each axis.

The trajectories corresponding to these extrema are straightforward/backward, moving forward while turning left/right, and movingbackward while turning left/right. The curves traced out by thesetrajectories are then either straight line segments or arcs of turningrate $\pm \tan \phi_{max}/L$. To find all minimum-time paths between twopoints, it is then a matter of enumerating all possible arcs andstraight line segments. The solutions are known as Reeds-Shepp curves.

Trajectory Optimization¶

Although analytical approaches like PMP can lead to elegant expressions ofoptimal controls, it is not always possible (or easy) to derive such expressions for generalnonlinear systems. In such cases it is practical to use numerical methods like trajectory optimization.For example, if we start with an LQR system, but then add any modification such as anon-quadratic term in the cost functional, or we add state constraints or control constraints,the analytical LQR solution no longer applies. However, trajectory optimization comes to our rescue andcan still solve such problems!

We have already seen a form of trajectory optimization under the discussion of kinematic pathplanning, and here we extend thistype of formulation to dynamic systems.Since trajectories are infinite-dimensional, the main challenge oftrajectory optimization is to suitably discretize the space oftrajectories. A second challenge is that the discretized optimizationproblem is usually also fairly large, depending on the granularity ofthe discretization, and hence optimization may be computationallyinefficient.

In any case, trajectory optimization methodsbegin by reformulating ($\ref{eq:OptimalControl}$) as a finite-dimensional optimizationproblem over some representation of the state and control trajectory, and then solving theoptimization problem usually using a gradient-based technique. They differ most signficantlyin the details of the representation and the optimization method.

Shooting methods¶

Let us assume a discrete-time system, or that the continuous-time systemcan be approximated sufficiently well by a discrete-time one. The firstmethod we will describe, single shooting, optimizes over the control sequence $\mathbf{u} = (u_0,\ldots,u_{N-1})$.When we guess a control trajectory, we can then derive the corresponding state trajectory $\mathbf{x}=(x_0,x_1,\ldots,x_N)$ by integrating forward from $x_0$ as follows:

Set the initial state $x_0$.
For $i=0,\ldots,N-1$, set $x_{i+1} = g(x_{i},u_{i})$.

Suppose now we define a simulation function, which is a method forintegrating the state trajectory over time. Specifically, given aninitial state $x(0)$, a constant control $u$, and a fixed duration $h$,the simulation function $g$ computes an approximation$$x(h) \approx g(x(0),u,h)$$ If the timestep is small enough, the Eulerapproximation is a reasonable simulation function:$$x(h) \approx x(0) +h f(x(0),u).$$ If accuracy of this method is toolow, then Euler integration could be performed at a finer time sub-stepup to time $h$, and/or a more accurate integration technique could beused.

TODO: illustration (Figure 5)

In any case, given a piecewise-constant control trajectory defined by acontrol sequence $u_1,\ldots,u_N$, we can derive corresponding points onthe state trajectory as follows.

Set $x_0 \gets x(0)$.
For $i=1,\ldots,N$, set $x_i = g(x_{i-1},u_{i-1},\Delta t)$ to arrive at a state sequence $x_0=x(0),x_1,\ldots,x_N$.

With this definition, each $x_i$ is a function of $u_0,\ldots,u_{i-1}$. Hence, we canapproximate the cost functional as:$$J(x,u) \approx \tilde{J}(u_0,\ldots,u_{N-1}) = \sum_{i=0}^{N-1} L(x_i,u_i,t_i) + \Phi(x_N).$$Now, we can express the approximated optimal controlfunction as a minimization problem:\begin{equation}\arg \min_{\mathbf{u}} \tilde{J}(\mathbf{u}).\label{eq:SingleShooting}\end{equation}With control space $\mathbb{R}^m$, this is anoptimization problem over $mN$ variables.

There is a tradeoff in determining the resolution $N$. With highervalues of $N$, the control trajectory can obtain lower costs, but theoptimization problem will have more variables, and hence become morecomputationally complex. Moreover, it will be more susceptible to localminimum problems.

Efficient gradients: DDP and iLQR¶

Standard gradient-based techniques can be used to solve the problem($\ref{eq:SingleShooting}$). One difficulty is that to take the gradientof $\tilde{J}$ with respect to a control variable $u_i$, observe thatthis choice of control affects every state variable $x_k$ with$i \leq k \leq N$. Hence,\begin{equation}\frac{\partial J}{\partial u_i} = \Delta t \frac{\partial L}{\partial u_i} (x_{i-1},u_i,t_i) + \Delta t \sum_{k=i}^{N-1} \frac{\partial L}{\partial x}(x_k,u_{k+1},t_k)\frac{\partial x_k}{\partial u_i} + \frac{\partial \Phi}{\partial x_N}\frac{\partial x_N}{\partial u_i}\label{eq:JacobianUi}\end{equation}The expressions for$\frac{\partial x_k}{\partial u_i}$ are relatively complex because $x_k$is defined recursively assuming that $x_{i-1}$ is known. $$\begin{split}x_i &= g(x_{i-1},u_i,\Delta t) \\x_{i+1} &= g(x_i,u_{i+1},\Delta t) \\x_{i+2} &= g(x_{i+1},u_{i+2},\Delta t) \\&\vdots\end{split}$$In this list, the only equation directly affected by $u_i$is the first. The effects on the remaining states are due to cascadingeffects of previous states. Hence, we see that$$\frac{\partial x_i}{\partial u_i} = \frac{\partial g}{\partial u}(x_{i-1},u_i,\Delta t)$$$$\frac{\partial x_{i+1}}{\partial u_i} = \frac{\partial x_{i+1}}{\partial x_i} \frac{\partial x_i}{\partial u_i} = \frac{\partial g}{\partial x}(x_i,u_{i+1},\Delta t) \frac{\partial x_i}{\partial u_i}$$And in general, for $k > i$,$$\frac{\partial x_k}{\partial u_i} = \frac{\partial g}{\partial x}(x_{k-1},u_k,\Delta t) \frac{\partial x_{k-1}}{\partial u_i}.$$This appears to be extremely computationally expensive, since eachevaluation of ($\ref{eq:JacobianUi}$) requires calculating $O(N)$ derivatives,leading to an overall $O(N^2)$ algorithm for calculating the gradientwith respect to the entire control sequence.

However, with a clever forward/backward formulation, the gradient can be calculated with $O(N)$ operations.Note that all expressions of the form$\frac{\partial x_k}{\partial u_i}$ are equivalent to$\frac{\partial x_k}{\partial x_i}\frac{\partial x_i}{\partial u_i}$.So, we observe that($\ref{eq:JacobianUi}$) is equal to$$\Delta t \frac{\partial L}{\partial u_i}(x_{i-1},u_i,t) + \frac{\partial J}{\partial x_i} \frac{\partial x_i}{\partial u_i}.$$Then, we can express:$$\frac{\partial J}{\partial x_i} = \Delta t \sum_{k=i}^{N-1} \frac{\partial L}{\partial x}(x_k,u_{k+1},t_k) \frac{\partial x_k}{\partial x_i} + \frac{\partial \Phi }{\partial x_N}\frac{\partial x_N}{\partial x_i}.$$This entire vector can be computed in a single backward pass startingfrom $i=N$ back to $i=1$. Starting with $i=N$, see that$$\frac{\partial J}{\partial x_N} = \frac{\partial \Phi }{\partial x_N}$$Then, proceeding to $i=N-1$, observe $$\begin{split}\frac{\partial J}{\partial x_{N-1}} &= \Delta t \frac{\partial L}{\partial x}(x_{N-1},u_N,t_{N-1}) + \frac{\partial \Phi }{\partial x_N}\frac{\partial x_N}{\partial x_{N-1}} \\&= \Delta t \frac{\partial L}{\partial x}(x_{N-1},u_N,t_{N-1}) + \frac{\partial J}{\partial x_N} \frac{\partial x_N}{\partial x_{N-1}}.\end{split}$$ In general, with $i<N$, we have the recursive expression$$\frac{\partial J}{\partial x_i} = \Delta t \frac{\partial L}{\partial x}(x_i,u_{i+1},t_i) + \frac{\partial J}{\partial x_{i+1}} \frac{\partial x_{i+1}}{\partial x_{i}}.$$The entire set of values can be computed in $O(N)$ time for all$x_1,\ldots,x_N$.

However, problems of this sort are usually poorly scaled, and hencestandard gradient descent converges slowly. The most commonly usedhigher-order technique is known as Differential Dynamic Programming(DDP), which is an efficient recursive method for performing Newton'smethod. Given a current control trajectory, it approximates the costfunction as a quadratic function of the controls, and solves for theminimum of the function. A similar approach is the Iterative LQR (iLQR) algorithm, which is very closely related to DDP but drops the 2nd derivative of the dynamics function.

The exact steps for implementing DDP and iLQR are beyondthe scope of this course but are readily available from other sources.In short, methods are able to achieve superlinear convergence rates like Newton's method. However, by exploiting the sparse structure of the optimization problem, they are much faster at finding the step direction, taking $O(Nm^3)$ time and $O(N(m^2+n^2))$ space per iteration compared to $O(N^3m^3)$ time and $O(N^2m^2)$ space of naive Newton's method.

Example: Pendulum swing-up¶

As an example, we show iLQR applied to a pendulum swing-up problem. The initial state is just a bit off of the goal (upright) state of $\pi/2$. The cost function here is a nearly quadratic function $\sum_i \Delta t (|\theta_i-\pi/2|_{ang}^2 + 0.01 u_i^2) + 3|\theta_N-\pi/2|_{ang}^2$ where $|a-b|_{ang}$ denotes the angular difference. The trajectory has $N=20$ timesteps with duration $\Delta t=$0.05s.

In the initial trajectory the pendulum falls down. Note the range of the $\theta$ variable is $[0,2\pi)$, so the jump is where the variable wraps around. The cost to go indicates the remainder of the $J$ function for the trajectory from time $t$ to $T$, and notice that it is quite large due to the terminal cost.

After running iLQR, the trajectory converges to a path that keeps the pendulum upright. The terminal cost is nearly 0, so the cost-to-go is low at the end of the trajectory. But a strong push ($u\approx 5.3$) was necessary for the pendulum to stop falling at the start, which incurs some cost near the start of the trajectory. Nevertheless, the total cost is quite low, less than 1.

Direct transcription¶

Direct transcription is related to single shooting but avoids explicitly nested calculations of gradients using the notion of constrained optimization (Appendix C). Constrained optimizationis more complex from the optimization algorithm's point of view, but direct transcription is often far simplerfor an implementer.

The idea is to simple write down the discrete-time optimal control problem and solve it over both $\mathbf{x}$ and $\mathbf{u}$:\begin{equation}\begin{gathered}\mathbf{x}^\star, \mathbf{u}^\star = \arg \min_{\mathbf{x},\mathbf{u}} J(\mathbf{x},\mathbf{u}) = \sum_{i=0}^{N-1} L(x_i,u_i,i) + \Phi(x_N) \text{ such that} \\x_{1} = g(x_0,u_0) \\x_{2} = g(x_1,u_1) \\\vdots \\x_{N} = g(x_{N-1},u_{N-1})\end{gathered}\label{eq:DirectTranscription}\end{equation}

The added complexity is that the dynamics equations are listed as equality constraints which the optimizer must attempt to meet in addition to minimizing the objective function. Unlike shooting, all of the $x$'s are free variables, and during optimization the $x$'s usually do not meet the dynamics constraints perfectly!

A helpful way to think of this is to consider the "ideal" predicted state $\hat{x}_i = g(x_{i-1},u_{i-1})$ and the defect vector at step $i$: $e_i = x_{i} - \hat{x}_{i}$. The optimizer will try to simultaneously minimize the value of $J$ while driving all of the defects $e_1,\ldots,e_{N}$ toward 0. As it iterates, the defects will in general be nonzero, meaning that the optimizer will wiggle around a state variable $x_i$ and its corresponding control $u_i$ to simultaneously reduce the running cost $L(x_i,u_i,i)$ while trying to meet $x_i = \hat{x}_i$ and $x_{i+1} = \hat{x}_{i+1} = g(x_i,u_i)$.

By freeing the $x$'s from the $u$'s, direct transcription can often use simpler initialization techniques than shooting. For example, if we wish for the system to reach a goal state, we can initialize the state trajectory to a linear interpolation from the start to the goal and the controls to zero. The initial stages of the optimization will attempt to reconcile the discrepancy between

The use of hard constraints in optimization also allows us to specify more versatile constraints. For example, we can easily state that a trajectory should end at a specific goal state, via an equality $x_N=x_G$. Shooting methods cannot in general handle simultaneous goal constraints and nonzero objective functions. Moreover, if we were to have a constrained optimizer that can accept inequality constraints, we may easily enforce state and control bounds of the form $x_L \leq x_i \leq x_U$ and $u_L \leq u_i \leq u_U$.

Computationally, direct transcription uses many more variables and constraints than shooting. Including the state variables, we now have $(n+m)N$ optimization variables and an additional $nN$ equality constraints, which has the potential to make optimization slower. Furthermore, to be efficient when $N$ is large, optimization solvers must accept sparse matrix formulations of constraint jacobians and use sparse linear algebra routines.

In the below example, we set up a direct transcription formulation of the pendulum trajectory optimization problem. We bound the control to $|u|\leq 4$, and bound the velocity to $|\dot{\theta}|\leq 10$. Observe that the initial trajectory has a very low objective function value because we have initialized the state-space trajectory to move directly toward the goal. However, the defects are relatively high, indicating that this trajectory cannot actually be executed by the initial controls. After optimizing, the cost is higher, but the defect is now small.

Note that the example below uses Scipy's minimize() function for compatibility, since many of you will have Scipy installed in your Python environment already (if not, a simple "pip install scipy" should do the trick). Professional-grade trajectory optimizers will use advanced nonlinear solvers like SNOPT and IPOPT that can support sparse matrices. This example also uses manual bookkeeping to pack and unpack the optimization variable (here called "xu" which stacks the x and u variables) and to perform differentiation, but more advanced packages are able to use automatic differentiation to set up optimal control problems more conveniently (e.g., CasADi).

Pseudospectral / collocation methods¶

As an alternative to piecewise constant controls, it is also possible touse other discretizations, such as polynomials or splines. In any case,the control is specified by a linear combination of basis functions$$u(t) = \sum_{i=1}^k c_i \beta_i(t)$$ where the $c_i \in \mathbb{R}^m$are control coefficients, which are to be chosen by the optimization,and the basis functions $\beta_i(t)$ are constant. For example, a set ofpolynomial basis functions could be $1$, $t$, $t^2$, ..., $t^{k-1}$.The difficulty with such parameterizations is that the state trajectorydepends on every control coefficient, so evaluating the gradient iscomputationally expensive.

To address this problem, it is typical to include a state trajectory parameterization inwhich the state trajectory $x(t)$ is also represented as an optimization variable that isparameterized explicitly along with the control trajectory. Specifically, we suppose that $$x(t) = \sum_{i=1}^k d_i \gamma_i(t)$$ where the $d_i \in \mathbb{R}^n$ are state coefficients to be optimized, and the basis functions $\gamma_i$ are constant functions of time.

TODO: illustration (Figure 6)

The main challenge is then to enforce dynamic consistency between the $x$ trajectory and the $u$ trajectory over the time domain. Because it is impossible to do this exactly in a continuous infinity of points, the dynamic consistance must then be enforced at a finite number of pointsin time, which are known as collocation points. The result is anequality-constrained, finite-dimensional optimization problem.

Specifically, given $N$ points in the time domain $t_1,\ldots,t_N$, dynamic consistency is enforced at the $j$'th time point with a constraint$$x^\prime(t_j) = f(x(t_j),u(t_j) )$$which can be rewritten in terms of the cofficients$$\sum_{i=1}^k d_i \gamma_i^\prime(t_j) = f\left(\sum_{i=1}^k d_i \gamma_i(t_j), \sum_{i=1}^k c_i \beta_i(t_j) \right).$$

Indirect methods¶

Indirect methods are an alternative numerical method that avoids explicit discretization of thecontrol trajectory, and instead uses Pointryagin's minimum principle to reduce the search spaceconsiderably. By PMP condition 3, we see that the state and costate variables $(x,\lambda)$ for an optimaltrajectory evolve according to the ordinary differential equations (ODEs)$$\dot{x}(t) = f(x(t),u^\star(t)),$$$$\dot{\lambda}(t) = -\frac{\partial}{\partial x}H(\lambda(t),x(t),u^\star(t),t).$$

Supposing that we could analytically or efficiently solve for the optimal control at each time $t$ knowing $x(t)$ and $\lambda(t)$,$$u^\star(t) \equiv \arg \min_{u\in \mathcal{U}} H(\lambda(t),x(t),u,t)$$then we can treat $d/dt (x,\lambda)$ as an ODE. Then, if we were to know $\lambda(0)$, then thestate and costate trajectories could be determined through standard integration techniques.

However, we do not precisely know what $\lambda(0)$ actually is. But the optimal trajectory isdefined by some initial value of $\lambda$. In other words, we can reduce the search over optimal trajectoriesto a search over an $n$-D variable! This is the general idea behind shooting methods. They proceed as follows

Repeat until convergence:

Guess a value of $\lambda_0 \in \mathbb{R}^n$.
Integrate $(x,\lambda)$ according to the ODE defined by PMP 2 and 3 from the initial condition $(x_0, \lambda_0)$. In the ODE, PMP 1 is used to derive $u^\star(t)$ at each time step. Let the resulting trajectories be denoted $(x(\cdot;\lambda_0),\lambda(\cdot;\lambda_0))$
Assess the value of the objective functional $J(x(\cdot;\lambda_0),\lambda(\cdot;\lambda_0))$
Compute the gradient of $J$ w.r.t. $\lambda_0$, $\nabla_{\lambda_0} J$.
Take a gradient step in the direction of $-\nabla {\lambda_0} J$.

As in direct transcription, calculating the gradient is the most cumbersome part (Step 4). Supposingthat we are using Euler integration with integration time step $h$, and we can derive $\nabla_{x(t)} u^\star(t)$. $\nabla_{\lambda(t)} u^\star(t)$. Then $$\lambda(t+h) = \lambda(t) - h \frac{\partial}{\partial x}H(\lambda(t),x(t),u^\star(t),t) $$$$x(t+h) = x(t) + h f(x(t),u^\star(t)).$$

Applying the chain rule, we obtain recursive equations for $\nabla_{\lambda_0} u^\star(t)$, $\nabla_{\lambda_0} x(t)$, and $\nabla_{\lambda_0} \lambda(t)$:

$$\nabla_{\lambda_0} u^\star(t) = \nabla_{x(t)} u^\star(t) \nabla_{\lambda_0} x(t) + \nabla_{\lambda(t)} u^\star(t) \nabla_{\lambda_0} \lambda(t)$$$$\nabla_{\lambda_0} x(t+h) = \nabla_{\lambda_0} x(t) + h \nabla_x f(x(t),u^\star(t)) \nabla_{\lambda_0} x(t) + h \nabla_u f(x(t),u^\star(t)) \nabla_{\lambda_0} u^\star(t)$$$$\nabla_{\lambda_0} \lambda(t+h) = \nabla_{\lambda_0} \lambda(t) - h \frac{\partial^2}{\partial x^2}H(\lambda(t),x(t),u^\star(t),t) \nabla_{\lambda_0} x(t) - h \frac{\partial^2}{\partial x \partial \lambda}H(\lambda(t),x(t),u^\star(t),t) \nabla_{\lambda_0} \lambda(t) - h \frac{\partial^2}{\partial x \partial u}H(\lambda(t),x(t),u^\star(t),t) \nabla_{\lambda_0} u^\star(t)$$

Observe that evaluating this latter recursive step requires $L$ and $f$ to be twice differentiable, and we also required the derivative of the optimal control with respect to state and costate. It should also be noted that the potential landscape of $J$ as a function of $\lambda_0$ is extremelycomplex and fraught with local minima, and the domain of $\lambda_0$ is unbounded with optimalvalues that can have extremely large magnitudes. So, guessing an appropriate value of $\lambda_0$ thatlies within the basin of attraction of the optimum can be difficult. These requirements can make it challenging to apply shooting methods in practice.

Local minima¶

A major issue with any descent-based trajectory optimization approach islocal minima. As an example, suppose we initialize the iLQR example above with anguessed trajectory in which the pendulum falls freely and swings over to the left(here the trajectory is simulated from the initial state (0.9,0) with 0 controls).The final state of this trajectory is closer to the goal on the left while theinitial state is on the right, swinging through the downward position.

Here, iLQR is not able to recover to the alternative which keeps the pendulum upright. However, it is able to find a trajectory with a final state that is close to upright, reducing the overallcost. It achieves this state by moving more aggressively through the downward state ($3\pi/2 \approx 4.7$)!

Local minima are very common in trajectory optimization due to the high dimensionality andnonlinearity of these optimization problems. Only in a few cases can we prove that the problem isconvex, such as in LTI problems with convex costs and linear controlconstraints.

As we have seen before, random restarts are one of the mosteffective ways to handle local minima, but in the high dimensional spaces oftrajectory optimization, a prohibitive number of restarts is needed tohave a high chance of finding a global optimum.

Methods for handling obstacles and constraints¶

To meet constraints and avoid obstacles in trajectory optimization there are several formulations, each with their strengths and weaknesses.

Formulate as unconstrained optimization, using barrier functions penalizing proximity to constraints.
Formulate as inequality constraints, ensuring nonnegative distance to constraints.
Formulate as pointwise inequality constraints, then solve a sequence of optimizations in which points are incrementally added (semi-infinite programming).

Barrier function methods¶

In the barrier function formulation, we assume that for each constraint $C$ a continuous (and ideally smooth) function $d_C(x)$ that measures the distance from a state $x$ to violating the constraint. To ensure smoothness when penetrating the constraint this should be a signed distance in which $d_C(x) < 0$ in the penetrating case, $d_C(x)=0$ when $x$ is just on the boundary of the constraint, and $d_C(x) > 0$ when $x$ is collision and contact-free. Ideally, the gradient of $d_C$ should be oriented to push $x$ away from constraint violation.

To use barrier functions, we define some barrier function $\rho(d)$ that increases as $d$ shrinks, and decreases as $d$ grows large. For example, some popular barrier functions are defined as follows:

Log-barrier: $\rho(d) \propto -\log d$. Hard, $C_\infty$ for $d>0$.
Inverse: $\rho(d) \propto 1/d - 2/d_{max}+d/d_{max}^2$ if $0 < d < d_{max}$, 0 if $d \geq d_{max}$, and $\infty$ if $d<0$. Hard, $C_1$ for $d>0$, 0 for $d > d_{max}$.
Half-quadratic: $\rho(d) \propto \max(0,(d_{max}-d)^2)$. Soft, $C_1$, 0 for $d > d_{max}$.

We then can optimize the barrier-augmented objective functional $\tilde{J}(\mathbf{x},\mathbf{u}) = J(\mathbf{x},\mathbf{u}) + \alpha \sum_i \sum_C \rho(d_C(x_i))$ in which $\alpha > 0$ is a term that weighs between constraint avoidance and objective minimization.

Hard barrier functions go to infinity as $d\rightarrow 0$, which prevents constraint violations entirely at an optimal solution to the barrier-augmented optimization problem. Soft barrier functions do not, and instead require sufficiently high values of $\alpha$ to prevent constraint violations. The main disadvantage of hard barrier functions is that if at any point the trajectory in the optimizer violates a constraint, then it will be unable to make further progress. This requires carefully choosing the initial guess for optimization and also designing appropriate line search methods to avoid violating constraints. Moreover, the derivatives of hard barrier functions get less numerically stable as $d\rightarrow 0$.

Inequality constraint methods¶

Inequality constraint methods also assume the existence of a distance function $d_C(x)$, and simply use constrained optimization techniques like direct transcription to enforce constraint satisfaction in the optimizer:\begin{equation}\begin{gathered}\mathbf{x}^\star, \mathbf{u}^\star = \arg \min_{\mathbf{x},\mathbf{u}} J(\mathbf{x},\mathbf{u}) \text{ such that} \\x_{i+1} = g(x_i,u_i) \text{ for }i=0,\ldots,N-1 \\x_L \leq x_{i} \leq x_U \text{ for }i=1,\ldots,N \\u_L \leq u_{i} \leq u_U \text{ for }i=0,\ldots,N-1 \\d_C(x_i) \geq 0 \text{ for }i=1,\ldots,N \text{ and all constraints }C \end{gathered}\label{eq:DirectTranscriptionObstacles}\end{equation}Such a problem formulation can more stable than barrier function methods, although constrained optimizers are typically more expensive than unconstrained ones.

Example: Double integrator with obstacle¶

The below example is a 2D double integrator problem with state $x=(p_x,p_y,v_x,v_y)$, control $u=(u_x,u_y)$, dynamics

$$\begin{bmatrix}\dot{p_x} \\\dot{p_y} \\\dot{v_x} \\\dot{v_y} \\\end{bmatrix} = \begin{bmatrix}\dot{v_x} \\\dot{v_y} \\\dot{u_x} \\\dot{u_y} \\\end{bmatrix}$$

and a cost function with goal-attraction, effort, and obstacle penalty terms. The incremental cost uses a standard quadratic objective plus a barrier term for obstacle avoidance:$$L(x,u) = (p_x-g_x)^2 + (p_y-g_y)^2 + 0.01(u_x^2+u_y^2) + \frac{1}{10\Delta t}\rho(d(x))$$and the terminal cost is $$\Phi(x) = 10(p_x-g_x)^2 + 10(p_y-g_y)^2 + v_x^2 + v_y^2 + \frac{1}{10}\rho(d(x)).$$In the below implementation, the log barrier function is used. Note that since a log barrier is a hard constraint, we require that the entire initial trajectory to strictly obey the constraint.

The obstacle here is a simple circle constraint $\|(p_x - c_x)^2 + (p_y - c_y)^2\| \geq r$ with $(c_x,c_y)$ the circle center and $r$ the radius. It is implemented as the signed distance function $d(x) = \|(p_x - c_x)^2 + (p_y - c_y)^2\| - r$.

Without a penalty term, the optimal path would pass right through the obstacle. After optimization with the penalty term, the trajectory circumvents the obstacle to reach toward the goal.

Handling infinite-horizon problems¶

There are some challenges when applying trajectory optimization toinfinite-horizon optimal control problems. Specifically, it is notpossible to define a computational grid over the infinite domain$[0,\infty)$ for the purposes of computing the integral in $J(x,u)$. Todo so, there are two general techniques available. The first is tosimply truncate the problem at some maximum time $T$, leading to afinite-horizon optimal control problem.

The second method is to reparameterize time so that the range$[0,\infty)$ is transformed into a finite range, say $[0,1]$. If we let$s=1-e^t$ then $s$ is in the range $[0,1]$. The cost functional thenbecomes:$$J(x,u) = \int_0^1 L(x(-\ln(1-s)),u(-\ln(1-s)),-\ln(1-s)) / (1-s) ds.$$This leads to a finite-horizon optimal control problem over the $s$domain, with $T=1$. Hence, if a uniform grid is defined over$s \in [0,1]$, then the grid spacing in the time domain becomesprogressively large as $t$ increases.

In the reformulated problem it is necessary to express the derivative of$x$ with respect to $s$ in the new dynamics:$$\frac{d}{d s} x(t(s)) = \dot{x}(t(s)) t^\prime(s) = f(x(t(s)),u(t(s))) / (1-s)$$Care must also be taken as $s$ approaches 1, since the $1/(1-s)$ termapproaches infinity, and if instantaneous cost does not also approach 0,then cost will become infinite. It is therefore customary to use adiscount factor. With an appropriately defined discount term, the $s=1$contribution to cost will be dropped.

Local minima¶

A major issue with any descent-based trajectory optimization approach islocal minima. Only in a few cases can we prove that the problem isconvex, such as in LTI problems with convex costs and linear controlconstraints.

As we have seen before, random restarts are one of the mosteffective ways to handle local minima, but trajectory optimization takes placein a very high dimensional space. Due to the high dimensionality of the parameter space,a prohibitive number of restarts is needed if we want to have a high chance of finding aglobal optimum. Hence, for a robot to perform trajectory optimization reliably, a considerable amount ofeffort needs to be spent finding good initializers or engineering the cost landscapesuch that it has few local minima.

Hamilton-Jacobi-Bellman Equation¶

An alternative method to solve optimal control problems is to find thesolution in state space rather than the time domain. In theHamilton-Jacobi-Bellman (HJB) equation, so named as an extension of theBellman equation for discrete optimal planning problems, a partialdifferential equation (PDE) across state space is formulated todetermine the optimal control everywhere. (Contrast this with thePointryagin's minimum principle, which is an optimality condition onlyalong a single trajectory.)

Derivation¶

We start by formulating the HJB equation in discrete time. Consider afinite-horizon optimal control problem, and define the value functionas a function$V(x,t) : \mathbb{R}^n \times \mathbb{R} \rightarrow \mathbb{R}$ that defines theminimum possible accumulated cost that could be obtained by any trajectorystarting from initial state $x$ and time $t$. In other words, let us define thetruncated cost functional\begin{equation}J_t(x,u) = \int_t^T L(x,u,s) ds + \Phi(x(T))\end{equation}which truncates the lower point in the integral term of $J(x,u)$ to start from time$t$. (Obviously, $J(x,u)=J_0(x,u)$.) Then, the value function is the minimizer ofthe truncated cost over all possible future controls: $V(x,t) = \min_u(J_t(x,u))$.The value is also known as the cost to go, measuring the remaining cost to reacha goal. (This stands in contrast to the cost to come which is the cost that would beaccumulated to reach a state $x$ from the start.)

TODO: illustration (Figure 7)

It is apparent that at time $T$, the only term that remains is theterminal cost, so one boundary term is given: $$V(x,T) = \Phi(x).$$ Nowwe examine the value function going backwards in time. Suppose we know$V(x,t+\Delta t)$ for all $x$, and now we are considering time $t$. Letus also assume that at a state $x$ with control $u$, the resulting stateat time $T$ is approximated by Euler integration, and the incrementalcost is approximately constant over the interval $[t,t+\Delta t)$. Then,we have the approximation\begin{equation}V(x,t) \approx \min_{u\in U} [ \Delta t L(x,u,t) + V(x + \Delta t f(x,u),t+\Delta t)]\label{eq:DiscreteTimeHJB}\end{equation}The minimization is taken over controls tofind the optimal control for the next time step. The first term of theminimized term includes the incremental cost from the current state,time, and chosen control. The second term includes the cost contributionfrom the next state under the chosen control, incremented forward intime.

Note that the first order approximation of $V(x,t)$ is given by:\begin{equation}V(x+\Delta x,t+\Delta t) \approx V(x,t) + \frac{\partial V}{\partial x}(x,t) \Delta x + \dot{V}(x,t)\Delta t\end{equation}If we take the limit of($\ref{eq:DiscreteTimeHJB}$) as the time step $\Delta t$ approaches0, subtract $V(x,t)$ from both sides, and divide by $\Delta t$, then weobtain the Hamilton-Jacobi-Bellman PDE :\begin{equation}0 = \dot{V}(x,t) + \min_{u \in U} [ L(x,u,t) + \frac{\partial V}{\partial x}(x,t) f(x,u)].\label{eq:HJB}\end{equation}

If these equations were to be solved either in discrete or continuoustime across the $\mathbb{R}^n \times \mathbb{R}$ state space, then wehave a complete description of optimal cost starting from any state. Itis also possible to enforce state constraints simply by setting thevalue function at inadmissible states to $\infty$. Moreover, it is arelatively straightforward process to determine the optimal controlgiven a value function:\begin{equation}u^\star(x,t) = \arg \min_{u \in U} [ \Delta t L(x,u,t) + V(x + \Delta t f(x,u),t + \Delta t)]\end{equation}for the discrete case and\begin{equation}u^\star(x,t) = \arg \min_{u \in U} [ L(x,u,t) + \frac{\partial V}{\partial x}(x,t) f(x,u)]\end{equation}for the continuous case. The main challenge here is to represent andcalculate a function over an $n+1$ dimensional grid, which isprohibitively expensive for high-D state spaces. It is also potentiallydifficult to perform the minimization over the controlin ($\ref{eq:DiscreteTimeHJB}$)and ($\ref{eq:HJB}$),since it must be performed at each point in time and space.

Reducing dimension by 1 using time-independence¶

It is often useful to reduce the dimensionality down to an $n$-D grid ifthe incremental cost is time-independent and the problem has an infinitehorizon. With these assumptions, the optimal control is stationary,that is, it is dependent only on state and not time. Then, we can set upa set of recursive equations on a time-independent value function:

\begin{equation}V(x) = \min_{u \in U} [ \Delta L(x,u) + V(x+\Delta t f(x,u)) ]\label{eq:DiscreteHJBStationary}\end{equation}

in the discrete time case, or takingthe limit as $\Delta t \rightarrow 0$, we get the continuous PDE

\begin{equation}0 = \min_{u \in U} [ L(x,u) + \frac{\partial V}{\partial x}(x) f(x,u) ].\end{equation}

Solution methods¶

It can be rather challenging to solve either the time-dependent or thestationary equations exactly due the dimensionality of the grids used,and the recursive nature of the stationary equations. Also, somediscretization of the control set $U$ is usually needed, and afiner discretization will help the methodcompute better estimates. Three general methods exist for solving HJBequations:

Value iteration uses a guess of $V(x)$ and then iteratively improvesit by optimizing($\ref{eq:DiscreteHJBStationary}$) on each $x$ in the grid. Thisis also known as recursive dynamic programming and is a continuous-space variant of thevalue iteration algorithm discussed in Chapter 11.
Policy iteration assigns guesses for the policy $u(x)$, anditeratively alternates between a) solving for the $V(x)$ induced bythose controls, and b) improving the assigned controls using theinduced $V(x)$. This is a continuous-space variant of thepolicy iteration algorithm discussed in Chapter 11.
Linear programming uses a set of sample points $x_1,\ldots,x_N$ on astate space grid and points $u_1,\ldots,u_M$ on a control spacegrid, and then sets up a large linear programming problem withconstraints of the form($\ref{eq:DiscreteHJBStationary}$).
The Fast Marching Method can be thought of as a one-pass value iteration, which isapplicable to problems with known terminal sets and positive costs. The principle is similar tothe "brush fire" method for calculating navigation functions as discussed inChapter 11. Observe that once a value is defined for the goal states, their value nolonger needs to be updated; they are called closed. We can try todetermine all states for which the value is below some threshold $v$, and thesestates will be found in a small neighborhood of the closed states, known as the frontier.Once these states' values are determined, they are added to the closed states, and a new set offrontier states is determined. Next, we increase $v$ and repeat the process, until allstates are visited.