Algorithms for Walking, Running, Swimming, Flying, and Manipulation
© Russ Tedrake, 2024
Last modified .
How to cite these notes, use annotations, and give feedback.
Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.
Previous Chapter | Table of contents | Next Chapter |
While solving the dynamic programming problem for continuous systems is very hard in general, there are a few very important special cases where the solutions are very accessible. Most of these involve variants on the case of linear dynamics and convex (e.g. positive quadratic) cost. The simplest case, called the linear quadratic regulator (LQR), is formulated as stabilizing a time-invariant linear system to the origin.
The linear quadratic regulator is likely the most important and influential result in optimal control theory to date. In this chapter we will derive the basic algorithm and a variety of useful extensions.
Consider a linear time-invariant system in state-space form, $$\dot{\bx} = {\bf A}\bx + \bB\bu,$$ with the infinite-horizon cost function given by $$J = \int_0^\infty \left[ \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu \right] dt, \quad {\bf Q} = {\bf Q}^T \succeq {\bf 0}, {\bf R} = {\bf R}^T \succ 0.$$ Our goal is to find the optimal cost-to-go function $J^*(\bx)$ which satisfies the HJB: $$\forall \bx, \quad 0 = \min_\bu \left[ \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu + \pd{J^*}{\bx} \left( {\bf A}\bx + \bB\bu \right) \right].$$
There is one important step here -- it is well known that for this problem the optimal cost-to-go function is quadratic. This is easy to verify. Let us choose the form: $$J^*(\bx) = \bx^T {\bf S} \bx, \quad {\bf S} = {\bf S}^T \succeq 0.$$ The gradient of this function is $$\pd{J^*}{\bx} = 2 \bx^T {\bf S}.$$
Since we have guaranteed, by construction, that the terms inside the $\min$ are quadratic and convex (because ${\bf R} \succ 0$), we can take the minimum explicitly by finding the solution where the gradient of those terms vanishes: $$\pd{}{\bu} = 2\bu^T {\bf R} + 2 \bx^T {\bf S} \bB = 0.$$ This yields the optimal policy $$\bu^* = \pi^*(\bx) = - {\bf R}^{-1} \bB^T {\bf S} \bx = - \bK \bx.$$
Inserting this back into the HJB and simplifying yields $$0 = \bx^T
\left[ {\bf Q} - {\bf S B R}^{-1}\bB^T{\bf S} + 2{\bf SA} \right]\bx.$$ All
of the terms here are symmetric except for the $2{\bf SA}$, but since
$\bx^T{\bf SA}\bx = \bx^T{\bf A}^T{\bf S}\bx$, we can write $$0 = \bx^T
\left[ {\bf Q} - {\bf S B R}^{-1}\bB^T{\bf S} + {\bf SA} + {\bf A}^T{\bf S}
\right]\bx.$$ and since this condition must hold for all $\bx$, it is
sufficient to consider the matrix equation $$0 = {\bf S} {\bf A} + {\bf A}^T
{\bf S} - {\bf S} \bB {\bf R}^{-1} \bB^T {\bf S} + {\bf Q}.$$ This extremely
important equation is a version of the algebraic Riccati equation.
Note that it is quadratic in ${\bf S}$, making its solution non-trivial, but
it is well known that the equation has a single positive-definite solution
if and only if the system is controllable. Moreover, there are good numerical
methods for finding that solution, even in high-dimensional problems. Both
the optimal policy and optimal cost-to-go function are available from
(K,S) =
LinearQuadraticRegulator(A,B,Q,R)
.
If the appearance of the quadratic form of the cost-to-go seemed mysterious, consider that the solution to the linear system $\dot\bx = (\bA - \bB\bK)\bx$ takes the form $\bx(t) = e^{(\bA-\bB\bK)t}\bx(0)$, and try inserting this back into the integral cost function. You'll see that the cost takes the form $J=\bx^T(0) {\bf S} \bx(0)$.
It is worth examining the form of the optimal policy more closely. Since the value function represents cost-to-go, the optimal actions will move down this landscape as quickly as possible. Indeed, $-{\bf S}\bx$ is in the direction of steepest descent of the value function. However, not all directions are possible to achieve in state-space. $-\bB^T {\bf S} \bx$ represents precisely the projection of the steepest descent onto the action space, and is the steepest descent achievable with the control inputs $\bu$. Finally, the pre-scaling by the matrix ${\bf R}^{-1}$ biases the direction of descent to account for relative cost weightings that we have placed on the different control inputs. Note that although this interpretation is straight-forward, the slope that we are descending (in the value function, ${\bf S}$) is a function that takes into account the long-term dynamics and cost.
Now we can use LQR to reproduce our HJB example from the previous chapter:
import numpy as np
from pydrake.all import LinearQuadraticRegulator
# Define the double integrator's state space matrices.
A = np.array([[0, 1], [0, 0]])
B = np.array([[0], [1]])
Q = np.eye(2)
R = np.eye(1)
(K, S) = LinearQuadraticRegulator(A, B, Q, R)
print("K = " + str(K))
print("S = " + str(S))
As in the hand-derived example, our numerical solution returns $${\bf K} = [ 1, \sqrt{3} ], \qquad{\bf S} = \begin{bmatrix} \sqrt{3} & 1 \\ 1 & \sqrt{3} \end{bmatrix}.$$
LQR is extremely relevant to us even though our primary interest is in nonlinear dynamics, because it can provide a local approximation of the optimal control solution for the nonlinear system. Given the nonlinear system $\dot{\bx} = f(\bx,\bu)$, and a stabilizable operating point, $(\bx_0,\bu_0)$, with $f(\bx_0,\bu_0) = 0.$ We can define a relative coordinate system $$\bar\bx = \bx - \bx_0, \quad \bar\bu = \bu - \bu_0,$$ and observe that $$\dot{\bar\bx} = \dot{\bx} = f(\bx,\bu),$$ which we can approximate with a first-order Taylor expansion to $$\dot{\bar\bx} \approx f(\bx_0,\bu_0) + \pd{f(\bx_0,\bu_0)}{\bx} (\bx - \bx_0) + \pd{f(\bx_0,\bu_0)}{\bu} (\bu - \bu_0) = {\bf A}\bar{\bx} + \bB\bar\bu.$$
Similarly, we can define a quadratic cost function in the error coordinates, or take a (positive-definite) second-order approximation of a nonlinear cost function about the operating point (linear and constant terms in the cost function can be easily incorporated into the derivation by parameterizing a full quadratic form for $J^*$, as seen in the Linear Quadratic Tracking derivation below).
The resulting controller takes the form $\bar\bu^* = -{\bf K}\bar\bx$
or $$\bu^* = \bu_0 - {\bf K} (\bx - \bx_0).$$ For convenience,
controller =
LinearQuadraticRegulator(system, context, Q, R)
on most dynamical
systems (including block diagrams built up of many subsystems); it will
perform the linearization for you.
LQR provides a very satisfying solution to the canonical "balancing" problem that we've already described for a number of model systems. Here is the notebook with those examples, again:
I find it very compelling that the same derivation (and effectively identical code) can stabilize such a diversity of systems!
Recall that the cost-to-go for finite-horizon problems is time-dependent, and therefore the HJB sufficiency condition requires an additional term for $\pd{J^*}{t}$. $$ \forall \bx, \forall t\in[t_0,t_f],\quad 0 = \min_\bu \left[ \ell(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t} \right]. $$
Consider systems governed by an LTI state-space equation of the form $$\dot{\bx} = {\bf A}\bx + \bB\bu,$$ and a finite-horizon cost function, $J = h(\bx(t_f)) + \int_0^{t_f} \ell(\bx(t), \bu(t)) dt,$ with \begin{gather*} h(\bx) = \bx^T {\bf Q}_f \bx,\quad {\bf Q}_f = {\bf Q}_f^T \succeq {\bf 0} \\ \ell(\bx,\bu) = \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu, \quad {\bf Q} = {\bf Q}^T \succeq 0, {\bf R}={\bf R}^T \succ 0 \end{gather*} Writing the HJB, we have $$ 0 = \min_\bu \left[\bx^T {\bf Q} \bx + \bu^T {\bf R}\bu + \pd{J^*}{\bx} \left({\bf A} \bx + \bB \bu \right) + \pd{J^*}{t} \right]. $$ Due to the positive definite quadratic form on $\bu$, we can find the minimum by setting the gradient to zero: \begin{gather*} \pd{}{\bu} = 2 \bu^T {\bf R} + \pd{J^*}{\bx} \bB = 0 \\ \bu^* = \pi^*(\bx,t) = - \frac{1}{2}{\bf R}^{-1} \bB^T \pd{J^*}{\bx}^T \end{gather*} In order to proceed, we need to investigate a particular form for the cost-to-go function, $J^*(\bx,t)$. Let's try a solution of the form: $$J^*(\bx,t) = \bx^T {\bf S}(t) \bx, \quad {\bf S}(t) = {\bf S}^T(t)\succ {\bf 0}.$$ In this case we have $$\pd{J^*}{\bx} = 2 \bx^T {\bf S}(t), \quad \pd{J*}{t} = \bx^T \dot{\bf S}(t) \bx,$$ and therefore \begin{gather*} \bu^* = \pi^*(\bx,t) = - {\bf R}^{-1} \bB^T {\bf S}(t) \bx \\ 0 = \bx^T \left[ {\bf Q} - {\bf S}(t) \bB {\bf R}^{-1} \bB^T {\bf S}(t) + {\bf S}(t) {\bf A} + {\bf A}^T {\bf S}(t) + \dot{\bf S}(t) \right] \bx.\end{gather*} Therefore, ${\bf S}(t)$ must satisfy the condition (known as the continuous-time differential Riccati equation): $$-\dot{\bf S}(t) = {\bf S}(t) {\bf A} + {\bf A}^T {\bf S}(t) - {\bf S}(t) \bB {\bf R}^{-1} \bB^T {\bf S}(t) + {\bf Q},$$ and the terminal condition $${\bf S}(t_f) = {\bf Q}_f.$$ Since we were able to satisfy the HJB with the minimizing policy, we have met the sufficiency condition, and have found the optimal policy and optimal cost-to-go function.
Note that the infinite-horizon LQR solution described in the prequel is exactly the steady-state solution of this equation, defined by $\dot{\bf S}(t) = 0$. Indeed, for controllable systems this equation is stable (backwards in time), and as expected the finite-horizon solution converges on the infinite-horizon solution as the horizon time limits to infinity.
The solution ${\bf S}(t)$ will be the result of numerical integration. On weakly controllable systems (like the perching airplane), I've found that even error-controlled integration can lead to crippling numerical artifacts such as the loss of symmetry or positive definiteness. For a more robust numerical solution, we can instead integrate a factorization of ${\bf S}$ that ensures these properties are maintained. Taking ${\bf S}(t) = {\bf P}(t){\bf P}^T(t)$, and again using the fact that $\bx^T{\bf P}\dot{\bf P}^T\bx = \bx^T\dot{\bf P}{\bf P}^T \bx$, we can write the "square-root form" of the Riccati differential equation: $$-\dot{\bf P}(t) = \bA^T {\bf P}(t) - \frac{1}{2}{\bf S}(t)\bB{\bf R}^{-1}\bB^T{\bf P}(t) + \frac{1}{2}{\bf Q}{\bf P}^{-T}(t), \quad {\bf P}(t_f) = {\bf Q}_f^\frac{1}{2}.$$ This form does require that ${\bf P}(t)$ is invertible, which in turn requires that ${\bf Q}_f \succ 0.$
The derivation above holds even if the dynamics are given by $$\dot{\bx} = {\bf A}(t)\bx + {\bf B(t)}\bu.$$ Similarly, the cost functions ${\bf Q}$ and ${\bf R}$ can also be time-varying. This is quite surprising, as the class of time-varying linear systems is a quite general class of systems. It requires essentially no assumptions on how the time-dependence enters, except perhaps that if ${\bf A}$ or $\bB$ is discontinuous in time then one would have to use the proper techniques to accurately integrate the differential equation.
One of the most powerful applications of time-varying LQR involves linearizing around a nominal trajectory of a nonlinear system and using LQR to provide a trajectory controller. This will tie in very nicely with the algorithms we develop in the chapter on trajectory optimization.
Let us assume that we have a nominal trajectory, $\bx_0(t), \bu_0(t)$ defined for $t \in [t_1, t_2]$. Similar to the time-invariant analysis, we begin by defining a local coordinate system relative to the trajectory: $$\bar\bx(t) = \bx(t) - \bx_0(t), \quad \bar\bu(t) = \bu(t) - \bu_0(t).$$ Now we have $$\dot{\bar\bx} = \dot{\bx} - \dot{\bx}_0 = f(\bx,\bu) - f(\bx_0, \bu_0),$$ which we can again approximate with a first-order Taylor expansion to $$\dot{\bar\bx} \approx f(\bx_0,\bu_0) + \pd{f(\bx_0,\bu_0)}{\bx} (\bx - \bx_0) + \pd{f(\bx_0,\bu_0)}{\bu} (\bu - \bu_0) - f(\bx_0,\bu_0) = {\bf A}(t)\bar{\bx} + \bB(t)\bar\bu.$$ This is very similar to using LQR to stabilize a fixed-point, but with some important differences. First, the linearization is time-varying. Second, our linearization is valid for any state along a feasible trajectory (not just fixed-points), because the coordinate system is moving along with the trajectory.
Similarly, we can define a quadratic cost function in the error coordinates, or take a (positive-definite) second-order approximation of a nonlinear cost function along the trajectory (linear and constant terms in the cost function can be easily incorporated into the derivation by parameterizing a full quadratic form for $J^*$, as seen in the Linear Quadratic Tracking derivation below).
The resulting controller takes the form $\bar\bu^* = -{\bf
K}(t)\bar\bx$ or $$\bu^* = \bu_0(t) - {\bf K}(t) (\bx - \bx_0(t)).$$
FiniteHorizonLinearQuadraticRegulator
method; if you pass it a nonlinear systems it will perform the
linearization in the proper coordinates for you using automatic
differentiation.
Remember that stability is a statement about what happens as time goes to infinity. In order to talk about stabilizing a trajectory, the trajectory must be defined for all $t \in [t_1, \infty)$. This can be accomplished by considering a finite-time trajectory which terminates at a stabilizable fixed-point at a time $t_2 \ge t_1$, and remains at the fixed point for $t \ge t_2$. In this case, the finite-horizon Riccati equation is initialized with the infinite-horizon LQR solution: $S(t_2) = S_\infty$, and solved backwards in time from $t_2$ to $t_1$ for the remainder of the trajectory. And now we can say that we have stabilized the trajectory!
For completeness, we consider a slightly more general form of the linear quadratic regulator. The standard LQR derivation attempts to drive the system to zero. Consider now the problem: \begin{gather*} \dot{\bx} = {\bf A}\bx + \bB\bu \\ h(\bx) = (\bx - \bx_d(t_f))^T {\bf Q}_f (\bx - \bx_d(t_f)), \quad {\bf Q}_f = {\bf Q}_f^T \succeq 0 \\ \ell(\bx,\bu,t) = (\bx - \bx_d(t))^T {\bf Q} (\bx - \bx_d(t)) + (\bu - \bu_d(t))^T {\bf R} (\bu - \bu_d(t)),\\ {\bf Q} = {\bf Q}^T \succeq 0, {\bf R}={\bf R}^T \succ 0 \end{gather*} Now, guess a solution \begin{gather*} J^*(\bx,t) = \bx^T {\bf S}_{xx}(t) \bx + 2\bx^T {\bf s}_x(t) + s_0(t), \quad {\bf S}_{xx}(t) = {\bf S}_{xx}^T(t)\succ {\bf 0}. \end{gather*} In this case, we have $$\pd{J^*}{\bx} = 2 \bx^T {\bf S}_{xx}(t) + 2{\bf s}_{x}^T(t),\quad \pd{J^*}{t} = \bx^T \dot{\bf S}_{xx}(t) \bx + 2\bx^T \dot{\bf s}_x(t) + \dot{s}_0(t).$$ Using the HJB, $$ 0 = \min_\bu \left[(\bx - \bx_d(t))^T {\bf Q} (\bx - \bx_d(t)) + (\bu - \bu_d(t))^T {\bf R} (\bu - \bu_d(t)) + \pd{J^*}{\bx} \left({\bf A} \bx + \bB \bu \right) + \pd{J^*}{t} \right], $$ we have \begin{gather*} \pd{}{\bu} = 2 (\bu - \bu_d(t))^T{\bf R} + (2\bx^T{\bf S}_{xx}(t) + 2{\bf s}_x^T(t))\bB = 0,\\ \bu^*(t) = \bu_d(t) - {\bf R}^{-1} \bB^T\left[{\bf S}_{xx}(t)\bx + {\bf s}_x(t)\right] \end{gather*} The HJB can be satisfied by integrating backwards \begin{align*} -\dot{\bf S}_{xx}(t) =& {\bf Q} - {\bf S}_{xx}(t) \bB {\bf R}^{-1} {\bf B}^T {\bf S}_{xx}(t) + {\bf S}_{xx}(t) {\bf A} + {\bf A}^T {\bf S}_{xx}(t)\\ -\dot{\bf s}_x(t) =& - {\bf Q} \bx_d(t) + \left[{\bf A}^T - {\bf S}_{xx} \bB {\bf R}^{-1} \bB^T \right]{\bf s}_x(t) + {\bf S}_{xx}(t) \bB \bu_d(t)\\ -\dot{s}_0(t) =& \bx_d(t)^T {\bf Q} \bx_d(t) - {\bf s}_x^T(t) \bB {\bf R}^{-1} \bB^T {\bf s}_x(t) + 2{\bf s}_x(t)^T \bB \bu_d(t), \end{align*} from the final conditions \begin{align*} {\bf S}_{xx}(t_f) =& {\bf Q}_f\\ {\bf s}_{x}(t_f) =& -{\bf Q}_f \bx_d(t_f) \\ s_0(t_f) =& \bx_d^T(t_f) {\bf Q}_f \bx_d(t_f). \end{align*} Notice that the solution for ${\bf S}_{xx}$ is the same as the simpler LQR derivation, and is symmetric (as we assumed). Note also that $s_0(t)$ has no effect on control (even indirectly), and so can often be ignored.
A quick observation about the quadratic form, which might be helpful in debugging. We know that $J(\bx,t)$ must be uniformly positive. This is true iff ${\bf S}_{xx}\succ 0$ and $s_0 > {\bf s}_x^T {\bf S}_{xx}^{-1} {\bf s}_x$, which comes from evaluating the function at $\bx_{min}(t)$ defined by $\left[ \pd{J^*(\bx,t)}{\bx} \right]_{\bx=\bx_{min}(t)} = 0$.
The finite-horizon LQR formulation can be used to impose a strict final boundary value condition by setting an infinite ${\bf Q}_f$. However, integrating the Riccati equation backwards from an infinite initial condition isn't very practical. To get around this, let us consider solving for ${\bf P}(t) = {\bf S}(t)^{-1}$. Using the matrix relation $\frac{d {\bf S}^{-1}}{dt} = - {\bf S}^{-1} \frac{d {\bf S}}{dt} {\bf S}^{-1}$, we have: $$-\dot{\bf P}(t) = - {\bf P}(t){\bf Q P}(t) + {\bf B R}^{-1} \bB - {\bf A P}(t) - {\bf P}(t){\bf A}^T,$$ with the final conditions $${\bf P}(t_f) = 0.$$ This Riccati equation can be integrated backwards in time for a solution.
It is very interesting, and powerful, to note that, if one chooses
${\bf Q} = 0$, therefore imposing no position cost on the trajectory
before time $T$, then this inverse Riccati equation becomes a linear ODE
which can be solved explicitly.
Essentially all of the results above have a natural correlate for discrete-time systems. What's more, the discrete time versions tend to be simpler to think about in the model-predictive control (MPC) setting that we'll be discussing below and in the next chapters.
Consider the discrete time dynamics: $$\bx[n+1] = {\bf A}\bx[n] + {\bf
B}\bu[n],$$ and we wish to minimize $$\min \sum_{n=0}^{N-1} \bx^T[n] {\bf
Q} \bx[n] + \bu^T[n] {\bf R} \bu[n], \qquad {\bf Q} = {\bf Q}^T \succeq 0,
{\bf R} = {\bf R}^T \succ 0.$$ The cost-to-go is given by $$J(\bx,n-1) =
\min_\bu \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu + J({\bf A}\bx + {\bf
B}\bu, n).$$ If we once again take $$J(\bx,n) = \bx^T {\bf S[n]} \bx,
\quad {\bf S}[n] = {\bf S}^T[n] \succ 0,$$ then we have $$\bu^*[n] = -{\bf
K[n]}\bx[n] = -({\bf R} + {\bf B}^T {\bf S}[n] {\bf B})^{-1} {\bf B}^T
{\bf S}[n] {\bf A} \bx[n],$$ yielding $${\bf S}[n-1] = {\bf Q} + {\bf
A}^T{\bf S}[n]{\bf A} - ({\bf A}^T{\bf S}[n]{\bf B})({\bf R} + {\bf B}^T
{\bf S}[n] {\bf B})^{-1} ({\bf B}^T {\bf S}[n]{\bf A}), \quad {\bf S}[N]
= 0,$$ which is the famous Riccati difference equation. The
infinite-horizon LQR solution is given by the (positive-definite)
fixed-point of this equation: $${\bf S} = {\bf Q} + {\bf A}^T{\bf S}{\bf
A} - ({\bf A}^T{\bf S}{\bf B})({\bf R} + {\bf B}^T {\bf S} {\bf B})^{-1}
({\bf B}^T {\bf S}{\bf A}).$$ Like in the continuous time case, this
equation is so important that we have special numerical recipes for
solving this discrete-time algebraic Riccati equation (DARE). LinearQuadraticRegulator
method on a system that
has only discrete state and a single periodic time step.
You can explore the relationship between the discrete-time and continuous-time formulations in this notebook:
In reinforcement learning, it is popular to consider the
infinite-horizon "discounted" cost: $\min \sum_{n=0}^\infty \gamma^n
(\bx^T[n]{\bf Q}\bx[n] + \bu^T[n]{\bf R}\bu[n]).$ The optimal controller
is $$\bu^* = -\gamma(\bR + \gamma \bB^T {\bf S} \bB)^{-1} \bB^T {\bf S}
\bA \bx.$$ and the corresponding Riccati equation is $${\bf S} = {\bf Q} +
\gamma {\bf A}^T{\bf S}{\bf A} - \gamma^2({\bf A}^T{\bf S}{\bf B})({\bf
R} + \gamma {\bf B}^T {\bf S} {\bf B})^{-1} ({\bf B}^T {\bf S}{\bf A}),$$
whose solution can be found using
DiscreteAlgebraicRiccatiEquation
$(\sqrt{\gamma} {\bf A},
{\bf B}, {\bf Q}, \frac{1}{\gamma} {\bf R}).$
In the dynamic programming chapter, we developed general purpose tools for approximating value functions with function approximators. I think it is particularly illustrative to see how these work out for LQR. Let's consider the discrete-time, infinite-horizon, discounted case.
For LQR, we know that the optimal value function will take a quadratic form, $\bx^T {\bf S}\bx.$ Although it is quadratic in $\bx$, this form is linear in the parameters, ${\bf S}$. We can leverage some of the special tools for fitted value iteration with a linear function approximator.
I've included two versions of the value iteration update in the notebook -- one that samples over both $\bx$ and $\bu$, and one that samples only over $\bx$ and uses the LQR policy (given the current estimated cost-to-go) to determine $\bu$. The difference is not subtle when $\gamma \rightarrow 1$. Take a look!
We must remember that there are multiple solutions to the Riccati equation -- the solution to the LQR problem is the (unique) positive definite solution. But there are non-positive definite solutions, too, which lead to unstable controllers -- these solutions do achieve zero Bellman residual. Every solution to the Riccati equation is a fixed point of the (fitted) value iteration update, but only the positive-definite solution is a stable fixed point of the algorithm.
To see this, write $\hat{J}(\bx) = \bx^{T} ({\bf S}^* + {\bf \Delta})\bx$, where ${\bf S}^*$ is any solution to the Riccati equation, and ${\bf \Delta}$ is a small deviation matrix. If we write the error dynamics through the fitted value iteration update, we can see that $${\bf \Delta}_{i+1} = (\bA - \bB\bK_i)^T {\bf \Delta}_i (\bA - \bB \bK_i),$$ where ${\bf K}_i$ is the optimal controller given ${\bf S}_i = {\bf S}^* + {\bf \Delta}_i \approx \bK^*.$ This error converges to zero if and only if $(\bA - \bB\bK^*)$ is stable, which is only if ${\bf S}^* \succ 0$.
Go ahead and try many different initial ${\bf S}$ in the code to verify it for yourself.
A natural extension for linear optimal control is the consideration of strict constraints on the inputs or state trajectory. Most common are linear inequality constraints, such as $\forall n, |\bu[n]| \le 1$ or $\forall n, \bx[n] \ge -2$ (any linear constraints of the form ${\bf Cx} + {\bf Du} \le {\bf e}$ can be solved with the same tools). Much is known about the solution to this problem in the discrete-time case, but it's computation is signficantly harder than the unconstrained case. Almost always, we give up on solving for the best control policy in closed form, and instead solve for the optimal control trajectory $\bu[\cdot]$ from a particular initial condition $\bx[0]$ over some finite horizon. Fortunately, this problem is a convex optimization and we can often solve it quickly and reliably enough to solve it at every time step, effectively turning a motion planning algorithm into a feedback controller; this idea is famously known as model-predictive control (MPC). We will provide the details in the trajectory optimization chapter.
One important case that does have closed-form solutions is LQR with linear equality constraints. This is particularly relevant in the case of stabilizing robots with kinematic constraints such as a closed kinematic chain, which appears in four-bar linkages or even for the linearization of a biped robot with two feet on the ground.
Our formulation is as follows: in addition to the linear dynamics (in continuous time or discrete time), we have a linear equality constraint of the form ${\bf F}\bx = 0,$ with ${\bf F} \in \Re^{(n-d) \times n}.$ Form a matrix ${\bf P} \in \Re^{d \times n}$ that forms an orthonormal basis for the nullspace of ${\bf F},$ such that $${\bf PP}^T = {\bf I}^{d \times d}, \quad {\bf PF}^T = {\bf 0}^{d \times (n-d)}.$$ Then for any $\bx$ we can write $\bx = {\bf P}^T \by + {\bf F}^T \bz,$ form some $\by \in \Re^d$ and $\bz \in \Re^{n-d}.$ However, as the result of the constraints we know that ${\bf F}\bx = \bz = 0.$ Instead of stabilizing $\bx$ via LQR, we will stabilize $\by$, using the projected cost and dynamics given by $$\bA_\by = {\bf P}\bA_\bx{\bf P}^T, \qquad \bB_\by = {\bf P}\bB_\bx, \qquad \bQ_\by = {\bf P}\bQ_\bx{\bf P}^T$$ where $\bA_\bx,\bB_\bx,$ and $\bQ_\bx$ are the matrices from the full coordinates model. I've chosen to write it this way because it is the same in both discrete and continuous time. The resulting cost-to-go is given by ${\bf S}_\bx = {\bf P}^T{\bf S}_y{\bf P}$ and the controller is given by $\bK_\bx = \bK_\by {\bf P},$ and the where $\bK_y, {\bf S}_y$ are the solutions to the reduced LQR problem in $\by.$
More generally, for a nonlinear system with nonlinear equality
constraints, we would linearize both the system dynamics and the
constraint, and then apply the derivation above.
A nice example of this is the task of balancing a Segway in the $x-z$
plane. In 3D, one could either consider the nonholonomic wheels of the
Segway, or turn it into a Ballbot. For the
purposes of simulation, one can simply construct the ground, the ball,
and the bot with appropriate collision geometry, and drop it into
MultibodyPlant
. The frictional contact between the ball
and the ground generates the expected dynamics.
Designing the LQR controller is another story. We will use a
PlanarJoint
for the floating base, giving us 4 degrees of
freedom: $[x, z, \theta_{ball}, \theta_{bot-ball}].$ If we linearize
this model and call LQR, the LQR call will fail because the system
is not controllable in the full coordinates. The first reason is
easy to see: there is nothing that the actuators can do to change $z$,
the vertical position of the ball. Perhaps the more subtle reason is
that there is a (holonomic) constraint from the ball rolling without
slip on the ground: $x = -r \theta_{ball},$ where $r$ is
the radius of the ball.
I will use one small implementation "trick". In my code, rather than
deal with finding the resting penetration of the ball, I decided to
remove the vertical degree of freedom completely, and used a prismatic
joint for $x$ and a revolute joint for $\theta_{ball}$ instead of the
PlanarJoint
for the floating base. I configured it so that
the ball was sufficiently in penetration that the normal forces from
the ground were substantial enough that the friction cone could support
the horizontal frictional forces I needed to roll. This effectively
implements the $z$ constraint by construction (completely removing that
degree of freedom).
To use LQR, we will write the constraints $x = -r\theta_{ball}$ and
the implied $\dot{x} = -r\dot\theta_{ball}$ as $${\bf Fx} =
\begin{bmatrix}1 & r & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & r & 0
\end{bmatrix} \bx = 0$$ The nullspace matrix ${\bf P}$ can be obtained
numerically, but here it was simple enough to get from inspection,
$${\bf P} = \begin{bmatrix} \frac{r}{\sqrt{1+r^2}} &
-\frac{1}{\sqrt{1+r^2}} & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 &
0 & 0 & \frac{r}{\sqrt{1+r^2}} & -\frac{1}{\sqrt{1+r^2}} & 0 \\ 0 & 0 &
0 & 0 & 0 & 1 \end{bmatrix}.$$ We can obtain the matrices $\bA_x \in
\Re^{6 \times 6}, \bB_x \in \Re^{6 \times 1}$ by linearizing the
MultibodyPlant
dynamics. Then we will call LQR with the
projected matrices. Note that the we will use the discrete-time
dynamics mode of the MultibodyPlant
(by passing in a
non-zero time_step
to the constructor), because
discrete-time dynamics engine contains our most advanced and robust
algorithms for simulating contact.
Importantly, although we design the LQR controller using the reduced model, we can run this controller directly on the original model.
Consider a linear system given by the state-space equations, $${\bf
E}\dot\bx = \bA\bx + \bB\bu,$$ where ${\bf E}$ may or may not be
invertible. We consider this model to be in "implicit form", and will
discuss much more about the use of implicit forms, and their strong
connections to mechanical systems, in
the next chapter. Under stabilizability and detectability
conditions
One can also design the LQR gains using linear matrix inequalities (LMIs). I will defer the derivation til we cover the policy gradient view of LQR, because the LMI formulation is based on a change of variables from the basic policy evaluation criterion. If you want to look ahead, you can find that formulation here.
Solving the algebraic Riccati equation is still the preferred way of computing the LQR solution. But it is helpful to know that one could also compute it with convex optimization. In addition to deepening our understanding, this can be useful for generalizing the basic LQR solution (e.g. for robust stabilization) or to solve for the LQR gains jointly as part of a bigger optimization.
We can also obtain the solution to the discrete-time finite-horizon
(including the time-varying or tracking variants) LQR problem using
optimization -- in this case it actually reduces to a simple
least-squares problem. The presentation in this section can be viewed as
a simple implementation of the Youla
parameterization (sometimes called "Q-parameterization") from
controls. Small variations on the formulation here play an important
role in the minimax variants of LQR (which optimize a worst-case
performance), which we will discuss in the robust control chapter (e.g.
First, let us appreciate that the default parameterization is not convex. Given \begin{gather*} \min \sum_{n=0}^{N-1} \bx^T[n] {\bf Q} \bx[n] + \bu^T[n] {\bf R} \bu[n], \qquad {\bf Q} = {\bf Q}^T \succeq 0, {\bf R} = {\bf R}^T \succ 0 \\ \subjto \quad \bx[n+1] = {\bf A} \bx[n] + {\bf B}\bu[n], \\ \bx[0] = \bx_0 \end{gather*} if we wish to search over controllers of the form $$\bu[n] = {\bf K}_n \bx[n],$$ then we have \begin{align*}\bx[1] &= {\bf A}\bx_0 + {\bf B}{\bf K}_0\bx_0, \\ \bx[2] &= {\bf A}({\bf A} + {\bf BK}_0)\bx_0 + {\bf BK}_1({\bf A} + {\bf BK}_0)\bx_0 \\ \bx[n] &= \left( \prod_{i=0}^{n-1} ({\bf A} + {\bf BK}_i) \right) \bx_0 \end{align*} As you can see, the $\bx[n]$ terms in the cost function include our decision variables multiplied together -- resulting in a non-convex objective. The trick is to re-parameterize the decision variables, and write the feedback in the form: $$\bu[n] = \tilde{\bf K}_n \bx_0,$$ leading to \begin{align*}\bx[1] &= {\bf A}\bx_0 + {\bf B}\tilde{\bf K}_0\bx_0, \\ \bx[2] &= {\bf A}({\bf A} + {\bf B}\tilde{\bf K}_0)\bx_0 + {\bf B}\tilde{\bf K}_1 \bx_0 \\ \bx[n] &= \left( {\bf A}^n + \sum_{i=0}^{n-1}{\bf A}^{n-i-1}{\bf B}\tilde{\bf K}_{i} \right) \bx_0 \end{align*} Now all of the decision variables, $\tilde{\bf K}_i$, appear linearly in the solution to $\bx[n]$ and therefore (convex) quadratically in the objective.
We still have an objective function that depends on $\bx_0$, but we would like to find the optimal $\tilde{\bf K}_i$ for all $\bx_0$. To achieve this let us evaluate the optimality conditions of this least squares problem, starting by taking the gradient of the objective with respect to $\tilde{\bf K}_i$, which is: $$\bx_0 \bx_0^T \left( \tilde{\bf K}_i^T \left({\bf R} + \sum_{m=i+1}^{N-1} {\bf B}^T ({\bf A}^{m-i-1})^T {\bf Q A}^{m-i-1} {\bf B}\right) + \sum_{m=i+1}^{N-1} ({\bf A}^m)^T {\bf Q A}^{m-i-1} {\bf B}\right).$$ We can satisfy this optimality condition for all $\bx_0$ by solving the linear matrix equation: $$\tilde{\bf K}_i^T \left({\bf R} + \sum_{m=i+1}^{N-1} {\bf B}^T ({\bf A}^{m-i-1})^T {\bf Q A}^{m-i-1} {\bf B}\right) + \sum_{m=i+1}^{N-1} ({\bf A}^m)^T {\bf Q A}^{m-i-1} {\bf B} = 0.$$ We can always solve for $\tilde{\bf K}_i$ since it's multiplied by a (symmetric) positive definite matrix (it is the sum of a positive definite matrix and many positive semi-definite matrices), which is always invertible.
If you need to recover the original ${\bf K}_i$ parameters, you can extract them recursively with \begin{align*} \tilde{\bf K}_0 &= {\bf K}_0, \\ \tilde{\bf K}_n &= {\bf K}_n \prod_{i=0}^{n-1} ({\bf A} + {\bf BK}_i), \qquad 0 < n \le N-1. \end{align*} But often this is not actually necessary. In some applications it's enough to know the performance cost under LQR control, or to handle the response to disturbances explicitly with the disturbance-based feedback (which I've already promised for the robust control chapter). After all, the problem formulation that we've written here, which makes no mention of disturbances, assumes the model is perfect and the controls $\tilde{\bf K}_n \bx_0$ are just as suitable for deployment as ${\bf K}_n\bx[n]$.
"System-Level Synthesis" (SLS) is the name for an important and
slightly different approach, where one optimizes the closed-loop
response directly
Once again, the algorithms presented here are not as efficient as
solving the Riccati equation if we only want the solution to the simple
case, but they become very powerful if we want to combine the LQR
synthesis with other objectives/constraints. For instance, if we want to
add some sparsity constraints (e.g. enforcing that some elements of
$\tilde{\bf K}_i$ are zero), then we could solve the quadratic
optimization subject to linear equality constraints
When we are designing LQR controllers based on a linearization (around some operating state or trajectory) of a nonlinear system, then it is interesting to ask: how does the LQR solution change as you change the nominal point (or trajectory)? If the nonlinear dynamics are smooth, does the solution also change smoothly?
The answer is yes -- the Riccati solution does change smoothly, and
its local variation can be obtained with a little bit of work by taking
Taylor approximation of the $\bA$ and $\bB$ matrices
To help you get a little more familiar with Drake, the exercise in will step you through the process of building a Drake Diagram, and designing an LQR controller for it. Systems and Diagrams are the way we write modular code describing dynamical systems, and implementing an algorithm like LQR against the System abstraction becomes very powerful.
For completeness, I've included here the derivation for continuous-time finite-horizon LQR with all of the bells and whistles.
Consider an time-varying affine (approximation of a) continuous-time dynamical system in state-space form: $$\dot{\bx} = {\bf A}(t)\bx + {\bf B}(t)\bu + {\bf c}(t),$$ and a running cost function in the general quadratic form: \begin{gather*} \ell(t, \bx,\bu) = \begin{bmatrix} \bx \\ 1 \end{bmatrix}^T {\bf Q}(t) \begin{bmatrix} \bx \\ 1 \end{bmatrix} + \begin{bmatrix} \bu \\ 1 \end{bmatrix}^T {\bf R}(t) \begin{bmatrix} \bu \\ 1 \end{bmatrix} + 2\bx^T{\bf N}(t)\bu, \\ \forall t\in[t_0, t_f], \quad \bQ(t) = \begin{bmatrix} \bQ_{xx}(t) & \bq_x(t) \\ \bq_x^T(t) & q_0(t) \end{bmatrix}, \bQ_{xx}(t) \succeq 0, \quad \bR(t) = \begin{bmatrix} \bR_{uu}(t) & {\bf r}_u(t) \\ {\bf r}_u^T(t) & r_0(t) \end{bmatrix}, \bR_{uu}(t) \succ 0.\end{gather*} Observe that our LQR "optimal tracking" derivation fits in this form, as we can always write $$(\bx - \bx_d(t))^T \bQ_t (\bx - \bx_d(t)) + (\bu - \bu_d(t))^T \bR_t (\bu - \bu_d(t)) + 2 (\bx - \bx_d(t))^T {\bf N}_t (\bu - \bu_d(t)),$$ by taking \begin{gather*} \bQ_{xx} = \bQ_t,\quad \bq_x = -\bQ_t \bx_d - {\bf N}_t\bu_d, \quad q_0 = \bx_d^T \bQ_t \bx_d + 2 \bx_d^T {\bf N}_t \bu_d, \\ \bR_{uu} = \bR_t, \quad {\bf r}_u = -\bR_t \bu_d - {\bf N}_t^T \bx_d, \quad r_0 = \bu_d^T \bR_t \bu_d, \quad {\bf N} = {\bf N}_t.\end{gather*} Of course, we can also add a quadratic final cost with $\bQ_f$. Let's search for a positive quadratic, time-varying cost-to-go function of the form: \begin{gather*} J(t, \bx)=\begin{bmatrix} \bx \\ 1 \end{bmatrix}^T {\bf S}(t) \begin{bmatrix} \bx \\ 1 \end{bmatrix}, \quad {\bf S}(t) = \begin{bmatrix} {\bf S}_{xx}(t) & {\bf s}_x(t) \\ {\bf s}_x^T(t) & s_0(t) \end{bmatrix}, {\bf S}_{xx}(t) \succ 0, \\ \frac{\partial J}{\partial \bx} = 2\bx^T{\bf S}_{xx} + 2{\bf s}_x^T, \quad \frac{\partial J}{\partial t} = \begin{bmatrix} \bx \\ 1 \end{bmatrix}^T\dot{\bf S} \begin{bmatrix} \bx \\ 1 \end{bmatrix}. \end{gather*} Writing out the HJB: \begin{gather*} \min_\bu \left[\ell(\bx,\bu) + \frac{\partial J}{\partial \bx} \left[{\bf A}(t)\bx + {\bf B}(t)\bu + {\bf c}(t) \right] + \frac{\partial J}{\partial t} \right] = 0, \end{gather*} we can find the minimizing $\bu$ with \begin{gather*} \frac{\partial}{\partial \bu} = 2\bu^T{\bf R}_{uu} + 2{\bf r}_u^T + 2\bx^T{\bf N} + (2\bx^T{\bf S}_{xx} + 2{\bf s}_x^T){\bf B} = 0 \\ \bu^* = -{\bf R}_{uu}^{-1} \begin{bmatrix} {\bf N} + {\bf S}_{xx} \bB \\ {\bf r}^T_{u} + {\bf s}^T_{x}\bB \end{bmatrix}^T \begin{bmatrix} \bx \\ 1 \end{bmatrix} = -{\bf K}(t) \begin{bmatrix} \bx \\ 1 \end{bmatrix} = -{\bf K}_x(t) \bx - {\bf k}_0(t). \end{gather*} Inserting this back into the HJB gives us the updated Riccati differential equation. Since this must hold for all $\bx$, we can collect the quadratic, linear, and offset terms and set them each individually equal to zero, yielding: \begin{align*} -\dot{\bf S}_{xx} =& \bQ_{xx} - ({\bf N} + {\bf S}_{xx} \bB){\bf R}_{uu}^{-1}({\bf N} + {\bf S}_{xx} \bB)^T + {\bf S}_{xx}{\bf A} + {\bf A}^T{\bf S}_{xx}, \\ -\dot{\bf s}_x =& \bq_x - ({\bf N} + {\bf S}_{xx} \bB){\bf R}_{uu}^{-1} ({\bf r}_u + \bB^T {\bf s}_x ) + {\bf A}^T{\bf s}_x + {\bf S}_{xx}{\bf c}, \\ -\dot{s}_0 =& q_0 + r_0 - ({\bf r}_u + \bB^T {\bf s}_x)^T{\bf R}_{uu}^{-1} ({\bf r}_u + \bB^T {\bf s}_x) + 2{\bf s}_x^T {\bf c}, \end{align*} with the final conditions ${\bf S}(t_f) = {\bf Q}_f.$
In the discrete-time version, we have...
Phew! Numerical solutions to these equations can be obtained by
calling the FiniteHorizonLinearQuadraticRegulator
methods in
Previous Chapter | Table of contents | Next Chapter |