Date: November 12nd 2018
Author: Robert Baraldi
Editor: Peng Zheng
A Basic Introduction to the Kalman Filter: An Optimization View
The Kalman-Bucy-tons of other people-filter has been well-studied to say the least. Between putting people on the moon and predicting baseball pitches, it’s inspired countless PhD theses, so it’s only a matter of time before you run into it. In essence, it is a recursion that mediates between a mathematical model of a physical system and data observations to accurately infer and predict system state. In the post below, we study instances of simple linear the Kalman filter from the optimization perspective which leads to a better interpretation of the model and a more flexible implementation of the algorithm. Note that this is adopted from Professor Aleksandr Aravkin’s class notes in AMATH 515 at the University of Washington and his survey [ABL+17].
A Simple Example
First, let’s setup a simple movtivating example. Imagine we want to track the movement of a satellite. Every period of time \(\Delta t\), we collect position information of the satellite, \(y_k \in \mathbb{R}^3\), for \(k=1,\ldots,N.\) However due to errors occurred in the measuring process and signal transmission, we cannot totally rely on the observations \(y_k\).
Linear Kalman System
Assume the true positions of the satellite are \(p_k\) and the latent velocities of the satellite are \(v_k\),
we want to infer the state variables \(x_k = [v_k; p_k]\) from the observations \(y_k\).
There are two sets of relations, the one between previous and current state
and the one between states and observations.
We call them transition model and measurement model,
- transition model: \[x_k = G_k x_{k-1} + \mu_k;\]
- measurement model: \[y_k = H_k x_k + \nu_k.\]
Here we denote \(G_k\) as the transition matrices, \(H_k\) as the measurement matrices and \(x_0\) is the given guess of the initial state. For this specific example, we have,
\[\begin{aligned} G_1 = I_6, \quad G_k &= \begin{bmatrix} I_3 & O_3 \\ \Delta t I_3 & I_3 \end{bmatrix}, &&\quad k = 2,\ldots,N,\\ H_k &= \begin{bmatrix} I_3 & O_3 \end{bmatrix}, &&\quad k = 1,\ldots,N, \end{aligned}\]
where \(I_n\) is \(n\) by \(n\) identity matrix and \(O_n\) is \(n\) by \(n\) zero matrix.
Error Model
In the transition and measurement models we listed above,
\(\mu_k\) and \(\nu_k\) represent the noises and errors in the system.
For simplicity, we assume they independently follow Gaussian distribution in the example, namely,
\[\mu_k \sim \mathcal{N}(0_6, Q_k), \quad \nu_k \sim \mathcal{N}(0_6, R_k),\]
where \(0_6\) is a zero vector with length of 6 and
\(Q_k\) and \(R_k\) are postive definite, represent for covariance matrices.
There are two extensions for the error model assumption,
- Heavy tail distribution
When the contamination of the data involves outliers, namely sparse large magnitude erros, heavy tail distributions are more suitable and can improve the performance compare to Gaussian assumption. - Singular covariance matrices
In a lot of the applications, the covariance matrices are intrinsic singular. How to build the correct model and leverage optimization technique are studied in [JAB+18].
Maximum Likelihood
After setting up the models, we want to formulate the problem into an optimization problem so that various suitable algorithms can be applied to infer the state. To accomplish this, we need to build the statistical maximum likelihood formulation based on the error model. From the assumption, we know that all \(\mu_k\) and \(\nu_k\) are independent. Given the distribution specification of \(\mu_k\) and \(\nu_k\) and the observations \(y_k\), how likely are \(x_k\) at current value is described by the probability,
\begin{equation} \begin{aligned} &\mathrm{Pr}[\mu_1, \ldots, \mu_N, \nu_1, \ldots, \nu_N| x_1, \ldots, x_N]\\ =& \prod_{k=1}^N \mathrm{Pr}[\mu_k|x_k] \mathrm{Pr}[\nu_k|x_k]\\ \propto& \prod_{k=1}^N \exp\left[-\frac{1}{2}(x_k - G_k x_{k-1})^\top Q_k^{-1} (x_k - G_k x_{k-1})\right] \exp\left[-\frac{1}{2}(y_k - H_k x_k)^\top R_k^{-1} (y_k - H_k x_k)\right] \end{aligned} \end{equation}
The most likely \(x_k\) are chosen by maximize the above quantity, namely, the maximum liklihood formulation,
\[ \max_{x_1, \ldots, x_N}~~\prod_{k=1}^N \exp\left[-\frac{1}{2}(x_k - G_k x_{k-1})^\top Q_k^{-1} (x_k - G_k x_{k-1})\right] \exp\left[-\frac{1}{2}(y_k - H_k x_k)^\top R_k^{-1} (y_k - H_k x_k)\right]. \]
And it is equivalent with minimizing the negative log liklihood formulation,
\[ \min_{x_1, \ldots, x_N} \sum_{k=1}^N \frac{1}{2}\|x_k - G_k x_{k-1}\|_{Q_k^{-1}}^2 + \sum_{k=1}^N \frac{1}{2}\|y_k - H_k x_k\|_{R_k^{-1}}^2, \]
which represent our basic optimization objective. By introducing the notation,
\[\begin{aligned} & x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix} && G = \begin{bmatrix} I_6 & & & \\ -G_2 & I_6 & &\\& \ddots & \ddots & \\ & & -G_N & I_6 \end{bmatrix} &&& Q = \begin{bmatrix} Q_1 & & &\\ & Q_2 & & \\ & & \ddots & \\ & & & Q_N \end{bmatrix}\\ & y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} && H = \begin{bmatrix} H_1 & & &\\ & H_2 & & \\ & & \ddots & \\ & & & H_N \end{bmatrix} &&& R = \begin{bmatrix} R_1 & & &\\ & R_2 & & \\ & & \ddots & \\ & & & R_N \end{bmatrix} \end{aligned}\]
and \(\eta = [G_1 x_0; 0_6; \ldots; 0_6]\), we could write the objective in a more compact form,
\[ \min_{x}~~\frac{1}{2}\|Gx - \eta\|_{Q^{-1}}^2 + \frac{1}{2}\|y - H x\|_{R^{-1}}^2. \]
Optimization Algorithms for Solving Kalman Filter
For the simple this simple example, the optimization problem is a least square problem. We could explicitly write out the optimality condition and obtain the solution in closed form,
\[ (G^\top Q^{-1} G + H^\top R^{-1} H) x = G^\top Q^{-1} \eta + H^\top R^{-1} y. \]
To efficiently solve this linear system, we need to exploit the tri-diagonal structrue, and it is worth a separate blog post.
In general, when we do not assume Gaussian distribution of the error model, and prior infomation is incorporated, the optimization problem can be written as,
\[ \min_{x}~~\rho_t(Gx - \eta) + \rho_m(y - Hx) + r(x), \]
where \(\rho_t\) and \(\rho_m\) are corresponding to the error models of transition and measurement models and \(r\) encodes the prior information. The general optimization problem can be nonsmooth and nonconvex, special optimization techniques need to be applied accordingly. Populor choices are,
- primal method:
- gradient descent and its accelerated variations
- proximal gradient descent and its acceleration variations
- primal-dual method:
- Champolle-Pock
- ADMM
Further discussions of the algorithm are in separate blogs under specific context.
Thanks for reading! Leave a comment if you have any questions or suggestions.