Math You Need to Estimate the Weather, Stocks, Robots, and More!

Particle Filtering — the Monte-Carlo method used for Bayesian Estimation

Karena X. Cai
13 min readFeb 26, 2020


This tutorial is two-fold: the first part gives a more gentle introduction about Bayesian estimation, and the second part dives into the specifics of the particle filtering algorithm, which is the Monte Carlo method of performing Bayesian estimation.

PART 1 — Bayesian Estimation


Have you ever stopped to wonder: how do scientists predict the weather? The weather, among other things, is a really random process that seems like it would be impossible to predict in the future. What mathematical tools allows scientists to do this seemingly impossible task?

Scientists could just use a mathematical model that is derived from the laws of physics (i.e. see below):

The mathematical model above is saying there exists some formula we can use to describe how the system will evolve at a future state given its current state. The n_t term is the noise or uncertainty associated with the system.

For weather, you could imagine this to be some physics equation that models how the state of the system will change over time. A scientist trying to predict the weather wants to estimate the state of the system described by the mathematical model above, where the state of the system x_t in this model might be the temperature or the likelihood of precipitation at a given location.

BUT, our mathematical models have some randomness (in the equation above, the n_t) that captures the uncertainty in the model. This uncertainty makes it difficult to estimate (or predict) the state of a physical system over time.

Because of this uncertainty, if scientists only used a mathematical model derived from the laws of physics to estimate (and predict) the weather, the weather forecasts would be really terrible! This is why scientists ALSO collect all the measurements (or data) they can about the weather. Weather measurements come from sensors like barometers, thermometers, humidity sensors, pressure sensors and more.

Measurements, or data, can come from a wide variety of different sensors we can use to measure the state of a system.

With this additional information, scientists can combine their mathematical model with the data they collect, to estimate the state of a physical process like the weather — and even make accurate predictions of it for up to 7 days.


In this article, we will describe one of the many mathematical frameworks (Kalman filters, Extended Kalman Filters, Unscented Kalman Filters, etc.) that can be used to estimate the state of a random process when given a 1) mathematical model of the process with some additive uncertainty and 2) some measurements (that also have uncertainty) of the system as it is evolving over time. These techniques are generally referred to as Bayesian Estimation.

All the pieces you need to succeed in Bayesian estimation!

Particle filtering, in particular, is a technique where we use Monte Carlo methods to iteratively estimate the state of the system at every time step!


Before we jump ahead to the math, let’s think about some other domains where this math might be useful. Imagine any random process that has a lot of uncertainty, and therefore seems very difficult to estimate or predict. But also imagine that you can take some measurements that give you some information about the state of your system. Can you stop to think about where else we could use Bayesian estimation? Here are a just few that I could come up with:

Some other processes where we could use particle filtering.

Bayesian filtering is the umbrella term for how you can combine a mathematical model with measurements to make estimates of the state of a system. Just keep in mind that particle filtering is just one of many methods to do Bayesian filtering.


Now, let us dig into the math! Let’s say we have a nonlinear dynamics model that captures the evolution of the dynamics of a system as shown below:

Mathematical model with Gaussian uncertainty.

The nonlinear model above is an equation that predicts how we think the system will evolve over time.

Let me introduce an example that will make these equations easier to interpret and understand. Let us consider a car that is moving with constant velocity (zero acceleration) in a straight line (1-D example) The mathematical model for this car will capture how the state of the car (its position x and its velocity v) will evolve over time. Using the laws of physics, we can write how the state of the car (its position x and velocity v) we evolve over time. In particular, the discrete mathematical model looks as follows:

Mathematical model for car moving at constant velocity. The noise is modeled as Gaussian white noise.

Note, since the car might experience some unmodeled physical perturbations like the slipperyness of the road or some wind, there is some inherent uncertainty in the process! This uncertainty is modeled to be random white noise and is captured by the Gaussian random variables n and w, which are modeled to be identically and independently distributed (i.i.d.).


Now that we have covered the mathematical model, we now want to figure out how the measurements that we take reflect the state of our system. Nonlinear measurements have the following form:

Measurement model. The measurement is written as a function of the state and the noise is normally distributed with mean zero and covariance Gamma.

Let’s see what measurements might look like in our original car example! Imagine we had a GPS that could measure the car’s position. The measurement equation is an equation that captures how the GPS position is a function of the system’s state. The measurement equation for a GPS reading can be written as:

Measurement equation for a GPS signal.

In other words, we can think of our GPS measurement as an indicator of the position of the car at a given time with some additive noise. Another example of a measurement (perhaps using vision or some other range sensor) could be the distance to an object with a known location. If the object’s position is given by px, we can write a range measurement as:

Measurement equation for a range measurement to an object with known position at px.

Again, since sensor measurements are not perfect, there is noise in our sensor measurement model as well. We use a Gaussian random variable (in this case n) to capture this randomness in our model.


Now, let us take a look at what this car example looks like graphically. In the figure below, we can see if the model of the physical process was entirely deterministic, we would already know the exact position of the car, and we would not need to use particle filtering (as shown in the top image of the figure). The orange represents the location of the car at every time-step and the green represents the GPS measurement taken at each time step.

This image shows us an illustration of the car example. Since there is uncertainty in both the model of the system as well as the , instead of being certain of where the system is located, there is a distribution of possibilities over where the car could actually be at a given time.

Since the physical process has some uncertainty, however, there is a range of values that the car position (and velocity) could be at (as shown by the ellipsoid region in the diagram). Similarly, there is uncertainty in the measurements, which is indicated by the lighter green region that indicates that the measurement can be in a range of positions, as well. Since it is uncertain where the car is actually located at each given time-step, we will derive an algorithm that will tell us our best estimate of the state at each time step of the process given all the measurements we have taken up to that time.

In other words, we want to find the following probability distribution:

Before deriving equations that show how we get this posterior distribution, let’s look at one more graphical representation of what we are computing.

The evolution of a dynamic system. Can we estimate the state at time step t given measurements y up to time t. In other words, can we find the posterior p(x_t|y_{0:t})?

Here, the x_t is the state of the system and the y_t is the corresponding measurement taken at that time. So the question we want to ask in Bayesian filtering is: at each time t, how do we find the posterior distribution p(x_t|y_{0:t})? In other words, what is the probability distribution over the current state, given the measurements that we have collected from time 0 up to time t.


We use Bayes’ theorem to answer this question! Bayes’ rule says the following:

Bayes’ theorem tells us how to mathematically solve for a probability distribution over the system’s state at time t (p(x_t|y_{0:t}). It is saying this posterior probability distribution depends on two probabilities: 1) the prior distribution of the state and 2) the measurement likelihood function. Note, the normalization term in the denominator is necessary since probabilities must take on values between 0 and 1. In the next sections, I will describe the intuition of each of the pieces of the equation on the right-hand side of the equation, and their connection to the mathematical model and the measurement model that we described earlier.


The prior distribution is the probability distribution of the state at x_t given all past measurements up to time t-1: p(x_t|y_{0:t-1}. This distribution can be derived using our mathematical model of our system. In particular, the prior distribution tells us where we think the state of the system will be if it evolves according to the mathematical model.

Let’s see what this means in the particle filtering setting where the distributions are represented by samples. Imagine we have samples of the previous posterior distribution: p(x_{t-1}|y_{0:t-1}). Therefore, if we took each of those samples from that distribution and propagated the state forward by one time-step according to the mathematical model, then these new samples would represent a new distribution. In particular, this new distribution would represent the prior distribution: p(x_t|y_{0:t-1}). This propagation of samples would result in a distribution where the mean of the prior distribution is centered around the prediction of the state by the model f(x_{t-1}) and its variance is given by SIGMA. This can be seen in the figure below:

An illustration of the prior distribution being derived from propagating the posterior distribution forward with the mathematical model.


While the prior distribution is closely associated to the mathematical model, the likelihood function (as shown on the right) can be derived from the measurement equations that we introduced earlier.

In particular, the likelihood function says if you are at state x_t, it will give you the likelihood of observing the measurement y_t. Let’s think back to the car example to get a more intuitive understanding of what this means. If our car is at a position x_t = 5 m, then the likelihood p(y_t|x_t) will tell us the probability of a given measurement y_t given that the car’s position is at x_t = 5 m. We can imagine that a measurement y_t close to 5 m will have a much higher probability than a state far away from 5 m, since the GPS is just measuring the position of the car. The mean of this distribution is given by h(x_t) and the covariance is given by Gamma, as can be seen in the figure above.

The likelihood function is a distribution that is derived from the measurement equations.

PART 2 — Particle Filtering

Particle filtering is a specific Monte-Carlo method to iteratively derive the posterior distribution using the likelihood and prior distributions! Now that we are familiar with the intuition behind the likelihood and prior distribution, let’s dive into the particle filtering algorithm itself.

If we could sample directly from the posterior distribution p(x_t|y_{0:t}), given by the blue distribution below, we would already be done! We could simply use Monte-Carlo sampling to approximate the distribution p(x_t|y_{0:t}).

(For those who are not familiar with Monte Carlo) let’s decompose what the formula at the bottom means in a more intuitive manner. Let’s say we had a bunch of samples from an underlying posterior distribution. This means we have one sample at 0 m, two samples at 1 m, 5 samples at 2m, etc. Then, the formula is saying that the probability of p(x= 1 m) is equal to the number of particles at 1 m / N where N is the total number of samples! For Monte-Carlo sampling, this method to approximate the distribution converges to the real distribution for a large number of samples.

Unfortunately, we do not have the ability to directly sample from this posterior distribution. The distribution we can sample from, however, is the prior distribution. Luckily, we can use a mathematical technique called importance sampling to approximate the posterior distribution that we are actually looking for by using samples from the prior distribution!


Importance sampling is a mathematical tool we can use so that samples from the prior distribution can be used to approximate the posterior distribution.

Instead of Monte-Carlo sampling, we can do Importance sampling. Here is a high-level illustration of how importance sampling works for particle filtering.

Remember Bayes’ theorem? Bayes’ theorem tells us the relationship between the prior distribution and the posterior distribution. Once we sample from the prior distribution, we can reweigh the samples (particles) according to the likelihood function. After some normalization, the reweighted samples from the prior distribution represent the posterior distribution. Note, samples and particles are meant to be equivalent and are used interchangeably in the rest of this tutorial.


Now, let’s dive into the details of how to do this in practice! Suppose we have some initial Gaussian distribution approximating our estimate of the initial state of our system at time t=0. In the first step of the algorithm, we sample n particles from this initial distribution. Note, these particles are a bunch of different x-positions on the car representing potential locations that the car is actually at. (Aside from the first step of this iterative algorithm, these samples will actually come from a non-Gaussian posterior distribution p(x_{t-1}|y_{0:t-1})) instead of p(x_0|y_0).

Now that we have these samples, we want to propagate them forward according to our mathematical model of the physical process. Of course, we propagate the particles forward with the noise in the mathematical model as well! Once we propagate these particles forward, we get a new distribution. This distribution represents the prior distribution p(x_{t}|y_{0:t-1}). The red x in the diagram corresponds to the most probable position of the vehicle according to the prior distribution.

Now, that we have samples from the prior distribution, we want to use our information we gain from having a measurement at time t=1 to reshape the weights on the samples. For the third step of the algorithm, we compute the weights of each particle. We can see the formula for deriving the weights with the equation given for finding w_{t+1}^(n). In this case, we can see that particles whose mapping h(x_{t+1}) matches the measurement well will have significantly higher weights. Of course, we must normalize the probability distribution, which we can do by normalizing the weights.

Now that we have reweighed the particles according to the likelihood function, we define our posterior distribution as follows:

Note, this new distribution is similar to the Monte-Carlo sampling except for the fact that some samples weigh more than others. For instance, if we have 1 sample at x = 2 m (the left most particle of the green distribution), then the probability of x_{t+1} being at 2 m is equivalent to w_1 *(1/N) where w_1 is the weight assigned to the 1 m particle and N is the number of samples.

We can see that our initial prior estimate without considering the measurements got shifted to the right after we added in our measurement information! This shows how the predicted estimate from our mathematical model is readjusted to accommodate for the measurements that we are simultaneously collecting! Remember this is an iterative process.

However, there is a main issue that we must address before iterating! The main issue has been termed particle degeneration.

There is some math (which I will not go into details into) which shows that after some time, all particles except one will have negligible weight! So, instead of getting a great approximation of the posterior distribution, you get a distribution that looks like the green distribution shown above. Luckily, there are many ways to remedy particle degeneration. One way is to do a resampling step as shown below:

In this case, we have some particles with large weights. If we resample from this distribution, we get a new set of samples where each of them has the same weight.

Resampling in this way is just a way to represent a distribution with samples such that instead of having varying weights (where some may be very large and some may be very small), they all have the same weight.

This may not be the most effective way to avoid particle degeneration, but it is one of the techniques that is the simplest and easiest to implement! After this resampling process, we now have a set of samples that represent the posterior distribution p(x_t|y_{0:t}). Now, we can repeat the process in order to estimate the next state.


In summary, in particle filtering, we are trying to identify the best estimate of the state of an uncertain or stochastic system as it is evolving over time. We do this by taking into account the mathematical model we have of our system and the measurements we are taking of the system as it is evolving over time.

A GIF showing the particle filtering algorithm evolution as it evolves over time.

Particle filtering allows us to iteratively estimate the state by doing Importance Sampling (i.e. taking samples from the prior distribution and reweighing the particles according to the new measurements taken at the new time-step). Note, there are some nuances when it comes to predicting the model forward a few steps (like the weather), but the mathematical tools used are very similar in flavor.

Note, there are plenty more resources to help you to understand this process more in depth! Here is a thorough tutorial that goes through the mathematical details:

You can also see some sample Python code [to come] where I perform this algorithm for a simple nonlinear dynamical system.



Karena X. Cai

PhD student @ Caltech. Research in designing behavioral contracts for self-driving cars. Lifelong teacher. Avid learner.