This is going to be a fairly advanced application of MPI, targeted at someone that has already had some basic exposure to parallel computing. Because of this, I am not going to go step by step but I will rather focus on specific aspects that I feel are of interest (specifically, the use of ghost cells and message passing on a twodimensional grid). As I have started doing in my recent blogposts, the code discussed here is only partial. It is accompanied by a fully featured solution that you can find on Github and I have named Diffusion.jl.
Parallel computing has entered the “commercial world” over the last few years. It is a standard solution for ETL (ExtractTransformLoad) applications where the problem at hand is embarassingly parallel: each process runs independently from all the others and no network communication is needed (until potentially a final “reduce” step, where each local solution is gathered into a global solution).
In many scientific applications, there is the need for information to be passed through the network of a cluster. These “nonembarrassingly parallel” problems are often numerical simulations that model problems ranging from astrophysics to weather modelling, biology, quantum systems and many more. In some cases, these simulations are run on tens to even millions of CPUs (Fig. 1) and the memory is distributed  not shared  among the different CPUs. Normally, the way these CPUs communicate in a supercomputer is through the Message Passing Interface (MPI) paradigm.
Anyone working in High Performance Computing should be familiar with MPI. It allows to make use of the architecture of a cluster at a very low level. In theory, a researcher could assign to every single CPU its computational load. He/She could decide exactly when and what information should be passed among CPUs and whether this should happen sinchronously or asynchronously.
And now, let’s go back to the contents of this blogpost, where we are going to see how to write the solution of a diffusiontype equation using MPI. We have already discussed the explicit scheme for a onedimensional equation of this type. However, in this blogpost we will look at the twodimensional solution.
The Julia code presented here is essentially a translation of the C/Fortran code explained in this excellent blogpost by Fabien Dournac.
In this blogpost I am not going to present a thorough analysis of scaling speed vs. number of processors. Mainly because I only have two CPUs that I can play with at home (Intel Core i7 processor on my MacBook Pro)… Nonetheless, I can still proudly say that the Julia code presented in this blogpost shows a significant speedup using two CPUs vs. one. Not only this: it is faster than the Fortran and C equivalent codes! (more on this later)
These are the topics that we are going to cover in this blogpost:
I am actually still a newbie with Julia, hence the choice of having a secion on my “first impressions”.
The main reason why I got interested in Julia is that it promises to be a general purpose framework with a performance comparable to the likes of C and Fortran, while keeping the flexibility and ease of use of a scripting language like Matlab or Python. In essence, it should be possible to write Data Science/HighPerformanceComputing applications in Julia that run on a local machine, on the cloud or on institutional supercomputers.
One aspect I don’t like is the workflow, which seems suboptimal for someone like me that uses IntelliJ and PyCharm on a daily basis (the IntelliJ Julia plugin is terrible). I have tried the Juno IDE as well, it is probably the best solution at the moment but I still need to get used to it.
One aspect that demonstrates how Julia has still not reached its “maturity” is how varied and outdated is the documentation of many packages. I still haven’t found a way to write a matrix of Floating point numbers to disk in a formatted way. Sure, you can write to disk each element of the matrix in a doublefor loop but there should be better solutions available. It is simply that information can be hard to find and the documentation is not necessarily exhaustive.
Another aspect that stands out when first using Julia is the choice of using onebased indexing for arrays. While I find this slightly annoying from a practical perspective, it is surely not a deal breaker also considering that this not unique to Julia (Matlab and Fortran use onebased indexing, too).
Now, to the good and most important aspect: Julia can indeed be really fast. I was impressed to see how the Julia code that I wrote for this blogpost can perform better than the equivalent Fortran and C code, despite having essentially just translated it into Julia. Have a look at the performance section if you are curious.
Open MPI is an open source Message Passing Interface library. Other famous libraries include MPICH and MVAPICH. MVAPICH, developed by the Ohio State University, seems to be the most advanced library at the moment as it can also support clusters of GPUs  something particularly useful for Deep Learning applications (there is indeed a close collaboration between NVIDIA and the MVAPICH team).
All these libraries are built on a common interface: the MPI API. So, it does not really matter whether you use one or the other library: the code you have written can stay the same.
The MPI.jl project on Github is a wrapper for MPI. Under the hood, it uses the C and Fortran installations of MPI. It works perfectly well, although it lacks some functionality that is available in those other languages.
In order to be able to run MPI in Julia you will need to install separately Open MPI on your machine. If you have a Mac, I found this guide to be very useful. Importantly, you will need to have also gcc (the GNU compiler) installed as Open MPI requires the Fortran and C compilers. I have installed the 3.1.1 version of Open MPI, as mpiexec version on my Terminal also confirms.
Once you have Open MPI installed on your machine, you should install cmake. Again, if you have a Mac this is as easy as typing brew install cmake on your Terminal.
At this point you are ready to install the MPI package in Julia. Open up the Julia REPL and type Pkg.add(“MPI”). Normally, at this point you should be able to import the package using import MPI. However, I also had to build the package through Pkg.build(“MPI”) before everything worked.
The diffusion equation is an example of a parabolic partial differential equation. It describes phenomena such as heat diffusion or concentration diffusion (Fick’s second law). In two spatial dimensions, the diffusion equation writes
The solution $u(x, y, t)$ represents how the temperature/concentration (depending on whether we are studying heat or concentration diffusion) varies in space and time. Indeed, the variables $x, y$ represent the spatial coordinates, while the time component is represented by the variable $t$. The quantity $D$ is the “diffusion coefficient” and determines how fast heat, for example, is going to diffuse through the physical domain. Similarly to what was discussed (in more detail) in a previous blogpost, the equation above can be discretized using a socalled “explicit scheme” of solution. I am not going to go through the details here, that you can find in said blogpost, it suffices to write down the numerical solution in the following form,
$$\begin{equation}
\frac{u_{i,k}^{j+1}  2u_{i,k}^{j}}{\Delta t} =
D\left(\frac{u_{i+1,k}^j2u_{i,k}^j+u_{i1,k}^j}{\Delta x^2}+
\frac{u_{i,k+1}^j2u_{i,k}^j+u_{i,k1}^j}{\Delta y^2}\right)
\label{eq:diffusion}
\end{equation}$$
where the $i, k$ indices refer to the spatial grid while $j$ represents the time index.
Assuming all the quantities at the time $j$ to be known, the only unknown is $u_{i,k}^{j+1}$ on the left hand side of eq. (\ref{eq:diffusion}). This quantity depends on the values of the solution at the previous time step $j$ in a crossshaped stencil, as in the figure below (the red dots represent those grid points at the time step $j$ that are needed in order to find the solution $u_{i,k}^{j+1}$).
Equation (\ref{eq:diffusion}) is really all is needed in order to find the solution across the whole domain at each subsequent time step. It is rather easy to implement a code that does this sequentially, with one process (CPU). However, here we want to discuss a parallel implementation that uses multiple processes.
Each process will be responsible for finding the solution on a portion of the entire spatial domain. Problems like heat diffusion, that are not embarrassingly parallel, require exhchanging information among the processes. To clarify this point, let’s have a look at Figure 3. It shows how processes #0 and #1 will need to communicate in order to evaluate the solution near the boundary. This is where MPI enters. In the next section, we are going to look at an efficient way of communicating.
An important notion in computational fluid dynamics is the one of ghost cells. This concept is useful whenever the spatial domain is decomposed into multiple subdomains, each of which is solved by one process.
In order to understand what ghost cells are, let’s consider again the two neighboring regions depicted in Figure 3. Process #0 is responsible for finding the solution on the left hand side, whereas process #1 finds it on the right hand side of the spatial domain. However, because of the shape of the stencil (Fig. 2), near the boundary both processes will need to communicate between them. Here is the problem: it is very inefficient to have process #0 and process #1 communicate each time they need a node from the neighboring process: it would result in an unacceptable communication overhead.
Instead, what is common practice is to surround the “real” subdomains with extra cells called ghost cells, as in Figure 4 (right). These ghost cells represent copies of the solution at the boundaries of neighboring subdomains. At each time step, the old boundary of each subdomain is passed to its neighbors. This allows the new solution at the boundary of a subdomain to be calculated with a significantly reduced communication overhead. The net effect is a speedup in the code.
There are a lot of tutorials on MPI. Here, I just want to describe those commands  expressed in the language of the MPI.jl wrapper for Julia  that I have been using for the solution of the 2D diffusion problem. They are some basic commands that are used in virtually every MPI implementation.
For our problem, we are going to decompose our twodimensional domain into many rectangular subdomains, similar to the Figure below
Note that the “x” and “y” axis are flipped with respect to the conventional usage, in order to associate the xaxis to the rows and the yaxis to the columns of the matrix of solution.
In order to communicate between the various processes, each process needs to know what its neighbors are. There is a very convenient MPI command that does this automatically and is called MPI_Cart_create. Unfortunately, the Julia MPI wrapper does not include this “advanced” command (and it does not look trivial to add it), so instead I decided to build a function that accomplishes the same task. In order to make it more compact, I have made extensive use of the ternary operator. You can find this function below,
1  function neighbors(my_id::Int, nproc::Int, nx_domains::Int, ny_domains::Int) 
The inputs of this function are my_id, which is the rank (or id) of the process, the number of processes nproc, the number of divisions in the $x$ direction nx_domains and the number of divisions in the $y$ direction ny_domains.
Let’s now test this function. For example, looking again at Fig. 5, we can test the output for process of rank 4 and for process of rank 11. This is what we find on a Julia REPL:
1  julia> neighbors(4, 12, 3, 4) 
1  julia> neighbors(11, 12, 3, 4) 
As you can see, I am using cardinal directions “N”, “S”, “E”, “W” to denote the location of a neighbor. For example, process #4 has process #3 as a neighbor located South of its position. You can check that the above results are all correct, given that “1” in the second example means that no neighbors have been found on the “North” and “East” sides of process #11.
As we have seen earlier, at each iteration every process sends its boundaries to the neighboring processes. At the same time, every process receives data from its neighbors. These data are stored as “ghost cells” by each process and are used to compute the solution near the boundary of each subdomain.
There is a very useful command in MPI called MPI_Sendrecv that allows to send and receive messages at the same time, between two processes. Unfortunately MPI.jl does not provide this functionality, however it is still possible to achieve the same result by using the MPI_Send and MPI_Receive functionalities separately.
This is what is done in the following updateBound! function, which updates the ghost cells at each iteration. The inputs of this function are the global 2D solution u, which includes the ghost cells, as well as all the information related to the specific process that is running the function (what its rank is, what are the coordinates of its subdomain, what are its neighbors). The function first sends its boundaries to its neighbors and then it receives their boundaries. The receive part is finalized through the MPI.Waitall! command, that ensures that all the expected messages have been received before updating the ghost cells for the specific subdomain of interest.
1  function updateBound!(u::Array{Float64,2}, size_total_x, size_total_y, neighbors, comm, 
The domain is initialized with a constant value $u=+10$ around the boundary, which can be interpreted as having a source of constant temperature at the border. The initial condition is $u=10$ in the interior of the domain (Fig. 6left). As time progresses, the value $u=10$ at the boundary diffuses towards the center of the domain. For example, at the time step j=15203, the solution looks like as in Fig. 6right.
As the time $t$ increases, the solution becomes more and more homogenous, until theoretically for $t \rightarrow +\infty$ it becomes $u=+10$ across the entire domain.
I was very impressed when I tested the performance of the Julia implementation against Fortran and C: I found the Julia implementation to be the fastest one!
Before jumping into this comparison, let’s have a look at the MPI performance of the Julia code itself. Figure 7 shows the ratio of the runtime when running with 1 vs. 2 processes (CPUs). Ideally, you would like this number to be close to 2, i.e., that running with 2 CPUs should be twice as fast than running with one CPU. What is observed instead is that for small problem sizes (grid of 128x128 cells), the compilation time and communication overhead have a net negative effect on the overall runtime: the speedup is smaller than one. It is only for larger problem sizes that the benefit of using multiple processes starts to be apparent.
And now, for the surprise plot: Fig. 8 demonstrates that the Julia implementation is faster than both Fortran and C, for both 256x256 and 512x512 problem sizes (the only ones I tested). Here I am only measuring the time needed in order to complete the main iteration loop. I believe this is a fair comparison, as for long running simulations this is going to represent the biggest contribution to the total runtime.
Before starting this blogpost I was fairly skeptical of Julia being able to compete against the speed of Fortran and C for scientific applications. The main reason was that I had previously translated an academic code of about 2000 lines from Fortran into Julia 0.6 and I observed a performance reduction of about x3.
But this time… I am very impressed. I have effectively just translated an existing MPI implementation written in Fortran and C, into Julia 1.0. The results shown in Fig. 8 speak for themselves: Julia appears to be the fastest by far. Note that I have not factored in the long compilation time taken by the Julia compiler, as for “real” applications that take hours to complete this would represent a negligible factor.
I should also add that my tests are surely not as exhaustive as they should be for a thorough comparison. In fact, I would be curious to see how the code performs with more than just 2 CPUs (I am limited by my home personal laptop) and with different hardware (feel free to check out Diffusion.jl!).
At any rate, this exercise has convinced me that it is worth investing more time in learning and using Julia for Data Science and scientific applications. Off to the next one!
Fabien Dournac, Version MPI du code de résolution numérique de l’équation de chaleur 2D
⁂
For this tutorial we will be using Python 3.6. The packages that we will need are NumPy (I am using version 1.13.3) and Keras (version 2.0.9). Here Keras is only used because of a few useful NLP tools (Tokenizer, sequence and np_utils). At the end of the blogpost I am also going to add a brief discussion on how to implement wordvec in Tensorflow (version 1.4.0), so you may want to import that as well.
1  import numpy as np 
The first step in our implementation is to transform a text corpus into numbers. Specifically, into onehot encoded vectors. Recall that in word2vec we scan through a text corpus and for each training example we define a center word with its surrounding context words. Depending on the algorithm of choice (Continuous BagofWords or Skipgram), the center and context words may work as inputs and labels, respectively, or vice versa.
Typically the context words are defined as a symmetric window of predefined length, on both the left and right hand sides of the center word. For example, suppose our corpus consists of the sentence “I like playing football with my friends”. Also, let’s say that we define our window to be symmetric around the center word and of length two. Then, our onehot encoded context and center words can be visualized as follows,
We are essentially interested in writing a few lines of code that accomplish this mapping, from text to onehotencoded vectors, as displayed in the figure above. In order to do so, first we need to tokenize the corpus text.
1  def tokenize(corpus): 
The function above returns the corpus tokenized and the size $V$ of the vocabulary. The vocabulary is not sorted in alphabetical order (it is not necessary to do so) but it simply follows the order of appearance.
At this point we can proceed with the mapping from text to onehot encoded context and center words using the function corpus2io which uses the auxiliary function to_categorical (copied from the Keras repository).
1  def to_categorical(y, num_classes=None): 
And that’s it! To show that the functions defined above do accomplish the required task, we can replicate the example presented in Fig. 1. First, we define the size of the window and the corpus text. Then, we tokenize the corpus and apply the corpus2io function defined above to find out the onehot encoded arrays of context and center words:
1  window_size = 2 
1  0 
One function that is particularly important in word2vec (and in any multiclassification problems) is the Softmax function. A simple implementation of this function is the following,
1  def softmax(x): 
Given an array of real numbers (including negative ones), the softmax function essentially returns a probability distribution with sum of the entries equal to one.
Note that the implementation above is slightly more complex than the naïve implementation that would simply return np.exp(x) / np.sum(np.exp(x)). However, it has the advantage of being superior in terms of numerical stability.
1  data = [1, 5, 1, 5, 3] 
1  softmax(data) = [2.1439e03 3.9267e05 1.5842e02 8.6492e01 1.1705e01] 
Now that we can build training examples and labels from a text corpus, we are ready to implement our word2vec neural network. In this section we start with the Continuous BagofWords model and then we will move to the Skipgram model.
You should remember that in the CBOW model the input is represented by the context words and the labels (ground truth) by the center words. The full set of equations that we need to solve for the CBOW model are (see the section on the multiword CBOW model in my previous blog post):
$$\begin{eqnarray} \textbf{h} = & W^T\overline{\textbf{x}}\hspace{2.8cm} \nonumber \\ \textbf{u} = & W'^T W^T\overline{\textbf{x}} \hspace{2cm} \nonumber \\ \textbf{y} = & \mathbb{S}\textrm{oftmax}\left( W'^T W^T\overline{\textbf{x}}\right) \hspace{0cm} \nonumber \\ \mathcal{L} = & u_{j^*} + \log \sum_i \exp{(u_i)} \hspace{0cm} \nonumber \\ \frac{\partial\mathcal{L}}{\partial W'} = & (W^T\overline{\textbf{x}}) \otimes \textbf{e} \hspace{2.0cm} \nonumber\\ \frac{\partial \mathcal{L}}{\partial W} = & \overline{\textbf{x}}\otimes(W'\textbf{e}) \hspace{2.0cm} \nonumber \end{eqnarray}$$Then, we apply gradient descent to update the weights:
$$\begin{eqnarray}
W_{\textrm{new}} = W_{\textrm{old}}  \eta \frac{\partial \mathcal{L}}{\partial W} \nonumber \\
W'_{\textrm{new}} = W'_{\textrm{old}}  \eta \frac{\partial \mathcal{L}}{\partial W'} \nonumber \\
\end{eqnarray}$$
Seems complicated? Not really, each equation above can be coded up in a single line of Python code. In fact, right below is a function that solves all the equations for the CBOW model:
1  def cbow(context, label, W1, W2, loss): 
It needs, as input, numpy arrays of context words and labels as well as the weights and the current value of the loss function. As output, it returns the updated values of the weights and loss.
Here is an example on how to use this function:
1  #userdefined parameters 
1  Training example #0 
As a further note, the output shown above represents a single epoch, i.e., a single pass through the training examples. In order to train a full model it is necessary to iterate across several epochs. We will see this in section 7.
In the skipgram model, the inputs are represented by center words and the labels by context words. The full set of equations that we need to solve are the following (see the section on the skipgram model in my previous blog post):
$$\begin{eqnarray}
\textbf{h} = & W^T\textbf{x} \hspace{9.5cm} \nonumber \\
\textbf{u}_c= & W'^T\textbf{h}=W'^TW^T\textbf{x} \hspace{3cm} \hspace{2cm} c=1, \dots , C \nonumber \\
\textbf{y}_c = & \ \ \mathbb{S}\textrm{oftmax}(\textbf{u})= \mathbb{S}\textrm{oftmax}(W'^TW^T\textbf{x}) \hspace{2.3cm} c=1, \dots , C \nonumber \\
\mathcal{L} = & \sum_{c=1}^C u_{c,j^*} + \sum_{c=1}^C \log \sum_{j=1}^V \exp(u_{c,j}) \hspace{5cm} \nonumber \\
\frac{\partial\mathcal{L}}{\partial W'} = & (W^T\textbf{x}) \otimes \sum_{c=1}^C\textbf{e}_c \hspace{7.7cm} \nonumber \\
\frac{\partial \mathcal{L}}{\partial W} = & \textbf{x}\otimes\left(W'\sum_{c=1}^C\textbf{e}_c\right) \hspace{7.2cm} \nonumber
\end{eqnarray}$$
The update equation for the weights is the same as for the CBOW model,
$$\begin{eqnarray}
W_{\textrm{new}} = W_{\textrm{old}}  \eta \frac{\partial \mathcal{L}}{\partial W} \nonumber \\
W'_{\textrm{new}} = W'_{\textrm{old}}  \eta \frac{\partial \mathcal{L}}{\partial W'} \nonumber \\
\end{eqnarray}$$
Below is the Python code that solves the equations for the skipgram model,
1  def skipgram(context, x, W1, W2, loss): 
The function skipgram defined above is used similarly to the cbow function, except that now the center words are no more the “labels” but are actually the inputs and the labels are represented by the context words. Using the same userdefined parameters as for the CBOW case, we can run the same example with the following code,
1  for i, (label, center) in enumerate(corpus2io(corpus_tokenized, V, window_size)): 
1  Training example #0 
I am not going to give a full discussion on how to implement the CBOW and Skipgram models in Tensorflow. I am only going to show you the actual code.
First, we define our usual parameters. This time we make things a little easier, without starting from a text corpus but rather predefining our own context and center words (as well the weights),
1  learning_rate = 0.1 
And now, we are in a position to implement our word2vec neural netwoks in Tensorflow. Let’s start with implementing CBOW,
1  with tf.name_scope("cbow"): 
Now we move to the Skipgram implementation,
1  with tf.name_scope("skipgram"): 
As you can see, the implementation is fairly straightforward even though it did take me some time to get it right.
Although I am not going to show this here, in word2veclite I have added some unit tests that demonstrate how the Python code that we have seen in the previous sections gives virtually identical results to the Tensorflow implementation of this section.
All the code in this blogpost can be put together as a general framework to train word2vec models. As you will know by now, I have already done this “aggregation” in the Python project word2veclite.
Using word2veclite is easy. Let’s go back to our example and suppose that the text corpus consists of the sentence “I like playing football with my friends”. Let’s also assume that we want to define the context words with a window of size 1 and a hidden layer of size 2. Finally, say that we want to train the model for 600 epochs.
In order to train a CBOW model, we can simply type
1  from word2veclite import Word2Vec 
and then we can plot loss_vs_epoch, which tells us how the loss function varies as a function of epoch number:
As you can see, the loss function keeps decreasing until almost zero. It does look like the model is working! I am not sure about you, but when I saw this plot I was really curious to see if the weights that the model had learnt were really good at predicting the center words given the context words.
I have added a prediction step in word2veclite that we can use for this purpose. Let’s use it and check, for example, that for the context word “like”  or [0, 1, 0, 0, 0, 0, 0]  the model predicts that the center word is “I”  or [1, 0, 0, 0, 0, 0, 0].
1  x = np.array([[0, 1, 0, 0, 0, 0, 0]]) 
As you can see, the prediction steps uses the weights W1 and W2 that are learnt from the model. The print statement above outputs the following line:
1  prediction_cbow = [9.862e01, 8.727e03, 7.444e09, 5.070e03, 5.565e13, 6.154e10, 7.620e14] 
Let’s try a different example. We take as context words the words “I” and “playing” and we hope to see “like” as the model prediction.
1  x = np.array([[1, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0]]) 
This time the code outputs:
1  prediction_cbow = [1.122e02, 9.801e01, 8.636e03, 1.103e06, 1.118e06, 1.022e10, 2.297e10] 
Let’s now move to training a Skipgram model with the same parameters as before. We can do this by simply changing method=”cbow” into method=”skipgram”,
1  from word2veclite import Word2Vec 
Once again, the model seems to have converged and learnt its weights. To check for this, we look at two more examples. For the center word “I” we expect to see “like” as the context word,
1  x = np.array([[1, 0, 0, 0, 0, 0, 0]]) 
and this is our prediction:
1  prediction_skipgram = [1.660e03, 9.965e01, 3.139e06, 1.846e03, 1.356e09, 3.072e08, 3.146e10] 
Our final example now. For the center word “like” we should expected the model to predict the words “I” and “playing” with equal probability.
1  x = np.array([[0, 1, 0, 0, 0, 0, 0]]) 
1  prediction_skipgram = [4.985e01, 9.816e04, 5.002e01, 2.042e08, 3.867e04, 3.767e11, 1.956e07] 
This is it for now. I hope you enjoyed the reading!
⁂
Alright then, let’s start from what you need in order to really understand backpropagation. Aside from some basic concepts in machine learning (like knowing what is a loss function and the gradient descent algorithm), from a mathematical standpoint there are two main ingredients that you will need:
If you know these concepts, what follows should be fairly easy. If you don’t master them yet, you should still be able to understand the reasoning behind the backpropagation algorithm. You may also want to <a href=/2018/01/07/backpropword2vecpython/>skip directly to the Python implementation and come back to this blogpost when needed.
First of all, I would like to start with defining what the backpropagation algorithm really is. If this definition will not make much sense at the beginning, it should make much more sense later on.
Given a neural network, the parameters that the algorithm needs to learn in order to minimize the loss function are the weights of the network (N.B. I use the term weights in a loose sense, I also mean the bias terms). These are the only variables of the model, that are tweaked at every iteration until we arrive close to the minimum of the loss function.
In this context, backpropagation is an efficient algorithm that is used to find the optimal weights of a neural network: those that minimize the loss function. The standard way of finding these values is by applying the gradient descent algorithm, which implies finding out the derivatives of the loss function with respect to the weights.
For trivial problems, those in which there are only two variables, it is easy to visualize how gradient descent works. If you look at the figure above, you will see in panel (a) a threedimensional plot of the loss function as a function of the weights $w_1$ and $w_2$. At the beginning, we don’t know what their optimal values are, i.e., we don’t know which values of $w_1$ and $w_2$ minimize the loss function. Let’s say that we start from the red point. If we know how the loss function varies as we vary the value of the weights, i.e., if we know the derivatives $\partial\mathcal{L}/\partial w_1$ and $\partial\mathcal{L}/\partial w_2$, then we can move from the red point to a point closer to the minimum of the loss function, which is represented by a blue point in the figure. How far we move from the starting point is dictated by a parameter $\eta$ that is usually called learning parameter.
A big part of the backpropagation algorithm requires evaluating the derivatives of the loss function with respect to the weights. It is particularly instructive to see this for shallow neural networks, which is the case of word2vec.
The objective of word2vec is to find word embeddings, given a text corpus. In other words, this is a technique for finding lowdimensional representations of words. As a consequence, when we talk about word2vec we are typically talking about Natural Language Processing (NLP) applications.
For example, a word2vec model trained with a 3dimensional hidden layer will result in 3dimensional word embeddings. It means that, say, the word “apartment” will be represented by a threedimensional vector of real numbers that will be close (think of it in terms of Euclidean distance) to a similar word such as “house”. Put another way, word2vec is a technique for mapping words to numbers. Amazing eh?
There are two main models that are used within the context of word2vec: the Continuous BagofWords (CBOW) and the Skipgram model. We will be looking first at the simplest model, which is the Continuous Bag of Word (CBOW) model with a oneword window. We will then move to the multiword window case and finally introduce the Skipgram model.
As we move along, I will present a few small examples with a text composed of only a few words. However, keep in mind that word2vec is typically trained with billions of words.
In the CBOW model the objective is to find a target word $\hat{y}$, given a context of words (what this means will be clear soon). In the simplest case in which the word’s context is only represented by a single word, the neural network for the CBOW model looks like in the figure below,
As you can see, there is one input layer, one hidden layer and finally an output layer. The activation function for the hidden layer is the identity $a=\mathbb{1}$ (usually, although improperly, called linear activation function). The activation function for the output is a softmax, $a=\mathbb{S}\textrm{oftmax}$.
The input layer is represented by a onehot encoded vector $\textbf{x}$ of dimension V, where V is the size of the vocabulary. The hidden layer is defined by a vector of dimension N. Finally, the output layer is a vector of dimension V.
Now, let’s talk about weights: the weights between the input and the hidden layer are represented by a matrix $W$, of dimension $V\times N$. Similarly, the weigths between the hidden and the output layer are represented by a matrix $W’$, of dimension $N\times V$. For example, as in the figure, the relationship between an element $x_k$ of the input layer and an element $h_i$ of the hidden layer is represented by the weight $W_{ki}$. The connection between this node $h_i$ and an element $y_j$ of the output layer is represented by an element $W'_{ij}$.
The output vector $\textbf{y}$ will need to be compared against the expected targets $\hat{\textbf{y}}$. The closest is $\textbf{y}$ to $\hat{\textbf{y}}$, the better is the performance of the neural network (and the lower is the loss function).
If some of what you read so far sounds confusing, that’s because it always is! Have a look at the example below, it might clear things up!
Given the topology of the network in Figure 1, let’s write down how to find the values of the hidden layer and of the output layer, given the input data $\textbf{x}$:
$$\begin{eqnarray*} \textbf{h} = & W^T\textbf{x} \hspace{7.0cm} \\ \textbf{u}= & W'^T\textbf{h}=W'^TW^T\textbf{x} \hspace{4.6cm} \\ \textbf{y}= & \ \ \mathbb{S}\textrm{oftmax}(\textbf{u})= \mathbb{S}\textrm{oftmax}(W'^TW^T\textbf{x}) \hspace{2cm} \end{eqnarray*}$$where, following [1], $\textbf{u}$ is the value of the output before applying the $\mathbb{S}\textrm{oftmax}$ function.
Now, let’s say that we are training the model against the targetcontext word pair ($w_t, w_c$). The target word represents the ideal prediction from our neural network model, given the context word $w_c$. It is represented by a onehot encoded vector. Say that it has value 1 at the position $j^*$ (and the value 0 for any other position).
The loss function will need to evaluate the output layer at the position $j^*$, or $y_{j^*}$ (the ideal value being equal to 1). Since the values in the softmax can be interpreted as conditional probabilities of the target word, given the context word $w_c$, we write the loss function as
$$\begin{equation*} \mathcal{L} = \log \mathbb{P}(w_tw_c)=\log y_{j^*}\ =\log[\mathbb{S}\textrm{oftmax}(u_{j^*})]=\log\left(\frac{\exp{u_{j^*}}}{\sum_i \exp{u_i}}\right), \end{equation*}$$where it is standard to add the log function. From the previous expression we obtain
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{
\mathcal{L} = u_{j^*} + \log \sum_i \exp{(u_i)}.
}
\label{eq:loss}
\end{equation}$$
The loss function (\ref{eq:loss}) is the quantity we want to minimize, given our training example, i.e., we want to maximize the probability that our model predicts the target word, given our context word.
Now that we have an expression for the loss function, eq. (\ref{eq:loss}), we want to find the values of $W$ and $W’$ that minimize it. In the machine learning lingo, we want our model to “learn” the weights.
As we have seen in Section 1, in the neural networks world this optimization problem is usually tackled using gradient descent. Figure 1 shows how, in order to apply this method and update the weigth matrices $W$ and $W’$, we need to find the derivatives $\partial \mathcal{L}/\partial{W}$ and $\partial \mathcal{L}/\partial{W’}$.
I believe the easiest way to understand how to do this is to simply write down the relationship between the loss function and $W$, $W’$. Looking again at eq. (\ref{eq:loss}), it is clear that the loss function depends on the weights $W$ and $W’$ through the variable $\textbf{u}=[u_1, u_2, \dots, u_V]$, or
$$\begin{equation*} \mathcal{L} = \mathcal{L}(\mathbf{u}(W,W'))=\mathcal{L}(u_1(W,W'), u_2(W,W'),\dots, u_V(W,W'))\ . \end{equation*}$$The derivatives then simply follow from the chain rule for multivariate functions,
$$\begin{equation} \frac{\partial\mathcal{L}}{\partial W'_{ij}} = \sum_{k=1}^V\frac{\partial\mathcal{L}}{\partial u_k}\frac{\partial u_k}{\partial W'_{ij}} \label{eq:dLdWp} \end{equation}$$and
$$\begin{equation} \frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V\frac{\partial\mathcal{L}}{\partial u_k}\frac{\partial u_k}{\partial W_{ij}}\ . \label{eq:dLdW} \end{equation}$$And that’s it, this is pretty much the backpropagation algorithm! At this point we just need to specify eqs. (\ref{eq:dLdWp}) and (\ref{eq:dLdW}) for our usecase.
Let’s start from eq. (\ref{eq:dLdWp}). Note that the weight $W’_{ij}$, which is an element of the matrix $W’$ and connects a node $i$ of the hidden layer to a node $j$ of the output layer, only affects the output score $u_j$ (and also $y_j$), as seen in the panel (a) of the figure below,
Hence, among all the derivatives $\partial u_k/\partial W'_{ij}$, only the one where $k=j$ will be different from zero. In other words,
$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \frac{\partial\mathcal{L}}{\partial u_j}\frac{\partial u_j}
{\partial W'_{ij}}
\label{eq:derivative#1}
\end{equation}$$
Let’s now calculate $\partial \mathcal{L}/\partial u_j$. We have
$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial u_j} = \delta_{jj^*} + y_j := e_j
\label{eq:term#1}
\end{equation}$$
where $\delta_{jj^*}$ is a Kronecker delta: it is equal to 1 if $j=j^*$, otherwise it is equal to zero. In eq. (\ref{eq:term#1}) we have introduced the vector $\textbf{e} \in \mathbb{R}^V$, which is used to reduce the notational complexity. This vector represents the difference between the target (label) and the predicted output, i.e., it is the prediction error vector.
For the second term on the right hand side of eq. (\ref{eq:derivative#1}), we have
$$\begin{equation}
\frac{\partial u_j}{\partial W'_{ij}} = \sum_{k=1}^V W_{ik}x_k
\label{eq:term#2}
\end{equation}$$
After inserting eqs. (\ref{eq:term#1}) and (\ref{eq:term#2}) into eq. (\ref{eq:derivative#1}), we obtain
$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = (\delta_{jj^*} + y_j) \left(\sum_{k=1}^V W_{ki}x_k\right)
}
\label{eq:backprop1}
\end{equation}$$
We can go through a similar exercise for the derivative $\partial\mathcal{L}/\partial W_{ij}$, however this time we note that after fixing the input $x_k$, the output $y_j$ at node $j$ depends on all the elements of the matrix $W$ that are connected to the input, as seen in Figure 3b. Therefore, this time we have to retain all the elements in the sum. Before going into the evaluation of $\partial u_k/\partial W_{ij}$, it is useful to write down the expression for the element $u_k$ of the vector $\textbf{u}$ as
$$\begin{equation*}
u_k = \sum_{m=1}^N\sum_{l=1}^VW'_{mk}W_{lm}x_l\ .
\end{equation*}$$
From this equation it is then easy to write down $\partial u_k/\partial W_{ij}$, since the only term that survives from the derivation will be the one in which $l=i$ and $m=j$, or
$$\begin{equation}
\frac{\partial u_k}{\partial W_{ij}} = W'_{jk}x_i\ .
\label{eq:term#3}
\end{equation}$$
Finally, substituting eqs. (\ref{eq:term#1}) and (\ref{eq:term#3}) we get our well deserved result:
$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V (\delta_{kk^*}+y_k)W'_{jk}x_i
}
\label{eq:backprop2}
\end{equation}$$
Now that we have eqs. (\ref{eq:backprop1}) and (\ref{eq:backprop2}), we have all the ingredients needed to go through our first “learning” iteration, that consists in applying the gradient descent algorithm. This should have the effect of getting us a little closer to the minimum of the loss function. In order to apply gradient descent, after fixing a learning rate $\eta>0$ we can update the values of the weight vectors by using the equations
$$\begin{eqnarray}
W_{\textrm{new}} & = W_{\textrm{old}}  \eta \frac{\partial \mathcal{L}}{\partial W} \nonumber \\
W'_{\textrm{new}} & = W'_{\textrm{old}}  \eta \frac{\partial \mathcal{L}}{\partial W'} \nonumber \\
\end{eqnarray}$$
What we have seen so far is only one small step of the entire optimization process. In particular, up to this point we have only trained the neural network with one single training example. In order to conclude the first pass, we have to go through all our training examples. Once we do this, we will have gone through a full epoch of optimization. At that point, most likely we will need to start again iterating through all our training data, until we reach a point in which we don’t observe big changes in the loss function. At that point, we can stop and declare that our neural network has been trained!
We know at this point how the backpropagation algorithm works for the oneword word2vec model. It is time to add an extra complexity by including more context words. Figure 4 shows how the neural network now looks. The input is given by a series of onehot encoded context words, their number $C$ depending on our choice (how many context words we want to use). The hidden layer becomes an average of the values obtained from each context word.
The equations of the multiword CBOW model are a generalization of the ones for the singleword CBOW model that we have already seen,
$$\begin{eqnarray}
\textbf{h} = & \frac{1}{C} W^T \sum_{c=1}^C\textbf{x}^{(c)} = W^T\overline{\textbf{x}}\hspace{5.8cm} \nonumber \\
\textbf{u}= & W'^T\textbf{h}= \frac{1}{C}\sum_{c=1}^CW'^T W^T\textbf{x}^{(c)}=W'^T W^T\overline{\textbf{x}} \hspace{2.8cm} \nonumber \\
\textbf{y}= & \ \ \mathbb{S}\textrm{oftmax}(\textbf{u})= \mathbb{S}\textrm{oftmax}\left( W'^T W^T\overline{\textbf{x}}\right) \hspace{3.6cm} \nonumber
\end{eqnarray}$$
Note that, for convenience, in the definitions above we have defined an “average” input vector $\overline{\textbf{x}}=\sum_{c=1}^C\textbf{x}^{(c)}/C$.
As before, in order to apply the backpropagation algorithm we need to write down the loss function and then look at its dependencies. The loss function looks just as before,
$$\begin{equation}
\mathcal{L} = \log\mathbb{P}(w_ow_{c,1},w_{c,2},\dots,w_{c,C})=u_{j^*} + \log \sum_i \exp{(u_i)}.
\end{equation}$$
We write again the chain rules (identical to the previous ones),
$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \sum_{k=1}^V\frac{\partial\mathcal{L}}{\partial u_k}\frac{\partial u_k}{\partial W'_{ij}}
\end{equation}$$
and
$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V\frac{\partial\mathcal{L}}{\partial u_k}\frac{\partial u_k}{\partial W_{ij}}\ .
\end{equation}$$
The derivatives of the loss function with respect to the weights are the same as for the singleword CBOW model, provided that we change the input vector with the average input vector. For completion, let’s derive these equations starting with the derivative with respect to $W’_{ij}$
$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \sum_{k=1}^V\frac{\partial\mathcal{L}}{\partial u_k}\frac{\partial u_k}{\partial W'_{ij}} = \frac{\partial\mathcal{L}}{\partial u_j}\frac{\partial u_j}{\partial W'_{ij}} = (\delta_{jj^*} + y_j) \left(\sum_{k=1}^V W_{ki}\overline{x}_k\right)
\end{equation}$$
and then writing the derivative with respect to $W_{ij}$,
$$\begin{equation}
\frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V\frac{\partial\mathcal{L}}{\partial u_k}\frac{\partial}{\partial W_{ij}}\left(\frac{1}{C}\sum_{m=1}^N\sum_{l=1}^V W'_{mk}\sum_{c=1}^C W_{lm}x_l^{(c)}\right)=\frac{1}{C}\sum_{k=1}^V\sum_{c=1}^C(\delta_{kk^*} + y_k)W'_{jk}x_i^{(c)} .
\end{equation}$$
Summarizing, for the multiword CBOW model we have
$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = (\delta_{jj^*} + y_j) \left(\sum_{k=1}^V W_{ki}\overline{x}_k\right)
}
\label{eq:backprop1_multi}
\end{equation}$$
and
$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V(\delta_{kk^*} + y_k)W'_{jk}\overline{x}_i .
}
\label{eq:backprop2_multi}
\end{equation}$$
The Skipgram model is essentially the inverse of the CBOW model. The input is a center word and the output predicts the context words, given the center word. The resulting neural network looks like in the figure below,
The equations of the Skipgram model are the following,
$$\begin{eqnarray*}
\textbf{h} = & W^T\textbf{x} \hspace{9.4cm} \\
\textbf{u}_c= & W'^T\textbf{h}=W'^TW^T\textbf{x} \hspace{4cm} c=1, \dots, C \hspace{0.7cm}\\
\textbf{y}_c = & \ \ \mathbb{S}\textrm{oftmax}(\textbf{u})= \mathbb{S}\textrm{oftmax}(W'^TW^T\textbf{x}) \hspace{2cm} c=1, \dots, C
\end{eqnarray*}$$
Note that the output vectors (as well as the vectors $\textbf{u}_c$) are all identical, $\mathbf{y}_1=\mathbf{y}_2\dots= \mathbf{y}_C$. The loss function for the Skipgram model looks like this:
$$\begin{eqnarray*}
\mathcal{L} = \log \mathbb{P}(w_{c,1}, w_{c,2}, \dots, w_{c,C}w_o)=\log \prod_{c=1}^C \mathbb{P}(w_{c,i}w_o) \\ = \log \prod_{c=1}^C \frac{\exp(u_{c,j^*})}{\sum_{j=1}^V \exp(u_{c,j})} =\sum_{c=1}^C u_{c,j^*} + \sum_{c=1}^C \log \sum_{j=1}^V \exp(u_{c,j})
\end{eqnarray*}$$
For the Skipgram model, the loss function depends on $C\times V$ variables via
$$\begin{equation*}
\mathcal{L} = \mathcal{L}(\mathbf{u_1}(W,W'), \mathbf{u_2}(W,W'), \dots, \mathbf{u_C}(W,W'))=\mathcal{L}(u_{1,1}(W,W'), u_{1,2}(W,W'), \dots, u_{C,V}(W,W'))
\end{equation*}$$
Therefore in this case the chain rule reads
$$\begin{equation*}
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \sum_{k=1}^V\sum_{c=1}^C\frac{\partial\mathcal{L}}{\partial u_{c,k}}\frac{\partial u_{c,k}}{\partial W'_{ij}}
\end{equation*}$$
and
$$\begin{equation*}
\frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V\sum_{c=1}^C\frac{\partial\mathcal{L}}{\partial u_{c,k}}\frac{\partial u_{c,k}}{\partial W_{ij}}\ .
\end{equation*}$$
Let’s now calculate $\partial \mathcal{L}/\partial u_{c,j}$. We have
$$\begin{equation*}
\frac{\partial\mathcal{L}}{\partial u_{c,j}} = \delta_{jj_c^*} + y_{c,j} := e_{c,j}
\end{equation*}$$
Similarly as for the CBOW model, we get
$$\begin{equation*}
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \sum_{k=1}^V\sum_{c=1}^C\frac{\partial\mathcal{L}}{\partial u_{c,k}}\frac{\partial u_{c,k}}{\partial W'_{ij}} = \sum_{c=1}^C\frac{\partial\mathcal{L}}{\partial u_{c,j}}\frac{\partial u_{c,j}}{\partial W'_{ij}} = \sum_{c=1}^C(\delta_{jj_c^*} + y_{c,j}) \left(\sum_{k=1}^V W_{ki}x_k\right)
\end{equation*}$$
The derivative with respect to $W_{ij}$ is the most complicated one, but still feasible:
$$\begin{equation*}
\frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V\sum_{c=1}^C\frac{\partial\mathcal{L}}{\partial u_{c,k}}\frac{\partial}{\partial W_{ij}}\left(\sum_{m=1}^N\sum_{l=1}^V W'_{mk} W_{lm}x_l\right)=\sum_{k=1}^V\sum_{c=1}^C (\delta_{kk_c^*} + y_{c,k})W'_{jk}x_i .
\end{equation*}$$
Summarizing, for the SkipGram model we have
$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W'_{ij}} = \sum_{c=1}^C(\delta_{jj_c^*} + y_{c,j}) \left(\sum_{k=1}^V W_{ki}x_k\right)
}
\label{eq:backprop1_skip}
\end{equation}$$
and
$$\begin{equation}
\bbox[white,5px,border:2px dotted red]{
\frac{\partial\mathcal{L}}{\partial W_{ij}} = \sum_{k=1}^V\sum_{c=1}^C (\delta_{kk_c^*} + y_{c,k})W'_{jk}x_i .
}
\label{eq:backprop2_skip}
\end{equation}$$
We have seen in detail how the backpropagation algorithm works for the word2vec usecase. However, it turns out that the implementations we have seen are not computationally efficient for large text corpora. The original paper [2] introduced some “tricks” useful to overcome this difficulty (Hierarchical Softmax and Negative Sampling), but I am not going to cover them here. You can find a good explanation in [1].
Despite being computationally inefficient, the implementations discussed here contain everything that is needed in order to train word2vec neural networks. The next step would then be to implement these equations in your favourite programming language. If you like Python, I have already implemented these equations for you. I discuss about them in <a href=/2018/01/07/backpropword2vecpython/>my next blogpost. Maybe I’ll see you there!
[1] X. Rong, word2vec Parameter Learning Explained, arXiv:1411.2738 (2014) .
[2] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781 (2013).
⁂
Before starting, it is worth mentioning that the topic of A/B testing has seen increased interest over the past few years (so… well done for reading this blog post!). This is obvious from the figure below, showing how the popularity of the search query “AB Testing” in Google Trends has grown linearly for at least the past five years.
Truth is, every webcompany with a big enough basin of users does A/B testing (or, at least, it should). Starting from the big giants like Google, Amazon or Facebook. The general public has learnt (and quickly forgot) about A/B testing when in 2013 Facebook released a paper showing a contagion effect of emotion on its News Feed [1], which generated a wave of indignation across the web (for example, see this article from The Guardian).
For anyone interested in introducing these optimization practices (i.e., A/B testing) in their company, there are commercial tools that can help. Some of the most popular ones are Optimizely and Virtual Website Optimizer (VWO). These two tools follow two different viewpoints for doing inferential statistics: Optimizely uses a Frequentist approach, while VWO uses a Bayesian approach.
Here I am not going to digress on the differences between Frequentism and Bayesianism (personally I don’t have a strong preference against one or the other). From now on, we will simply deep dive into the A/B testing world  as seen by a Bayesian.
The main steps needed for doing Bayesian A/B testing are three:
1. Collect the data for the experiment;
2. Compare the different variants by applying Bayes’ Theorem;
3. Decide whether or not the experiment has reached a statistically significant result and can be stopped.
These three steps are visualized in the flow chart below, which give a more complete view of the different choices that a practitioner needs to make. The meaning of these options (analytic/MCMC solution and ROPE/Expected Loss decision rule) will become clear as you keep reading this blog post.
And now, let’s discuss each of these steps individually.
This is the “engineering” step of the lot. As such, I am only going to briefly touch on it.
It is obvious that collecting data is the first thing that should be developed in the experimental pipeline. Users should be randomized in the “A” and “B” buckets (often called the “Control” and “Treatment” buckets). The client (the browser) should send events, typically click events, to a specific endpoint that should be accessible for our analysis. The endpoint could be a database hosted on the backend, or more advanced solutions that are possible with cloud services such as AWS.
If I were to start from scratch I would consider using a backend tool such as Facebook PlanOut for randomizing the users and serving the different variants to the client. I would then use a data warehouse such as Amazon Redshift for storing all the event logs.
It is important to make sure that users are always exposed to the same variant when they access the website during the experiment, otherwise the experiment itself would be invalidated. This requirement can be ensured by using cookies. Alternative solutions are possible if the users can be uniquely identified (for example if they are logged in on the website).
After we begin collecting the data and for each click event we have at least logged the type of event (for example, if it is a click on a “sign in” or on a “register” button), the unique id of the user and the variation in which he/she was bucketed (let’s say A or B), we can start our analysis.
Every time Bayesian methods are applied, it is always useful to write down Bayes’ theorem. After all, it is so simple that it only requires a minimal amount of effort to remember it.
Assume that $H$ is some hypothesis to be tested, typically the parameters of our model (for our purposes, the conversion rates of our A and B buckets), while $\textbf{d}$ is the data collected during our experiment. Then Bayes’ theorem reads
$$\begin{equation}
\mathbb{P}(H\textbf{d}) = \frac{\mathbb{P}(\textbf{d}H)\mathbb{P}(H)}{\mathbb{P}(\textbf{d})}
\label{eq:Bayes}
\end{equation}$$
Normally, it is very easy to calculate the posterior distribution $\mathbb{P}(H\textbf{d})$ up to its normalizing constant. In other words, it is usually easy to calculate the terms in the numerator of Bayes’ theorem. What is difficult to do is to evaluate the evidence $\mathbb{P}(\textbf{d})$. While we don’t always care about it, in many cases we are actually interested in it. It is only by knowing its normalizing constant that we can make the posterior distribution an actual probability distribution (that integrates to one), which we can then use to calculate any other quantities of interest (usually called “moments” of the distribution function).
Unfortunately, for an arbitraty choice of the prior distribution $\mathbb{P}(H)$ it is normally only possible to calculate the posterior distribution  including its normalizing constant  through numerical calculations. In particular, it is common to use Markov Chain Monte Carlo methods. However, for specific type of models and for specific choices of the prior it turns out that the posterior distribution can be calculated analytically. Luckily, it is possible to do so for the analysis of A/B experiments.
To summarize, for A/B testing we have two possible choices for finding out the posterior distribution: 1. analytic method; 2. numerical method. Let’s discuss them in some detail.
1  import numpy as np 
1  matplotlib.rc('font', size=18) 
1  from scipy.stats import beta 
At this point we can assume that, either analytically or numerically, we have found our posterior distribution. Okay… now what? Well, we need to apply a decision rule that will tell us whether we have a winner or not. Aren’t you curious to see how this works? Me too!
The third step in our flowchart above consists in applying a decision rule to our analyis: is our experiment conclusive? If so, who is the winner?
To answer this question, it is worth pointing out that Bayesian statistics is much less standardized than Frequentist statistics. Not just because Bayesian statistics is still fairly “young” but also very much because of its richness: Bayesian statistics gives us full access to the distribution of the parameters of interest. This means that there are potentially many different ways of making inference from our data. Nevertheless, the methods will probably become more and more standardized over time.
In terms of A/B testing, there seem to be two main approaches for decision making. The first one is based on the paper by John Kruschke at Indiana University, called “Bayesian estimation supersedes the t test” [2]. It is often cited as the BEST paper (yes, that’s called good marketing strategy ;) ). The decision rule used in this paper is based on the concept of Region Of Practical Equivalence (ROPE) that I discuss in Section 3.1.
The other possible route is the one that makes use of the concept of an Expected Loss. It has been proposed by Chris Stucchio [2] and I discuss it in Section 3.2.
At this point we have all the ingredients that are needed to understand and analyze an A/B experiment through the package aByes.
After having downloaded and installed the package, we import aByes using the command
1  import abyes as ab 
Next, we generate some random data. For this example, we are going to assume that the B variant does better than the A variant by picking values from a Bernoulli distribution with mean values of 0.5 and 0.4, respectively:
1  data = [np.random.binomial(1, 0.4, size=10000), np.random.binomial(1, 0.5, size=10000)] 
The first element of this list is represented by a numpy array of size 10000, each element being a sample drawn by a Bernoulli distribution with probability of success of 0.4. Similarly, the second element is another numpy array of size 10000, but this time the probability of success is 0.5.
The data list represents our experimental data for the A and B buckets. Now let’s see how we can apply the different methods previously discussed to do a Bayesian analysis of these experimental results.
1  exp = ab.AbExp(method='analytic', decision_var = 'lift', rule='rope', rope=(0.01,0.01), alpha=0.95, plot=True) 
1  *** abyes *** 
1  exp = ab.AbExp(method='analytic', decision_var = 'lift', rule='loss', toc=0.01, plot=True) 
1  *** abyes *** 
1  exp = ab.AbExp(method='mcmc', decision_var = 'es', rule='rope', alpha=0.95, plot=True) 
1  *** abyes *** 
1  exp = ab.AbExp(method='compare', decision_var = 'es', plot=True) 
In this lengthy blog post, I have presented a detailed overview of Bayesian A/B Testing. From the first step of gathering the data to deciding whether to follow an analytic or numerical approach, to choosing the decision rule.
In the previous section, we have also seen some practical examples that make use of the Python package aByes.
Here are my suggestions for using aByes for conversion rate experiments:
Thanks for reading this post. And remember to keep using your ttests and chisquare tests when needed! :)
[1] A. D. I. Kramera, J. E. Guilloryb, and J. T. Hancock, Experimental evidence of massivescale emotional contagion through social networks, PNAS, 111, 8788 (2013).
[2] J. K. Kruschke, Bayesian Estimation Supersedes the t Test, Journal of Experimental Psychology: General, 142, 573 (2013).
[3] C. Stucchio, Bayesian A/B Testing at VWO (2015).
[4] D. Hitchcock, Lecture notes available for the course STAT J535, Introduction to Bayesian Data Analysis, South Carolina University.
⁂
Here I will present a summary of what I have discovered during this time. The material presented is inspired by several sources, which are detailed in the “References” section at the end of this post. I will cite these sources as we proceed, together with identifying edge scenarios, test cases and example Scala code.
We will be looking at six popular metrics: Precision, Recall, F1measure, Average Precision, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).
Before starting, it is useful to write down a few definitions. Let us first assume that there are $U$ users. Each user $i$ ($i=1,\dots,U$) receives a set $R_i$ of recommendations of dimension $\rho_i$ (predictions vector). Out of all these recommendations, only a fraction of them will be actually useful (relevant) to the user. The set of all relevant items represents the “ground truth” or labels vector. Here it will represented by the vector $G_i$, with dimension $\gamma_i$. In summary,
$$\begin{eqnarray} R_i = \{r_{i1},\dots,r_{i\rho_i}\} \\ G_i = \{g_{i1},\dots,g_{i\gamma_i}\} \nonumber \end{eqnarray}$$Whether a recommended document $r_{ij}\in R_i$ is relevant or not, i.e. whether or not it belongs to the ground truth set $G_i$, is denoted by an indicator function $\textrm{rel}_i$$_j$ (“relevance”)
$$\textrm{rel}_{ij} = \left\{
\begin{eqnarray}
1 \ \ \textrm{if} \ r_{ij} \in G_i\nonumber \\
0 \ \ \textrm{otherwise}
\nonumber
\end{eqnarray}
\right.$$
Also, as usual I will denote with TP$_i$ the number of true positives for user $i$ (i.e., the number of recommended items that are indeed relevant) and with FP$_i$ the number of false positives (i.e., the number of recommended items that are not relevant).
At this point we have all the ingredients needed to properly define our metrics. For each metric, we will compute the results related to the four test cases shown here below (click on the arrow to display the text). These will be our “unit tests”, in that any code that implements these metrics (in the way defined here) should recover the same results.
Precision is the probability that a predicted item is relevant,
$$\begin{equation}
\textrm{P}_i@k = \frac{\textrm{TP}_i}{\textrm{TP}_i + \textrm{FP}_i} = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}{k}
\label{eq:precision}
\end{equation}$$
Using the definition in eq. (\ref{eq:precision}), here are the precision values for the test cases above,
User Test  Precision@1  Precision@3  Precision@5 

#1  1/1=1  (1+1+0)/3=2/3=0.666  (1+1+0+0+0)/5=2/5=0.400 
#2  0/1=0  (0+1+0)/3=1/3=0.333  (0+1+0+1+0)/5=2/5=0.400 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean Precision  (1+0+0)/3=1/3=0.333  (2/3+1/3+0)/3=1/3=0.333  (2/5+2/5+0)/3=4/15=0.267 
1  private def precisionAtk(ats: Seq[Int], pred: Array[_ <: Int], label: Array[_ <: Int]) = (pred,label) match { 
Recall is the probability that a relevant item (i.e., an item belongings to the labels vector) is recommended,
$$\begin{equation}
\textrm{R}_i@k = \frac{\textrm{TP}_i}{\textrm{TP}_i + \textrm{TN}_i} = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}{\gamma_i}
\label{eq:recall}
\end{equation}$$
where as before $\gamma_i$ is the total number of relevant documents (labels).
Using the definition in eq. (\ref{eq:recall}), here are the recall values for the test cases above,
User Test  Recall@1  Recall@3  Recall@5 

#1  1/6=0.167  (1+1+0)/6=1/3=0.333  (1+1+0)/6=1/3=0.333 
#2  0/3=0  (0+1+0)/3=1/3=0.333  (0+1+0+1+0)/3=2/3=0.666 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean Recall  (1/6+0+0)/3=1/18=0.056  (1/3+1/3+0)/3=2/9=0.222  (1/3+2/3+0)/3=1/3=0.333 
1  private def recallAtk(ats: Seq[Int], pred: Array[_ <: Int], label: Array[_ <: Int]) = (pred,label) match { 
The F1score aims at capturing in a single metric the precision and recall metrics and is usually defined as the harmonic mean between them,
$$\begin{equation}
\textrm{F1}_i@k = 2\frac{\textrm{P}_i@k \cdot \textrm{R}_i@k}{\textrm{P}_i@k + \textrm{R}_i@k}
\label{eq:F1score}
\end{equation}$$
User Test  F1@1  F1@3  F1@5 

#1  2/6/(1+1/6)=2/7=0.286  4/9/(2/3+1/3)=4/9=0.444  4/15/(2/5+1/3)=4/11=0.364 
#2  0  2/9/(1/3+1/3)=1/3=0.333  8/15/(2/3+2/5)=1/2=0.500 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean F1score  (2/7+0+0)/3=2/21=0.095  (4/9+1/3+0)/3=7/27=0.259  (4/11+1/2+0)/3=19/66=0.288 
This metric represents the average of the precision values for the relevant items from 1 to k,
$$\begin{equation}
\textrm{AP}_i@k = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij} \textrm{P}_i@j}{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}
\label{eq:ap}
\end{equation}$$
Compared to the Precision, the Average Precision biases the result towards the top ranks. It is designed to work for binary outcomes (in this sense, the NDCG defined at the end of this post is more general as it also takes into account nonbinary relevance).
Using the definition in eq. (\ref{eq:ap}), here are the Average Precision scores for the test cases above,
User Test  AP@1  AP@3  AP@5 

#1  1  (1+1+0)/2=1.000  (1+1+0)/2=1.000 
#2  0  (0+1/2+0)/1=1/2=0.500  (0+1/2+0+1/2+0)/2=1/2=0.500 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
MAP  (1+0+0)/3=1/3=0.333  (1+1/2+0)/3=1/2=0.500  (1+1/2+0)/3=1/2=0.500 
1  private def mapAtk(ats: Seq[Int], pred: Array[_ <: Int], label: Array[_ <: Int]) = (pred,label) match { 
The reciprocal rank of a set of recommendations served to user $i$ is defined as the inverse position (rank) of the first relevant document among the first k ones,
$$\begin{equation}
\textrm{RR}_i@k = \frac{1}{\textrm{rank}_{i,1}}
\nonumber
\end{equation}$$
Because of this definition, this metric is mostly suited to cases in which it is the topranked result that matters the most.
User Test  RR@1  RR@3  RR@5 

#1  1  1  1 
#2  0  1/2=0.500  1/2=0.500 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
MRR  (1+0+0)/3=1/3=0.333  (1+1/2+0)/3=1/3=0.333  (1+1/2+0)/3=1/3=0.333 
1  def mrrAtk(ats: Seq[Int], pred: Array[_ <: Int], label: Array[_ <: Int]) = (pred,label) match { 
The Normalized Discounted Cumulative Gain (NDCG) is similar to MAP, except that NDCG also works for nonbinary relevance. NDCG has a heavier tail at high ranks, which means that it does not discount lower ranks as much as MAP does. For this reason, some practitioners prefer MAP to NDCG (for binary outcomes).
To define NDCG, we first need to define the discounted cumulative gain (DCG)
$$\begin{equation}
\textrm{DCG}_i@k = \sum_{j=1}^{k}\frac{2^{\textrm{rel}_i}1}{\ln (i+1)}
\label{eq:dcg}
\end{equation}$$
and then the Ideal Discounted Cumulative Gain (IDCG)
$$\begin{equation}
\textrm{IDCG}_i@k = \sum_{j=1}^{k}\frac{2^{\textrm{SORT(rel)}_i}1}{\ln (i+1)}\ .
\label{eq:idcg}
\end{equation}$$
The IDCG represents the ideal scenario in which the given recommendations are ranked as best as possible, this is the origin of the SORT function in the definition above.
Using eqs. (\ref{eq:dcg}) and (\ref{eq:idcg}) we finally arrive at the definition of the Normalized Discounted Cumulative Gain (NDCG) as
$$\begin{equation}
\textrm{NDCG}_i@k = \frac{\textrm{DCG}_i@k}{\textrm{IDCG}_i@k}\ .
\label{eq:ndcg}
\end{equation}$$
User Test  DCG@1  IDCG@1  NDCG@1 

#1  1/log(2)  1/log(2)  DCG@1/IDCG@1=1.000 
#2  0  0  0 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean NDCG@1      (1+0+0)/3=1/3=0.333 
User Test  DCG@3  IDCG@3  NDCG@3 

#1  1/log(2)+1/log(3)  1/log(2)+1/log(3)  DCG@3/IDCG@3=1.000 
#2  1/ln(3)  1/ln(2)  DCG@3/IDCG@3=0.631 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean NDCG@3      (1+0.631+0)/3=0.544 
User Test  DCG@5  IDCG@5  NDCG@5 

#1  1/log(2)+1/log(3)  1/log(2)+1/log(3)  DCG@5/IDCG@5=1.000 
#2  1/log(3)+1/log(5)  1/log(2)+1/log(3)  DCG@5/IDCG@5=0.651 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean NDCG@5      (1+0.651+0)/3=0.550 
1  private def ndcgAtk(ats: Seq[Int], pred: Array[_ <: Int], label: Array[_ <: Int]) = (pred, label) match { 
⁂
And there it is, the solution:
As you can see, there is a time slider that allows you to visualize the graph as a function of time. The width of each edge gives information about the strength of the connection between two nodes, the size of each node represents how important a particular node is in a given year.
You can drag the nodes around for a little bit of fun and hovering over each node will tell you the id of the node.
Here is the code that is used to generate the graph (vizgraph.html)
1 

Two more things are missing: the css file and the data file.
Let’s first start with the css file (vizstyle.css)
1  .node { 
And finally the json file used to generate the graph (graph.json)
1  { 
In this post I will present the solution to the same problem from a Bayesian perspective, using a mix of both theory and practice (using the $\small{\texttt{pymc3}}$ package). The frequentist and Bayesian approaches give actually very similar results, as the maximum a posteriori (MAP) value, which maximises the posterior distribution, coincides with the MLE for uniform priors. In general, despite the added complexity in the algorithm, the Bayesian results are rather intuitive to interpret.
Just as for the frequentist case, the Bayesian problem admits a solution that can be expressed in analytical form. In the first part of this blog post I will first present this solution, which is based on the excellent chapter 5 of the book “Numerical Bayesian Methods Applied to Signal Processing” by Joseph O. Ruanaidh and William Fitzgerald.
In the second part of this post I will present numerical solutions based on the $\small{\texttt{pymc3}}$ package, for different type of problems including multiple changepoints.
Let us assume that we only have one changepoint and that the distributions before and after it are well modeled by stationary Gaussian distributions $\mathcal{N}(\mu,\sigma)$ with mean $\mu$ and variance $\sigma^2$.
We then model each of $i=1,\dots,N$ observed data points $\textbf{d} = \{d_1,..,d_N\}$ in the following way,
$$d_i = \left\{
\begin{array}
$\mathcal{N}(\mu_1,\sigma) \ \mathrm{for} \ t<\tau \ \ \ \ \ (i.) \\
\mathcal{N}(\mu_2,\sigma) \ \mathrm{for} \ t\geq\tau \ \ \ \ \ (ii.)
\end{array}
\right.$$
in which $\mu_1$ and $\mu_2$ are the mean values before and after the changepoint, respectively.
We now introduce the Bayesian point of view. First of all, note that the only known quantities in the expressions i. and ii. are the observed data $d_i$ and we want to infer the unknowns $\mu_1,\mu_2,\sigma,\tau$ based on the observed data.
“Inferring the model parameters by looking at the observed data” immediately rings like Bayes’ theorem. Let’s write this theorem in all its glorious form, specifying it for our problem,
$$\begin{equation} \mathbb{P}(\mu_1,\mu_2,\sigma,\tau\textbf{d}) = \frac{\mathbb{P}(\textbf{d}\mu_1,\mu_2,\sigma,\tau)\mathbb{P}(\mu_1,\mu_2,\sigma,\tau)}{\mathbb{P}(\textbf{d})} \label{eq:Bayes} \end{equation}$$and discuss separately each term in eq. (\ref{eq:Bayes}):
In the absence of specific prior information on the data, we are going to use noninformative prior distributions, assuming that $\mu_1,\mu_2,\sigma$ and $\tau$ are all independent random variables,
$$\begin{eqnarray}
\mathbb{P}(\mu_1) = k_1 & \\
\mathbb{P}(\mu_2) = k_2 & \\
\mathbb{P}(\log\sigma) = k_3 & \Rightarrow \mathbb{P}(\sigma)=\frac{1}{\sigma} \\
\mathbb{P}(\tau) = k_4
\end{eqnarray}$$
where $k_1, k_2, k_3$ and $k_4$ are all unspecified constants. Note how the resulting priors are all improper priors, in the sense that they are not integrable (and in particular, they don’t integrate to 1). Note also who the prior for the variance is not uniform but goes like $1/\sigma$, consistent with Jeffrey’s principle for noninformative priors.
Finally… here comes the fun part! The objective of our analytic treatment is to find the posterior distribution of the changepoint $\tau$, or $\mathbb{P}(\tau\textbf{d})$. This is obtained by marginalizing the posterior distribution $\mathbb{P}(\mu_1,\mu_2,\sigma,\tau\textbf{d})$ with respect to the nuisance (i.e., “uninteresting”) parameters $\mu_1,\mu_2,\sigma$. In other words, we are not interested in the whole multidimensional posterior distribution, which will be impossible even to plot. Instead, we decide to have a lowerdimensional representation of the posterior distribution by looking only at one particular dimension ($\tau$) and integrating over the other dimensions.
Marginilizing with respect to $\mu_1,\mu_2,\sigma$ means nothing more than summing (for discrete distributions) or integrating (for continuous distributions) with respect to the nuisance parameters, like so:
$$\begin{equation} \mathbb{P}(\tau\textbf{d})\propto \int_0^\infty \textrm{d}\sigma \int_{\infty}^\infty \textrm{d}\mu_1 \int_{\infty}^\infty \textrm{d}\mu_2 \ \frac{\mathbb{P}(\textbf{d}\mu_1,\mu_2,\sigma,\tau)}{\sigma} \end{equation}$$in which we have used Bayes’ theorem and the definitions of the priors that we introduced above.
Here the process that is needed to obtain $\mathbb{P}(\tau\textbf{d})$:
Regarding integration with respect to $\mu_1$ and $\mu_2$, the following identity is useful:
$$\begin{equation}
\int_{\infty}^{\infty}\exp(ax^2bxc)\,\textrm{d}x = \sqrt{\frac{\pi}{a}}\exp\left(\frac{b^2}{4a}c\right)\ .
\label{eq:intIdentity1}
\end{equation}$$
Regarding integration with respect to $\sigma$, this other identity is useful:
$$\begin{equation}
\int_0^{\infty}x^{\alpha1}\exp(Qx)\,\textrm{d}x=\frac{\Gamma(\alpha)}{Q^\alpha}\ ,
\label{eq:intIdentity2}
\end{equation}$$
in which $\Gamma(\alpha)=\int_0^{\infty}x^{\alpha1}\exp(x)\,\textrm{d}x$ is the Gamma function.
This is based on our model definition, in which we are assuming that all of our data points from $1$ to $\tau$ can be modeled by a normal distribution $\mathcal{N}(\mu_1,\sigma)$ and the remaining ones by another normal distribution $\mathcal{N}(\mu_2,\sigma)$,
$$\begin{align}
\mathbb{P}(\textbf{d}\mu_1,\mu_2,\sigma,\tau) & =\prod_{i=1}^\tau \mathbb{P}(d_i\mu_1,\sigma)\prod_{j=\tau+1}^N \mathbb{P}(d_j\mu_2,\sigma) \propto \nonumber \\ & (2\pi\sigma^2)^{N/2}\exp{\left\{\frac{1}{2\sigma^2}\left[\sum_{i=1}^\tau (x_i^2+\mu_1^22\mu_1x_i) + \sum_{j=\tau+1}^N(x_j^2+\mu_2^22\mu_2x_j)\right]\right\}}
\label{eq:likelihood}
\end{align}$$
We can use the identity eq. (\ref{eq:intIdentity1}) for this one, obtaining
$$\begin{align}
\mathbb{P}(\mu_2,\sigma,\tau\textbf{d}) & = \int_{\infty}^{\infty} \mathbb{P}(\mu_1,\mu_2,\sigma,\tau\textbf{d}) \,\textrm{d}\mu_1 \propto
\nonumber \\
& \left(\frac{2\pi}{\tau}\right)^{1/2}(2\pi\sigma^2)^{N/2}
\exp{\left\{\frac{1}{2\sigma^2}\left[\sum_{i=1}^\tau x_i^2\frac{(\sum_{i=1}^\tau x_i)^2}{\tau}  \sum_{j=\tau+1}^N(x_j\mu_2)^2\right]\right\}}
\end{align}$$
We use again the identity eq. (\ref{eq:intIdentity1}),
$$\begin{align}
\mathbb{P}(\sigma,\tau\textbf{d}) & = \int_{\infty}^{\infty} \mathbb{P}(\mu_2,\sigma,\tau\textbf{d}) \,\textrm{d}\mu_2 \propto
\nonumber \\
& \frac{(2\pi\sigma^2)^{(N1)/2}}{(\tau(N\tau)^{1/2})}
\exp{\left\{\frac{1}{2\sigma^2}\left[\sum_{i=1}^\tau x_i^2\frac{(\sum_{i=1}^\tau x_i)^2}{\tau}  \frac{(\sum_{j=\tau+1}^N x_j)^2}{N\tau}\right]\right\}}
\end{align}$$
Using the expression in eq. (\ref{eq:intIdentity2}), this is the final result that we get:
$$\bbox[white,5px,border:2px solid red]{
\begin{align}
\mathbb{P}(\tau\textbf{d}) & = \int_{\infty}^{\infty} \mathbb{P}(\sigma,\tau\textbf{d}) \,\textrm{d}\sigma\propto
\nonumber \\
& \frac{1}{\sqrt{\tau(N\tau)}}\left[\sum_{k=1}^{N} x_k^2  \frac{(\sum_{i=1}^{\tau}x_i)^2}{\tau}  \frac{(\sum_{j=\tau+1}^{N}x_j)^2}{N\tau}\right]^{(N2)/2}\ \hspace{2cm}
\label{eq:bayesian_final}
\end{align}
}$$
Now please stay focused as we are going to compare numerical results obtained with $\small{\texttt{pymc3}}$ against this very expression (\ref{eq:bayesian_final}) and we will find a very impressive matching!
In this section we are going to see how to use the $\small{\texttt{pymc3}}$ package to tackle our changepoint detection problem. I found $\small{\texttt{pymc3}}$ to be rather easy to use, particularly after a quick introduction to Theano. The key is understanding that Theano is a framework for symbolic math, it essentially allows you to write abstract mathematical expressions in python. Bonus point: it is possible to run Keras on top of Theano (not just on top of Tensorflow).
Before diving into the technical details of the implementation in $\small{\texttt{pymc3}}$, here are my observations regarding solving a changepoint detection problem using a naïve Markov Chain Monte Carlo approach:
And now… let’s start playing.
The main idea behind solving a multiple changepoint detection problem in $\small{\texttt{pymc3}}$ is the following: using multiple Theano switch functions to model multiple changepoints. This is quite a simple idea that shows the versatility of Theano.
Let’s make an example from scratch to show how the logic works. In this example we are going to:
Suppose we are expecting the data to contain two changepoints $\tau_1$ and $\tau_2$ such that we model the observed data points $d_i$ in the following way,
$$d_i = \left\{ \begin{array} $\mathcal{N}(\mu_1,\sigma) \ \ \mathrm{for} \ \ t<\tau_1 \\ \mathcal{N}(\mu_2,\sigma)\ \ \mathrm{for} \ \ \tau_1 \leq t < \tau_2 \\ \mathcal{N}(\mu_3,\sigma)\ \ \mathrm{for} \ \ t \geq \tau_2 \end{array} \right.$$In this problem we have 6 unknown parameters: $\mu_1$, $\mu_2$, $\mu_3$, $\sigma$, $\tau_1$ and $\tau_2$. We want to use pymc3 to find posterior distributions for these parameters (so we are implicityly in a Bayesian framework).
Here is how we can do this in $\small{\texttt{pymc3}}$. First, we have to import the relevant packages (make sure you have installed $\small{\texttt{theano}}$ and $\small{\texttt{pymc3}}$),
1  import pymc3 as pm 
At this point we define some synthetic data that we are going to use as our testbed,
1  np.random.seed(100) #initialize random seed 
The code above defines two changepoints at positions 1000 and 2000, in which the mean value changes from 1000 to 1100 and then from 1100 to 800. The standard deviation is set at the value 30 for the whole dataset.
We want $\small{\texttt{pymc3}}$ to find all these parameters. Here is how we can do it:
1  niter = 2000 #number of iterations for the MCMC algorithm 
In the code above, the interesting parts are the definitions of the stochastic variables $\mu$ and $\sigma$ and the definition of the loglikelihood function which is the same as in eq. (\ref{eq:likelihood}), after applying the logarithmic function,
$$\begin{equation} \log{\mathcal{L}} = \log\left\{\prod_{i=1}^N (2\pi\sigma^2)^{1/2}\exp{\left[\frac{(d_i  \mu)^2}{2\sigma^2}\right]}\right\} = \sum_{i=1}^N\log\left[(2\pi)^{1/2}\sigma\right]  \sum_{i=1}^N\left[\frac{(d_i  \mu)^2}{2\sigma^2}\right] \end{equation}$$Note that the changepoint variables $\small{\texttt{tau1}}$ and $\small{\texttt{tau2}}$ are initialized as uniform random variables. However, while $\small{\texttt{tau1}}$ spans the whole time domain from 1 to N (where N is the total number of data points), $\small{\texttt{tau2}}$ only spans the values from $\small{\texttt{tau1}}$ to N. In this way we ensure that $\small{\texttt{tau2}}$ > $\small{\texttt{tau1}}$.
The other trick is what we were discussing at the beginning of this section: the stochastic variable $\small{\texttt{mu}}$ is defined after first defining a dummy stochastic variable _$\small{\texttt{mu}}$. In essence, line 16 says that for all the values $\small{\texttt{t}}$ for which $\small{\texttt{tau2}}$>t (i.e., for all the values to the left of $\small{\texttt{tau2}}$), $\small{\texttt{mu}}$ takes the value _$\small{\texttt{mu}}$. Otherwise, it takes the value $\small{\texttt{mu3}}$. In this way, by using twice the switch function we achieve the objective of having a stochastic variables that changes according to whether $t<\tau_1$, $\tau_1\leq t < \tau_2$ or $t \geq \tau_2$.
Finally we can have a look at the result by plotting the trace:
1  pm.traceplot(trace[500:]) 
(You will see more about the $\small{\texttt{traceplot}}$ function in the next sections. And if you want to see how the plots for this example look like, you can skip immediately to Section 2.4.)
In the last part of this blog post I am going to list a series of problems that are solved using the code $
\small{\texttt{changepoint_bayesian.py}}$, which is written in python 3.5 and can be downloaded below,
This code is more general (but also more obscure) than the example given above. Using $\small{\texttt{changepoint_bayesian.py}}$ I will present the solution to a series of problems that range from the singlechangepoint detection case that was discussed in the analytic solution above (Section 1), up to a threechangepoints case. The code can easily be generalized to more change points, it is in fact pretty much ready for it.
In this problem there is a changepoint at the position $\tau=2500$, where the mean value changes from $\mu_1=1000$ to $\mu_2=1020$, the standard deviation being the same at $\sigma=50$,
$$d_i = \left\{
\begin{array}
$\mathcal{N}(1000,50) \ \ \mathrm{for} \ \ t<2500 \\
\mathcal{N}(1020,50)\ \ \mathrm{for} \ \ t \geq 2500
\end{array}
\right.$$
To load this problem and see how the data look like, after downloading $\small{\texttt{changepoint_bayesian.py}}$ you can run the following lines in IPython:
1  In [1]: run changepoint_bayesian.py 
To find the posterior estimates for $\tau$, $\mu_1$, $\mu_2$ and $\sigma$ here is how we can proceed,
1  In [4]: d.find_changepoint() 
and the corresponding trace plot is shown below,
What is the trace plot telling us? We should first look at the right column. These are the results of the MCMC iterations. The first iterations have been filtered out, that’s where you would see that the algorithm has not yet converged. Instead, in all the plots on the right column you can see that the MCMC solution does not change much, i.e. we can confidently say that the algorithm has converged to its stationary solution (the posterior distribution). This means that the algorithm is now randomly sampling from the posterior distribution.
Now, what about the left column? These are simply the histograms of the data in the right column. If, as discussed, we believe the data on the right are all representative of the posterior distribution (i.e., the MCMC algorithm has converged) then these histograms are the answers to our problems: the marginals of the posterior distribution.
With reference to the figure above, here is how we interpret the data:
When I first saw the top left plot, I was a bit skeptic about the result. This is not at all a smooth solution, I was expecting to see something “nicer”. And this is exactly why it is important to have analytical solutions to compare against!
Once we do that, once we compare the result for $\mathbb{P}(\tau\textbf{d})$ as given by $\small{\texttt{pymc3}}$ against the analytical result given in eq. (\ref{eq:bayesian_final}), this is what we get
1  In [6]: d.plot_theory() 
Wow, the simulation results are pretty close to the theory! And as you can see, the MCMC solution is able to reproduce nontrivial shapes of the probability distribution function. Way to go $\small{\texttt{pymc3}}$!
In this problem, instead of the mean value we are going to change the variance from $\sigma_1=10$ to $\sigma_2=20$, keeping the same mean value $\mu=1000$, like so:
$$d_i = \left\{
\begin{array}
$\mathcal{N}(1000,10) \ \ \mathrm{for} \ \ t<2500 \\
\mathcal{N}(1000,20)\ \ \mathrm{for} \ \ t \geq 2500
\end{array}
\right.$$
Now, load the data and plot them. In IPython, run:
1  In [1]: run changepoint_bayesian.py 
To find the posterior estimates for $\tau$, $\sigma_1$, $\sigma_2$ and $\mu$,
1  In [4]: d.find_changepoint() 
The resulting trace plot is given below. It shows that the posterior estimates are close to the true parameters.
This is the same example given in Section 2.1. We are going to have two change points $\tau_1=1000$ and $\tau_2=2000$, with the mean value changing from $\mu_1=1000$ to $\mu_2=1100$ to $\mu_3=800$, keeping the same standard deviation $\sigma=30$ (this is the same example of section 2.1)
$$d_i = \left\{
\begin{array}
$\mathcal{N}(1000,30) \ \ \mathrm{for} \ \ t<1000 \\
\mathcal{N}(1100,30)\ \ \mathrm{for} \ \ 1000 \leq t \geq 2000 \\
\mathcal{N}(800,30)\ \ \mathrm{for} \ \ t \geq 2000
\end{array}
\right.$$
First, let’s load and plot the data. In IPython, run:
1  In [1]: run changepoint_bayesian.py 
To find the posterior estimates for $\tau_1$, $\tau_2$, $\mu_1$, $\mu_2$ and $\sigma$,
1  In [4]: d.find_changepoint() 
The resulting trace plot is given below. As before, the posterior estimates are close to the true parameters.
This is our final problem. Here we are going to have not one, not two but three change points! The change point positions are $\tau_1=1000$, $\tau_2=2000$ and $\tau_3=2500$. The mean value changes from $\mu_1=1000$ to $\mu_2=1100$, then to $\mu_3=800$ and finally to $\mu_1=1020$. The standard deviation is kept constant at the value $\sigma=30$.
$$d_i = \left\{ \begin{array} $\mathcal{N}(1000,30) \ \ \mathrm{for} \ \ t<1000 \\ \mathcal{N}(1100,30)\ \ \mathrm{for} \ \ 1000 \leq t \geq 2000 \\ \mathcal{N}(800,30)\ \ \mathrm{for} \ \ 2000 \leq t \geq 2500 \\ \mathcal{N}(1020,30)\ \ \mathrm{for} \ \ t \geq 2500 \end{array} \right.$$Again, we load and plot the data first:
1  In [1]: run changepoint_bayesian.py 
We have to estimate 8 parameters in this problem! They are $\tau_1$, $\tau_2$, $\tau_3$, $\mu_1$, $\mu_2$, $\mu_3$, $\mu_4$ and $\sigma$,
1  In [4]: d.find_changepoint() 
Once again, the trace plot shows that the MCMC algorithm has eventually converged. The Bayesian estimate for the model parameters are, once again, close to the true value. Good job $\small{\texttt{pymc3}}$!
CPD is a generally interesting problem with lots of potential applications other than quality control, ranging from predicting anomalies in the electricity market (and, more generally, in financial markets) to detecting security attacks in a network system or even detecting electrical activity in the brain. The point is to have an algorithm that can automatically detect changes in the properties of the time series for us to make the appropriate decisions.
Whatever the application, the general framework is always the same: the underlying probability distribution function of the time series is assumed to change at one (or more) moments in time.
Figure 1 shows two examples of how this might work. Suppose we follow the history of visits on a website for 1000 days. In (a), we clearly see that at the 500th day something happens (maybe Google started indexing the website to the top of its search results). After that moment, the number of visits per day is clearly different. However, things are usually not as good as in this example… we may in fact be in the situation (b) of the figure. If I didn’t tell you that there is a changepoint at the 500th day, would you be able to find it?
The cases above are examples in which the changepoint occurs because of a change in the mean. In other situations, a changepoint can occur because of a change in the variance, as in the figure below.
The idea of a CPD algorithm is to being able to automatically detect the positions of the most likely changepoints and, ideally, determine whether there is statistical evidence for claiming them to be “real”.
I will tackle this problem from a frequentist point of view, with the plan of talking about a bayesian approach in a future post.
As in all hypothesis testing problems, there is a potential issue of multiple testing. I have already talked about this issue. To avoid it, I will make the important assumption that the analysis involves a fixed sample so that we are going to perform one single hypothesis test. In other words the analysis is retrospective, as opposed to online sequential in which there is a stream of data that are continuously analyzed.
In searching for an R package on CPD, I found $\texttt{changepoint}$ to be very well written and documented. There is a good paper in the Journal of Statistical Software that describes the package in more detail and can be freely accessed here.
In this post I will give my own interpretation on how to approach the problem, although mainly following the arguments and notation of the paper above and of this review.
First of all, for simplicity I will assume that there is a single changepoint during the whole time period, defined at the discrete times $t=1,2,3,\dots,N$. This case already captures the essence of CPD.
Let’s define $\tau$ as the changepoint time that we want to test. Each data point in the time series is assumed to be drawn from some probability distribution function (for example, it could be a binomial or a normal distribution). In this sense, the time series can be considered a realization of a stochastic process. The probability distribution function is assumed to be fully described by the quantity $\theta$ (in general, a set of parameters). For example, $\theta$ could be the probability $p$ of success in a binomial distribution, or the mean and variance in a normal distribution.
At this point the test goes like this: the null hypothesis is that there is no changepoint, while the alternative hypothesis assumes that there is a changepoint at the time $t=\tau$. More formally, here is our hypothesis test:
$$\begin{eqnarray}
H_0 &:& \theta_1=\theta_2=\dots=\theta_{N1}=\theta_N \nonumber \\
H_1 &:& \theta_1=\theta_2=\dots=\theta_{\tau1}=\theta_\tau\neq\theta_{\tau+1}=\theta_{\tau+2}=\dots=\theta_{N1}=\theta_N \nonumber
\end{eqnarray}$$
The key in the expression $H_1$ above is in the inequality $\theta_\tau\neq\theta_{\tau+1}$: at some point in the time series, and precisely between $t=\tau$ and $t=\tau+1$, the underlying distribution changes.
The algorithm for detection is based on the loglikelihood ratio. Let’s first define what is the likelihood. The likelihood is nothing else than the probability of observing the data that we have (in the time series), assuming that the null (or the alternative) hypothesis are true. It is a measure of how good the hypothesis is: the highest the likelihood, the higher the data are well fit by the $H_0$ (or $H_1$) assumption.
Assuming independent random variables, under the null hypothesis $H_0$ the likelihood $\mathcal{L}(H_0)$ is given by the probability of observing the data $\mathbf{x}=x_1,\dots,x_N$ conditional on $H_0$. In other words,
$$\begin{equation}
\mathcal{L}(H_0)=p(\mathbf{x}H_0)=\prod_{i=1}^{N}p(x_i\theta_1)
\label{eq:L1}
\end{equation}$$
Let us define the likelihood of the alternative hypothesis,
$$\begin{equation}
\mathcal{L}(H_1)=p(\mathbf{x}H_1)=\prod_{i=1}^{\tau}p(x_i\theta_1)\prod_{j=\tau+1}^{N}p(x_j\theta_2)
\label{eq:L2}
\end{equation}$$
The loglikelihood ratio $\mathcal{R}_\tau$ is then
$$\begin{equation}
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right)=\sum_{i=1}^{\tau}\log p(x_i\theta_1) + \sum_{j=\tau+1}^{N}\log p(x_j\theta_2)  \sum_{k=1}^{N}\log p(x_k\theta_0)
\label{eq:LRatio}
\end{equation}$$
Since $\tau$ is not known, the previous equation becomes a function of $\tau$. We can thus define a generalized loglikelihood ratio $G$, which is the maximum of $\mathcal{R}_\tau$ for all the possible values of $\tau$,
$$\begin{equation}
G = \max_{1\leq\tau\leq N}\mathcal{R}_\tau
\label{eq_likelihood}
\end{equation}$$
If the null hypothesis is rejected, then the maximum likelihood estimate of the changepoint is the value $\hat{\tau}$ that maximizes the generalized likelihood ratio,
$$\begin{equation}
\hat{\tau} = \underset{1\leq\tau\leq N}{\mathrm{argmax}} \ \mathcal{R}_\tau
\end{equation}$$
In general, the null hypothesis is rejected for a sufficiently large value of $G$. In other words, there is a critical value $\lambda^*$ such that $H_0$ is rejected if
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]
{
2G=2R(\hat{\tau})>\lambda^*
}
\label{eq:criterion}
\end{equation}$$
The factor 2 is retained to be consistent with the $\texttt{changepoint}$ package.
The problem is how to define this critical value $\lambda^*$. The package $\texttt{changepoint}$ has several ways of defining the “penalty” factor $\lambda^*$. Going through the R code, I managed to find the definition of a few of them,
where $k$ is the number of extra parameters that are added as a result of defining a changepoint. For example, it is $k=1$ if there is just a shift in the mean or a shift in the variance.
At this point we are ready to look into a few more specific examples.
Assume the variables that compose the time series are drawn from independent normal random distributions. We want to test the hypothesis that there is a change in the mean of the distribution at some discrete point in time $\tau$, while we assume that the variance $\sigma^2$ does not change. The probability density function is
$$\begin{equation} f(x\mu,\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\textrm{e}^{(x\mu)^2/2\sigma^2} \nonumber \end{equation}$$Let us call $\mu_1$ the mean before the changepoint, $\mu_2$ the mean after the changepoint and $\mu_0$ the global mean.
Under the null hypothesis there is no change in mean so the likelihood looks like
$$\begin{equation} \mathcal L_{H_0} = \frac{1}{\sqrt{2\pi\sigma^{2N}}}\prod_{i=1}^{N}\exp\left[\frac{(x_i\mu_0)^2}{2\sigma^2}\right] \label{eq:L3} \end{equation}$$Under the alternative hypothesis, a changepoint occurs at the time $\tau$ and the corresponding likelihood will then be
$$\begin{equation} \mathcal L_{H_1} = \frac{1}{\sqrt{2\pi \sigma^{2N}}}\prod_{i=1}^{\tau}\exp\left[\frac{(x_i\mu_1)^2}{2\sigma^2}\right] \prod_{j=\tau+1}^{N}\exp\left[\frac{(x_j\mu_2)^2}{2\sigma^2}\right]\ . \label{eq:L4} \end{equation}$$Okay, nothing really new so far. Equations (\ref{eq:L3}) and (\ref{eq:L4}) are in fact just specializations of eqs. (\ref{eq:L1}) and (\ref{eq:L2}) to the case of normally distributed random variables.
Now it is time to write the loglikelihood ratio, as in eq. (\ref{eq:LRatio}), obtaining
$$\begin{equation}
\bbox[white,5px,border:2px solid red]{
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right)=\frac{1}{2\sigma^2}\left[\sum_{i=1}^{\tau}(x_i\mu_1)^2+\sum_{j=\tau+1}^{N}(x_j\mu_2)^2\sum_{k=1}^{N}(x_k\mu_0)^2\right]
}
\label{eq:LR1}
\end{equation}$$
after which we should calculate $G$ and then apply the criterion (\ref{eq:criterion}).
If there is a change in variance, rather than in mean, the analysis goes pretty similar to the previous section. Let us call $y_0$ and $\sigma_0^2$ the global mean and variance, respectively. Under the null hypothesis, the likelihood is
$$\begin{equation} \mathcal L_{H_0} = \frac{1}{\sqrt{2\pi\sigma_0^{2N}}}\prod_{i=1}^{N}\exp\left[\frac{(x_i\mu_0)^2}{2\sigma_0^2}\right] \nonumber \end{equation}$$and under the alternative hypothesis,
$$\begin{equation}
\mathcal L_{H_1} = \frac{1}{\sqrt{2\pi\sigma_1^{2\tau}\sigma_2^{2(N\tau)}}}\prod_{i=1}^{\tau}\exp\left[\frac{(x_i\mu_0)^2}{2\sigma_1^2}\right]
\prod_{j=\tau+1}^{N}\exp\left[\frac{(x_j\mu_0)^2}{2\sigma_2^2}\right]\ .
\nonumber
\end{equation}$$
The loglikelihood ratio is somewhat more complicated than before,
$$\begin{eqnarray}
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right)= & N\log\sigma_0\tau\log\sigma_1(N\tau)\log\sigma_2 \nonumber \\ & + \sum_{k=1}^{N}\frac{(x_k\mu_0)^2}{2\sigma_0^2}\sum_{i=1}^{\tau}\frac{(x_i\mu_0)^2}{2\sigma_1^2}\sum_{j=\tau+1}^{N}\frac{(x_j\mu_0)^2}{2\sigma_2^2}
\label{eq:Rvar1}
\end{eqnarray}$$
However, noting that $\sum_{k=1}^N(x_k\mu_0)^2=N\sigma_0^2$, $\sum_{i=1}^\tau(x_i\mu_0)^2=\tau\sigma_1^2$ and $\sum_{j=\tau+1}^N(x_j\mu_0)^2=(N\tau)\sigma_2^2$ the three last terms in eq. (\ref{eq:Rvar1}) cancel out. The final expression for $R_\tau$ is then
The Bernoulli distribution is possibly the easiest distribution of all. It models binary events that have probability $p$ of happening and probability $1p$ of not happening. This is why it is usually introduced in probability theory with the “coin flipping” example.
As before, we can write the likelihood under the null hypothesis
$$\begin{equation} \mathcal L_{H_0} = p_0^m(1p_0)^{Nm} \end{equation}$$and under the alternative hypothesis,
$$\begin{equation}
\mathcal L_{H_1} = p_1^{m_1}(1p_1)^{\taum_1}p_2^{m_2}(1p_2)^{N\taum_2} .
\end{equation}$$
The loglikelihood ratio becomes
$$\bbox[white,5px,border:2px solid red]{ \begin{eqnarray} \mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right) = & \hspace{2cm} m_1\log(p_1) + (\taum_1)\log(1p_1) + m_2\log(p_2) \nonumber \\ & + (N\taum_2)\log(1p_2)  m\log p_0  (Nm)\log(1p_0) \hspace{1.5cm} \label{eq:LR3} \end{eqnarray} }$$Here we assume that the random variables follow a Poisson distribution,
$$\begin{equation}
f(x\lambda) = \frac{\lambda^x \textrm{e}^{x}}{x!}, \ \ x\in \mathbb{N}
\nonumber
\end{equation}$$
where $x$ represents the number of events during a predefined time interval and $\lambda$ the expected number of events during the same time interval.
As before, we can write the likelihood under the null hypothesis
$$\begin{equation} \mathcal L_{H_0} = \prod_{i=1}^N\frac{\lambda_0^{x_i} \textrm{e}^{x_i}}{x_i!} \end{equation}$$and under the alternative hypothesis,
$$\begin{equation}
\mathcal L_{H_1} = \prod_{i=1}^{\tau}\frac{\lambda_1^{x_i} \textrm{e}^{x_i}}{x_i!}
\prod_{j=\tau+1}^{N}\frac{\lambda_2^{x_j} \textrm{e}^{x_j}}{x_j!}
\end{equation}$$
and the loglikelihood ratio,
$$\begin{equation}
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right) = \sum_{i=1}^{\tau}\log\left(\frac{\lambda_1^{x_i}\mathrm{e}^{x_i}}{x_i!}\right)
+ \sum_{j=\tau+1}^{N}\log\left(\frac{\lambda_2^{x_j}\mathrm{e}^{x_j}}{x_j!}\right)
 \sum_{k=1}^{N}\log\left(\frac{\lambda_0^{x_k}\mathrm{e}^{x_k}}{x_k!}\right)
\nonumber
\end{equation}$$
which can be greatly simplified to give
$$\begin{equation}
\bbox[white,5px,border:2px solid red]{
\mathcal{R}_\tau=\log\left(\frac{\mathcal{L}_{H_1}}{\mathcal{L}_{H_0}}\right) = m_1\log\lambda_1
+ m_2\log\lambda_2  m_0\log\lambda_0
}
\label{eq:LR4}
\end{equation}$$
where $m_1=\sum_{i=1}^{\tau}x_i$, $m_2=\sum_{j=\tau+1}^{N}x_j$ and $m_0=\sum_{i=1}^{N}x_k$.
We are finally in a position to test the criteria of above for different potential applications. Below I am listing a series of problems with the “solution” underneath. The solution is based on a code that I have written in Python that you can find here,
The code is not optimized, the purpose of it is to show that the methods described above work well at finding single changepoints in time series. The criterion for rejecting (or not) the null is the Bayes Information Criterion (BIC). Feel free to use other criteria.
Problem definition: A website receives a certain amount of visits per day. Lately, the marketing team has worked on improving the position of the website on Google searches (Search Engine Optimization). We want to test if this has resulted in an increase in visits on the website.
Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),
In IPython, run:
1  In [1]: run changepoint.py 
Problem definition: Many economists and investors are interested in understanding why financial markets can show abrupt changes in volatility. Assume that such a change in volatility happens in the dataset we are provided. We need to detect when this change happens in the dataset in order to improve our investment model.
Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),
In IPython, run:
1  In [1]: run changepoint.py 
Problem definition: An industrial chain delivers products that need to be manually inspected. Under normal circumstances, about 10% of the products are found with small imperfections that need to taken care of. Recently, a potential bug in the software has been found which may have caused an increase in the inspection failure rate. We want to detect when the bug was introduced and whether the failure rate has increased because of it.
Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),
In IPython, run:
1  In [1]: run changepoint.py 
Problem definition: Under normal circumstances, a network system has a load of a few tens of hits per minute. Check whether there is a sudden change in the number of hits per minute in the data provided.
Note: these parameters are completely made up. You can change the parameters in the code as you wish, also making the test fail (not rejecting the null),
In IPython, run:
1  In [1]: run changepoint.py 
⁂
While Phython is certainly not the best choice for scientific computing, in terms of performance and optimization, it is a good language for rapid prototyping and scripting (and in many cases even for complex productionlevel code).
For a problem of this type, Python is more than sufficient at doing the job. For more complicated problems involving multiple dimensions, more coupled equations and many extra terms, other languages are typically preferred (Fortran, C, C++,…), often with the inclusion of parallel programming using the Message Passing Interface (MPI) paradigm.
The problem that I am going to present is the one proposed in the excellent book “Numerical Solution of Partial Differential Equations” by G.D. Smith (Oxford University Press),
$$\left\{ \begin{eqnarray} \frac{\partial u}{\partial t} \ \ & = & \ \frac{\partial^2 u}{\partial x^2}, \ \ & 0\leq x \leq 1 \nonumber \\ u(t,0) & = & u(t,1)=0 & \ \ \ \ \ \ \forall t\nonumber\\ u(0,x) & = & 2x & \mathrm{if} \ \ x\leq 0.5 \nonumber\\ u(0,x) & = & 2(1x) & \mathrm{if} \ \ x> 0.5 \nonumber \end{eqnarray} \right.$$Before I turn to the numerical implementation of a CrankNicolson scheme to solve this problem, let’s see already how the solution looks like in the video below.
As you can see, the maximum of the function $u$ occurs at $t=0$, after which $u$ keeps decreasing. This behaviour is in line with the maximum principle for parabolic equations, which essentially states that the solution of a parabolic equation takes its maximum on the boundary (intended as the time $t=0$ or space boundaries).
The reason for this can be made intuitive by comparing to the case of a metallic onedimensional rod with the sides that are kept at some fixed temperature and with a temperature distribution that is linear and maximum at the centre, as in the initial condition above. As time progresses, the two “heat sources” (or sinks) at the sides are kept at constant low temperature. The diffusion of heat results in the rod becoming colder and colder until its temperature becomes equal to the temperature at the boundaries.
Let’s now talk about the numerical solution of the problem above. As already discussed, the numerical solution has to solve for the following matrix equation
$$\begin{equation} A\textbf{u}_{j+1} = B\textbf{u}_{j} + \textbf{b}_{j}, \end{equation}$$where
$$\begin{equation} A = \left( \begin{matrix} 2+2r & r & & & \\ r & 2+2r & r & \Huge{0} &\\ & & \ddots & & & \\ & \Huge{0} & r & 2+2r & r \\ & & & r & 2+2r \\ \end{matrix} \right) , \textbf{u}_{j+1} = \left( \begin{matrix} u_{2,j+1} \\ u_{3,j+1} \\ \vdots \\ u_{N2,j+1} \\ u_{N1,j+1} \\ \end{matrix} \right) \\ B = \left( \begin{matrix} 22r & r & & & \\ r & 22r & r & \Huge{0} &\\ & & \ddots & & & \\ & \Huge{0} & r & 22r & r \\ & & & r & 22r \\ \end{matrix} \right), \textbf{u}_{j} = \left( \begin{matrix} u_{2,j} \\ u_{3,j} \\ \vdots \\ u_{N2,j} \\ u_{N1,j} \\ \end{matrix} \right), \textbf{b}_{j} = \left( \begin{matrix} r u_{1,j} \\ 0 \\ \vdots \\ 0 \\ r u_{N,j} \\ \end{matrix} \right) \end{equation}$$and $r =\Delta t/\Delta x^2$.
The Python implementation below can be broken down into the following steps:
The code should run for just a few seconds. To generate the m4v movie, I use the $\texttt{ffmpeg}$ library that can be downloaded here.
1  import numpy as np 
⁂
We have seen how an explicit scheme of solution may require the time step to be constrained to unacceptably low values, in order to guarantee stability of the algorithm.
In this post we will see a different scheme of solution that does not suffer from the same constraint. So let’s go back to the basic equation that we want to solve numerically,
\begin{equation}
\frac{\partial u}{\partial t} = D\frac{\partial^2 u}{\partial x^2} \ + \mathrm{I.B.C.}
\label{eq:diffusion}
\end{equation}
In an explicit scheme of solution we use forwarddifferencing for the time derivative $\partial u/\partial t$, while the second derivative $\partial^2 u/\partial x^2$ is evaluated at the time step $j$.
In an implicit scheme of solution, we use a different way of differencing eq. (\ref{eq:diffusion}). We still use forwarddifferencing for the time derivative, but the spatial derivative is now evaluated at the time step $j+1/2$, instead of the time $j$ as before, by taking an average between the time step $j$ and the time step $j+1$,
$$\begin{equation} \frac{\bbox[white,5px,border:2px solid red]{ u_{i,j+1}}u_{i,j}}{\Delta t} = \frac{u_{i+1,j+1/2}2u_{i,j+1/2}+u_{i1,j+1/2}}{\Delta x^2} \\ = \frac{D}{2}\left(\frac{\bbox[white,5px,border:2px solid red]{u_{i+1,j+1}}2\bbox[white,5px,border:2px solid red]{u_{i,j+1}}+\bbox[white,5px,border:2px solid red]{u_{i1,j+1}}}{\Delta x^2} + \frac{u_{i+1,j}2u_{i,j}+u_{i1,j}}{\Delta x^2} \right). \label{eq:CN} \end{equation}$$Note how the terms inside the red boxes are all unknown, as they are the values at the next time step (the values at the timestep $j$ is instead assumed to be known).
Intuitively, this seems a more precise way of differencing than the explicit scheme since we can now argue that the derivatives on both the left and right hand sides are evaluated at the time $j+1/2$. The scheme of eq. (\ref{eq:CN}) is called CrankNicolson after the two mathematicians that proposed it. It is a popular way of solving parabolic equations and it was published shortly after WWII. The CrankNicolson scheme has the big advantage of being a stable algorithm of solution, as opposed to the explicit scheme that we have already seen.
The disadvantage is that it is computationally more difficult to solve eq. (\ref{eq:CN}), as there are now four unknowns in the equation, instead of just one as in the explicit scheme. As the figure below shows, the solution $u(i,j+1)$ depends on quantities already known from the previous iteration, just as in the explicit scheme: $u(i1,j), u(i,j)$ and $u(i+1,j)$. However, it also depends on quantities at the current iteration: $u(i1,j+1), u(i,j+1)$ and $u(i+1,j+1)$).
How can we find all the values of the solution $u$ at the time step $j+1$? Since it is not possible to calculate $u(i,j+1)$ directly (explicitly) but only indirectly (implicitly), the CrankNicolson scheme belongs to the family of schemes named implicit. The trick is to solve for all the equations $i=2\dots N1$ (the values at $i=1$ and $i=N$ are known via the boundary conditions) at once. These equations define a linear system, with $N2$ unknowns in $N2$ equations.
After rearranging the unknown values to the left and the known values to the right, equation (\ref{eq:CN}) looks like
$$\begin{equation} r\,u_{i1,j+1} + (2+2r)u_{i,j+1} r\,u_{i+1,j+1} = r \,u_{i1,j} + (22r) u_{i,j} + r\,u_{i+1,j}\ , \label{eq:CN1} \end{equation}$$where $r=D\Delta t/\Delta x^2$.
For each inner grid point $i=2,3,\dots N1$ (as noted, the points at $i=1$ and $i=N$ are supposed to be known from the boundary conditions) there will be an equation of the same form as eq. (\ref{eq:CN1}). The whole set of equations can be rewritten in matrix form as
$$\begin{equation} \left( \begin{matrix} 2+2r & r & & & \\ r & 2+2r & r & \Huge{0} &\\ & & \ddots & & & \\ & \Huge{0} & r & 2+2r & r \\ & & & r & 2+2r \\ \end{matrix} \right) \left( \begin{matrix} u_{2,j+1} \\ u_{3,j+1} \\ \vdots \\ u_{N2,j+1} \\ u_{N1,j+1} \\ \end{matrix} \right) = \\ \left( \begin{matrix} 22r & r & & & \\ r & 22r & r & \Huge{0} &\\ & & \ddots & & & \\ & \Huge{0} & r & 22r & r \\ & & & r & 22r \\ \end{matrix} \right) \left( \begin{matrix} u_{2,j} \\ u_{3,j} \\ \vdots \\ u_{N2,j} \\ u_{N1,j} \\ \end{matrix} \right) + \left( \begin{matrix} r u_{1,j} \\ 0 \\ \vdots \\ 0 \\ r u_{N,j} \\ \end{matrix} \right) \end{equation}$$which can be rewritten as
$$\begin{equation} A\textbf{u}_{j+1} = B\textbf{u}_{j} + \textbf{b}_{j}, \end{equation}$$The definitions of $A, \textbf{u}_{j+1}, \textbf{u}_{j}$ and $\textbf{b}_{j}$should be obvious from the above, but for completion let’s write them down anyways,
$$\begin{equation}
A =
\left(
\begin{matrix}
2+2r & r & & & \\
r & 2+2r & r & \Huge{0} &\\
& & \ddots & & & \\
& \Huge{0} & r & 2+2r & r \\
& & & r & 2+2r \\
\end{matrix}
\right)
,
\textbf{u}_{j+1} =
\left(
\begin{matrix}
u_{2,j+1} \\
u_{3,j+1} \\
\vdots \\
u_{N2,j+1} \\
u_{N1,j+1} \\
\end{matrix}
\right)
\\
B = \left(
\begin{matrix}
22r & r & & & \\
r & 22r & r & \Huge{0} &\\
& & \ddots & & & \\
& \Huge{0} & r & 22r & r \\
& & & r & 22r \\
\end{matrix}
\right),
\textbf{u}_{j} =
\left(
\begin{matrix}
u_{2,j} \\
u_{3,j} \\
\vdots \\
u_{N2,j} \\
u_{N1,j} \\
\end{matrix}
\right),
\textbf{b}_{j}
= \left(
\begin{matrix}
r u_{1,j} \\
0 \\
\vdots \\
0 \\
r u_{N,j} \\
\end{matrix}
\right)
\end{equation}$$
After multiplying both sides by the inverse $A^{1}$, we find the general solution
$$\begin{equation} \bbox[lightblue,5px,border:2px solid red]{\textbf{u}_{j+1} = A^{1}B\,\textbf{u}_{j} + A^{1}\textbf{b}_{j}} \end{equation}$$which needs to be iterated for as many time steps an needed.
Now comes the difficult part: evaluating the stability of the CrankNicolson scheme. It can be shown that the matrix $B’:=A^{1}B$ is symmetric (and real), therefore its 2norm $\cdot_2$ is equivalent to its spectral radius $r(B’)=\max_k \lambda_k$, where $\lambda_k$ are the eigenvalues of $B’$,
$$\begin{equation} B'_2 = r(B') = \max_k \left \frac{12r\sin^2k\pi/2N}{1+2r\sin^2k\pi/2N}\right < 1 \ \ \ \forall \ r>0. \end{equation}$$The result $B'_2<1 \ \forall \ r>0$ ensures that the CrankNicholson scheme is unconditionally stable, in the sense that the stability condition $B'_2<1$ does not depend on the choice of the grid size, time step and diffusion coefficient (parameters that enter in the definition of $r=D\Delta t/\Delta x^2$).
As a final note, it is important to point out that the CrankNicolson scheme has a truncation error of $O(\Delta t^2,\Delta x^2)$ as opposed to the explicit scheme that is $O(\Delta t,\Delta x^2)$.
In the next post of this series, I will be looking at a practical implementation of the CrankNicolson scheme using Python. That’s it for now, time to look at some computergenerated art…
⁂
As we will see more in detail below, the answer is $10.99^N$. For $N=5$, the event has already almost a $5\%$ chance of happening at least once in five attempts.
Fine… so what’s the problem with this? Well, here is why it can complicate things in statistical inference: suppose this 1% event is “rejecting the null hypothesis $H_0$ when the null is actually true”. In other words, committing a TypeI error in hypothesis testing. Continuing with the parallel, “making N attempts” would mean making N hypothesis tests. That’s the problem: if we are not careful, making multiple hypothesis tests can dangerously imply underestimating the TypeI error. Things are not that funny anymore right?
The problem becomes particularly important now that streaming data are becoming the norm. In this case it may be very tempting to continue collecting data and perform tests after tests… until we reach statistical significance. Uh… that’s exactly when the analysis becomes biased and things get bad.
The problem of bias in hypothesis testing is much more general than what is in the example above. In particle physics searches, the need for reducing bias goes as far as performing blind analysis, in which the data are scrambled and altered until a final statistical analysis is performed.
Let’s now go back to our “multiple testing” problem and be a bit more precise about the implications. Suppose we formulate a hypothesis test by defining a null hypothesis $H_0$ and alternative hypothesis $H_1$. We then set a typeI error $\alpha$, which means that if the null hypothesis $H_0$ were true, we would incorrectly reject the null with probability $\alpha$.
In general, given $n$ tests the probability of rejecting the null in any of the tests can be written as
$$\begin{equation}
P(\mathrm{rejecting\ the\ null\ in \ any \ of \ the \ tests})=P(r_1\lor r_2\lor\dots\lor r_n)
\label{eq:prob}
\end{equation}$$
in which $r_j$ denotes the event “the null is rejected at the jth test”.
While it is difficult to evaluate eq. (\ref{eq:prob}) in general, the expression greatly simplifies for independent tests as it will be clear in the next section.
For two independent tests $A$ and $B$ we have that $P(A\land B)=P(A)P(B)$. The hypothesis of independent tests can thus be used to simplify the expression (\ref{eq:prob}) as
$$\begin{equation}
P(r_1\lor r_2\lor\dots\lor r_n) = 1  P(r_1^* \land r_2^* \land\dots\land r_n^* ) = 1  \prod_{j=1}^n P(r_j^* ),
\end{equation}$$
where $ r_j^* $ denotes the event “the null is NOT rejected at the jth test”.
What is the consequence of all this? Let’s give an example.
Suppose that we do a test where we fix the typeI error $\alpha=5\%$. By definition, if we do one test only we will reject the null 5% of the times if the null is actually true (making an error…). What if we make 2 tests? What are the chances of committing a typeI error then? The error will be
$$\begin{equation} P(r_1\lor r_2) = 1  P(r_1^* \land r_2^* ) = 1P(r_1^* )P(r_2^* ) = 10.95\times0.95=0.0975 \end{equation}$$What if we do $n$ tests then? Well, the effective typeI error will be .
\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{\mathrm{Type \ I \ error} = 1(1\alpha)^n} \ \ \ (\mathrm{independent \ tests})
\label{eq:typeI_independent}
\end{equation}
What can we do to prevent this from happening?
Note that in order to arrive at eq. (\ref{eq:alpha_eff}), we have assumed $\alpha\ll 1$ and use Taylor’s expansion $$(1x)^m\approx 1mx$$
What we have just shown is that, for independent tests, we can take into account the multiple hypothesis testing by correcting the typeI error by a factor $1/n$ such that
$$\bbox[lightblue,5px,border:2px solid red]{\alpha_\mathrm{eff} = \frac{\alpha}{n} }$$
This is what goes under the name of Bonferroni correction, from the name of the Italian mathematician Carlo Emilio Bonferroni.
In this section we are going to do several ttests on independent sets of data with the null hypothesis being true. We will see that eq. (\ref{eq:typeI_independent}) fits well the real typeI error.
We will draw samples from two Bernoulli distributions A and B, each with a probability $p=0.5$ of success. Each hypothesis test looks like
$$\begin{eqnarray}
H_0 &:& \Delta \mu = \mu_B \mu_A = 0 \nonumber \\
H_1 &:& \Delta \mu = \mu_B \mu_A \neq 0 \nonumber
\end{eqnarray}$$
where $\mu_A$ and $\mu_B$ are the two sample means.
By our definition, the null $H_0$ is true as we are going to set $\mu_A=\mu_B=0.5$ (hence $\Delta \mu=0$). The figure below shows the probability of committing a typeI error as a function of the number of independent ttests, assuming $\alpha=0.05$. Without correction, the Monte Carlo results are well fitted by eq. (\ref{eq:typeI_independent}) and show a rapid increase of the typeI error rate. Applying the Bonferroni correction does succeed in controlling the error at the nominal 5%.
Below is the Python code that I used to produce the figure above (code that you can also download). The parameter nsims that I used for the figure was 5000, but I had my machine running for a couple of hours so I decided to use 1000 as default. Give it a try, if you are curious!
1  import scipy as sp 
⁂
Below is a list of the topics of this blog post
Parabolic partial differential equations model important physical phenomena such as heat conduction (described by the heat equation) and diffusion (for example, Fick’s law). Under an appropriate transformation of variables the BlackScholes equation can also be cast as a diffusion equation. I might actually dedicate a full post in the future to the numerical solution of the BlackScholes equation, that may be a good idea…
The simplest parabolic problem is of the type
\begin{equation}
\frac{\partial u}{\partial t} = D\frac{\partial^2 u}{\partial x^2} \ + \mathrm{I.B.C.}
\label{eq:diffusion}
\end{equation}
where $D$ is a diffusion/heat coefficient (for simplicity, assumed to be independent of the time $t$ and space $x$ variables) and I.B.C. stands for “initial and boundary conditions”. The unknown $u$ represents a generic quantity of interest, which depends on the specific problem (could be temperature, concentration, price of an option,…).
The boundary conditions specify the constraints that the solution has to fulfill at the boundaries of the domain of interest. For example, let’s assume we have to solve a problem of heat conduction along a metallic rod as in the figure below. We keep the two sides of the rod at a fixed temperature $T^*$ and the initial temperature profile inside the domain is given by a function $T_0(x)$, which in general is dependent on the coordinate $x$.
This problem can be well approximated by a 1D model of heat conduction (as we assume that the length of the rod is much larger than the dimensions of its section). For simplicity, let’s assume $D=1$ in eq. (\ref{eq:diffusion}). The heat diffusion problem requires then to find a function $T(x,t)$ that satisfies the following equations
where this time I have explicitly written the initial and boundary conditions. This is now a well defined problem that can be solved numerically.
The numerical solution involves writing approximate expressions for the derivatives above, via a method called finite differences, that I describe briefly below.
Derivatives, such as the quantities $\partial u/\partial t$ and $\partial^2 u/\partial x^2$ that appear in eq. ($\ref{eq:diffusion}$), are a mathematical abstraction that can only be approximated when using numerical techniques.
Going back to the very definition of the derivative of a onedimensional function $f(x)$, the derivative $f’(x)$ of this function at the point $x$ is defined as
$$\begin{equation} f'(x) = \underset{\Delta x\rightarrow 0}\lim\frac{f(x+\Delta x)f(x)}{\Delta x} = \underset{\Delta x\rightarrow 0}\lim\frac{\Delta f}{\Delta x} \end{equation}$$Now, while a CPU can easily calculate the differences $\Delta f$ and $\Delta x$, it is a different story to calculate the limit $\lim(\Delta x\rightarrow 0)$. That is not a basic arithmetic operation. So, how to calculate the derivative? The trick is to calculate the ratio $\Delta f/\Delta x$ for a “sufficiently small” value of $\Delta x$. We can in fact approximate the derivative in different ways, making use of Taylor’s theorem
$$\begin{eqnarray} f(x+\Delta x) = f(x) + f'(x)\Delta x + \frac{1}{2}f''(x)\Delta x^2 + \frac{1}{6}f'''(x)\Delta x^3 + \dots \label{eq:taylor1}\\ f(x\Delta x) = f(x)  f'(x)\Delta x + \frac{1}{2}f''(x)\Delta x^2  \frac{1}{6}f'''(x)\Delta x^3 + \dots \label{eq:taylor2} \end{eqnarray}$$Summing up the two equations above, we get
$$\begin{equation}
f(x+\Delta x) + f(x\Delta x) = 2f(x) + f''(x)\Delta x^2 + O(\Delta x^4),
\end{equation}$$
where $O(\Delta x^4)$ indicates terms containing fourth and higher order powers of $\Delta x$ (negligible with respect to lower powers of $\Delta x$ for $\Delta x \rightarrow 0$).
This immediately gives us an expression for the second derivative $f’’(x)$,
$$\begin{equation}
f''(x) = \frac{f(x+\Delta x)2f(x)+f(x\Delta x)}{\Delta x^2} + O(\Delta x^4)
\label{eq:fd1}
\end{equation}$$
and for the approximation that is suited for calculations using a CPU,
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{f''(x) \simeq \frac{f(x+\Delta x)2f(x)+f(x\Delta x)}{\Delta x^2}}
\label{eq:fd2}
\end{equation}$$
Similarly, subtraction of eq. (\ref{eq:taylor2}) from eq. (\ref{eq:taylor1}) leads to the centraldifference approximation to the first derivative,
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{f'(x) \simeq \frac{f(x+\Delta x)f(x\Delta x)}{2\Delta x}}
\label{eq:fd3}
\end{equation}$$
with an error $O(\Delta x^2)$. Alternative approximations to the first derivative are the forwarddifference formula,
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{f'(x) \simeq \frac{f(x+\Delta x)f(x)}{\Delta x}}
\label{eq:fd4}
\end{equation}$$
with an error $O(\Delta x)$, and the backwarddifference formula
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{f'(x) \simeq \frac{f(x)f(x\Delta x)}{\Delta x}}
\label{eq:fd5}
\end{equation}$$
also with an error $O(\Delta x)$.
Higher order (or mixed) derivatives can be calculated in a similar fashion. The important point is that the approximations (\ref{eq:fd2},\ref{eq:fd3},\ref{eq:fd4},\ref{eq:fd5}) form the building blocks for the numerical solution of ordinary and partial differential equations.
In the next section I will discuss a first algorithm that uses these approximations to solve eq. ($\ref{eq:diffusion}$) and goes under the name of explicit solution. It is a fast and intuitive algorithm, that however has the drawback of only working under some well defined conditions, otherwise the solution is unstable (i.e., the numerical solution will get NaN values pretty quickly…).
Let’s go back to eq. ($\ref{eq:diffusion}$) and think about a possible way of solving this problem numerically. Using the results of the previous section, we can think of discretizing the derivative $\partial u/\partial t$ using any of the formulas above (central/forward/backward differencing) and the derivative $\partial u^2/\partial x^2$ using eq. (\ref{eq:fd2}). As it turns out, for stability considerations it is better to avoid central differencing for $\partial u/\partial t$. We will use instead forward differencing (we could also choose backward differencing), and the fully discretized equation then looks like
$$\begin{equation}
\frac{u_{i,j+1}u_{i,j}}{\Delta t} = D\frac{u_{i+1,j}2u_{i,j}+u_{i1,j}}{\Delta x^2},
\label{eq:fd}
\end{equation}$$
to which the initial and boundary conditions need to be added.
Note how eq. (\ref{eq:fd}) implies having already defined a space and time grids. Let’s assume the spatial resolution is $\Delta x$ and the temporal resolution is $\Delta t$. Then if $u(i,j)$ is the solution at the position $x$ and time $t$, $u(i+1,j+1)$ represents the solution at the position $x+\Delta x$ and at the time $t + \Delta t$. Visually, the spacetime grid can be seen in the figure below.
The initial and boundary conditions are represented by red dots in the figure. The space grid is represented by the index $i$ in eq. (\ref{eq:fd}), while the time is represented by the index $j$. At the beginning of the simulation, the space domain is discretized into $N$ grid points, so that $i=1\dots N$. The index $j$ starts from $j=1$ (the initial condition) and runs for as long as needed to capture the dynamics of interest.
Looking at the figure above, it is clear that the only unknown in eq. (\ref{eq:fd}) is $u_{i,j+1}$. Solving for this quantity, we get
$$\begin{equation}
u_{i,j+1} = r\,u_{i+1,j} + (12r)u_{i,j} + r\,u_{i1,j}
\label{eq:explicit}
\end{equation}$$
where $r = D\Delta t/\Delta x^2$. As shown in Figure 3, the unknown $u_{i,j+1}$ only depends on quantities at the previous time step $j$, and only at the grid points $i1, i$ and $i+1$.
Equation (\ref{eq:explicit}) can be recast in matrix form. Remember that the boundary conditions are given at the gridpoints $i=1$ and $i=N$, so the real unknowns are the values at the gridpoints $i=2,\dots,N1$ and at the timestep $j+1$, giving
$$\begin{equation} \left( \begin{matrix} u_{2,j+1} \\ u_{3,j+1} \\ \vdots \\ u_{N2,j+1} \\ u_{N1,j+1} \\ \end{matrix} \right) = \left( \begin{matrix} 12r & r & & & \\ r & 12r & r & \Huge{0} &\\ & & \ddots & & & \\ & \Huge{0} & r & 12r & r \\ & & & r & 12r \\ \end{matrix} \right) \left( \begin{matrix} u_{2,j} \\ u_{3,j} \\ \vdots \\ u_{N2,j} \\ u_{N1,j} \\ \end{matrix} \right) + \left( \begin{matrix} r u_{1,j} \\ 0 \\ \vdots \\ 0 \\ r u_{N,j} \\ \end{matrix} \right) \label{eq:matrix1} \end{equation}$$which can be rewritten as
$$\begin{equation} \textbf{u}_{j+1} = B\textbf{u}_{j} + \textbf{b}_{j}. \label{eq:matrix2} \end{equation}$$where $B$ is a square $(N1)\times (N1)$ matrix and $\textbf{b}_{j}$ is a vector of boundary conditions. The matrix $B$ is tridiagonal.
It is pretty easy to implement the solution (\ref{eq:matrix2}) in a code, but there is a potential big problem that we should be aware of when using this type of algorithm (explicit): the solution can be unstable! This means that any small and unavoidable roundoff error can degenerate to a huge error (and eventually NaNs) after a few hundreds or thousands of iterations.
How can we make sure we don’t suffer from this problem? That’s the subject of the next section.
Equation ($\ref{eq:matrix2}$) is only an approximate solution of eq. ($\ref{eq:diffusion}$). In fact, it is not even possible to solve exactly eq. ($\ref{eq:matrix2}$), let alone eq. (\ref{eq:diffusion}), because of roundoff errors. How can we guarantee that these initially small errors will not accumulate over many iterations causing a catastrophic runtime error during execution of our code?
Let’s do a direct calculation: let’s assume that the true solution of ($\ref{eq:matrix2}$)  which as noted is already an approximation of the original problem  is given by the quantity $\textbf{u}_1$, while due to roundofferrors our code finds the solution $\textbf{u}_1^*$. What is going to happen to the error $\textbf{e}_1 = \textbf{u}_1  \textbf{u}_1^*$? Is it going to be amplified at successive iterations, or will it always be possible to bound the propagated error by some number independent of the iteration step $j$?
To answer this question, we can apply recursively eq. (\ref{eq:matrix2}) to find
$$\begin{eqnarray} \textbf{u}_j=B^j\textbf{u}_1+B^{j1}\textbf{b}_1+\dots+B^{j2}\textbf{b}_2+\textbf{b}_{j} \\ \textbf{u}^*_j=B^j\textbf{u}^*_1+B^{j1}\textbf{b}_1+\dots+B^{j2}\textbf{b}_2+\textbf{b}_{j} \end{eqnarray}$$where $B^j=B\times B\dots \times B$ ($j$ times). It follows that at the timestep $j$ the difference of the solutions will be $$\textbf{e}_j=\textbf{u}_j\textbf{u}^*_j=B^j \textbf{e}_1,$$
which implies
$$\textbf{e}_j\leqB^j\,\textbf{e}_1\ .$$
According to Lax and Ritchmyer, for the numerical scheme to be stable there should exist a positive number M independent of $j, \Delta x$ and $\Delta t$ such that $B^j\leq M \ \forall \ j\ $, so that
$$\begin{equation} \textbf{e}_j\leq M \textbf{e}_1\ . \label{eq:laxritchmyer} \end{equation}$$This ensures that a small error at the first timestep, $\textbf{e}_1$, will not propagate catastrophically at subsequent timesteps.
The necessary and sufficient condition for the LaxRitchmyer stability condition (\ref{eq:laxritchmyer}) to be satisfied is $B\leq 1$. For the explicit method of eqs. (\ref{eq:matrix1}) and (\ref{eq:matrix2}), it can be shown that a sufficient condition to guarantee that $B\leq 1$ is
$$r\leq \frac{1}{2}$$
which, remembering that $r = D\Delta t/\Delta x^2$, translates into
$$\begin{equation}
\bbox[lightblue,5px,border:2px solid red]{\Delta t \leq \frac{\Delta x^2}{2D}\ .}
\label{eq:stability}
\end{equation}$$
Unfortunately, in many practical applications it is not possible to satisfy eq. (\ref{eq:stability}) without strongly compromising the efficiency of an algorithm. For example during the implosion of deuterium (D) and tritum (T) spherical pellets in inertial confinement fusion (ICF) experiments, the D and T ions are subject to mass diffusion (à la Fick). The variable $u$ of eq. (\ref{eq:diffusion}) is in this case the species concentration. In particular conditions the diffusion coefficient $D$ may reach very large values, on the order of $10^2$ m$^2/$s. Now, usually in ICF simulations the grid size is constrained to resolve distances on the order of a $\mu$m (i.e., $\Delta x \sim 10^{6}$ m). Applying eq. (\ref{eq:stability}), we see that to guarantee stability with an explicit scheme the time step would need to be kept to values $\Delta t\sim 10 $ fs or $10^{14}$s. Typical ICF codes run with time steps on the order of $10^{12}$ s. If we were to model species diffusion with an explicit scheme, we would then slow down an ICF code by 2 orders of magnitude (!!).
Is it possible to avoid having a time step constrained by the condition (\ref{eq:stability})? Yes, luckily it is. We need to use a different scheme of solution called implicit. This comes at the price of having to solve a set of equations that is numerically more demanding. We will see in the next post of the series how a popular implicit scheme (CrankNicolson) works.
⁂