Introduction to Optimal Transport

Published in

Analytics Vidhya

7 min readApr 19, 2020

Optimal Transport (OT) is a mathematical field used in many interesting and popular tasks, such as image segmentation, image restoration, computer vision, data analysis, etc., and in general, OT can be useful for those tasks involving a density distribution.

I started using Optimal Transport tools to characterize the functional neural connections in the brain of patients with different neurological disorders. After appropriate preprocessing, I obtained the eigenvalues of the neural graph of each patient and used it as a feature vector. Since the vector of eigenvalues in Euclidean spaces of different dimensions, it was difficult to compute the distance between the vectors of eigenvalues of each graph. Hence, I decided to take advantage of OT theory, which provides the means to compare distributions in different dimensions (among many other wonderful use cases). When I started digging into OT, I noticed the lack of introductory information to ease my beginning in the field. After battling my way through, I decided to make this small tutorial of Optimal Transport. With this tutorial, I am fulfilling one of my TODO things of finally creating a Medium post! I am very excited about this and please, feel free to reach to me if anything is unclear.

It all started with Gaspart Monge (1746–1818) when he stated the problem of moving a large pile of sand (which shape can be thought of as a probability distribution) to a target pile with a prescribed shape (or probability distribution) with minimal effort.

How to move pile X to the shape of pile Y with minimal effort? I.e. given a cost function **c(x,y)** of moving a grain **x∈X** to the position **y∈Y**, what is the optimal displacement of all X to Y?

As repetition always helps for understanding the problem better, I will paraphrase the question: what is the optimal transformation T to move several items (discrete distributions) from one place to another? Similarly for the continuous case, what is the optimal transformation T to move one density function to another?

OT is an optimization problem which goal is to minimize the cost of transportation. We will see that OT is actually a linear program, making the problem more manageable.

Optimal Transport between histograms and discrete measures

Definition 1: A probability vector (also known as histogram) a is a vector with positive entries that sum to one

The probability simplex Σₙ is the set of all histograms:

Definition 2: A discrete measure α represents weighted points in space X. Namely,

where a is vector of weights and x₁,…, xn ∈X are the locations. As usual, δₓ is the Dirac’s function, which equals 1 at location x and 0 everywhere else.

Let us take the simplest case: two uniform probability vectors a and b of same size n. Let (Cᵢ ⱼ) be the associated cost matrix of moving the entry i of a to the entry j of b (sometimes the entries are called bins in the context of OT). Since each bin of a and b are uniform and equals 1/n, we are not concerned about mass transportation yet. Therefore, we only seek the optimal permutation σ that minimizes the transport cost.

Namely, find:

Although the factor 1/n is not necessary in this particular optimization problem, we shall keep it for a natural generalization to mass transportation. This problem is called the assignment problem.

Let’s point out that the solution is not unique. Consider the following example of transporting xᵢ →yⱼ , for i,j ∈{1,2}.

The cost of transporting xᵢ →yⱼ is 1 for all i,j

The cost matrix is (Cᵢ ⱼ) is

Only two permutations are possible in this case, and the cost of both of the permutations is 2. Hence, both are a solution to the assignment problem.

Let us move on to a more general case: probability vectors/discrete distributions with different masses. Since we cannot transport a bin of some mass to another bin with different mass, we have a new restriction to take care of.

Namely, if α, β are discrete measures

We have to find a function T

This function is called push forward operator and the mass conservation condition is compactly denoted as:

This problem, known as the Monge problem, is similar to the assignment problem but with mass conservation addition. In complete form, it is written as follows:

Example of a push forward operator:

Transporting blue dots to the red dot and fulfilling that the sum of masses of the blue dots equal the mass of the red one.

What if the probability distributions do not match up? Let us consider a final generalization, which accounts for this problem by allowing the mass at any point x_i to be dispatched across several locations. This flexibility is known as Kantorovich relaxation.

In place of a permutation σ or a map T, a coupling matrix P ∈ R^{n×m} is sought. P_{i,j} describes the amount of mass flowing from bin i to bin j. Note that, to be admissible, the matrix P must satisfy that the sum of masses outgoing from a location equals to the total mass of that location. Similarly, the sum of masses transported to a location must equal the total mass of the target location.

Mathematically, the set of admissible couplings is:

The generalized optimal problem reads as

and it is known Kantorovich transport problem.

By allowing mass splitting, we make the problem symmetric, in the sense that if P is a coupling matrix to transport between (a,b) , then the transpose of P is a coupling matrix between (b,a).

Note that the constraints and the problem is linear, making the Kantorovich optimal problem a linear program.

Another very interesting and extremely useful result is that, if the cost matrix C encodes the distance D, then the Kantorovich optimal transport is actually a distance, known as the Wasserstein distance:

The Wasserstein distance is getting increasingly popular in Artificial Neural Networks as a loss function, since it leverages physical ideas (mass displacement) and geometry (cost function). For instance, some types of Generative Adversarial Neural Networks (GANs) use this distance.

Last thing before ending the introduction to discrete measures, we will state the dual optimization problem:

with the set of admissible dual variables:

The dual problem is important to calculate the gradient of the Wasserstein distance, which is useful to minimize the loss function in the Neural Network. We will revise it thoroughly through the next tutorials.

Optimal Transport between arbitrary measures

For the sake of completeness, and in case you need it as I did in some of my projects, I will leave the optimal transport equations for arbitrary measures. The definitions are basically the same as for discrete measures.

Definition 3: An arbitrary measure α on (X,d), being d a distance, can be assessed by integrating it with a continuous function f(x):

The set of all measures for which the above integral is finite is known as the set of Radon measures M(X).

The push forward measure:

Monge problem for arbitrary measures:

Admissible couplings of arbitrary measures:

Below you can see an example of a coupling π between two arbitrary measures. By mentally integrating π on the horizontal axis, one can note that the integral coincides with β. Analogously with α through the integration on the vertical axis.

Graphs were taken from Computational Optimal Transport book, written by Gabriel Peyré and Marco Cuturi

Kantorovich optimal transport problem for arbitrary measures:

If X=Y and c(x,y)=d(x,y), then the Wasserstein distance for arbitrary measures is defined as:

Now we know what the Assignment, Monge and Kantorovich problems are, and we have defined the Wasserstein distance. In the following tutorials, I will tell you how to find a solution to those problems. In addition, we will revise the gradients of the Wasserstein distance, which are necessary to find good estimators of distributions.

I hope that you have enjoyed reading my first post and looking forward to see you next time!

Bibliography

Gabriel Peyré and Marco Cuturi, Computational Optimal Transport, ArXiv:1803.00567, 2018
https://optimaltransport.github.io/

Introduction to Optimal Transport

Optimal Transport between histograms and discrete measures

Optimal Transport between arbitrary measures

Bibliography

Written by Liuba