As it stands, the scipy.stats.wasserstein_distance is great, but it's somewhat confusing on whether it can be used directly on a pre-computed frequency vector without drawing arbitrary samples from that vector.
I've seen this suggested in the wild. I then attempted to use it myself, and was surprised how different it was from, say, pyemd or R's various emd implementations.
from scipy import stats
import numpy
a = numpy.array([.2,.8])
b = numpy.array([.8,.2])
# stats.wasserstein_distance gives zero
stats.wasserstein_distance(a,b)
# pyemd gives 0.6
import pyemd
pyemd.emd(np.array([.2,.8]), np.array([.2,.8]),
np.ones((2,2))) # cost of moving mass under L1 is constant
An example in R under identical conditions:
library(emdist)
a = matrix(c(.2,.8), nrow=1)
b = matrix(c(.8,.2), nrow=1)
emd(a,b, dist='manhattan') # is also 0.6, agreeing with pyemd
Is there something I'm missing about how this might be used? Or, do we need something different to use this directly on frequency vectors?
This might tie into how this gets incorporated if it's added to scipy.spatial.distance, since bet it's much more common to use emd/wasserstein_distance for all pairs of n frequency vectors, rather than n sets of raw observations.
Sorry, updated to reflect a typo. the distributions I meant to include is a frequency vector with 2 bins with p and 1-p in each bin, whose emd should be 1-2p.
I'm not familiar with those other implementations. @CharlesMasson could you comment on this?
I'm not very familiar with pyemd, but as far as I understand, when using pyemd.emd, you specify a distance matrix that gives the distances between the bins. So when you compute pyemd.emd(np.array([.2,.8]), np.array([.8,.2]), np.ones((2,2))), you say that the distance between the two bins is 1.
scipy.stats.wasserstein_distance only works in the one-dimensional case, and instead of specifying distances between bins, you specify the bin locations. So in that case, we need to generate two bin locations with a distance of 1 between them:
>>> bin_locations = [0, 1]
>>> wasserstein_distance(bin_locations, bin_locations, [.2, .8], [.8, .2])
0.6000000000000001
Because we are in the one-dimension case, there is a way here to encode the distance matrix as real-valued bin locations, but this might not work for more complex distance matrices (those that do not represent distances between points in a one-dimension space). So, if the one-dimensional case is enough for you, you should be able to use both pyemd and scipy functions. Otherwise, the scipy function might not fulfill your needs and you might have to use pyemd.
Thanks @CharlesMasson
Most helpful comment
I'm not very familiar with
pyemd, but as far as I understand, when usingpyemd.emd, you specify a distance matrix that gives the distances between the bins. So when you computepyemd.emd(np.array([.2,.8]), np.array([.8,.2]), np.ones((2,2))), you say that the distance between the two bins is 1.scipy.stats.wasserstein_distanceonly works in the one-dimensional case, and instead of specifying distances between bins, you specify the bin locations. So in that case, we need to generate two bin locations with a distance of 1 between them:Because we are in the one-dimension case, there is a way here to encode the distance matrix as real-valued bin locations, but this might not work for more complex distance matrices (those that do not represent distances between points in a one-dimension space). So, if the one-dimensional case is enough for you, you should be able to use both
pyemdandscipyfunctions. Otherwise, thescipyfunction might not fulfill your needs and you might have to usepyemd.