Currently, in Evaluation-Tools/Step2_AnalysisofResults.py, MSE is computed using this code. The printed result (computed in lines 150-151) is then the arithmetic mean of the MSE variable (or a subset of indices).
It seems that this computes the arithmetic mean of the euclidean distances between the predicted and ground truth positions, while what actually should be computed according to the definition of RMSE is the quadratic mean.
More closely: let x1 and y1 denote 1D NumPy arrays containing ground truth coordinates for one bodypart for every frame, and similarly let x2 and y2 denote the predicted coordinates. Let's call RMSE the value which would be computed by np.nanmean(MSE.values.flatten()) with the current code (for simplicity, we ignore the train/test set split). Let's also assume there are no NaN values. With this notation, the computation for MSE (lines 82-83) may be rewritten as follows:
MSE = np.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)
This means MSE[i] is the euclidean distance between the ground truth and predicted positions in the i-th frame. Now RMSE is calculated as RMSE = np.mean(MSE).
However the correct MSE and RMSE should be:
MSE_B = (x1 - x2) ** 2 + (y1 - y2) ** 2
RMSE_B = np.sqrt(np.mean(MSE_B))
(the name MSE_B is still misleading because it's actually not a mean, it's a vector of squared distances)
If this is true, it means that the reported RMSE is always lower or equal than the actual one (because QM>=AM)! It is not explicitly said in this code that the printed error supposed be the RMSE, but I myconfig.py and the Demo Guide suggests that it is.
Am I wrong about this?
Fair enough - thanks for raising this! We compute the Euclidean distance for any pair of 2D points (e.g. human vs. predicted). This matrix (indexed by image & body part) is then averaged. In the paper we write: "To compare between datasets generated by the human scorer, as well as with or between model-generated labels, we used the Euclidean distance (root mean square error, RMSE) calculated pairwise per body part. Depending on the context, this metric is either shown for a specific body part, averaged over all body parts, or averaged over a set of images."
So (at least) there it is clear that we actually take the mean of the Euclidean distances, for which "Mean Absolute Error" / or "Mean absolute Euclidean distance" would have been a better name. However, the name in the paper was based on the misleading equivalence of "Euclidean distance (root mean square error, RMSE)", which obviously differ by sqrt(2). We will clarify the names in a future update. We think the mean of the Euclidean distances is a better evaluation metric than the RMSE of the Euclidean distances. Note that during training the "RMSE" is not minimized, so this only affects the evaluation.
Thank you for clarifying this. I scanned the paper briefly to look for clarification on this but I did miss the section you cite here. Glad to hear that the paper states this correctly.
As for whether the mean of the distances (which is essentially Mean Absolute Error) or RMSE is a better evaluation metric, MAE does seem easier to interpret.
Hi guys, sorry but its not 100% clear to me yet - for calculating the human variability (error between two annotations), do you use
MSE = (((x1 - x2) ** 2) + ((y1 - y2) ** 2))/2
or
MAE = np.abs((x1- x2) + (y1 - y2))/2
The errors for all annotated images + labels are then summed up and divided by the number of images + labels right?
In the Cheetah case, another distance calculation was used by calculating the distance between X and Y marker see here
So I am not quite sure which of these three I should use ..
Whether it鈥檚 between two annotation rounds (human vs human) or human vs model this is the code used in the Nature Neuroscience paper:
Most helpful comment
Thank you for clarifying this. I scanned the paper briefly to look for clarification on this but I did miss the section you cite here. Glad to hear that the paper states this correctly.
As for whether the mean of the distances (which is essentially Mean Absolute Error) or RMSE is a better evaluation metric, MAE does seem easier to interpret.