Mediapipe: Hand tracking landmarks - Z value range

Created on 27 May 2020 · 14Comments · Source: google/mediapipe

I am failing to find any kind of documentation or example that would explain the exact definition/behavior of the estimated Z coordinates returned by the hand tracking graph.

We're able to successfully extract the landmark data as X, Y and Z coordinates. The X and Y coordinates are clearly normalized but the Z coordinates appear to take values to which I have no reference (they are not normalized, they are sometimes negative, sometimes positive and don't appear to adhere to any coherent scale. Clear is: They are most likely relative to each other.

Could somebody shine some light on the estimated Z coordinates - especially the scale they adhere to?

hands

Source

Tectu

👍3

Most helpful comment

The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).

Weak-perspective projection is an orthographic projection plus a scaling, which serves to approximate perspective projection by assuming that all points on a 3D object are at roughly the same distance from the camera.

The justification for using weak-perspective is that in many cases it approximates perspective closely. In particular for situations when the average variation of the depth of the object (delta Z) along the line of sight is small, compared to the fixed average depth (Z avg). This also allows objects at a distance not to distort due to perspective, but to only uniformly scale up/down.

The z predicted by the model is relative depth, based on the Zavg of "typical hand depth" (in the case of holding a phone with one hand and the other is tracked, or being close to the phone and showing both hands).
Also, the range of z is unconstrained, but it is scaled proportionally along with x and y (via weak projection), and expressed in the same units as x & y.

There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).

mcclanahoochie on 4 Jun 2020

👍5 ❤1

All 14 comments

I am wondering about this also.

brianm-sra on 1 Jun 2020

Normalized X gives 0 to 1 where x-origin is origin of the image x-coordinate
Normalized Y gives 0 to 1 where y-origin is origin of the image y-coordinate
Normalized Z where z-origin is relative to the wrist z-origin. I.e if Z is positive, the z-la ndmark coordinate is out of the page with respect to the wrist. Z is negative, the z-landmark coordinate is into the page with respect of the wrist.

mgyong on 2 Jun 2020

❤4

Thanks for responding. Can you just clarify "out of the page" and "into the page" in the case of mobile phone (Android/iOS). Does "into" mean closer to device/camera?

brianm-sra on 2 Jun 2020

@brianm-sra Take a piece of paper facing you. Into the page means moving away from the page and away your face. Out of the page means moving means closer to your face.

mgyong on 2 Jun 2020

👀1

@mgyong,
Did I get it correctly, that?:

All the coordinates are normalized
Z-coordinate is relative to Z-coordinate of 0-indexed output landmark (which is the "wrist")

And, if 1. is correct. could you elaborate more on the normalization formula or share a link to the code? (linking to #739)

azahreba on 2 Jun 2020

Well, they are normalized values (range [0..1]). Simply scale by the frame dimensions to determine the pixel-based location:

int x = landmark_normal_x * image.width();
int y = landmark_normal_y * image.height();

Tectu on 2 Jun 2020

👍3 🚀1

@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.

MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4)
Landmark[1]: (0.7133818, 0.5926434, 8.609375)
Landmark[2]: (0.71597075, 0.60017556, 12.796875)
Landmark[3]: (0.7186818, 0.6051307, 16.171875)
Landmark[4]: (0.7203635, 0.6082897, 18.5625)
Landmark[5]: (0.7297122, 0.60433185, 4.828125)
Landmark[6]: (0.73730975, 0.608294, 10.921875)
Landmark[7]: (0.7415061, 0.6100682, 14.1640625)
Landmark[8]: (0.74460053, 0.6116357, 15.7890625)
Landmark[9]: (0.735119, 0.60136366, 4.2734375)
Landmark[10]: (0.7428533, 0.6047355, 10.25)
Landmark[11]: (0.7464073, 0.6058627, 14.0234375)
Landmark[12]: (0.74882674, 0.60691535, 16.6875)
Landmark[13]: (0.73774654, 0.59717864, 5.4140625)
Landmark[14]: (0.7438597, 0.59961534, 10.8671875)
Landmark[15]: (0.7469474, 0.60099816, 13.1171875)
Landmark[16]: (0.7484096, 0.6021778, 14.765625)
Landmark[17]: (0.7387837, 0.5926967, 7.7421875)
Landmark[18]: (0.74255306, 0.594779, 11.65625)
Landmark[19]: (0.7440435, 0.5965584, 13.4921875)
Landmark[20]: (0.7447367, 0.59839743, 15.46875)

MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4)
MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125)
MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625)
MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25)
MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125)
MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625)
MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625)
MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875)
MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234)
MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125)
MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125)
MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125)
MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688)
MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375)
MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875)
MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375)
MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375)
MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625)
MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625)
MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625)
MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)

brianm-sra on 4 Jun 2020

👀1 👍1

The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).

There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).

mcclanahoochie on 4 Jun 2020

👍5 ❤1

I added some code in my Android app to keep track of minimum and maximum Z values observed,
and tested while moving my hand near/far, different angles, etc.
At the end of running the test, the minimum Z observed was -198.0 and maximum Z observed was 168.0. Does this make sense? Are these in line with expected minimum and maximum values of Z
for 3D hand tracking graph ?
These are coming from NormalizedLandmark getZ()

brianm-sra on 5 Jun 2020

How to judge palm turning or not ?

chensisi0730 on 10 Jun 2020

@brianm-sra Do you figure out how far "paper" is? Can anyone explain clearly about how to convert z to 3d camera coordinate

LeDuySon on 4 Aug 2020

Hi @jiuqiant & @mgyong , @Tectu , @mcclanahoochie , Please help

After reading all the contents above I am still unknown on which scale value of z depends. How z coordinate is getting change. x an y depend on screen on which pixel of the screen the landmark lies. But It is sure that z should be used as depth or distance of landmark from the device camera. But please clarify how it changes it's value.. on which thing value of z coordinate depends. Please help

rajan8garg on 26 Aug 2020

Any update on this? I am also not clear on how to use the z value to get 3d coordinates. Is there a way to get the value of Z avg at least?

lbouis on 12 Sep 2020

@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.

MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4)
MediaPipeHandTracker: Landmark[1]: (0.7133818, 0.5926434, 8.609375)
MediaPipeHandTracker: Landmark[2]: (0.71597075, 0.60017556, 12.796875)
MediaPipeHandTracker: Landmark[3]: (0.7186818, 0.6051307, 16.171875)
MediaPipeHandTracker: Landmark[4]: (0.7203635, 0.6082897, 18.5625)
MediaPipeHandTracker: Landmark[5]: (0.7297122, 0.60433185, 4.828125)
MediaPipeHandTracker: Landmark[6]: (0.73730975, 0.608294, 10.921875)
MediaPipeHandTracker: Landmark[7]: (0.7415061, 0.6100682, 14.1640625)
MediaPipeHandTracker: Landmark[8]: (0.74460053, 0.6116357, 15.7890625)
MediaPipeHandTracker: Landmark[9]: (0.735119, 0.60136366, 4.2734375)
MediaPipeHandTracker: Landmark[10]: (0.7428533, 0.6047355, 10.25)
MediaPipeHandTracker: Landmark[11]: (0.7464073, 0.6058627, 14.0234375)
MediaPipeHandTracker: Landmark[12]: (0.74882674, 0.60691535, 16.6875)
MediaPipeHandTracker: Landmark[13]: (0.73774654, 0.59717864, 5.4140625)
MediaPipeHandTracker: Landmark[14]: (0.7438597, 0.59961534, 10.8671875)
MediaPipeHandTracker: Landmark[15]: (0.7469474, 0.60099816, 13.1171875)
MediaPipeHandTracker: Landmark[16]: (0.7484096, 0.6021778, 14.765625)
MediaPipeHandTracker: Landmark[17]: (0.7387837, 0.5926967, 7.7421875)
MediaPipeHandTracker: Landmark[18]: (0.74255306, 0.594779, 11.65625)
MediaPipeHandTracker: Landmark[19]: (0.7440435, 0.5965584, 13.4921875)
MediaPipeHandTracker: Landmark[20]: (0.7447367, 0.59839743, 15.46875)

MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4)
MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125)
MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625)
MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25)
MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125)
MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625)
MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625)
MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875)
MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234)
MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125)
MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125)
MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125)
MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688)
MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375)
MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875)
MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375)
MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375)
MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625)
MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625)
MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625)
MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)