I am failing to find any kind of documentation or example that would explain the exact definition/behavior of the estimated Z coordinates returned by the hand tracking graph.
We're able to successfully extract the landmark data as X, Y and Z coordinates. The X and Y coordinates are clearly normalized but the Z coordinates appear to take values to which I have no reference (they are not normalized, they are sometimes negative, sometimes positive and don't appear to adhere to any coherent scale. Clear is: They are most likely relative to each other.
Could somebody shine some light on the estimated Z coordinates - especially the scale they adhere to?
I am wondering about this also.
Normalized X gives 0 to 1 where x-origin is origin of the image x-coordinate
Normalized Y gives 0 to 1 where y-origin is origin of the image y-coordinate
Normalized Z where z-origin is relative to the wrist z-origin. I.e if Z is positive, the z-la ndmark coordinate is out of the page with respect to the wrist. Z is negative, the z-landmark coordinate is into the page with respect of the wrist.
Thanks for responding. Can you just clarify "out of the page" and "into the page" in the case of mobile phone (Android/iOS). Does "into" mean closer to device/camera?
@brianm-sra Take a piece of paper facing you. Into the page means moving away from the page and away your face. Out of the page means moving means closer to your face.
@mgyong,
Did I get it correctly, that?:
And, if 1. is correct. could you elaborate more on the normalization formula or share a link to the code? (linking to #739)
Well, they are normalized values (range [0..1]). Simply scale by the frame dimensions to determine the pixel-based location:
int x = landmark_normal_x * image.width();
int y = landmark_normal_y * image.height();
@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.
MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4)
MediaPipeHandTracker: Landmark[1]: (0.7133818, 0.5926434, 8.609375)
MediaPipeHandTracker: Landmark[2]: (0.71597075, 0.60017556, 12.796875)
MediaPipeHandTracker: Landmark[3]: (0.7186818, 0.6051307, 16.171875)
MediaPipeHandTracker: Landmark[4]: (0.7203635, 0.6082897, 18.5625)
MediaPipeHandTracker: Landmark[5]: (0.7297122, 0.60433185, 4.828125)
MediaPipeHandTracker: Landmark[6]: (0.73730975, 0.608294, 10.921875)
MediaPipeHandTracker: Landmark[7]: (0.7415061, 0.6100682, 14.1640625)
MediaPipeHandTracker: Landmark[8]: (0.74460053, 0.6116357, 15.7890625)
MediaPipeHandTracker: Landmark[9]: (0.735119, 0.60136366, 4.2734375)
MediaPipeHandTracker: Landmark[10]: (0.7428533, 0.6047355, 10.25)
MediaPipeHandTracker: Landmark[11]: (0.7464073, 0.6058627, 14.0234375)
MediaPipeHandTracker: Landmark[12]: (0.74882674, 0.60691535, 16.6875)
MediaPipeHandTracker: Landmark[13]: (0.73774654, 0.59717864, 5.4140625)
MediaPipeHandTracker: Landmark[14]: (0.7438597, 0.59961534, 10.8671875)
MediaPipeHandTracker: Landmark[15]: (0.7469474, 0.60099816, 13.1171875)
MediaPipeHandTracker: Landmark[16]: (0.7484096, 0.6021778, 14.765625)
MediaPipeHandTracker: Landmark[17]: (0.7387837, 0.5926967, 7.7421875)
MediaPipeHandTracker: Landmark[18]: (0.74255306, 0.594779, 11.65625)
MediaPipeHandTracker: Landmark[19]: (0.7440435, 0.5965584, 13.4921875)
MediaPipeHandTracker: Landmark[20]: (0.7447367, 0.59839743, 15.46875)
MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4)
MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125)
MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625)
MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25)
MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125)
MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625)
MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625)
MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875)
MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234)
MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125)
MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125)
MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125)
MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688)
MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375)
MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875)
MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375)
MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375)
MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625)
MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625)
MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625)
MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)
The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).
Weak-perspective projection is an orthographic projection plus a scaling, which serves to approximate perspective projection by assuming that all points on a 3D object are at roughly the same distance from the camera.
The justification for using weak-perspective is that in many cases it approximates perspective closely. In particular for situations when the average variation of the depth of the object (delta Z) along the line of sight is small, compared to the fixed average depth (Z avg). This also allows objects at a distance not to distort due to perspective, but to only uniformly scale up/down.
The z predicted by the model is relative depth, based on the Zavg of "typical hand depth" (in the case of holding a phone with one hand and the other is tracked, or being close to the phone and showing both hands).
Also, the range of z is unconstrained, but it is scaled proportionally along with x and y (via weak projection), and expressed in the same units as x & y.
There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).
I added some code in my Android app to keep track of minimum and maximum Z values observed,
and tested while moving my hand near/far, different angles, etc.
At the end of running the test, the minimum Z observed was -198.0 and maximum Z observed was 168.0. Does this make sense? Are these in line with expected minimum and maximum values of Z
for 3D hand tracking graph ?
These are coming from NormalizedLandmark getZ()
How to judge palm turning or not ?
@brianm-sra Do you figure out how far "paper" is? Can anyone explain clearly about how to convert z to 3d camera coordinate
After reading all the contents above I am still unknown on which scale value of z depends. How z coordinate is getting change. x an y depend on screen on which pixel of the screen the landmark lies. But It is sure that z should be used as depth or distance of landmark from the device camera. But please clarify how it changes it's value.. on which thing value of z coordinate depends. Please help
Any update on this? I am also not clear on how to use the z value to get 3d coordinates. Is there a way to get the value of Z avg at least?
@mgyong You wrote that "Normalized Z gives 0 to 1" but that is NOT what I am seeing on Android as output from NormalizedLandmark.getZ() . Instead I have seen larger and smaller values ranging from approximately -80.0 to +80.0 in different tests. Here are a couple of examples from a recent test with my app based on the 3D version of hand tracker graph. Note Z values are outside of 0 to 1 range. Also I am curious how to determine how far "paper" is from Android phone camera.
MediaPipeHandTracker: Landmark[0]: (0.7117575, 0.5807782, 1.283884E-4)
MediaPipeHandTracker: Landmark[1]: (0.7133818, 0.5926434, 8.609375)
MediaPipeHandTracker: Landmark[2]: (0.71597075, 0.60017556, 12.796875)
MediaPipeHandTracker: Landmark[3]: (0.7186818, 0.6051307, 16.171875)
MediaPipeHandTracker: Landmark[4]: (0.7203635, 0.6082897, 18.5625)
MediaPipeHandTracker: Landmark[5]: (0.7297122, 0.60433185, 4.828125)
MediaPipeHandTracker: Landmark[6]: (0.73730975, 0.608294, 10.921875)
MediaPipeHandTracker: Landmark[7]: (0.7415061, 0.6100682, 14.1640625)
MediaPipeHandTracker: Landmark[8]: (0.74460053, 0.6116357, 15.7890625)
MediaPipeHandTracker: Landmark[9]: (0.735119, 0.60136366, 4.2734375)
MediaPipeHandTracker: Landmark[10]: (0.7428533, 0.6047355, 10.25)
MediaPipeHandTracker: Landmark[11]: (0.7464073, 0.6058627, 14.0234375)
MediaPipeHandTracker: Landmark[12]: (0.74882674, 0.60691535, 16.6875)
MediaPipeHandTracker: Landmark[13]: (0.73774654, 0.59717864, 5.4140625)
MediaPipeHandTracker: Landmark[14]: (0.7438597, 0.59961534, 10.8671875)
MediaPipeHandTracker: Landmark[15]: (0.7469474, 0.60099816, 13.1171875)
MediaPipeHandTracker: Landmark[16]: (0.7484096, 0.6021778, 14.765625)
MediaPipeHandTracker: Landmark[17]: (0.7387837, 0.5926967, 7.7421875)
MediaPipeHandTracker: Landmark[18]: (0.74255306, 0.594779, 11.65625)
MediaPipeHandTracker: Landmark[19]: (0.7440435, 0.5965584, 13.4921875)
MediaPipeHandTracker: Landmark[20]: (0.7447367, 0.59839743, 15.46875)MediaPipeHandTracker: Landmark[0]: (0.58928907, 0.61223286, -5.173683E-4)
MediaPipeHandTracker: Landmark[1]: (0.5756706, 0.6165022, 9.328125)
MediaPipeHandTracker: Landmark[2]: (0.56418365, 0.6183474, 10.390625)
MediaPipeHandTracker: Landmark[3]: (0.5544628, 0.6195656, 11.25)
MediaPipeHandTracker: Landmark[4]: (0.54804426, 0.6200474, 12.1953125)
MediaPipeHandTracker: Landmark[5]: (0.5571172, 0.6139561, -11.5390625)
MediaPipeHandTracker: Landmark[6]: (0.5400312, 0.6170851, -12.6015625)
MediaPipeHandTracker: Landmark[7]: (0.53433615, 0.620509, -6.7421875)
MediaPipeHandTracker: Landmark[8]: (0.531901, 0.6224792, -1.4365234)
MediaPipeHandTracker: Landmark[9]: (0.556367, 0.61244273, -14.953125)
MediaPipeHandTracker: Landmark[10]: (0.537187, 0.6163406, -19.078125)
MediaPipeHandTracker: Landmark[11]: (0.53178513, 0.6196333, -13.3203125)
MediaPipeHandTracker: Landmark[12]: (0.5302227, 0.62104964, -7.3554688)
MediaPipeHandTracker: Landmark[13]: (0.5572119, 0.6120599, -15.9609375)
MediaPipeHandTracker: Landmark[14]: (0.5396851, 0.6155686, -17.671875)
MediaPipeHandTracker: Landmark[15]: (0.5339748, 0.6182261, -10.6484375)
MediaPipeHandTracker: Landmark[16]: (0.5321559, 0.61939704, -4.359375)
MediaPipeHandTracker: Landmark[17]: (0.5594273, 0.61234933, -15.6015625)
MediaPipeHandTracker: Landmark[18]: (0.54581386, 0.6152981, -15.5390625)
MediaPipeHandTracker: Landmark[19]: (0.54055375, 0.6172483, -11.0625)
MediaPipeHandTracker: Landmark[20]: (0.5383292, 0.6179763, -6.3984375)
Can you please share how you were able to get these landmark locations? I've been trying but nothings working for me
Most helpful comment
The hand model uses "scaled orthographic projection" (or, weak perspective), with some fixed average depth (Z avg).
Weak-perspective projection is an orthographic projection plus a scaling, which serves to approximate perspective projection by assuming that all points on a 3D object are at roughly the same distance from the camera.
The justification for using weak-perspective is that in many cases it approximates perspective closely. In particular for situations when the average variation of the depth of the object (delta Z) along the line of sight is small, compared to the fixed average depth (Z avg). This also allows objects at a distance not to distort due to perspective, but to only uniformly scale up/down.
The z predicted by the model is relative depth, based on the Zavg of "typical hand depth" (in the case of holding a phone with one hand and the other is tracked, or being close to the phone and showing both hands).
Also, the range of z is unconstrained, but it is scaled proportionally along with x and y (via weak projection), and expressed in the same units as x & y.
There is a root landmark point (wrist) that all the other landmark depths are relative to (again normalized via weak projection w.r.t. x & y).