Hi,
I've been following the development of your hand tracking module with enthusiasm, hoping I maybe able to use it in my sign language tooling. I noticed today that you have a web demo up, and gave it a go in Chrome on macOS. I wanted to share some context about Sign Languages and the problems that come up with your hand tracking model when trying to apply it to that context, particularly because the Google AI Blog writes about this module as having possible uses with Sign Languages.
Here are some examples from Australian Sign Language (Auslan). I've posed in a static position from various common Auslan signs:
"Connect":

"Hearing":

(something fairly rude that i'm not going to transcribe):

Here I provide some American Sign Language (ASL) signs which fail:
"Culture":

"Bus":

"Category":

It seems mediapipe's hand tracking struggles any time a finger is located close to a skin tone object, like another finger or face. Unless this can be resolved, using this for sign language applications is impossible, as sign languages often communicate things by touching fingers to locations such as other fingers. For example: using non-dominant hand as a palate of options, or using body parts to represent conceptual spaces like "mind", "taste", "hearing", "emotions", or using classifier handshapes to represent abstract people, animals, vehicles. Your model is probably sufficient to recognise english manual encoding (fingerspelling) in ASL, French sign, and irish, but that seems fairly pointless as it's a slower input mechanism than typing on a keyboard, and doesn't provide any accessibility benefits over a touch screen keyboard, and importantly, fingerspelling is not a sign language, it's a manual encoding of a spoken language.
Above I've provided some examples of severely wrong outputs, and some examples that look pretty subtle. For example, look at the index finger in Auslan "Connect" and "Hearing". The index finger is only misaligned by a small amount in some cases, but it's important to understand, if you genuinely want to build tooling for sign language user interfaces, that any situation where two fingers touch, or a finger touches part of the face: it's going to be critically important to accurately identify which finger is touching which other finger or face part, in almost any sign language.
To understand why: A common pattern in sign languages is to use one flat open hand as a list of choices, and to choose one of them by pointing to one of the five fingers with the dominant hand's index finger. If the position of the fingers isn't well aligned, a model might misunderstand someone signing "the second item on the list" as "the first item on the list", which could be catastrophic if asking for a response to a multiple choice question.
I hope you'll continue working on it. It could be a major win for accessibility if in the future we're able to use technology like this to build conversational user interfaces that natively communicate in sign languages, especially for people who may not be able to communicate in any written or spoken language.
Hi @Bluebie , first of all, I'd like to thank you for trying out the web hand demo and shed light on the feasibility/difficulty of applying hand tracking on sign language. We couldn't t be more grateful for your detailed analysis. We actually started to look into the vision part of sign language recognition, e.g. recognizing single gestures for single words/phrases, and become more and more aware of the extreme difficulties of this problem. Actually we internally already tried fingerspelling problem and got really good accuracy (>95%) even real time on device. But exactly like what you said, although being an important part of ASL/BSL/ISL, fingerspelling is further away from sign language recognition. Internally, we are collaborating with a few research teams and Googlers that are native/trained signers to investigate this problem as a long term goal. This will definitely involve not only us computer vision researchers, but also NLP experts, linguistics and sign language specialists (since sign language has very unique grammar). Even the vision problem alone is already a very challenging task, for example it requires a good understanding of facial expression, gaze directions, relative positions of hands/human poses and e.t.c. Thank you again for giving us the detailed feedback. We really appreciate it and it's the motivation for us to keep working on and improving the models/pipeline.
In terms of the hand tracking specifically, the web demo is actually using a light weight version of the hand model to run faster enough on WebAssembly. So I'd definitely recommend you to install and try our demo app which should already be better than web version. And we plan to release a significantly improved version soon that has much better tracking. Definitely stay tuned for that.
@Bluebie We have released a new updated hand model in v0.7.5 for mobile. We will look into updating the web demo shortly
Here is a visualization of how the new updated hand model performs https://twitter.com/GoogleAI/status/1265319835283537921
Are there any improvements to the accuracy of the model when the backdrop behind the hand is another hand (overlapping) or skin tone region like face? That doesn鈥檛 seem demoed one the gif.
Most helpful comment
Hi @Bluebie , first of all, I'd like to thank you for trying out the web hand demo and shed light on the feasibility/difficulty of applying hand tracking on sign language. We couldn't t be more grateful for your detailed analysis. We actually started to look into the vision part of sign language recognition, e.g. recognizing single gestures for single words/phrases, and become more and more aware of the extreme difficulties of this problem. Actually we internally already tried fingerspelling problem and got really good accuracy (>95%) even real time on device. But exactly like what you said, although being an important part of ASL/BSL/ISL, fingerspelling is further away from sign language recognition. Internally, we are collaborating with a few research teams and Googlers that are native/trained signers to investigate this problem as a long term goal. This will definitely involve not only us computer vision researchers, but also NLP experts, linguistics and sign language specialists (since sign language has very unique grammar). Even the vision problem alone is already a very challenging task, for example it requires a good understanding of facial expression, gaze directions, relative positions of hands/human poses and e.t.c. Thank you again for giving us the detailed feedback. We really appreciate it and it's the motivation for us to keep working on and improving the models/pipeline.
In terms of the hand tracking specifically, the web demo is actually using a light weight version of the hand model to run faster enough on WebAssembly. So I'd definitely recommend you to install and try our demo app which should already be better than web version. And we plan to release a significantly improved version soon that has much better tracking. Definitely stay tuned for that.