I think it'd be be helpful to have basic emoji handling. And eventually understanding positive馃槂/negative馃槨/neutral馃惄 emoji.
Can assign to me 馃槆
@tooaverage yes that would be an awesome contribution 馃巻 馃槂
Weo @tmbo What is the best/leanest way to integrate this?
one pretty lean solution would be to replace the emoji with words, for example taking the short name or keywords http://unicode.org/emoji/charts/full-emoji-list.html
Hello @tooaverage, any update on this issue?
I trained the model using both unicode (\U0001f37a) and the actual emoji (馃嵑) within the training data, and set synonyms to map to my named entity. No success (tried just in case).
This project seems to have built a dictionary of vectors for emoji -
is this useful?
https://raw.githubusercontent.com/uclmr/emoji2vec/master/pre-trained/
(the paper this belongs to is also great)
Is there a way I can assign these vectors to emoji tokens if they're found?
spacymoji would be helpful here: https://pypi.python.org/pypi/spacymoji/1.0.0
Any progress on this? 馃槉
We have not been working on this, but it would be a great contribution :wink:
Might be a little off-topic but how are emoji's handled currently by the NLU? I would like the include emojis in our current chatbot's training data.
Are they currently simply ignored while predicting the intent?
ok i think that depends on the intent classifcation component used:
@tmbo In the tensorflow_embedding pipeline my guess is that the vectors for the emojis should also be learned from the training data, since that pipeline learns the word embeddings from only the training data anyway (thus able to handle OOV words).
Guesses aside, would be interesting to know how it actually happens though
I tried with tensorflow_embedding but no luck so far, but I was trying with actual emojis. Have to try with codes/synonyms
I think the tensorflow embedding policy also ignores them, because it only looks at words with a certain amount of characters. You can change that in the token_pattern parameter of the intent_featurizer_count_vectors though
Yes, it depends whether a python string that stores emoji is falling under token_pattern or not
Hey @Ghostvv for now I am using token_pattern as "(?u)\b\w+\b" for words. However, I also want to add a capability to recognize emoji, what should be the token_pattern then?
Secondly, what would the training data be like for emojis, "馃憤" or "U+1F44D"?
@kirtisynap19 in this case token pattern should correspond to regex that also picks emojis
@Ghostvv thanks for your response. And what goes in the training data? Is it '馃憤' or 'U+1F44D' or 'u"\U0001F44D"' or "\uD83D\uDC4D"?
Token pattern: (?u)(\b\w+\b|(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]))
@kirtisynap19 I'm not sure, try both. You can print vocabulary of CounntVectorizer to check
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, this issue is available? @lucasdutraf, @Henrike100, and I to work on it! :)
@mbslet Are you working on this change ? if not, i would like to pick it up !
Most helpful comment
I think the tensorflow embedding policy also ignores them, because it only looks at words with a certain amount of characters. You can change that in the
token_patternparameter of theintent_featurizer_count_vectorsthough