Rasa: Add out of the box emoji handling

Created on 23 Apr 2017 · 21Comments · Source: RasaHQ/rasa

I think it'd be be helpful to have basic emoji handling. And eventually understanding positive😃/negative😡/neutral🐩 emoji.

Can assign to me 😇

help wanted type

Source

tooaverage

Most helpful comment

I think the tensorflow embedding policy also ignores them, because it only looks at words with a certain amount of characters. You can change that in the token_pattern parameter of the intent_featurizer_count_vectors though

akelad on 25 Sep 2018

👍3

All 21 comments

@tooaverage yes that would be an awesome contribution 🎆 😃

tmbo on 23 Apr 2017

Weo @tmbo What is the best/leanest way to integrate this?

tooaverage on 23 Apr 2017

one pretty lean solution would be to replace the emoji with words, for example taking the short name or keywords http://unicode.org/emoji/charts/full-emoji-list.html

amn41 on 23 Apr 2017

👍1

Hello @tooaverage, any update on this issue?

I trained the model using both unicode (\U0001f37a) and the actual emoji (🍺) within the training data, and set synonyms to map to my named entity. No success (tried just in case).

perezmunoz on 9 May 2017

This project seems to have built a dictionary of vectors for emoji -

is this useful?

https://raw.githubusercontent.com/uclmr/emoji2vec/master/pre-trained/
(the paper this belongs to is also great)

Is there a way I can assign these vectors to emoji tokens if they're found?

nicksahler on 23 May 2017

spacymoji would be helpful here: https://pypi.python.org/pypi/spacymoji/1.0.0

amn41 on 5 Mar 2018

Any progress on this? 😊

giorgobiani on 21 Aug 2018

We have not been working on this, but it would be a great contribution :wink:

tmbo on 21 Aug 2018

Might be a little off-topic but how are emoji's handled currently by the NLU? I would like the include emojis in our current chatbot's training data.

Are they currently simply ignored while predicting the intent?

parthsharma1996 on 24 Sep 2018

ok i think that depends on the intent classifcation component used:

for svm + spacy: the emojis will be ignored because they do not have a word vector assigned and hence don't contribute anything to the sentence representation
for the embedding policy, actually I am not sure <- @Ghostvv

tmbo on 24 Sep 2018

@tmbo In the tensorflow_embedding pipeline my guess is that the vectors for the emojis should also be learned from the training data, since that pipeline learns the word embeddings from only the training data anyway (thus able to handle OOV words).
Guesses aside, would be interesting to know how it actually happens though

parthsharma1996 on 24 Sep 2018

I tried with tensorflow_embedding but no luck so far, but I was trying with actual emojis. Have to try with codes/synonyms

giorgobiani on 24 Sep 2018

akelad on 25 Sep 2018

👍3

Yes, it depends whether a python string that stores emoji is falling under token_pattern or not

Ghostvv on 25 Sep 2018

Hey @Ghostvv for now I am using token_pattern as "(?u)\b\w+\b" for words. However, I also want to add a capability to recognize emoji, what should be the token_pattern then?
Secondly, what would the training data be like for emojis, "👍" or "U+1F44D"?

kirtisynap19 on 28 May 2019

@kirtisynap19 in this case token pattern should correspond to regex that also picks emojis

Ghostvv on 28 May 2019

@Ghostvv thanks for your response. And what goes in the training data? Is it '👍' or 'U+1F44D' or 'u"\U0001F44D"' or "\uD83D\uDC4D"?

Token pattern: (?u)(\b\w+\b|(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]))

kirtisynap19 on 28 May 2019

@kirtisynap19 I'm not sure, try both. You can print vocabulary of CounntVectorizer to check

Ghostvv on 29 May 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.