Spacy: 💫 Which lexical attributes would you like spaCy to support?

Created on 19 Oct 2017 · 12Comments · Source: explosion/spaCy

Symbols live in spacy/symbols.pyx and are used to reference attributes internally and externally without using strings. They include strings of the annotation scheme like POS, entity labels like PERSON or lexical attributes and flags like IS_ALPHA.

In spaCy v1.x, adding new symbols was difficult and inconvenient, because it required re-training all models to prevent mismatched integer IDs. This is not the case anymore in v2.0, since spaCy now uses hash values to generate vocabulary-independent IDs for strings. So we're free to add more symbols and lexical attributes!

💬 COMMUNITY QUESTION: Is there anything you think is currently missing from the lexical attributes that spaCy should support and allow languages to add functions for?

Lexical attributes always refer to the Lexeme, i.e. an entry in the vocabulary – not an individual token in context. Token attributes can now be added using custom attribute extensions. Built-in lexical attributes should be general-purpose flags that are useful across all languages. (If your application needs very specific flags, you can always add them yourself via Vocab.set_flag!)

See here for how lexical attributes can be used and overwritten by a Language subclass. See spacy/lang/lex_attrs.py for the language-independent functions used by spaCy.

The following symbols are currently available:

| Symbol | Description | Example |
| --- | --- | --- |
| IS_ALPHA | consists of alphanumeric characters | token.is_alpha |
| IS_ASCII | consists of ASCII character | token.is_ascii |
| IS_DIGIT | is a digit | token.is_digit |
| IS_LOWER | is in lowercase | token.is_lower |
| IS_PUNCT | is punctuation | token.is_punct |
| IS_SPACE | is whitespace | token.is_space |
| IS_TITLE | is titlecase | token.is_title |
| IS_UPPER | is uppercase | token.is_upper |
| LIKE_NUM | resembles a number, e.g. "four" | token.like_num |
| LIKE_URL | resembles a URL | token.like_url |
| LIKE_EMAIL | resembles an email address | token.like_email |
| IS_STOP | is a stop word | token.is_stop |
| IS_BRACKET | is a bracket | token.is_bracket |
| IS_QUOTE | is a quotation mark | token.is_quote |
| IS_LEFT_PUNCT | is a left punctuation mark | token.is_left_punct |
| IS_RIGHT_PUNCT | is a right punctuation mark | token.is_right_punct |

Looking forward to hearing your ideas and feedback!

Suggestions from previous issues

#760: Support for numeric ranking words and fractions

enhancement help wanted help wanted (easy)

Source

ines

Most helpful comment

@ines : How about is_emoji as a lexical attribute?

nipunsadvilkar on 9 Jul 2018

👍3

All 12 comments

What about numeric entities in general?

For instance

Currencies: 1.000.000$
Percentages: %55
.
.
Many people are interested in financial corpus, I think this would be nice to have!

todayokay on 27 Oct 2017

@todayokay Thanks for the suggestion! The only problem here is that spaCy's tokenization policy will split those into more than one token – e.g. "55%" → ['55', '%']. So there's no way to assign a lexical attribute to "55%". This would have to be handled by custom tokenizer exceptions or rule-based match patterns, or the entity recognizer, which comes with built-in entity labels for MONEY and PERCENT (if this doesn't work well on your texts out-of-the box, you can always customise it, too).

However, maybe an is_currency attribute could be nice? For example:

doc = nlp(u"Net income was $9 million"
assert [t.text for t in doc] == ['Net', 'income', 'was', '$', '9', 'million']
assert doc[3].is_currency == True

Btw, unrelated to this issue, but if you're working with financial texts, check out this relation extraction example which I just dusted off for spaCy v2.0. It shows how to extract money/currencies and the noun phrase they are referring to – for example "$9.4 million" → "Net income".

ines on 27 Oct 2017

@ines : How about is_emoji as a lexical attribute?

nipunsadvilkar on 9 Jul 2018

👍3

@nipunsadvilkar Yes, that would be great! I've actually been thinking about this the other day. One thing to consider is that we want to transition away from using the regex library (which comes with pre-defined classes for symbols, including emoji).

So we'd probably want to ship the emoji unicode ranges with spaCy (and update them as new emoji are available). Maybe a Python version of something like this?

ines on 9 Jul 2018

@ines : I referred to your stackoverflow thread and thought of adding is_emoji lexeme attribute for text-based emoticons. Is this one useful?

nipunsadvilkar on 9 Jul 2018

@ines how about like_date and like_time as lexical attributes?

Shashi456 on 16 Aug 2018

We have IS_ASCII, perhaps marking more Unicode blocks would be useful?

IS_THAI
IS_CYRILLIC
...

timClicks on 10 Sep 2018

@nipunsadvilkar

I referred to your stackoverflow thread and thought of adding is_emoji lexeme attribute for text-based emoticons. Is this one useful?

This would be the first step to make the IS_EMOJI flag available, yes – however, in your case, you actually removed the IS_SENT_SPLIT instead of replacing one of the empty flags. Instead of just checking against the text-based emoticons, we likely also want to incorporate all unicode ranges to actually make it return True for real emoji.

@Shashi456

how about like_date and like_time as lexical attributes?

That's an interesting idea! However, I'm not sure we can make this work on the lexeme-level, since lexical attributes can only refer to single tokens. Dates and times are usually context-sensitive and often include one or more tokens, which is why a statistical approach (predicting DATE or TIME) with the entity recognizer usually works better.

@timClicks

We have IS_ASCII, perhaps marking more Unicode blocks would be useful? IS_THAI,
IS_CYRILLIC

That'd be cool! And it also sounds like something we could easily do on the unicode-level, even if it means that we have to maintain the lists of unicode ranges ourselves. Lexical attributes like this would also allow expedited language detection, and would make it easier to handle mixed-language texts (including social media posts).

Would you be interested in looking into this and potentially submitting a PR?

ines on 10 Sep 2018

@ines sorry for taking 5 days to respond. I would love to contribute. ..but I'm not sure how flags are defined.

I've looked, but would like some guidance. Are you able to tell me where are the rules defined?

timClicks on 15 Sep 2018

👍1

@timClicks Sounds great and no worries! Sorry this isn't better documented currently. The most relevant file for language-independent lexical attributes is spacy/lang/lex_attrs.py. The attribute getters are pretty straightforward functions that take the lexeme text and return a value – for example:

def is_ascii(text):
    for char in text:
        if ord(char) >= 128:
            return False
    return True

I'd imagine that we could also have is_cyrillic, is_thai etc. functions in the same style. Don't worry about actually adding the actual flags for now if that's confusing – we can always do that later. But we'd probably replace the current placeholders FLAG19, FLAG20 and so on in attrs.pyx and attrs.pxd.

Once that's done, we can add the property to the Lexeme and Token, for example:

https://github.com/explosion/spaCy/blob/81564cc4e819851e9b4473027b5fa672dbe072b6/spacy/lexeme.pyx#L380-L385

https://github.com/explosion/spaCy/blob/81564cc4e819851e9b4473027b5fa672dbe072b6/spacy/tokens/token.pyx#L794-L799

ines on 15 Sep 2018

I'm planning on creating a lexical attribute locally for identifying measurements (specifically something like is_depth, or more loosely defined like is_measurement which would catch both symbolic measurements " ' and shorthand inches ft) . Would this be a helpful addition to your existing attribute collection? :)