Spacy: Make "text" key in JSONL format optional when "tokens" key is provided

Created on 7 May 2019 · 4Comments · Source: explosion/spaCy

Feature description

I would like to build a pre-trained model using an already tokenized dataset. I store my data in JSONL format as described in the docs.

{"tokens": ["my", "tokenized", "data", "."]}
{"tokens": ["one", "more", "example", "."]}
...

As I understand, the text attribute is obligatory. (The CLI tool raises error trying to access text key). However, it seems that these lines use exactly one of these two keys:

def make_docs(nlp, batch, min_length, max_length):
    docs = []
    for record in batch:
        text = record["text"]
        if "tokens" in record:
            doc = Doc(nlp.vocab, words=record["tokens"])  # use tokens
        else:
            doc = nlp.make_doc(text)  # use "raw" text
        ... # the rest of the code

Is it possible to make the text key optional if tokens are provided? It is not a big deal to extend my data with something like {"text": null, "tokens": [...]} but it seems a bit clumsy taking into account that this fragment can be easily refactored to account both cases without explicitly asking for a missing key.

Or am I missing something and these keys are used in some other places also?

enhancement feat / cli help wanted help wanted (easy)

Source

devforfu

All 4 comments

In the code the text key is used only optional, since it is just used in the else part. So moving text = record["text"] in the else part should solve your issue and would make pre-training require either text or tokens instead of text and optional tokens.

Moreover the heads key (see these lines) isn't documented at all in the CLI docs.

BreakBB on 8 May 2019

👍1

Yes, that's what I was talking about. The text attributed is forced while it is not really required and can be accessed only in the second branch of the conditional logic. Also, a good point about heads thing.

devforfu on 8 May 2019

Yes, good point – moving text = record["text"] into the else should be fine. If you want to submit a PR with this, that'd be great! 👍