I would like to build a pre-trained model using an already tokenized dataset. I store my data in JSONL format as described in the docs.
{"tokens": ["my", "tokenized", "data", "."]}
{"tokens": ["one", "more", "example", "."]}
...
As I understand, the text attribute is obligatory. (The CLI tool raises error trying to access text key). However, it seems that these lines use exactly one of these two keys:
def make_docs(nlp, batch, min_length, max_length):
docs = []
for record in batch:
text = record["text"]
if "tokens" in record:
doc = Doc(nlp.vocab, words=record["tokens"]) # use tokens
else:
doc = nlp.make_doc(text) # use "raw" text
... # the rest of the code
Is it possible to make the text key optional if tokens are provided? It is not a big deal to extend my data with something like {"text": null, "tokens": [...]} but it seems a bit clumsy taking into account that this fragment can be easily refactored to account both cases without explicitly asking for a missing key.
Or am I missing something and these keys are used in some other places also?
In the code the text key is used only optional, since it is just used in the else part. So moving text = record["text"] in the else part should solve your issue and would make pre-training require either text or tokens instead of text and optional tokens.
Moreover the heads key (see these lines) isn't documented at all in the CLI docs.
Yes, that's what I was talking about. The text attributed is forced while it is not really required and can be accessed only in the second branch of the conditional logic. Also, a good point about heads thing.
Yes, good point – moving text = record["text"] into the else should be fine. If you want to submit a PR with this, that'd be great! 👍
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.