Spacy: 💫 Participating in CoNLL 2018 Universal Dependencies evaluation (Team spaCy?)

Created on 21 Feb 2018  Â·  41Comments  Â·  Source: explosion/spaCy

Update 06/06/2018. Best way to run the CoNLL experiments is:

git clone https://github.com/explosion/spaCy -b develop
cd spaCy
make
./dist/spacy.pex ud-train --help

The Conference for Natural Language Learning (CoNLL) 2017 shared task is a great standard for evaluating parsing algorithms. Unlike previous parsing evaluations, CoNLL 2017 is end-to-end: from raw text to dependencies, across many languages. While we missed the 2017 evaluation, I'd like to participate in 2018.

To participate in CoNLL 2018, we would need to:

  • Adapt tokenizers for to match UD tokenization more closely.

  • Add pipeline component for statistical lemmatization, to improve lemmatizer coverage across languages.

  • Add pipeline component to predict morphological tags.

  • Support joint segmentation and tagging or parsing, for languages like Chinese.

All of these are great goals, regardless of the competition! However, it's a lot of work, especially the tokenization, which really needs speakers of the various languages.

Even if we don't get everything done in time to participate in the official evaluation, it will be a great step for spaCy to publish accuracy figures using the official evaluation software and methodology. This will allow direct comparison against other systems, and make quality control across languages much easier.

What would be really awesome is if we got a few people working on this together, so we could participate as "Team spaCy". Ideally we'd have people taking ownership of some of the main languages, e.g. French, Spanish, German, Chinese, Japanese etc. It's much easier to work on a specific language that you're well familiar with. The official evaluation will consider all language equally, but I'm okay with having low accuracy on like, Ancient Greek or Dothraki.

The official testing period will run April 30 to June 26. However, we can get started right away by working with the CoNLL 2017 data.

To get started, I've made a quick script to run an experiment, which I've been testing on the English data. You can run it by building the feature/better-gold branch, and running the examples/training/conllu.py script like so:

python examples/training/conllu.py en ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.conllu ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.txt  ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.conllu ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.txt /tmp/dev.conllu

This will write you an output file /tmp/dev.conllu after each training epoch, which you can pass into the official CoNLL 2017 evaluation scorer. Scores currently suck, as there are various things to tweak and fix --- but at least the evaluation runs.

enhancement help wanted

Most helpful comment

To everyone who wants to help out, here's another issue to take on. It's also well-suited for spaCy beginners and mostly requires some iteration (and knowing regular expressions).

#1642: Refactor regular expressions and get rid of "regex" dependency

The gist: The punctuation rules (especially suffixes) are too slow and unnecessarily complex. If we can rewrite and simplify them, remove the lookbehinds, replace the disjunctive expressions (|) with character classes ([]) and drop the regex dependency, this could have a big impact.

I've already played around with this a little and I got all basic tokenizer and English tokenizer tests to pass with only 3-4 failures by removing almost all of the suffix rules (!) and rewriting some of the others. So this should definitely be doable.

Another idea: The tokenizer could also enforce stricter splitting in general and use the Matcher to merge certain tokens based on patterns. This would allow us to keep the regular expressions simpler and avoid complex lookbehinds. For example:

merge_patterns = [
    [{'ORTH': '-'}, {'ORTH': '-'}, {'ORTH': '-'}],  # "---"
    [{'ORTH': '-'}, {'ORTH': '-'}], # "--"
    [{'IS_UPPER': True}, {'ORTH': '.'}, {'IS_UPPER': True}, {'ORTH': '.'}],  # "U.S." etc.
    [{'LOWER': 'us'}, {'ORTH': '$'}]  # "US$", "us$" etc.
]

You can play with this by adding a simple component like this to the pipeline (as the first step, right after the tokenizer):

from spacy.matcher import Matcher

def merge_tokens(doc):
    matcher = Matcher(doc.vocab)
    matcher.add('MERGE_TOKENS', None, *merge_patterns)
    spans = []
    for match_id, start, end in matcher(doc):
        spans.append(doc[start : end])
    for span in spans:
        span.merge()
    return doc
nlp.add_pipe(merge_tokens, name='token_merger', first=True)

The goals are pretty straightforward:

  • All existing tests should pass (both tests/tokenizer and tests/lang).
  • It shouldn't be slower than the current solution 😛

All 41 comments

Will try to help with Spanish, I don´t have much experience, but Spacy is so awesome that I am eager to learn :)

Let me know if I can help with Spanish and how!

@kevinrosenberg21 Maybe have a go at running the script on the Spanish AnCora corpus for now? You can get the corpora and evaluation script from here: http://universaldependencies.org/conll17/data.html

I think we'll probably have a problem with the "multi-token" tokenization. This isn't in English, so it hasn't come up yet --- but basically for tokens like "zum" in German they would want an output like this:

1-2 zum ...
1 zu
2 dem

I think we'll need to add a flag to the tokenizer_exceptions like "IS_MULTI_TOKEN". Currently I don't think we can distinguish what CoNLL consider a multi-token from other forms of non-whitespace tokenization. For instance, they don't mark "can't" as a multi-token in English, even though that's handled in our tokenizer exceptions.

Another thing someone might do is set up a docker container with the branch and all the data etc? Personally I don't dock, but I'm sure a lot of others would find it helpful.

@honnibal Oh yeah, 'cause German has those words that get unified for meaning? So a single word is actually a unification of two words?. I don't think we've got that problem in Spanish. In fact, it's probably a bit easier than English in that we don't have contractions and don't tend to hyphenate words. Give me some time to take a look during the weekend/next week and I'll get back to you.

@kevinrosenberg21 Just remembered there's a stats.xml in the treebank that's pretty useful. Fusions for Spanish:

  <fusions unique="823" /><!-- del, al, convertirse, verse, darle, hacerlo, hacerse, convirtiéndose, dedicarse, quedarse, casarse, ponerse, presentarse, encontrarse, haberse -->

del and al are articles, right? And the others are verbs with person agreement?

@honnibal oh, okay, you're right, I guess those are fusioned words. De means of and a means to. Del is a combination of "de el", so "Es el hijo del carnicero" has the same meaning as "Es el hijo de el carnicero" and both mean "he's the butcher's son" (here de would be the possessive 's). Same thing with al, so del and al could be lemmatized to "de el" and "a el" which is grammatically incorrect but it keeps the meaning. The other words are indicatives of what the verb is applied to, so hacerlo means to do it (where "it" would be the "lo" part and "hacer" the verb "to do"), so I'm not sure how those should be tokenized.

Hi @honnibal
This is an exciting idea!
As we have some experience from training v1 Spanish models, we would love to contribute and work on the Spanish part. Also, a very good opportunity to catch up with all the new things in V2. We are open to collaborate with @paurosello and @kevinrosenberg21

Best,

Dani

Awesome, thanks @dvsrepo !

It should be easy to add stuff like del to the tokenizer exceptions. Current behaviour:

>>> nlp = spacy.blank('es')
>>> doc = nlp(u'Es el hijo del carnicero')
>>> [t.text for t in doc]
['Es', 'el', 'hijo', 'del', 'carnicero']

Getting it to split del is easy because the tokenizer exceptions match on exact strings. Verbs are harder, because they're open class. So we'll need to rely on the POS tag and morphological features, and then do some rule-based post-processing.

We'll also need to change the examples/training/conllu.py script so that it outputs the multi-tokens in the format they need.

I can help. Probably I duplicate existing skills. Native English, near-native German, French slightly better than high-school.

Thanks @cbrew --- I'm sure there'll be plenty to do!

Idea: Data augmentation?

I've been taking another look over the proceeding from 2017. I forgot how sad shared tasks often make me. Overall an enormous amount of effort goes into these things, but it's usually very difficult to identify any clear findings.

My main goal is still to get spaCy up-to-date with the current preferred evaluation methodology in the literature, so that we can have figures that are directly comparable with other systems. However, if we do submit to CoNLL 2018, it would be nice if we could make a small research contribution.

Here's my idea: what if we put particular effort into data augmentation? This is something that hasn't really been explored much in NLP. With a linear model using one-hot features, data augmentation is of pretty limited use. But with neural networks, it's quite important, especially for these small treebanks.

I think the spaCy community could be well placed to try this out more thoroughly than a normal research effort. It's something many people can do a little of, across a wide variety of languages.

By data augmentation, I mean rules that transform a gold parse in some way, so that we have a new sentence we still know the parse for. For instance, in English a declarative sentence can be transformed into a question by inverting the subject and auxiliary. Other rules might passivise active sentences, make lexical substitutions, delete or add optional elements such as adjectives and prepositional phrases, etc.

What I don't want to do is waste a bunch of electricity in aimless hyper-parameter searches. I also don't think anyone should spend their time setting up some stupid ensemble that is sure to be 2% more accurate, 50x slower, and of no use to man or beast.

I'm hoping we can base our submission on a "stock" version of spaCy, with all modifications limited to the documented API. If we can get the system stabilised early, we'll have a nice stable and configuration to test out the data augmentation.

Updates

  • The feature/better-gold branch should be merged tonight. It contains the examples/training/conllu.py script, and reimplemented Levenshtein alignment code. The Levenshtein alignment is used to project the gold labels onto the system tokenization, so that we don't have to use any gold pre-processing during training.

  • English is starting to okay enough for now -- above the UDPipe baseline, anyway.

It might be interesting to compare two approaches:

1) directly use the tokenization in the gold labels, even when it conflicts with the tokenization that spacy prefers and the models expect. This feels wrong, because spaCy is being fed stuff that it was never trained on, and will screw up. But how much, and does it really matter?

2) what you said, project the gold labels onto what spaCy produces. This feels much better, provided the projecting goes well, because the tokens will make sense to the model. But how will the projection go, especially if spaCy tokenizes something like 'white-list' as a unit, correctly assessing it as a VB, and the gold standard has JJ-VB? Do we always want the rightmost p-o-s? Which dependencies shall we keep?

A bit more explicitness about how the training from raw text works.

Let's say we have the following input and annotations:

# Labelled examples for training are (text, annotations) pairs
text = "it was white-listed"
annotations = {
    'words': ['it', 'was', 'white-listed'],
    'heads': [1, 1, 1], # Using spaCy's convention of 0 index, self for root.
    'deps': ['nsubj', 'root', 'attr']
    'tags': ['PRP', 'VBD', 'JJ']
}

We want a function nlp() that can produce annotations given text.

At both training and runtime, our first step will be to take the text and produce a Doc object, with the predicted tokenization:

doc = nlp.make_doc(text)

We'll then use the current weights to make predicted annotations, use the true annotations to calculate a gradient of the loss for each component, and then update the weights. Cool.

But let's say our tokenization doesn't match the input words. For instance, if the tokenizer outputs ["it", "was", "white", "-", "listed"], how should we calculate the loss?

What spaCy has always done is align the predicted and true tokens, using the Levenshtein distance. This gives us two arrays: one mapping indices in the predictions to indices in the gold, and the other vice versa:

>>> from spacy._align import align
>>> guess = ['it', 'was', 'white', '-', 'listed']
>>> true = ['it', 'was', 'white-listed']
>>> cost, guess2true, true2guess, matrix = align(guess, true)
>>> guess2true
[0, 1, -1, -1, -1]
>>> true2guess
[0, 1, -1]

Entries of -1 indicate there's no alignment. For it and was, we can now get the part-of-speech tag, head, dependency label etc from the gold annotations, allowing us to define the loss as normal. For non-aligned tokens, we consider these values missing. This means that the gradient of the loss is always zero for these unaligned tokens --- we don't supervise them. This can be a bad policy --- sometimes it's better to at least use a heuristic of some sort, to guide the model towards a plausible label. For instance, we probably don't want to tag a number as a verb, even if we don't have an explicit label.

Using the predicted tokenization during training helps protect us against train/test skew. The tagger and parser should have the same inputs at training time as they're going to receive at test time. For English etc tokenization, accuracy is high enough that this doesn't matter so much. We can definitely experiment with this, especially for sanity. There are four combinations, for gold/predicted tokens at train/runtime. The combination (predicted at train, gold at runtime) isn't very interesting, though.

Where it really matters to train from raw text is in the sentence segmentation. If the model hasn't been trained with sentence boundary errors, it's going to get fragments at runtime unlike anything it's seen during training. The greedy transition-based model is particularly vulnerable to this --- it'll usually start off parsing the fragment as a noun, and then won't be able to recover into a more reasonable state. The need for training with predicted sentence boundaries is the main reason it's helpful to train without the gold-standard tokenization. Usually the tokenization disambiguates the sentence boundaries to an unrealistic degree. If you train with the gold tokens, you probably won't make very many sentence boundary errors.

The downside of training from the raw text is that the sequences are super long, and we're currently batching up the data on a per-input basis. This means that if we have 5000 word documents, we're going to be making one gradient update for every 5000 words of input. This corresponds to a batch size that might be much larger than we want. There are some details in the parser that make this less of an efficiency problem, but it still effects the optimization --- and neural nets are very fiddly like this...

My current solution is to include two copies of the data: one with the oracle segmentation, and one without. This seems to be helping, but I haven't run a proper evaluation yet.

You can control whether we add raw text or oracle segmentation in the training data using the function arguments here: https://github.com/explosion/spaCy/blob/feature/better-gold/examples/training/conllu.py#L251 .

I'd be glad to work on Japanese. Still need to port the UD stuff over to v2 though...

@polm Ah crap -- what did I forget?

Oh, you didn't forget anything - I said I'd port it to 2.0 then I broke my leg and was out of commission for a few months :/ I've been trying to figure out the new Tagger so I'll make a ticket about that soon.

Another update:

I've extended the tokenization alignment to allow many-to-one mappings from the system to the gold. Now if the tokenizer over-segments, the last token takes the annotations from the gold token, and the other tokens are given a special dependency label, "subtok". After parsing, we can then merge the tokens back up.

This means we now have an end-to-end joint model for word segmentation, parsing, and sentence segmentation :tada: . I think for languages like English we'll want to keep relying no rules to do most of the work. However, if we do over-segment in some cases, we can fall back to the statistical model. This is helping a lot for tokens like phone numbers, emails, URLs and punctuation strings, which are difficult to cover fully with regular expressions, but are relatively easy for the transition-based parser. Letting the parser fix the tokenization in these cases also seems to be helping for the sentence segmentation.

Results on the English development set are already looking very nice:

Sentences  |     81.97 |     78.32 |     80.10 |
Words      |     98.91 |     98.12 |     98.51 |
UAS        |     82.94 |     82.27 |     82.60 |
LAS        |     79.86 |     79.21 |     79.53 |

I don't know the 2017 systems' dev scores, but if we get these numbers on the test set, I'll be pretty satisfied. A UAS of 82.6 would put us in the 2nd place group, behind only the run-away winners from Stanford. The most encouraging thing is the sentence accuracy. A score of 80.1% is substantially better than the leading score of 78%, by a team from Facebook.

Every sentence segmentation mistake is necessarily at least one parsing mistake. I think parsing models are pretty much converging in quality, so having better sentence segmentation may separate us from other systems which would otherwise have similar accuracy.

Scores have improved over a few days ago due to the following changes:

  • Now using GloVe pre-trained vectors. This seems to help quite a lot, because the treebank is small. I'm looking forward to putting in FastText's new vectors, to see which works best.

  • Using multi-task objectives, predicting the POS tag and a sentence BILOU tag. These objectives are used to train the CNN to include more information. Specifically, we add a model which shares the CNN weights with the parser, and then has a single softmax layer. This model is trained to predict sentence begin / end / inside / unit tags. I haven't tested this thoroughly, but it seems to help slightly.

  • Limiting the number of sentences per training document. This keeps the training documents from being too large, solving the batch sizing issue. We simply make a new Doc object when the limit is reached. A limit of 1 corresponds to using gold-standard sentence segmentation.

To everyone who wants to help out, here's another issue to take on. It's also well-suited for spaCy beginners and mostly requires some iteration (and knowing regular expressions).

#1642: Refactor regular expressions and get rid of "regex" dependency

The gist: The punctuation rules (especially suffixes) are too slow and unnecessarily complex. If we can rewrite and simplify them, remove the lookbehinds, replace the disjunctive expressions (|) with character classes ([]) and drop the regex dependency, this could have a big impact.

I've already played around with this a little and I got all basic tokenizer and English tokenizer tests to pass with only 3-4 failures by removing almost all of the suffix rules (!) and rewriting some of the others. So this should definitely be doable.

Another idea: The tokenizer could also enforce stricter splitting in general and use the Matcher to merge certain tokens based on patterns. This would allow us to keep the regular expressions simpler and avoid complex lookbehinds. For example:

merge_patterns = [
    [{'ORTH': '-'}, {'ORTH': '-'}, {'ORTH': '-'}],  # "---"
    [{'ORTH': '-'}, {'ORTH': '-'}], # "--"
    [{'IS_UPPER': True}, {'ORTH': '.'}, {'IS_UPPER': True}, {'ORTH': '.'}],  # "U.S." etc.
    [{'LOWER': 'us'}, {'ORTH': '$'}]  # "US$", "us$" etc.
]

You can play with this by adding a simple component like this to the pipeline (as the first step, right after the tokenizer):

from spacy.matcher import Matcher

def merge_tokens(doc):
    matcher = Matcher(doc.vocab)
    matcher.add('MERGE_TOKENS', None, *merge_patterns)
    spans = []
    for match_id, start, end in matcher(doc):
        spans.append(doc[start : end])
    for span in spans:
        span.merge()
    return doc
nlp.add_pipe(merge_tokens, name='token_merger', first=True)

The goals are pretty straightforward:

  • All existing tests should pass (both tests/tokenizer and tests/lang).
  • It shouldn't be slower than the current solution 😛

@honnibal and @ines: this sounds cool and I'd like to spend some time helping out. Please suggest something...

Also, if there are other Sydneysiders interested, we could look into some kind of focussed hackathon via the NLP Meetup if it helps to get people in the same room.

@honnibal @ines @cbrew I would like to help for French.
I am also quite interested in the data augmentation part.

I'll have some availability to work on this from April 1st on.

@moreymat Awesome, thanks!

Sorry for the silence on this folks -- I made some great progress, and then thought "I'll just get a better experiment running so I can give more numbers". That sent me down a rabbit hole for what's felt like a month. Surprised to check back and see it's less than two weeks!

What I've been doing is fixing our infrastructure, so that we have a proper cluster to schedule these jobs on. My vision has been to use very small, pre-emptible instances on Google Compute Engine. This costs around $5 per CPU-month.

Tasks such as training CoNLL runs are naturally parallel --- one process per treebank. So we want this to be scheduled on 60 nodes on GCE or AWS, and the tasks all get done quickly.

I suppose I should've tried harder to get this running on Spark, or maybe used Kubernetes or something. What I've done instead is work on a more custom solution that will help us automate a wider variety of tasks in future.

Specifically, I've used Hashicorp's stack (Packer, Terraform, Consul, Nomad and Vault) to write a cluster I can provision, launch and operate from code, without clicking around in GCE's interface. Tonnes of frustrations along the way --- I now have many opinions about systemd, detailed thoughts about Luigi vs Airflow, new depths of contempt for Bash scripting, and a comprehensive knowledge of the arguments to chmod.

What I don't yet have is a working cluster. But I'm close! I can almost schedule tasks. Almost. I still have nerves about whether the whole thing will work. Maybe my solution will perform terribly for reasons I don't yet anticipate. I think it will be good though --- we'll see.

Anyway. About that great progress. The joint tokenization, sentence segmentation and parsing approach is already producing very nice results for languages like Chinese, Vietnamese and Japanese. At the moment the only languages we can't really run on are the Semitic languages, due to their non-concatenative morphology. Results are also a bit crap on the Romance languages.

At the moment the scores on the development set would place us around the 2nd place pack on most of the languages I've tried. So, I think we'll be able to put together a good submission. The major missing piece at the moment is the lemmatisation and morphology.

I would appreciate if a researcher could put together a prototype model that predicts the lemmas and binary morphological features, using the input word and one or two words of context. You can write the model in whatever framework you want, e.g. PyTorch or whatever -- and don't worry about working from raw text; you can munge the data and just run your experiments off your pre-processed files etc.

Given a UD sentence like:

# sent_id = phi0975.phi001.perseus-lat1.tb.xml@227
# text = Super iuvencum stabat deiectum leo.
1       Super   super2  ADP     r--------       _       2       case    _       _
2       iuvencum        juvencus        NOUN    n-s---ma-       Case=Acc|Gender=Masc|Number=Sing        3       obl     _       _
3       stabat  sto     VERB    v3siia---       Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act      0       root    _       _
4       deiectum        deicio  VERB    v-srppma-       Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass        2       nmod    _       _
5       leo     leo2    NOUN    n-s---mn-       Case=Nom|Gender=Masc|Number=Sing        3       nsubj   _       SpaceAfter=No
6      

The model would have the following inputs and labels:

"inputs": [
        ["<S>", "Super", "iuvencum"],
        ["Super", "iuvenvum", "stabat"],
        ["stabat", "deiectum", "leo"],
        ["deiectum", "leo", "."],
        ["leo", ".", "</S>"],
    ],
    "outputs": [
        [{"lemma": "super2", "morphology": {..}}],
        [{"lemma": "juvencus", "morphology": {...}}],
        [{"lemma": "sto", "morphology': {...}}],
        [{"lemma": "deicio", "morphology": {...}}],
        [{"lemma": "leo2", "morphology": {...}}],
        [{"lemma": ".", "morphology': {...}}]
    ]

Note that for each input token such as "stabat", we have to output a list of lemmatized and morphologically analysed "words". The reason is that on languages such as Arabic and Hebrew, the tokenizer will under segment, and we'll have multiple "fused" words in a single orthographic token. (We also get these for languages like German, but on those languages a rule-based approach is sufficient.)

There's obviously lots of literature on this problem. I think the Morfette approach is pretty appealing: https://pdfs.semanticscholar.org/b138/f3a54e9903a7295fe1441bae03a2ff1c123c.pdf . However, this doesn't solve the fused tokens problem.

Note also that fused tokens don't necessarily have to be connected subtrees. So, we need to segment the fused tokens before the parser runs, or have some other way of determining how to insert the fused tokens into the parse tree. It could be that there are only a small number of tree edit codes we can learn for the fused tokens. If so, we could output a tag that indicates the edit operation, just like Morfette does for the lemmatization. That would be neat.

I've added a few comments on my tokenizer investigations so far to https://github.com/explosion/spaCy/issues/1642

Hi @honnibal . Sorry for the silence, I will have some time to dedicate to this starting next week. Shall I look at statistical lemmatization and morphology as described above or there's something more pressing to focus on?

@dvsrepo Yes if you could look at the lemmatization and morphology, that'd be great!

Another task that would be super useful is if someone could finish the PyTorch wrapper within Thinc, and/or work on a wrapper for Tensorflow? There's a thread about this on the PyTorch forums here: https://discuss.pytorch.org/t/help-developing-a-small-shim-to-allow-pytorch-models-to-be-used-in-spacy/14862/4 . The wrapper for the fully-connected layer is mostly finished, but we need another wrapper for RNN, as the API is quite different for that.

If we can get these wrappers done, it should be much easier for everyone to contribute, since you'll be able to work with tools you're familiar with.

Another set of tasks for CoNLL are around the training automation. I have a solution almost working with Nomad and Luigi. If anyone is familiar with those (or have more general experience in DevOps), it would be helpful to ask you a few questions :).

I've lost a lot of time to fixing Windows the last two weeks, but that's finally taken care of, so I can get back to this. Here's more updates on the planning:

I'll be extending the parser's transition system to allow it to learn to split tokens during parsing. Together with the existing functionality to learn to merge tokens and insert sentence boundaries, we'll have a nicely joint model capable of handling all languages.

We'll then handle lemmatisation and morphological analysis after parsing. This means we'll be able to use parse features inside the lemmatizer and morphological analyser if we want. This should be especially useful for pronoun features such as gender, which require wider context than can be easily learned from a sequence model.

@honnibal I have added a very rought WIP here: https://github.com/explosion/thinc/pull/61 . Don't know if its the best place to discuss it, let me know what you think.

Tomorrow, I will create a script for generating the training/validation data for the lemmatization/morphology model from the CoNLL 2017 data (is there somewhere else we could generate training data from?).

Thanks!

This is a really good idea. Apart from up-to-date evaluation, I think UD support in spaCy is really important (see also #1460). Any idea on when we can have UD models for English and German, for example?

Hello! Shay from NLPH here! 😃
@honnibal

We care about open NLP for Hebrew. I'm also, personally, a Python developer (data scientist, actually, so not a programmer per-se). Anything I can do to help with this in the context of Hebrew, both as a single dev and in my capacity running NLPH (not much resources there, but perhaps I can mobilize more people)?

Thank you, @danielhers, for pointing me at this. 😺

Hey, sorry for not following up on this sooner, but how can I run the latest code on Japanese to see how it looks? I tried running the conllu.py script from the better-gold branch as the first post in this thread did with English but it blows up with a tokenizer exception:

AttributeError: 'JapaneseTokenizer' object has no attribute 'add_special_case'

I see you mention merging the better-gold branch into master, but I don't see a conllu.py script there so not sure what branch/script I should be running to check this...

Hi, I'm trying to run conllu.py script onthe English dataset from the better-gold branch. the following error is coming out:
Traceback (most recent call last):
File "examples/training/conllu.py", line 17, in from spacy._align import align
ModuleNotFoundError: No module named 'spacy._align'

Does anyone have any idea or experienced anything like that?

@polm @alba83 You need to build the branch; unfortunately you can't just take the script. The develop branch is the one you want. I've added a Makefile, so you should be able to just do:

git clone https://github.com/explosion/spaCy -b develop
cd spaCy
make
./dist/spacy.pex ud-train --help

Hi @honnibal I've built the branch and now I would like to train and evaluate a model for French, to see where and how I can contribute.
ud-train expects a json config file but I failed to find an example, could you provide directions on this?
Also, what word vectors can I use? The shared task page points to the FastText vectors, but in your original post on this issue you were using GloVe.

Config: I'll get this commited to the repo (or use it as defaults)

{
    "multitask_tag": false,
    "multitask_sent": false,
    "dropout": 0.2,
    "batch_size": 1000,
    "vectors": true,
    "max_doc_length": 3
}

FastText vectors have been getting me better results. If you download a FastText model, use spacy init-model -v /path/to/vectors.zip /path/to/output to convert the vectors into spaCy's format. You might find it convenient to prune the vectors down to say, 20k rows using a flag like -V 20000. This will give you slightly lower performance, but will keep your models small.

I think I was having some problems with French, due to the fused tokens -- but I don't remember in detail.

Fused tokens ("au" = "à" + "le", "du" = "de" + "le"...) are currently not processed.

I've tried to add tokenizer exceptions to handle them, using the custom attributes "begins_fused" and "inside_fused", like so:

fused_tokens = {
    "du": [
        {ORTH: "d", LEMMA: "de", NORM: "de", POS: "ADP"},
        {ORTH: "u", LEMMA: "le", NORM: "le", POS: "DET"},
    ],
}
for orth in fused_tokens:
    fused_tokens[orth][0]['_'] = {'begins_fused': True}
    for sub_tok in fused_tokens[orth][1:]:
        sub_tok['_'] = {'inside_fused': True}
_exc.update(fused_tokens)

but then I have the following error:

(.env) (spacy-conll-2018) ➜  spaCy git:(develop) ✗ spacy ud-train -v ../fr_fasttext_20k ../conll-data/release-2.2-st-train-dev-data/ud-treebanks-v2.2 ../out_dev_parses ../default_config.json UD_French-Sequoia
Train and evaluate UD_French-Sequoia using lang fr
Traceback (most recent call last):
  File "/home/mathieu/miniconda3/envs/spacy-conll-2018/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/mathieu/miniconda3/envs/spacy-conll-2018/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/spacy/__main__.py", line 34, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/.env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/.env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/spacy/cli/ud_train.py", line 365, in main
    nlp = load_nlp(paths.lang, config, vectors=vectors_dir)
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/spacy/cli/ud_train.py", line 271, in load_nlp
    nlp = spacy.blank(lang)
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/spacy/__init__.py", line 20, in blank
    return LangClass(**kwargs)
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/spacy/language.py", line 157, in __init__
    make_doc = factory(self, **meta.get('tokenizer', {}))
  File "/home/mathieu/dev/spacy-conll-2018/spaCy/spacy/language.py", line 71, in create_tokenizer
    token_match=token_match)
  File "tokenizer.pyx", line 55, in spacy.tokenizer.Tokenizer.__init__
  File "tokenizer.pyx", line 342, in spacy.tokenizer.Tokenizer.add_special_case
  File "vocab.pyx", line 232, in spacy.vocab.Vocab.make_fused_token
  File "attrs.pyx", line 147, in spacy.attrs.intify_attrs
KeyError: '_'

I browsed the code base but could not find any lead, so I suspect that support for fused tokens is still incomplete (same for conllu input and output).

Another issue is that the French treebanks specify UPOS and FEATS, but (mostly) leave XPOS unspecified ("_"). So all tokens end up with reference and predicted UPOS "X".

Please let me know how I can help on these issues.

Hey @moreymat, from an all too brief look at this, the KeyError is because the _ that you're storing the custom attributes in is not in this low-level property dict IDS.

Maybe you could reserve one of the FLAG* IDs for this task?

@wejradford This seems to be the most direct course of action, but FLAG*IDs correspond to (standard) lexical attributes whereas begins_fused and inside_fused are currently declared as custom attributes, via set_extension, in three different modules.
I don't know the code base well enough to know what these two options imply so cannot really make an informed guess.

Got it @moreymat. I do know that _ does a few magic things, maybe @ines or @honnibal know what the right course of action is in this case...

Have you guys considered using pyICU for tokenization? It does a great job and supports most languages reasonably well. I always use ICU for Solr/ES tokenization and normalization. Some languages do benefit from their own specialized implementation such as Japanese, but it is great default nonetheless.

@moreymat (cc: @wejradford)

If I read your code correctly, the problem here is how you're trying to assign a value to the _ property. token._ is a property that's resolved by the Underscore class (which then stores the custom attributes in the user hooks dictionary). But token._.foo != token['_']['foo'].

Instead, you probably want to do something like this:

token._.inside_fused = True

To register the extension attributes on the global token, you'll also need to call Token.set_extension:

from spacy.tokens import Token

Token.set_extension('inside_fused', default=False)
Token.set_extension('begins_fused', default=False)

See here for more examples. If you want to assign custom attributes by string name, you can also use the set, get and has methods:

token._.set('inside_fused', False)
token._.has('inside_fused')  # True
token._.get('inside_fused')  # False

@ines thanks for the informative answer.
I think I'm still missing something though.
Each lemmatizer exception is processed by Vocab.make_fused_tokens() and it works fine for "basic" fused tokens (eg. "don't" -> ["do", "n't"]) that are represented as n lines in the conllu format.
But here I want to handle the "explicit" fused tokens that end up as n+1 lines in the conllu format (eg. "au" -> ["au", "à", "le"]), where the fused token ("au") has its own line with an interval index spanning those of its subtokens (eg. "8-9 au", "8 à", "9 le").

From a user perspective, it seems natural to declare these fused tokens in tokenizer_exceptions.
The exceptions are passed to the Tokenizer __init__, which passes them to add_special_case, which calls the vocabulary's make_fused_token, but they all process TokenCs on which custom attributes are not defined.
Am I missing something there, or do I need to define a whole different process for these "explicit" fused tokens?

Sorry for another late reply on this but I still haven't been able to get Japanese working.

Here's my command line:

./dist/spacy.pex ud-train ~/data/release-2.2-st-train-dev-data/ud-treebanks-v2.2/ -v vecs ud-parses ud-conf.json UD_Japanese-GSD

Gives this error:

  ...
  File "/home/23/.pex/install/spacy_nightly-2.1.0a0-cp36-cp36m-linux_x86_64.whl.bcf4ee7567f49f09ed8d59b41092f9f149f3dd66/spacy_nightly-2.1.0a0-cp36-cp36m-linux_x86_64.whl/spacy/lang/ja/__init__.py", line 127, in create_tokenizer
    return JapaneseCharacterSegmenter(cls, nlp.vocab)
TypeError: __init__() takes 2 positional arguments but 3 were given

If I check __init__ for JapaneseCharacterSegmenter it doesn't match up with the call to create it. If I fix that I get this error:

...
  File "/home/23/.pex/install/spacy_nightly-2.1.0a0-cp36-cp36m-linux_x86_64.whl.bcf4ee7567f49f09ed8d59b41092f9f149f3dd66/spacy_nightly-2.1.0a0-cp36-cp36m-linux_x86_64.whl/spacy/lang/ja/__init__.py", line 108, in __call__
doc = self.tokenizer(text)
AttributeError: 'JapaneseCharacterSegmenter' object has no attribute 'tokenizer'

And indeed JapaneseCharacterSegmenter has no tokenizer member. I think the idea is it gets the generic tokenizer from Language somehow but I couldn't figure out how.

Am I missing something? It looks like the code for Japanese was checked in in an inconsistent state...

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings