Spacy: How is the en_core_web_sm model MIT licensed while it was trained on OntoNotes?

Created on 25 Apr 2019 · 7Comments · Source: explosion/spaCy

https://spacy.io/models/en#en_core_web_sm (mirror) indicates that the en_core_web_sm model model is MIT licensed and was trained on OntoNotes. OntoNotes comes from the Linguistic Data Consortium often (LDC) (https://catalog.ldc.upenn.edu/LDC2013T19 (mirror) assuming it is OntoNotes 5.0).

From my understanding, the LDC User Agreement for Non-Members (mirror) forbids commercial use. Assuming that Explosion AI is a for-profit member of LDC (~25kUSD/year), does the LDC For-Profit Membership Agreement (mirror) allows to redistribute models trained on LDC corpora? It's not clear to me when reading the agreement. The most relevant section I could find in the agreement is:

Member may incorporate portions of the LDC Databases into its own work products, including for commercial purposes. Unless explicitly permitted here in, Member shall have no right to copy, redistribute, transmit, publish or otherwise use the LDC ه Databases for any other purpose.

I don't know whether redistributing a model trained on an LDC Database count as incorporating the database into one's product (allowed), or redistributing an LDC Database (not allowed). Do you interpret the agreement as redistributing a model trained on an LDC Database = incorporating the database into one's product, hence it's allowed?

(PS: do not take this question as any form of support for LDC, which I consider to be a horrible organization grossly misusing public funding (mirror) by placing speech and NLP datasets behind paywall and with non-commercial licenses).

meta models

Source

Franck-Dernoncourt

Most helpful comment

Hey,

Thanks for the question. The short answer is: yes, we're commercial members, and we did check with LDC about redistribution of the trained model, and were told this fell within the expected usage. The trained model doesn't include the original corpus, it's just an artifact trained from it.

In general the licensing questions around trained models are fairly murky, especially when it comes to things like CC-BY-NC models. I do think for the corpora we've licensed commercially, it's at least a little bit clearer. After all, if it weren't possible to redistribute models trained on these things, the license wouldn't have made sense when it was originally written. The terms were developed in the 1990s, before SaaS was the popular way to provide machine learning solutions. At the time the only way to distribute something like a speech recognition system was as an application with trained weights.

honnibal on 25 Apr 2019

👍2

All 7 comments

Hey,

honnibal on 25 Apr 2019

👍2

Thanks for the fast response.

We're commercial members, and we did check with LDC about redistribution of the trained model, and were told this fell within the expected usage.

This is great! Does that apply to all LDC corpora?

The licensing questions around trained models are fairly murky, especially when it comes to things like CC-BY-NC models. I do think for the corpora we've licensed commercially, it's at least a little bit clearer. After all, if it weren't possible to redistribute models trained on these things [...]

Sorry just to clarify: do "things" with refer to "corpora we've licensed commercially" or "CC-BY-NC models"?

Franck-Dernoncourt on 25 Apr 2019

We've licensed certain corpora from the LDC, not their whole catalogue.

"Things" there refers to "corpora licensed commercially by the LDC".

honnibal on 6 May 2019

👍1

Thanks! If the redistribution of the trained models fell within the expected usage of LDC corpora, why doesn't this apply to all LDC corpora, provided that one is a commercial member of LDC?

Franck-Dernoncourt on 6 May 2019

Well being a commercial member doesn't automatically give you their whole catalogue --- you still have to license a specific corpus. We've licensed OntoNotes 5 from them and a number of other resources...But there are others we don't have.

honnibal on 6 May 2019

Got it, thanks!

On Mon, May 6, 2019, 14:06 Matthew Honnibal <[email protected] wrote:

Closed #3632 https://github.com/explosion/spaCy/issues/3632.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/explosion/spaCy/issues/3632#event-2322181228, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAADXY4YCQHM54G4FPOIEPDPUCMUHANCNFSM4HIJOISA
.