https://spacy.io/models/en#en_core_web_sm (mirror) indicates that the en_core_web_sm model model is MIT licensed and was trained on OntoNotes. OntoNotes comes from the Linguistic Data Consortium often (LDC) (https://catalog.ldc.upenn.edu/LDC2013T19 (mirror) assuming it is OntoNotes 5.0).
From my understanding, the LDC User Agreement for Non-Members (mirror) forbids commercial use. Assuming that Explosion AI is a for-profit member of LDC (~25kUSD/year), does the LDC For-Profit Membership Agreement (mirror) allows to redistribute models trained on LDC corpora? It's not clear to me when reading the agreement. The most relevant section I could find in the agreement is:
Member may incorporate portions of the LDC Databases into its own work products, including for commercial purposes. Unless explicitly permitted here in, Member shall have no right to copy, redistribute, transmit, publish or otherwise use the LDC Ù‡ Databases for any other purpose.
I don't know whether redistributing a model trained on an LDC Database count as incorporating the database into one's product (allowed), or redistributing an LDC Database (not allowed). Do you interpret the agreement as redistributing a model trained on an LDC Database = incorporating the database into one's product, hence it's allowed?
(PS: do not take this question as any form of support for LDC, which I consider to be a horrible organization grossly misusing public funding (mirror) by placing speech and NLP datasets behind paywall and with non-commercial licenses).
Hey,
Thanks for the question. The short answer is: yes, we're commercial members, and we did check with LDC about redistribution of the trained model, and were told this fell within the expected usage. The trained model doesn't include the original corpus, it's just an artifact trained from it.
In general the licensing questions around trained models are fairly murky, especially when it comes to things like CC-BY-NC models. I do think for the corpora we've licensed commercially, it's at least a little bit clearer. After all, if it weren't possible to redistribute models trained on these things, the license wouldn't have made sense when it was originally written. The terms were developed in the 1990s, before SaaS was the popular way to provide machine learning solutions. At the time the only way to distribute something like a speech recognition system was as an application with trained weights.
Thanks for the fast response.
We're commercial members, and we did check with LDC about redistribution of the trained model, and were told this fell within the expected usage.
This is great! Does that apply to all LDC corpora?
The licensing questions around trained models are fairly murky, especially when it comes to things like CC-BY-NC models. I do think for the corpora we've licensed commercially, it's at least a little bit clearer. After all, if it weren't possible to redistribute models trained on these things [...]
Sorry just to clarify: do "things" with refer to "corpora we've licensed commercially" or "CC-BY-NC models"?
We've licensed certain corpora from the LDC, not their whole catalogue.
"Things" there refers to "corpora licensed commercially by the LDC".
Thanks! If the redistribution of the trained models fell within the expected usage of LDC corpora, why doesn't this apply to all LDC corpora, provided that one is a commercial member of LDC?
Well being a commercial member doesn't automatically give you their whole catalogue --- you still have to license a specific corpus. We've licensed OntoNotes 5 from them and a number of other resources...But there are others we don't have.
Got it, thanks!
On Mon, May 6, 2019, 14:06 Matthew Honnibal <[email protected] wrote:
Closed #3632 https://github.com/explosion/spaCy/issues/3632.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/explosion/spaCy/issues/3632#event-2322181228, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAADXY4YCQHM54G4FPOIEPDPUCMUHANCNFSM4HIJOISA
.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Hey,
Thanks for the question. The short answer is: yes, we're commercial members, and we did check with LDC about redistribution of the trained model, and were told this fell within the expected usage. The trained model doesn't include the original corpus, it's just an artifact trained from it.
In general the licensing questions around trained models are fairly murky, especially when it comes to things like CC-BY-NC models. I do think for the corpora we've licensed commercially, it's at least a little bit clearer. After all, if it weren't possible to redistribute models trained on these things, the license wouldn't have made sense when it was originally written. The terms were developed in the 1990s, before SaaS was the popular way to provide machine learning solutions. At the time the only way to distribute something like a speech recognition system was as an application with trained weights.