Hello, I've some questions related to publishing models trained on MIMIC-III, if you can please let me know what of this is allowed:
It's hard to give concrete answers as it depends on the exact context, but in general:
- Train a word2vec model and make the vector-word pairs available publicly (I do understand that a pre-trained fasttext model is available, but very little pre-processing was done and the embeddings are a bit messy)?
It depends on the preprocessing, as this may contain sensitive vocabulary.
- Train a CUI2Vec model (CUI would be disease identifier from UMLS/SNOMED) and make it public? This model would not contain any text/words found in MIMIC-III.
This would probably be fine.
- Train a BERT like model on MIMIC-III and make it public?
This has been done in the past when the byte pair encoding was not retrained on MIMIC. If retraining and incorporating the vocabulary, again some checking is needed regarding the model memorizing notes.
Hi @alistairewj, thank you for the answer this was really helpful. Just one more thing regarding the word2vec model:
The rest is perfectly clear, thanks.
Hi @alistairewj, thank you for the answer this was really helpful. Just one more thing regarding the word2vec model:
- It seems that the main problem is the vocabulary, so if I were to use e.g. a word2vec model trained on Wikipedia (vocab built on Wikipedia) and then fine-tune it on MIMIC, I assume this would be fine? All the words in the vocabulary, in this case, would be from Wikipedia and I do not see a way to reproduce notes given only vector embeddings.
Yes, I agree, it would be fine!
One point to add is that we encourage sharing of sensitive models on PhysioNet. There is a "model' data type that can be selected during submission: https://physionet.org/about/publish/#guidelines
For example, see the following project:
_Amin-Nejad, A., Ive, J., & Velupillai, S. (2020). Transformer models trained on MIMIC-III to generate synthetic patient notes (version 1.0.0). PhysioNet. https://doi.org/10.13026/m34x-fq90._