Mimic-code: Privacy concerns regarding models trained on MIMIC-III

Created on 28 Nov 2020  路  4Comments  路  Source: MIT-LCP/mimic-code

Prerequisites

Description

Hello, I've some questions related to publishing models trained on MIMIC-III, if you can please let me know what of this is allowed:

  • Train a word2vec model and make the vector-word pairs available publicly (I do understand that a pre-trained fasttext model is available, but very little pre-processing was done and the embeddings are a bit messy)?
  • Train a CUI2Vec model (CUI would be disease identifier from UMLS/SNOMED) and make it public? This model would not contain any text/words found in MIMIC-III.
  • Train a BERT like model on MIMIC-III and make it public?

All 4 comments

It's hard to give concrete answers as it depends on the exact context, but in general:

  • If the model will make use of notes, and has the potential to reproduce notes verbatim (or memorize them within), we consider it sensitive and ask that the model be shared on PhysioNet under the same restrictions as MIMIC. Anyone with access to MIMIC would be able to access the model as well.
  • A reasonably high level model trained using only the structured data may be shared publicly. I say "reasonably high level" as you can imagine a KNN type model which would contain the entire dataset. Sharing this model would obviously violate the DUA. There is a spectrum here and we recommend to ask if you're not sure.
  • Train a word2vec model and make the vector-word pairs available publicly (I do understand that a pre-trained fasttext model is available, but very little pre-processing was done and the embeddings are a bit messy)?

It depends on the preprocessing, as this may contain sensitive vocabulary.

  • Train a CUI2Vec model (CUI would be disease identifier from UMLS/SNOMED) and make it public? This model would not contain any text/words found in MIMIC-III.

This would probably be fine.

  • Train a BERT like model on MIMIC-III and make it public?

This has been done in the past when the byte pair encoding was not retrained on MIMIC. If retraining and incorporating the vocabulary, again some checking is needed regarding the model memorizing notes.

Hi @alistairewj, thank you for the answer this was really helpful. Just one more thing regarding the word2vec model:

  • It seems that the main problem is the vocabulary, so if I were to use e.g. a word2vec model trained on Wikipedia (vocab built on Wikipedia) and then fine-tune it on MIMIC, I assume this would be fine? All the words in the vocabulary, in this case, would be from Wikipedia and I do not see a way to reproduce notes given only vector embeddings.

The rest is perfectly clear, thanks.

Hi @alistairewj, thank you for the answer this was really helpful. Just one more thing regarding the word2vec model:

  • It seems that the main problem is the vocabulary, so if I were to use e.g. a word2vec model trained on Wikipedia (vocab built on Wikipedia) and then fine-tune it on MIMIC, I assume this would be fine? All the words in the vocabulary, in this case, would be from Wikipedia and I do not see a way to reproduce notes given only vector embeddings.

Yes, I agree, it would be fine!

One point to add is that we encourage sharing of sensitive models on PhysioNet. There is a "model' data type that can be selected during submission: https://physionet.org/about/publish/#guidelines

For example, see the following project:

_Amin-Nejad, A., Ive, J., & Velupillai, S. (2020). Transformer models trained on MIMIC-III to generate synthetic patient notes (version 1.0.0). PhysioNet. https://doi.org/10.13026/m34x-fq90._

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JohannesWiesner picture JohannesWiesner  路  4Comments

smartnikocj picture smartnikocj  路  13Comments

EarlGlynn picture EarlGlynn  路  3Comments

Lejla1979 picture Lejla1979  路  25Comments

jeblundell picture jeblundell  路  30Comments