E.g. we would push encrypted files to remote for improved security.
From the Discord channel:
The basic idea would be to be able to store data and models on an untrusted cloud provider :innocent: while stile being able to share with others or work on it locally by using a shared decryption key among members
dvc enables us to share data and models on a remote server / data storage service.
Such service may not always be trusted for our (or our clients) data and models.
This would be as follows:
The encryption key would have to be shared among project members.
dvc remote add [...] --encryption-method AES-256 --encryption-key XYZ
Example of dvc file ( source ):
cmd: python cmd.py input.data output.data metrics.json
deps:
- md5: da2259ee7c12ace6db43644aef2b754c
path: cmd.py
md5_encrypted: [...]
- md5: e309de87b02312e746ec5a500844ce77
path: input.data
md5_encrypted: [...]
md5: 521ac615cfc7323604059d81d052ce00 # <= this hash is about this file?
outs:
- cache: true
md5: 70f3c9157e3b92a6d2c93eb51439f822
md5_encrypted: [...]
metric: false
path: output.data
- cache: false
md5: d7a82c3cdfd45c4ace13484a931fc526
md5_encrypted: [...]
metric:
type: json
xpath: AUC
path: metrics.json
locked: True
@aurelien-clu
That md5 is a checksum for that dvc file. See https://dvc.org/doc/user-guide/dvc-file-format .
EDIT: sorry, didn't notice your link right away.
Also, remote configuration is done with dvc remote modify command, so something like dvc remote modify myremote encryption_method rsa.
Alright thanks.
(sorry did not have much time to look further in the documentation)
I am looking into how to handle file encryption and decryption in python and then where to hook this into dvc code.
I feel like dvc.remote.base is a good candidate but I am not certain.
And there are other parts to update as well:
cli, .dvc fileImho.
EDIT:
File encryption and decryption using AES (still updated though not many stars)
FWIW, it will be a very useful feature (at least for my workplace!) for dvc to first encrypt and then store into cloud storage. It will allow more layers of defence and control while storing sensitive artefacts. Google cloud's KMS would be easy to add into RemoteGS class (remote/gs.py) _upload, _download methods for this. This can sit as another remote or additional config on GCS remote.
@swanandgore hi! make total sense to me, and probably we will make the priority higher for this. Could you though elaborate a bit why cloud side encryption is not enough? Something like: sse option for S3 remote here - https://dvc.org/doc/commands-reference/remote/modify
Why not using existing encryption solutions like git-crypt or the like for transparent encryption?
The tools already provide an AES key and a PGP user "management" which is very handy.
This would also reduce the implementation complexity a lot.
Especially in our use case it would be very convenient. We encrypt some no-dvc-files with git-crypt and also want to encrypt those managed by dvc so that only encrypted data is stored in the storage (and cache).
@dr-duplo Great idea! Maybe you could elaborate on how the git-crypt-based implementation could look like?
Yes. Sure. I will try, but I'm not deeply familiar with the inner workings of dvc, yet:
I assume for now git-crypt is used. With it you can generate multiple AES keys used to encrypt files. Besides the default key you can also create a named one. After having a dvc specific key created you can permit a GPG user to use the key. Other users will not be able to get the key. It is individual encrypted with the users GPG public key. Only already permitted users can allow other users to use the key. The per-user encrypted key get's added to the repo.
The reveal the key for a user "git-crypt unlock" has to be used. It can now be found in the ".git" directory. DVC can use the key to encrypt/decrypt the data to/from remote storage or the cache.
This should be completely transparent to the user.
DVC could generate individual keys for every remote or on the repo level.
The encryption as git-crypt uses it, produces stable cipher-texts, which means if the content of the file
doesn't change it's encrypted version is also fixed (read more in docu of git-crypt). Question here is if one could reuse it's encryption/decryption routines.
Possible DVC command extensions:
This is a rough draft w/o checking anything. A POC would be nice.
Want to circle back on this issue - SSE-KMS for AWS S3 back end is still important, as many shops require this for policy reasons.
@jackwellsxyz does dvc remote modify myremote sse <AES256, aws:kms, etc> work for you?
please check it here https://dvc.org/doc/command-reference/remote/modify in the list of AWS S3 specific options.
Thanks @shcheklein, I've been diving into the code a bit. That is half of the solution: I need to set "sse = aws:kms" and also pass the 'SSEKMSKeyId'=
I'd be happy to do a PR if that would help - in either case, it seems like a relatively straightforward fix (don't they all...)
@jackwellsxyz it looks like we already down the path of adding all the options to the schema (I hope it's a finite amount after all). It's definitely great to have a PR for that. I would start with git grep <name_of_other_option> to see where it should be added. Thanks 馃檹