This ticket is to keep track of required further improvements for Google Drive remote
implementation after https://github.com/iterative/dvc/pull/2551 will be merged into master
:
[x] Processing of auth exceptions and printing more meaningful error message on failure. https://github.com/iterative/dvc/pull/2551#discussion_r348182771
[x] Implement gc
command support for gdrive
( def remove(self, path_info)
overloading )
[x] Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create):
)
[x] Simplify a few things using new and shiny @wrap_prop
and filter_errors
param of @retry
from funcy 1.14 :)
[x] Make name -> id deterministic by choosing minimum id https://github.com/iterative/dvc/pull/2551#discussion_r350424816
[x] Reimplement caching to have 2 entry points .path_info_to_ids() and .id_to_path_info(). The idea is to remove cache dictionaries completely, but it will require introduction of helper method which accepts parent_id, title and returns the actual id ( this method introduces 1 -> 1 relation between input and resulting remote id ). https://github.com/iterative/dvc/pull/2551#discussion_r350213128
[x] Simplify resolve_remote_item_from_path
https://github.com/iterative/dvc/pull/2551#discussion_r351940507
[x] Protect create_remote_dir
with lock. https://github.com/iterative/dvc/pull/2551#discussion_r350428630
[x] Organize methods, params in ordered and unified way. https://github.com/iterative/dvc/pull/2551#discussion_r350152633
[x] Enhance retrieving of HTTP error from GDrive API exceptions https://github.com/MaxRis/dvc/commit/783d8b610db9da2bdcee1817cd90366b435e81eb#r36184990
[x] Create Iterative Google Drive account and Project to share client id and secret with final DVC users. Highest possible API usage limits quotas should be requested. Probably, it will be needed to have separate Google Project for CI. https://github.com/iterative/dvc/pull/2551#discussion_r351936393
[x] Stop creating a path to the DVC remote root if it does not exist. Since Google Drive allows multiple folders with the same name (at least on in My Drive and one in Shared With Me) in case a path like gdrive://root/storage
is used to access, collaborators see a storage
empty folder after the first dvc pull
attempt`. Instead we should just throw path does not exist.
[x] Hide /Users/ivan/Projects/test-gdrive/.env/lib/python3.7/site-packages/oauth2client/_helpers.py:255: UserWarning: Cannot access /Users/ivan/Projects/test-gdrive/.dvc/tmp/gdrive-user-credentials.json: No such file or directory
warnings.warn(_MISSING_FILE_MESSAGE.format(filename))
on the first auth.
[x] Reconsider self.no_traverse = False
. Now even to pull a single file we run the full traversal. We can at least start listing only prefixes we need (e.g. remote/0c/*
if we need to check if file 0c1234...ef
exists), we can use a parallel exists
if we need to check less than 256 files, etc.
gdrive://root
is not accessible by other people - that ID must be used to actually share data with other team members.[x] ~Do not allow empty root DVC remote path (except shared?). I think it can prevent a lot of strange issues. More on this here https://github.com/iterative/dvc/issues/3586~
[x] Add _download
progress https://github.com/iterative/dvc/issues/2865#issuecomment-560487931 -> #3722
[x] Support file streaming in dvc.api.open()
function (more details in #3408) and and update docs
[ ] Add import-url
support
[ ] Check that close()
is handles in the dvc.api.open()
context manager.
[ ] Support show URL in dvc get
for Gdrive. We can generate a HTTPS link that can be used in a browser if user is authed. The same link as you would get if you download a file from UI.
[ ] Fix credentials management for external repos (when we do dvc get
, etc). We should have a way to cache them.
[ ] Support external dependencies and outputs
[ ] Check that dvc get-url
functions properly. Need to come up with some credentials management.
[ ] Review and add more tests if needed. Starting from API tests.
[ ] Add an explanation comment about how path vs ids work, all non 1-1 stuff. Someone running into GDrive for first time will appreciate https://github.com/iterative/dvc/pull/2551#pullrequestreview-311748384. Explain how we deal with it, how caching is involved too.
[ ] Notify user on retries, keep the message on the screen if they are happening at a certain rate
[ ] Consider raising an exception if there are multiple remote root directories with the same name
[ ] Move _gdrive_*
helpers into a separate module api
to simplify testing and reading the code
[ ] When it's more or less stable make it trusted by default
[ ] Check if there is a way to generate URLs for GDrive publicly available files to download them w/o auth. Examples: https://github.com/NVlabs/stylegan/blob/master/pretrained_example.py and https://github.com/NVlabs/stylegan/blob/master/dnnlib/util.py . dvc get
ideally should work w/o asking to Auth for public objects.
[ ] Improve performance, pass fields
, consider using batch for exists call - https://developers.google.com/drive/api/v2/performance , see example here https://github.com/iterative/PyDrive2/issues/42
@MaxRis would be helpful if you include links to the discussions in the initial PR.
@shcheklein sure :+1:
also add _download
progress (atm only uploads will provide progress). Specifically,
see https://github.com/iterative/dvc/pull/2875#issuecomment-560044844 (source at https://github.com/gsuitedevs/PyDrive/blob/68cea204cdcd10bab9321580164c8f4961385a0f/pydrive/files.py#L202) which means this is non-trivial to add
@casperdcl thanks, have added this to the list
@efiop @Suor could you please review the order of priority for available things to be addressed in this ticket? Thanks!
@MaxRis maybe start from tests, then pick whatever you like.
One more GDrive related issue that can be added to the list https://github.com/iterative/dvc/issues/3015
Hi all, here is a screenshot of an issue I have with pushing to gdrive, and the subsequent creation of duplicate folders
@tierney6 Thanks for reporting! @MaxRis is currently researching the issue. Stay tuned! 🙂
(Added "Support file streaming in dvc.api.open() function (more details in #3408)" checkbox.)
@MaxRis @Suor do you remember what Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create): )
was about?
@shcheklein just to put assert
to check remote file has expected title
. The idea was to add more validation of data received from GDrive API. For example, if we resolved file by id
, it makes sense to validate the title
with expected.
@MaxRis k, so meaning that we need to put an assert in the _get_remote_item
?
yes, exactly
Woah this one is huge! Lots of things have been done too 🎉
Maybe move the remaining stuff into smaller issues and close this one? 🙂
@jorgeorpinel it's easier for me to manage it this way vs creating a long tail of small tickets. Let's keep it for now.
Put notes in docs that path like gdrive://root is not accessible by other people - that ID must be used to actually share data with other team members.
I'm addressing this in https://github.com/iterative/dvc.org/pull/1096#pullrequestreview-386926510 ✅
Also, how about
dvc config gdrive.client_id
and gdrive.client_secret
config options so you don't have to set it up per remote? Maybe also for service account params.
Also, how about
not sure about this, I don't see that much difference with other remotes and other params they have
Regarding adding _download
progress, it's super hard because of the sheer stack of libraries and functions which were not designed for any form of iteration or callbacks in mind. There are a lot of places where it may look like progress could be added but really you'd only be monitoring a pipe after the file has been downloaded.
CC @shcheklein the stack:
dvc.remote.gdrive.GDriveRemote._gdrive_download_file()
gdrive_file.GetContentFile()
== pydrive2.files.GoogleDriveFile.GetContentFile()
FetchContent()
_DownloadFromUrl()
self.http.request()
== httplib2.Http.request()
_request(self, conn : httplib2.HTTPSConnectionWithTimeout, ...)
_conn_request(self, conn, ...)
response.read()
response == conn.getresponse()
== httplib2.HTTPSConnectionWithTimeout.getresponse()
== http.client.HTTPSConnection.getresponse()
== http.client.HTTPResponse
http.client.HTTPResponse.read()
_readall_chunked()
Could monkey-patch doing:
class Custom(http.client.HTTPResponse):
def _readall_chunked(...):
with Tqdm(...) as pbar:
...
httplib2.HTTPConnectionWithTimeout.response_class = Custom
gdrive_file.GetContentFile()
httplib2.HTTPConnectionWithTimeout.response_class = http.client.HTTPResponse
but it's not pretty.
Opened upstream https://github.com/httplib2/httplib2/issues/165
Dear @shcheklein and @efiop, I'm starting working today on the "add import-url
support for GDrive"
Most helpful comment
also add
_download
progress (atm only uploads will provide progress). Specifically,https://github.com/iterative/dvc/blob/d42a04e64710d0da1949458d20e131a6e33eb495/dvc/remote/gdrive/__init__.py#L120
see https://github.com/iterative/dvc/pull/2875#issuecomment-560044844 (source at https://github.com/gsuitedevs/PyDrive/blob/68cea204cdcd10bab9321580164c8f4961385a0f/pydrive/files.py#L202) which means this is non-trivial to add