Dvc: Google Drive support further enhancements

Created on 28 Nov 2019  Â·  20Comments  Â·  Source: iterative/dvc

This ticket is to keep track of required further improvements for Google Drive remote implementation after https://github.com/iterative/dvc/pull/2551 will be merged into master:

  • [x] Processing of auth exceptions and printing more meaningful error message on failure. https://github.com/iterative/dvc/pull/2551#discussion_r348182771

  • [x] Implement gc command support for gdrive ( def remove(self, path_info) overloading )

  • [x] Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create): )

  • [x] Simplify a few things using new and shiny @wrap_prop and filter_errors param of @retry from funcy 1.14 :)

  • [x] Make name -> id deterministic by choosing minimum id https://github.com/iterative/dvc/pull/2551#discussion_r350424816

  • [x] Reimplement caching to have 2 entry points .path_info_to_ids() and .id_to_path_info(). The idea is to remove cache dictionaries completely, but it will require introduction of helper method which accepts parent_id, title and returns the actual id ( this method introduces 1 -> 1 relation between input and resulting remote id ). https://github.com/iterative/dvc/pull/2551#discussion_r350213128

  • [x] Simplify resolve_remote_item_from_path https://github.com/iterative/dvc/pull/2551#discussion_r351940507

  • [x] Protect create_remote_dir with lock. https://github.com/iterative/dvc/pull/2551#discussion_r350428630

  • [x] Organize methods, params in ordered and unified way. https://github.com/iterative/dvc/pull/2551#discussion_r350152633

  • [x] Enhance retrieving of HTTP error from GDrive API exceptions https://github.com/MaxRis/dvc/commit/783d8b610db9da2bdcee1817cd90366b435e81eb#r36184990

  • [x] Create Iterative Google Drive account and Project to share client id and secret with final DVC users. Highest possible API usage limits quotas should be requested. Probably, it will be needed to have separate Google Project for CI. https://github.com/iterative/dvc/pull/2551#discussion_r351936393

  • [x] Stop creating a path to the DVC remote root if it does not exist. Since Google Drive allows multiple folders with the same name (at least on in My Drive and one in Shared With Me) in case a path like gdrive://root/storage is used to access, collaborators see a storage empty folder after the first dvc pull attempt`. Instead we should just throw path does not exist.

  • [x] Hide /Users/ivan/Projects/test-gdrive/.env/lib/python3.7/site-packages/oauth2client/_helpers.py:255: UserWarning: Cannot access /Users/ivan/Projects/test-gdrive/.dvc/tmp/gdrive-user-credentials.json: No such file or directory warnings.warn(_MISSING_FILE_MESSAGE.format(filename)) on the first auth.

  • [x] Reconsider self.no_traverse = False. Now even to pull a single file we run the full traversal. We can at least start listing only prefixes we need (e.g. remote/0c/* if we need to check if file 0c1234...ef exists), we can use a parallel exists if we need to check less than 256 files, etc.

  • [x] Put notes in docs that path like gdrive://root is not accessible by other people - that ID must be used to actually share data with other team members.
  • [x] We should store one credentials file per remote
  • [x] ~Do not allow empty root DVC remote path (except shared?). I think it can prevent a lot of strange issues. More on this here https://github.com/iterative/dvc/issues/3586~

  • [x] Add _download progress https://github.com/iterative/dvc/issues/2865#issuecomment-560487931 -> #3722

  • [x] Support file streaming in dvc.api.open() function (more details in #3408) and and update docs

  • [ ] Add import-url support

  • [ ] Check that close() is handles in the dvc.api.open() context manager.

  • [ ] Support show URL in dvc get for Gdrive. We can generate a HTTPS link that can be used in a browser if user is authed. The same link as you would get if you download a file from UI.

  • [ ] Fix credentials management for external repos (when we do dvc get, etc). We should have a way to cache them.

  • [ ] Support external dependencies and outputs

  • [ ] Check that dvc get-url functions properly. Need to come up with some credentials management.

  • [ ] Review and add more tests if needed. Starting from API tests.

  • [ ] Add an explanation comment about how path vs ids work, all non 1-1 stuff. Someone running into GDrive for first time will appreciate https://github.com/iterative/dvc/pull/2551#pullrequestreview-311748384. Explain how we deal with it, how caching is involved too.

  • [ ] Notify user on retries, keep the message on the screen if they are happening at a certain rate

  • [ ] Consider raising an exception if there are multiple remote root directories with the same name

  • [ ] Move _gdrive_* helpers into a separate module api to simplify testing and reading the code

  • [ ] When it's more or less stable make it trusted by default

  • [ ] Check if there is a way to generate URLs for GDrive publicly available files to download them w/o auth. Examples: https://github.com/NVlabs/stylegan/blob/master/pretrained_example.py and https://github.com/NVlabs/stylegan/blob/master/dnnlib/util.py . dvc get ideally should work w/o asking to Auth for public objects.

  • [ ] Improve performance, pass fields, consider using batch for exists call - https://developers.google.com/drive/api/v2/performance , see example here https://github.com/iterative/PyDrive2/issues/42

enhancement help wanted p2-medium

Most helpful comment

also add _download progress (atm only uploads will provide progress). Specifically,

https://github.com/iterative/dvc/blob/d42a04e64710d0da1949458d20e131a6e33eb495/dvc/remote/gdrive/__init__.py#L120

see https://github.com/iterative/dvc/pull/2875#issuecomment-560044844 (source at https://github.com/gsuitedevs/PyDrive/blob/68cea204cdcd10bab9321580164c8f4961385a0f/pydrive/files.py#L202) which means this is non-trivial to add

All 20 comments

@MaxRis would be helpful if you include links to the discussions in the initial PR.

@shcheklein sure :+1:

also add _download progress (atm only uploads will provide progress). Specifically,

https://github.com/iterative/dvc/blob/d42a04e64710d0da1949458d20e131a6e33eb495/dvc/remote/gdrive/__init__.py#L120

see https://github.com/iterative/dvc/pull/2875#issuecomment-560044844 (source at https://github.com/gsuitedevs/PyDrive/blob/68cea204cdcd10bab9321580164c8f4961385a0f/pydrive/files.py#L202) which means this is non-trivial to add

@casperdcl thanks, have added this to the list

@efiop @Suor could you please review the order of priority for available things to be addressed in this ticket? Thanks!

@MaxRis maybe start from tests, then pick whatever you like.

One more GDrive related issue that can be added to the list https://github.com/iterative/dvc/issues/3015

Hi all, here is a screenshot of an issue I have with pushing to gdrive, and the subsequent creation of duplicate folders
Screenshot from 2020-01-14 10-17-39

@tierney6 Thanks for reporting! @MaxRis is currently researching the issue. Stay tuned! 🙂

(Added "Support file streaming in dvc.api.open() function (more details in #3408)" checkbox.)

@MaxRis @Suor do you remember what Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create): ) was about?

@shcheklein just to put assert to check remote file has expected title. The idea was to add more validation of data received from GDrive API. For example, if we resolved file by id, it makes sense to validate the title with expected.

@MaxRis k, so meaning that we need to put an assert in the _get_remote_item ?

yes, exactly

Woah this one is huge! Lots of things have been done too 🎉
Maybe move the remaining stuff into smaller issues and close this one? 🙂

@jorgeorpinel it's easier for me to manage it this way vs creating a long tail of small tickets. Let's keep it for now.

Put notes in docs that path like gdrive://root is not accessible by other people - that ID must be used to actually share data with other team members.

I'm addressing this in https://github.com/iterative/dvc.org/pull/1096#pullrequestreview-386926510 ✅

Also, how about

  • Add dvc config gdrive.client_id and gdrive.client_secret config options so you don't have to set it up per remote

? Maybe also for service account params.

Also, how about

not sure about this, I don't see that much difference with other remotes and other params they have

Regarding adding _download progress, it's super hard because of the sheer stack of libraries and functions which were not designed for any form of iteration or callbacks in mind. There are a lot of places where it may look like progress could be added but really you'd only be monitoring a pipe after the file has been downloaded.

CC @shcheklein the stack:

  • dvc.remote.gdrive.GDriveRemote._gdrive_download_file()

    • gdrive_file.GetContentFile()

    • == pydrive2.files.GoogleDriveFile.GetContentFile()

    • FetchContent()



      • _DownloadFromUrl()


      • self.http.request()


      • == httplib2.Http.request()


      • > The return value is [...] a string that contains the response entity body





        • _request(self, conn : httplib2.HTTPSConnectionWithTimeout, ...)



        • _conn_request(self, conn, ...)







          • response.read()




          • where response == conn.getresponse()




          • == httplib2.HTTPSConnectionWithTimeout.getresponse()




          • == http.client.HTTPSConnection.getresponse()




          • == http.client.HTTPResponse




          • http.client.HTTPResponse.read()




          • _readall_chunked()









            • can easily add progress here















Could monkey-patch doing:

class Custom(http.client.HTTPResponse):
    def _readall_chunked(...):
        with Tqdm(...) as pbar:
            ...

httplib2.HTTPConnectionWithTimeout.response_class = Custom
gdrive_file.GetContentFile()
httplib2.HTTPConnectionWithTimeout.response_class = http.client.HTTPResponse

but it's not pretty.

Opened upstream https://github.com/httplib2/httplib2/issues/165

Dear @shcheklein and @efiop, I'm starting working today on the "add import-url support for GDrive"

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ynop picture ynop  Â·  41Comments

yukw777 picture yukw777  Â·  45Comments

jorgeorpinel picture jorgeorpinel  Â·  45Comments

dmpetrov picture dmpetrov  Â·  35Comments

dmpetrov picture dmpetrov  Â·  64Comments