Dvc: Pushing artifacts via WebDAV results in a 411 Length Required response

Created on 27 Oct 2020  路  31Comments  路  Source: iterative/dvc

Bug Report

I am trying to connect to a remote via WebDAV. I can correctly setup user and password along with the url, but when I try to push the artifacts I get a 411 Length Required response. How can I solve the missing header problem?

Please provide information about your setup

DVC version: 1.9.0 (brew)

Platform: Python 3.9.0 on macOS-10.15.7-x86_64-i386-64bit
Supports: azure, gdrive, gs, http, https, s3, ssh, oss, webdav, webdavs
Cache types: reflink, hardlink, symlink
Repo: dvc, git

research

Most helpful comment

Any info about the server? At the first glance seems like the server is not understanding the chunked upload. Might be missing something though. CC @iksnagreb

@efiop @LucaButera Can we try to figure out, whether it is really (only) the chunked upload and not something else?

@LucaButera If you have a copy of the dvc repository and some time to try something: It should be quite easy to change the _upload method of the WebDAVTree to use the upload_file method, which irrc does no chunking of the file.

https://github.com/iterative/dvc/blob/master/dvc/tree/webdav.py#L243

You would have to change the last line
self._client.upload_to(buff=chunks(), remote_path=to_info.path)
to
self._client.upload_file(local_path=from_file, remote_path=to_info.path)

If this modification lets you upload files, we can be pretty sure it is the chunking or a bug in the webdavclient upload_to method. Note that this will disable the progressbar, so it might seem as it is hanging...

I assume you have no valid dvc cache at the remote yet (as uploading does not work at all)? So you cannot check whether downloading is working?

Before trying to upload the file, the parent directories should be created e.g. datasets/a7, could you please check, whether this was successful?

All 31 comments

Hi @LucaButera

Please show full log for dvc push -v.

2020-10-27 18:30:54,485 DEBUG: Check for update is enabled.
2020-10-27 18:30:54,487 DEBUG: fetched: [(3,)]
2020-10-27 18:30:54,487 DEBUG: Checking if stage 'as' is in 'dvc.yaml'
2020-10-27 18:30:54,494 DEBUG: Assuming '/Users/lucabutera/bolt_datasets/.dvc/cache/67/560cedfa23a09b3844c3278136052f.dir' is unchanged since it is read-only
2020-10-27 18:30:54,495 DEBUG: Assuming '/Users/lucabutera/bolt_datasets/.dvc/cache/67/560cedfa23a09b3844c3278136052f.dir' is unchanged since it is read-only
2020-10-27 18:30:54,510 DEBUG: Preparing to upload data to 'https://<user>@drive.switch.ch/remote.php/dav/files/<user>/datasets'
2020-10-27 18:30:54,510 DEBUG: Preparing to collect status from https://<user>@drive.switch.ch/remote.php/dav/files/l<user>/datasets
2020-10-27 18:30:54,510 DEBUG: Collecting information from local cache...
2020-10-27 18:30:54,511 DEBUG: Assuming '/Users/lucabutera/bolt_datasets/.dvc/cache/67/560cedfa23a09b3844c3278136052f.dir' is unchanged since it is read-only
2020-10-27 18:30:54,511 DEBUG: Assuming '/Users/lucabutera/bolt_datasets/.dvc/cache/a7/1ba7ec561a112e0af205674a767b7a' is unchanged since it is read-only
2020-10-27 18:30:54,511 DEBUG: Collecting information from remote cache...
2020-10-27 18:30:54,511 DEBUG: Querying 1 hashes via object_exists
2020-10-27 18:30:55,791 DEBUG: Matched '0' indexed hashes
2020-10-27 18:30:56,302 DEBUG: Estimated remote size: 256 files
2020-10-27 18:30:56,303 DEBUG: Querying '2' hashes via traverse
2020-10-27 18:30:58,349 DEBUG: Uploading '.dvc/cache/a7/1ba7ec561a112e0af205674a767b7a' to 'https://<user>@drive.switch.ch/remote.php/dav/files/<user>/datasets/a7/1ba7ec561a112e0af205674a767b7a'
2020-10-27 18:30:59,678 ERROR: failed to upload '.dvc/cache/a7/1ba7ec561a112e0af205674a767b7a' to 'https://<user>@drive.switch.ch/remote.php/dav/files/<user>/datasets/a7/1ba7ec561a112e0af205674a767b7a' - Request to https://<user>@drive.switch.ch/remote.php/dav/files/<user>/datasets/a7/1ba7ec561a112e0af205674a767b7a failed with code 411 and message: b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>411 Length Required</title>\n</head><body>\n<h1>Length Required</h1>\n<p>A request of the requested method PUT requires a valid Content-length.<br />\n</p>\n<hr>\n<address>Apache/2.4.18 (Ubuntu) Server at a01.drive.switch.ch Port 80</address>\n</body></html>\n'
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/cache/local.py", line 32, in wrapper
    func(from_info, to_info, *args, **kwargs)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/tree/base.py", line 356, in upload
    self._upload(  # noqa, pylint: disable=no-member
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/tree/webdav.py", line 243, in _upload
    self._client.upload_to(buff=chunks(), remote_path=to_info.path)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/webdav3/client.py", line 66, in _wrapper
    res = fn(self, *args, **kw)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/webdav3/client.py", line 438, in upload_to
    self.execute_request(action='upload', path=urn.quote(), data=buff)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/webdav3/client.py", line 226, in execute_request
    raise ResponseErrorCode(url=self.get_url(path), code=response.status_code, message=response.content)
webdav3.exceptions.ResponseErrorCode: Request to https://<user>@drive.switch.ch/remote.php/dav/files/<user>/datasets/a7/1ba7ec561a112e0af205674a767b7a failed with code 411 and message: b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>411 Length Required</title>\n</head><body>\n<h1>Length Required</h1>\n<p>A request of the requested method PUT requires a valid Content-length.<br />\n</p>\n<hr>\n<address>Apache/2.4.18 (Ubuntu) Server at a01.drive.switch.ch Port 80</address>\n</body></html>\n'
------------------------------------------------------------
2020-10-27 18:30:59,680 DEBUG: failed to upload full contents of 'as', aborting .dir file upload
2020-10-27 18:30:59,680 ERROR: failed to upload '.dvc/cache/67/560cedfa23a09b3844c3278136052f.dir' to 'https://<user>@drive.switch.ch/remote.php/dav/files/<user>/datasets/67/560cedfa23a09b3844c3278136052f.dir'
2020-10-27 18:30:59,680 DEBUG: fetched: [(1925,)]
2020-10-27 18:30:59,682 ERROR: failed to push data to the cloud - 2 files failed to upload
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/command/data_sync.py", line 50, in run
    processed_files_count = self.repo.push(
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/repo/__init__.py", line 51, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/repo/push.py", line 35, in push
    return len(used_run_cache) + self.cloud.push(used, jobs, remote=remote)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/data_cloud.py", line 65, in push
    return self.repo.cache.local.push(
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/remote/base.py", line 15, in wrapper
    return f(obj, named_cache, remote, *args, **kwargs)
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/cache/local.py", line 427, in push
    return self._process(
  File "/usr/local/Cellar/dvc/1.9.0/libexec/lib/python3.9/site-packages/dvc/cache/local.py", line 396, in _process
    raise UploadError(fails)
dvc.exceptions.UploadError: 2 files failed to upload
------------------------------------------------------------
2020-10-27 18:30:59,687 DEBUG: Analytics is disabled.

This is the full output, minus the user field which is changed to for privacy reasons.
Hope this can help.

Any info about the server? At the first glance seems like the server is not understanding the chunked upload. Might be missing something though. CC @iksnagreb

The server is a Switch Drive, which is a cloud storage provider based on ownCloud. I would assume the WebDAV server is the same as ownCloud, but I don't have further info

@LucaButera Thanks! So it might be a bug in https://github.com/ezhov-evgeny/webdav-client-python-3 , need to take a closer look.

@efiop After browsing the source code it seems plausible to me. Mind that I am able to connect to the server through MacOS Finder, so it doesn't seem a server issue.

Sadly the upload_to method in the webdav library does not allow to pass headers, even though the underlying method that performs the request allows custom headers. They even have an issue open regarding this subject https://github.com/ezhov-evgeny/webdav-client-python-3/issues/80

One solution might be to emulate the upload_to method and directly call execute_request on the Client object. In the meantime hope for the resolution of the mentioned issue. Otherwise one should open a PR on https://github.com/ezhov-evgeny/webdav-client-python-3 and solve the issue so that custom headers can be directly passed to the upload_to method.

I am willing to help, I might download both DVC and webdav-client source code and try out these modifications by myself, just to report if adding the header fixes the issue. I just don't know how to trigger the dvc push using the modified source.

Any info about the server? At the first glance seems like the server is not understanding the chunked upload. Might be missing something though. CC @iksnagreb

@efiop @LucaButera Can we try to figure out, whether it is really (only) the chunked upload and not something else?

@LucaButera If you have a copy of the dvc repository and some time to try something: It should be quite easy to change the _upload method of the WebDAVTree to use the upload_file method, which irrc does no chunking of the file.

https://github.com/iterative/dvc/blob/master/dvc/tree/webdav.py#L243

You would have to change the last line
self._client.upload_to(buff=chunks(), remote_path=to_info.path)
to
self._client.upload_file(local_path=from_file, remote_path=to_info.path)

If this modification lets you upload files, we can be pretty sure it is the chunking or a bug in the webdavclient upload_to method. Note that this will disable the progressbar, so it might seem as it is hanging...

I assume you have no valid dvc cache at the remote yet (as uploading does not work at all)? So you cannot check whether downloading is working?

Before trying to upload the file, the parent directories should be created e.g. datasets/a7, could you please check, whether this was successful?

@efiop @iksnagreb I will try to modify the source in the afternoon and report to you.

Concerning the creation of the base folders, yes they get created, so connection to the server should be working.

@LucaButera, to see if the chunking upload is the issue, you could also try sending a curl request with chunking upload:

$ curl --upload-file test.txt https://<user>@drive.switch.ch/remote.php/dav/files/<user>/test.txt -vv --http1.1 --header "Transfer-Encoding: chunked"

Also, check without that header. If the files are uploaded successfully on both instances, something's wrong with the library. If it's just the former, chunking upload might have been forbidden on the server entirely.

@skshetry I tried your suggestion which seemed quicker. Actually without the header it correctly uploads the file, while with the chunked upload it returns 411 as with dvc.

@iksnagreb@efiop Do I have any way to perform non-chunked upload in DVC? Or I have no choice but to contact the provider and hope they can somehow enable the chunked upload?

Do I have any way to perform non-chunked upload in DVC?

Hm, I do not thinks this is possible right now - at least for the WebDAV remote. It should be possible to implement an option to enable non-chunked upload, the problem I see is: This would also disable the progressbar (without chunking, we cannot count progress...) which is not obvious and might confuse users. @efiop Are there options for disabling chunking for other remotes, if yes, how do these handle that problem?

Or I have no choice but to contact the provider and hope they can somehow enable the chunked upload?

I think an option for selecting chunked/non-chunked upload could be an configuration option (if we can find a way to handle this conveniently), there are probably other cloud providers disallowing chunked upload as well...

@LucaButera, did you try @iksnagreb's suggestion? If that works, we could provide a config for disabling it.

If that didn't work, I am afraid there's no other easy solution than to contact the provider. Nextcloud/Owncloud does support non-standard webdav extension for chunking upload for these kind of situations, but it's unlikely we are going to support it.

@iksnagreb actually it could be intuitive to have an option on dvc push for non-chunked upload, like, for example, setting jobs to 0.

@skshetry I am trying it, I just downloaded dvc source and I'm trying to figure it out. Will report back soon.

@skshetry I can confirm that @iksnagreb suggestion works, I have been able to push and pull from the WebDAV storage. Moreover I must say that the progressbar works, but it updates less frequently, probably on each file upload.

What should I do next?

Then lets think about implementing something like dvc remote modify <remote> chunked_upload false (I think true should be the default). Maybe chunked_transfer or just chunked would be a better name as this might apply to download as well?

@iksnagreb I think chunked is a good choice. Otherwise, as I suggested previously, it could be an idea to have non chunked behavior when --jobs 0 as I imagine having multiple jobs works only when using chunks. But I might be wrong.

Hm, I think jobs just control how many upload processes to start in parallel, each of these could then be a chunked or non-chunked transfer. You might be right, that more jobs makes sense with chunking (as it allows for transmitting and reading from disk more parallel), so there is probably not much (performance) benefit from a single chunked job. But I do not known much about the jobs thing (@efiop?).

However, I think of the chunking more as an option choice at the communication/transmission level between server and client (where the client needs to match what the server can understand). Furthermore, chunking allowed to implement the progressbar per file, irrc that was the reason to use the chunked upload in the first place.

@iksnagreb then I think having something like chunked false with true as default should be a nice solution.

It could also be overridden by a command option on dvc push and dvc pull.
This can allow the user to change the option on the fly in a non permanent way.
I don't know if it has use cases, but should be easy to implement.

Seems like adding a config option for it would greatly worsen the ui. Not having a progress bar is a very serious thing. I also don't like the idea of introducing a CLI option, because that seems out-of-place. Plus it potentially breaks future scenarios in which dvc would push automatically.

I'm genuinely that this problem even exists, hope we are not simply missing some info here.

If I understand the situation correctly, if we introduce that option in any way, it will also result in people running into timeout errors for big files. This is unacceptable for dvc, as we are storing files without chunking them (at least for now, there are some plans https://github.com/iterative/dvc/issues/829 ) and so webdav uploads will break for big files (files might gigabytes and much bigger) which is our core use case. This is a dealbreaker.

As pointed out by @skshetry , this is likely a provider problem, so I would look for a solution there. I didn't look deeply into https://docs.nextcloud.com/server/15/developer_manual/client_apis/WebDAV/chunking.html , but that seems like a feature request for our webdav library and not for dvc, right? Or am I missing something?

[...] it will also result in people running into timeout errors for big files [...]

Uff, yes, did not even think about this yet... You probably not want to adjust the timeout config depending on your expected file size, so chunked transmission is the only solution to avoid timeouts per request.

@efiop I think you are right with the large files issue. Tell me if I got this straight. The problem here is not chunking being enabked or not but rather the fact that chunking is implemented in a peculiar way in this provider's webdav. Is this correct?

Mind that this platform is based on ownCloud and not nextcloud. Don't know if that is relevant.

I'm also facing similar but slightly different issue with "Nextcloud + mod_fcgi" (which is a bug in httpd2), in which files are uploaded empty.

The original issue might be due to that bug (not fixed yet) or, this bug which was only fixed 2 years ago (OP's server is 2.4.18, whereas recent one is 2.4.46).

Sabredav's wiki has a good insight into these bugs:

Finder (_On OS X_) uses Transfer-Encoding: Chunked in PUT request bodies. This is a little-used HTTP feature, and therefore not implemented in a bunch of web servers. The only server I've seen so far that handles this reasonably well is Apache + mod_php. Nginx and Lighttpd respond with 411 Length Required, which is completely ignored by Finder. This was seen on Nginx 0.7.63. It was recently reported that a development release (1.3.8) no longer had this issue.

When using this with Apache + FastCGI PHP completely drops the request body, so it will seem as if the PUT request was successful, but the file will end up empty.

So, the best thing to do is either drop "chunked" requests on PUT or introduce config to disable it.

Not having a progress bar is a very serious thing

@efiop, as the webdavclient3 uses streaming upload, we can still support progress bars:

with open(file, "rb") as fd:
    with Tqdm.wrapattr(fd, "read", ...) as wrapped:
        self._client.upload_to(buff=wrapped, remote_path=to_info.path)

Look here for the change:
https://github.com/iterative/dvc/blob/f827d641d5c2f58944e49d2f6537a9ff09e447e1/dvc/tree/webdav.py#L224

but that seems like a feature request for our WebDAV library and not for DVC, right? Or am I missing something?

The Owncloud Chunking (NG) might be too slow for our use case, as it needs to create a separate request for each chunk (and, then send "MOVE" that joins all the chunk which is again expensive). So, unless we change our upload strategy to parallelize chunking upload rather than file upload, we will make it 3-4x slower, just for the sake of having a progress bar.
And, it seems it's possible to have a progress bar without it.
Not to add, it's not a WebDAV standard, that's unsupported outside of Nextcloud and Owncloud.

it will also result in people running into timeout errors

I don't think, there is any way around timeout errors, especially if we talk about PHP based WebDAV servers (they have a set max_execution_time). The Owncloud Chunking NG exists because of this very reason.

Though, we could just chunk and upload and then assemble it during pull. I think, this is what rclone chunker does.

For closing this issue, we could just disable chunking upload via a config or by default.

@skshetry it would be wonderful to have a simple solution like that.

On the other hand a more reliable solution like the one of the "assembly on pull" seems also a nice feature in the long run.

I have never contributed to open source projects but I am willing to help if needed, as I think DVC is really a much needed tool.

I am willing to help if needed

@LucaButera, that'd be great. See if that above snippets work. Also, make sure you test a few scenarios manually (we lack tests for webdav, though that will be added soon).

If you face any issues, please comment here or ask on #dev-talk on the Discord. Thanks.

@skshetry Ok, I'll test a few scenarios, namely:
1) Loading many small files.
2) Loading a really large file.
3) Loading a realistic folder.

Just a question, do you need me to simply test a few cases with the snippet above or do I need to open a PR implementing the snippet and the relative config needed to use it?

@LucaButera, It'd be great if you could make a PR. Thanks. Check contributing-guide for setup.

the relative config needed to use it?

Maybe, no need of the config, but we can decide that on the PR discussion.

@LucaButera @skshetry FYI: lighttpd supports PUT with Transfer-Encoding: chunked since lighttpd 1.4.44, released almost 4 years ago. lighttpd 1.4.54, released a year and a half ago, has major performance enhancements to lighttpd mod_webdav and large files.

What version of lighttpd are you having trouble with?

@gstrauss, thanks for participating and the info. I was quoting from the Sabredav's wiki, which is more than 6 years old, so it might not be up-to-date. And, we were not using lighthttpd, @LucaButera's server is Apache 2.4.18 which is ~5yrs old whereas mine is Apache 2.4.46.

But, we'll bump into old web-servers, so we have to err in the side of caution and just remove chunking upload (is there any disadvantages/performance hit to that?)

just remove chunking upload (is there any disadvantages/performance hit to that?)

If you already know the content length on the client side, then there should be no performance hit.

If the upload is generated content, then the content would have to first be cached locally on the client to be able to determine the content length when Transfer-Encoding: chunked is not being used. There can be a performance hit and additional local resource usage to do so.

@gstrauss @skshetry So are you suggesting to completely remove the option for chunked upload? Doesn't this pose an issue with the upload of large files?

@LucaButera, we stream-upload the file, so it does not affect the memory usage. There should not be any issues that were not already there with this approach.

Was this page helpful?
0 / 5 - 0 ratings