Client: Enable content checksums

Created on 20 Jan 2016  路  13Comments  路  Source: owncloud/client

Due to #3235 and #2983 we use content checksums for detecting unnecessary uploads and renames-that-aren't. But content checksums are currently only computed for .eml files.

We should consider enabling content checksumming for everything.

Enhancement ReadyToTest

All 13 comments

:+1:

I still vote for using a safe cryptograpgical checksum so we can later use it for more advanced magic.
But if @ckamm says we can anyway later switch...

@LukasReschke @ogoffart What do you think?

@moscicki your input here?

Maybe also from @PVince81 and @DeepDiver1975 about storage backends (on the server) natively supporting certain checksums so we can align with them.

Discussion on IRC regarding checksum type:

  • Performance of checksumming will rarely matter compared to the cost of uploads and downloads, we can use SHA1 or better.
  • We should choose a content checksum type that's supported by the server as transmission checksum. Otherwise we might have to compute two checksums.
  • The content checksum could be computed while we're reading the file into memory anyway, instead of as a separate pass.
  • The transmission checksum is currently part of the header of the last chunk's PUT request and must thus be computed in advance.

Other jobs that need to be done to expand our use of checksums:

  • Does the 'transmission checksum' vs 'content checksum' distinction make sense? Currently the server only sees the 'transmission checksum' and we discard the checksum we get from the server because the journal only stores the 'content checksum'.
  • Ensure we have a content checksum for downloaded files: either use the checksum the server sends or compute it after downloading.

Please consider that some storage backends such as ours already have a preferred checksum (adler32 in our case). It may be impossible for us to change the checksum type. More precisely, we have a technical ability to change the checksum type (for new files) but we may be constrained to do so by other considerations/policies. And also, consider that currently all our files are already checksummed with adler32 so any checksum type change could only apply to new files (so we would have a mix of checksum types in the system).

The adler32 checksum is not efficient for change detection for small files. Hence using it as content checksum may be problematic for files smaller than a given size (say 1KB). Also, consider this especially if you think in the future of storing checksum blocks which are by definition rather small. So if we want to avoid to compute additional content checksum then the content checksum mechanism with adler32 should be only enabled for files above the critical size (meaning that below that size you should count with a possibility of collision and false-positive). So that would option one but it is not optimal because it would not work for very small files (which I guess defeats the purpose for *.eml files).

So I think at least for us the distinction between content and transmission checksum makes sense. What is currently called a transmission checksum is actually enforced by the server and client cannot change it (another example: MD5 for Amazon S3). What is called content checksum is actually what client wants to use locally to achieve better operation with local change detection. In some cases both can be aligned, in other cases they cannot.

If we wanted to add additional content checksum than I would suggest:

  • on PUT: also send it as additional special header (e.g. OC-Client-Content-Checksum) just in case server decided to make use of this information in some way (or just store it as additional metadata). I would only consider to add this header unless you think that it is a better implementation choice to avoid the second pass on the client.
  • on GET: if transmission checksum does not match the type you expect you should compute the content checksum yourself and store locally. You may want to consider to store also the transmission checksum in case some correlation between the client and server checksums would be needed. This checksum however, is not normally needed by the client and should be included in a way which does not increase the memory footprint of the client (if that's possible at all)

I am open to discuss it further, these are just a bunch of initial thoughts but the initial constraints I described are valid.

A wishful note: Using two differently calculated checksums against same file often increase reliability.

Hm, I got pointed to this: https://tools.ietf.org/html/rfc3230
Maybe we should be compatible?

Created owncloud/core#22711 about this.

Pull request #4532

Hello,

In 1.7.2 we have the content checksums implemented and enabled. Will this pull request affect the current transmission checksum implementation?

Thank you,

kuba

@moscicki This has no effects on transmission checksums, it's only about checksums that the client stores in its database.

Was this page helpful?
0 / 5 - 0 ratings