Node: Resumable hash operations

Created on 13 Apr 2019  路  4Comments  路  Source: nodejs/node

Is your feature request related to a problem? Please describe.

At present, calculating the hash of a large file needs to be done in one go, in a single process.

It would be useful to be able to stop and later resume, perhaps in another process or on another machine entirely.

This is particularly useful where the files are large (10GB+).

Use cases:

  1. Process crashes halfway through hashing a large file. It would be useful to be able to resume from where it left off rather than start again from scratch.
  2. Computer 1 has the first half of a file, computer 2 has 2nd half. It would be expensive to transfer the file contents from one computer to the other, so want to hash first half of file on Computer 1 and finish it on Computer 2.
  3. Computer 1 is hashing large file. Part way through, Computer 1's resources are required for a higher-priority task. It would be useful to stop the hashing on Computer 1 and allow Computer 2 which is idle to finish it off.
  4. Hashing large files on short-lived processes e.g. AWS Lambda, where one invocation cannot complete the entire file.

As background, my personal use case is dealing with 100GB+ video files which need to be hashed with SHA1 and MD5. The files are stored in chunks which are distributed across many machines. Many of the machines are geographically distant from each other and connected to each other by slow networks. So transferring the full file to a single machine is slow and expensive.

Describe the solution you'd like

I imagine an API something like this:

// Start hash
const crypto = require('crypto');
const hash = crypto.createHash('sha1');

hash.update('data 1');
hash.update('data 2');

// Get state of hash so far
const state = hash.getState();
const json = JSON.stringify(state);

// Later on, perhaps on another machine entirely...

// Reinstate hash's state
const state = JSON.parse(json);
const hash = crypto.createHash('sha1');
hash.setState(state);

// Continue hashing
hash.update('data 3');
hash.update('data 4');

// Get digest
console.log(hash.digest('hex'));

I envisage it also working with the streaming interface.

The state object returned by .getState() and passed to .setState() would differ depending on the hash algorithm.

OpenSSL appears to have interfaces to access the internal state of a hash (see EVP_MD_CTX_copy_ex() here).

This also appears to be possible in Python: https://github.com/kislyuk/rehash

Describe alternatives you've considered

For some use cases, an alternative would be to hash the file in chunks and store an array of chunk hashes. The integrity of the file can then be determined later by comparing each chunk's hash to what's in the store.

However, in my case, the source of the file provides a hash of the entire file. To ensure the file has been transferred to me without corruption, I need to calculate the hash of the entire file.

crypto feature request wontfix

Most helpful comment

I am sorry, but this sounds like a very unusual use case to me: Hashing assumes a continuous stream of data. That alone doesn't mean we won't help you, but there is no standardized way of encoding the internal state: How would you transfer the EVP_MD_CTX to another process or maschine? The contents of the EVP_MD_CTX structure depend on the OpenSSL version, the hashing algorithm, and possibly also on the system architecture. EVP_MD_CTX, EVP_MD_CTX_copy and EVP_MD_CTX_copy_ex are not designed for transfers across processes or systems. While the algorithms have to comply with specifications, the internal state is up to the implementation.

For some use cases, an alternative would be to hash the file in chunks and store an array of chunk hashe

You can also use Merkle trees, they are usually more efficient.

However, in my case, the source of the file provides a hash of the entire file. To ensure the file has been transferred to me without corruption, I need to calculate the hash of the entire file.

Out of curiousity, why did you split the file across systems with slow network connections in the first place?

All 4 comments

I would be happy to assist in whatever way I can to make this happen. Unfortunately, I have never used C/C++ in my life, so I think doing the actual implementation is way beyond me, but I'd be happy to write tests, write docs, benchmark etc.

@nodejs/crypto

I am sorry, but this sounds like a very unusual use case to me: Hashing assumes a continuous stream of data. That alone doesn't mean we won't help you, but there is no standardized way of encoding the internal state: How would you transfer the EVP_MD_CTX to another process or maschine? The contents of the EVP_MD_CTX structure depend on the OpenSSL version, the hashing algorithm, and possibly also on the system architecture. EVP_MD_CTX, EVP_MD_CTX_copy and EVP_MD_CTX_copy_ex are not designed for transfers across processes or systems. While the algorithms have to comply with specifications, the internal state is up to the implementation.

For some use cases, an alternative would be to hash the file in chunks and store an array of chunk hashe

You can also use Merkle trees, they are usually more efficient.

However, in my case, the source of the file provides a hash of the entire file. To ensure the file has been transferred to me without corruption, I need to calculate the hash of the entire file.

Out of curiousity, why did you split the file across systems with slow network connections in the first place?

I'm going to close out the issue because there is no reasonable way to implement this feature. Thanks anyway for the report.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

filipesilvaa picture filipesilvaa  路  3Comments

sandeepks1 picture sandeepks1  路  3Comments

willnwhite picture willnwhite  路  3Comments

Icemic picture Icemic  路  3Comments

mcollina picture mcollina  路  3Comments