Dvc: hdfs: don't use hadoop CLI for getting the checksum

Created on 22 Jul 2019  路  7Comments  路  Source: iterative/dvc

Need to implement it for pyarrow https://issues.apache.org/jira/browse/ARROW-5995

enhancement p2-medium

All 7 comments

hdfs of hadoop v. 3.1.1 started to provide possibility to calculate checksum in 3 different modes: COMPOSITE_CRC, CRC32C and default MD5MD5CRC.
Links: https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
https://community.cloudera.com/t5/Community-Articles/Comparing-checksums-in-HDFS/ta-p/248617

But still all 3 checksum calculation algorithms depend on initial CRC32 calculation for each file's block ( from https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html "hdfs stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance." )

hdfs checksums seems designed only to prevent data corruption (data integrity check), but not for unique hashed result retrieving (unlike md5 checksum).

Used CRC32 algorithms to calculate checksums is limited to 32 bits output result and can generate same results for different inputs https://network.informatica.com/docs/DOC-15035

Getting of md5 for hdfs file can be done after reading of the whole file ( that seems might be too "expensive" ) Example: https://github.com/rdsr/hdfs-checksum/blob/a299c917e0229d319c19e44ff899a4d8c22fdc39/src/clj/hdfs_checksum/core.clj#L25

Or via hadoop cli on server side https://stackoverflow.com/a/47301778

Or by mounting hadoop with FUSE as usual file system on server side https://opensciencegrid.org/docs/data/install-hadoop/#optional-fuse-client-configuration . Just md5sum FILEPATH will work.

CRC32 based md5 is probably good enough for external dependencies as an ETAG, so we could use it there. For external outputs it is indeed not unique enough and we might resort to the honest md5 calculation like we do with local files.

@efiop nice! Does it mean that pyarrow still should be extended with hadoop-hdfs-getChecksum as it was planned?

@MaxRis Yes, let's implement it, it would at least be useful for external dependencies and scenarios like dvc import-url hdfs://somewhere/something.

Have reported libhdfs's new API method request in their JIRA https://issues.apache.org/jira/browse/HDFS-14804

For the record, during the dicsussion with @Suor and @shcheklein , there was a point that md5 of block crcs might be prone to false-negatives, which questions the usability even for external dependecies :( So maybe we should simply remove that functionality from dvc

Also, the user here is def looking for external outputs secnario for hdfs https://discordapp.com/channels/485586884165107732/485596304961962003/628935403239374859

Was this page helpful?
0 / 5 - 0 ratings