dvc add --external fails on hdfs directory

Created on 4 Aug 2020  路  10Comments  路  Source: iterative/dvc

Bug Report

dvc add --external hdfs://... fails when the target is a directory.

Please provide information about your setup

dvc add --external hdfs://...

Output of dvc version:

$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.8.2 on Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
Supports: hdfs, http, https
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

2020-08-04 17:26:41,289 DEBUG: fetched: [(3,)]
Adding...SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hanmail/connex/opt/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hanmail/connex/opt/hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hanmail/connex/opt/hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Adding...
2020-08-04 17:26:44,920 DEBUG: fetched: [(4,)]
2020-08-04 17:26:44,922 ERROR: hdfs command 'hadoop fs -checksum hdfs://search-hammer-analyzer1.dakao.io:9000/output/kurt/dvc_test_source/test1/' finished with non-zero return code 1': b"checksum: `hdfs://search-hammer-analyzer1.dakao.io:9000/output/kurt/dvc_test_source/test1': Is a directory\n"
------------------------------------------------------------
Traceback (most recent call last):
ab1
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/command/add.py", line 17, in run
ab2
    self.repo.add(
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/repo/__init__.py", line 34, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/repo/add.py", line 90, in add
    stage.save()
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/stage/__init__.py", line 380, in save
    self.save_outs()
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/stage/__init__.py", line 391, in save_outs
    out.save()
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 279, in save
    if not self.changed():
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 221, in changed
    status = self.status()
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 218, in status
    return self.workspace_status()
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 206, in workspace_status
    if self.changed_checksum():
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 194, in changed_checksum
    return self.checksum != self.get_checksum()
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 180, in get_checksum
    return self.tree.get_hash(self.path_info)
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/tree/base.py", line 268, in get_hash
    hash_ = self.get_file_hash(path_info)
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/tree/hdfs.py", line 167, in get_file_hash
    stdout = self.hadoop_fs(
  File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/tree/hdfs.py", line 155, in hadoop_fs
    raise RemoteCmdError(self.scheme, cmd, p.returncode, err)
dvc.tree.base.RemoteCmdError: hdfs command 'hadoop fs -checksum hdfs://.../test1/' finished with non-zero return code 1': b"checksum: `hdfs://.../test1': Is a directory\n"
bug p2-medium research

All 10 comments

Hi @rkmwiaim !

hadoop fs -checksum hdfs://.../test1/

Hm, this path looks very strange, need to take a closer look at the code.

Btw, we usually consider --external as a very advanced feature and don't really recommend using it unless there are very good reasons to. For example, the checksum reported by hdfs is not really a hash, so there might be potential problems with it overlapping with something (not an issue for regular scenarios where you use hdfs as a remote to push/pull to). Could you tell us about your scenario?

Thanks for reply @efiop

I hide the middle of path hdfs://.../test1/ because of security issues.

Anyway, I want to use hdfs as a external data file storage. The data comes from spark and will be used as training data of deep learning model. So, data is stored in hdfs and deep learning framework like tensorflow will streams over that data directly.

I didn't know that hdfs checksum is not a hash. I think I should use hdfs only as a remote storage (not use it with --external)

@rkmwiaim Are there any symlinks in /output/kurt/dvc_test_source/test1/ or its parents?

@efiop No, there are no symlinks in /output/kurt/dvc_test_source/test1/ or its parents.

@rkmwiaim Ok, closing this issue for now since you are going with the push/pull approach and I'm not able to reproduce hdfs issue. Thanks for the feedback!

@efiop Thank you for your fast responses!!

Reopening, since another user is running into this as well https://discordapp.com/channels/485586884165107732/485596304961962003/747506120867840101

Hi there!

I have the same issue with dvc run.

Information about setup

I try to add spark table as a dependency to pipeline stage with dvc run, e.g.:

python -m dvc run -n featurize \
    -d  hdfs://my_path/my_sandbox.db/my_tablename \
python featurize.py \
    --config=params.yaml

Output of聽dvc version:

WARNING: Unable to detect supported link types, as cache directory '../.dvc/cache' doesn't exist. It is usually auto-created by commands such as `dvc add/fetch/pull/run/import`, but you could create it manually to enable this check.
DVC version: 1.1.7
Python version: 3.7.2
Platform: Linux-4.1.12-61.1.18.el7uek.x86_64-x86_64-with-redhat-7.3-Maipo
Binary: False
Package: pip
Supported remotes: hdfs, http, https
Repo: dvc, git

Additional Information:

Spark table is partitioned by date, so for Hadoop file system it looks like a directory that also contains directories with orc files.
Since spark table is a directory not a file it hasn鈥檛 got flag -checksum.

Output of the above-mentioned command dvc run with --verbose` flag:

core>20/08/24 23:22:55 WARN util.NativeCodeLoader:

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/24 23:22:56 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
ERROR: hdfs command 'hadoop fs -checksum /my_path/my_sandbox.db/my_tablename' finished with non-zero return code 1': b"checksum: `/my_path/my_sandbox.db/my_tablename': Is a directory\n"

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

The same problem will occur if you just run the command:

hadoop fs -checksum /my_path/my_sandbox.db/my_tablename

The output will be:

checksum: `my_path/my_sandbox.db/my_tablename': Is a directory

So, how could I use the pipeline stage dependency on the spark table?

@maks-sh Thanks for reporting the issue! Indeed, looks like there is a bug where we are not able to see that something is an hdfs directory. I can reproduce it now, working on a fix...

Guys, 1.6.1 is out with the fix. Please give a try and let us know how it goes :slightly_smiling_face: Thank you so much for the feedback! :pray:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  路  3Comments

ghost picture ghost  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

mfrata picture mfrata  路  3Comments