dvc add --external hdfs://...
fails when the target is a directory.
dvc add --external hdfs://...
Output of dvc version
:
$ dvc version
DVC version: 1.3.1 (pip)
---------------------------------
Platform: Python 3.8.2 on Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
Supports: hdfs, http, https
Cache types: hardlink, symlink
Repo: dvc, git
Additional Information (if any):
If applicable, please also provide a --verbose
output of the command, eg: dvc add --verbose
.
2020-08-04 17:26:41,289 DEBUG: fetched: [(3,)]
Adding...SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hanmail/connex/opt/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hanmail/connex/opt/hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hanmail/connex/opt/hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Adding...
2020-08-04 17:26:44,920 DEBUG: fetched: [(4,)]
2020-08-04 17:26:44,922 ERROR: hdfs command 'hadoop fs -checksum hdfs://search-hammer-analyzer1.dakao.io:9000/output/kurt/dvc_test_source/test1/' finished with non-zero return code 1': b"checksum: `hdfs://search-hammer-analyzer1.dakao.io:9000/output/kurt/dvc_test_source/test1': Is a directory\n"
------------------------------------------------------------
Traceback (most recent call last):
ab1
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/command/add.py", line 17, in run
ab2
self.repo.add(
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/repo/__init__.py", line 34, in wrapper
ret = f(repo, *args, **kwargs)
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/repo/scm_context.py", line 4, in run
result = method(repo, *args, **kw)
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/repo/add.py", line 90, in add
stage.save()
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/stage/__init__.py", line 380, in save
self.save_outs()
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/stage/__init__.py", line 391, in save_outs
out.save()
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 279, in save
if not self.changed():
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 221, in changed
status = self.status()
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 218, in status
return self.workspace_status()
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 206, in workspace_status
if self.changed_checksum():
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 194, in changed_checksum
return self.checksum != self.get_checksum()
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/output/base.py", line 180, in get_checksum
return self.tree.get_hash(self.path_info)
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/tree/base.py", line 268, in get_hash
hash_ = self.get_file_hash(path_info)
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/tree/hdfs.py", line 167, in get_file_hash
stdout = self.hadoop_fs(
File "/hanmail/.pyenv/versions/3.8.2/envs/dvc/lib/python3.8/site-packages/dvc/tree/hdfs.py", line 155, in hadoop_fs
raise RemoteCmdError(self.scheme, cmd, p.returncode, err)
dvc.tree.base.RemoteCmdError: hdfs command 'hadoop fs -checksum hdfs://.../test1/' finished with non-zero return code 1': b"checksum: `hdfs://.../test1': Is a directory\n"
Hi @rkmwiaim !
hadoop fs -checksum hdfs://.../test1/
Hm, this path looks very strange, need to take a closer look at the code.
Btw, we usually consider --external
as a very advanced feature and don't really recommend using it unless there are very good reasons to. For example, the checksum reported by hdfs is not really a hash, so there might be potential problems with it overlapping with something (not an issue for regular scenarios where you use hdfs as a remote to push/pull to). Could you tell us about your scenario?
Thanks for reply @efiop
I hide the middle of path hdfs://.../test1/
because of security issues.
Anyway, I want to use hdfs as a external data file storage. The data comes from spark and will be used as training data of deep learning model. So, data is stored in hdfs and deep learning framework like tensorflow will streams over that data directly.
I didn't know that hdfs checksum is not a hash. I think I should use hdfs only as a remote storage (not use it with --external)
@rkmwiaim Are there any symlinks in /output/kurt/dvc_test_source/test1/
or its parents?
@efiop No, there are no symlinks in /output/kurt/dvc_test_source/test1/
or its parents.
@rkmwiaim Ok, closing this issue for now since you are going with the push/pull approach and I'm not able to reproduce hdfs issue. Thanks for the feedback!
@efiop Thank you for your fast responses!!
Reopening, since another user is running into this as well https://discordapp.com/channels/485586884165107732/485596304961962003/747506120867840101
Hi there!
I have the same issue with dvc run
.
I try to add spark table as a dependency to pipeline stage with dvc run
, e.g.:
python -m dvc run -n featurize \
-d hdfs://my_path/my_sandbox.db/my_tablename \
python featurize.py \
--config=params.yaml
WARNING: Unable to detect supported link types, as cache directory '../.dvc/cache' doesn't exist. It is usually auto-created by commands such as `dvc add/fetch/pull/run/import`, but you could create it manually to enable this check.
DVC version: 1.1.7
Python version: 3.7.2
Platform: Linux-4.1.12-61.1.18.el7uek.x86_64-x86_64-with-redhat-7.3-Maipo
Binary: False
Package: pip
Supported remotes: hdfs, http, https
Repo: dvc, git
Spark table is partitioned by date, so for Hadoop file system it looks like a directory that also contains directories with orc files.
Since spark table is a directory not a file it hasn鈥檛 got flag -checksum
.
Output of the above-mentioned command dvc run
with --verbose` flag:
core>20/08/24 23:22:55 WARN util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/24 23:22:56 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
ERROR: hdfs command 'hadoop fs -checksum /my_path/my_sandbox.db/my_tablename' finished with non-zero return code 1': b"checksum: `/my_path/my_sandbox.db/my_tablename': Is a directory\n"
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
The same problem will occur if you just run the command:
hadoop fs -checksum /my_path/my_sandbox.db/my_tablename
The output will be:
checksum: `my_path/my_sandbox.db/my_tablename': Is a directory
So, how could I use the pipeline stage dependency on the spark table?
@maks-sh Thanks for reporting the issue! Indeed, looks like there is a bug where we are not able to see that something is an hdfs directory. I can reproduce it now, working on a fix...
Guys, 1.6.1 is out with the fix. Please give a try and let us know how it goes :slightly_smiling_face: Thank you so much for the feedback! :pray: