Kerberos is the only supported security option for hadoop, thus passing kerberos authentication will be necessary for hdfs users who want to use dvc in a remote environment.
Currently I can run dvc commands without a problem on the edge node of my hadoop cluster (after runnining kinit). If I try running the same dvc commands on a remote machine (with hadoop installed #2273), I receive:
ERROR: failed to upload '.dvc/cache/64/a414dd1dd7f45f5f8e14d7607b05ce' to
'hdfs://MyUserName@MyServer/path/to/dvc/storage/a414dd1dd7f45f5f8e14d7607b05ce' -
hdfs command 'HADOOP_USER_NAME=MyUserName hadoop fs -mkdir -p
hdfs://MyUserName@MyServer/path/to/dvc/storage/64' finished with non-zero return code 1':
b'mkdir: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]\n'
This is not an issue with kerberos, as I can successfully make directories with webHDFS from the remote machine.
Does dvc currently attempt to pass kerberos credentials? (maybe something to add to #50)
It is not clear to me how dvc is communicating with the hdfs server. If someone can clarify this, Ill take a look (disclaimer, I don't have much experience in this area).
If the entirety of dvc hdfs commands can be made via webHDFS #1629, this should be relatively straightforward to implement.
@JoshuaPostel We are pretty much just running hadoop fs -* commands directly in dvc/remote/hdfs.py, it is pretty straightforward. Are you able to use hadoop fs commands directly from your remote machine? If so, do you run kinit before that or do some other actions to authenticate?
I will attempt to clarify with following example.
MachineAMachineB and runuser@MachineB:~$ dvc remote add foo hdfs://MyUserName@MachineA:/pathMy misunderstanding: I expected running user@MachineB:~$ dvc push -r foo to push data to MachineA over a network, but it appears to me that running user@MachineB:~$ dvc push -r foo only communicates with MachineB.
If there is a way to have user@MachineB:~$ dvc push -r foo talk to MachineA please let me know (and sorry for all the noise).
dvc remote methods can communicate with a remote machine (for example ssh) and which cannot (hdfs)dvc's hdfs calls through webHDFS #1629 would allow for developing and running projects on machines other than the hadoop cluster's edge node@JoshuaPostel hdfs does communicate your remote machine. E.g. hadoop fs -ls hdfs://MyUserName@MachineA:/path accesses MachineA. So when you specify the remote like that, dvc does push to the remote machine MachineA, but it runs hadoop command itself on your host machine MachineB. Does hadoop fs -ls hdfs://MyUserName@MachineA:/path work on your machineB ?
Yes, hadoop fs -ls hdfs://MyUserName@MachineA:/path runs on MachineB and successfully contacts MachineA, big thanks for the suggestion.
I am still receiving SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]. I will continue to debug this to determine whether this is due to the hadoop cluster, my credentials, my local hadoop's communication, or how dvc passes the credentials. This may take some time (need to reach out to an external team), but I will close the issue as it is likely local.
My apologies for my misunderstandings and thank you very much for all the support and suggestions. dvc is a great tool and I am looking forward to making it work in my team's environment :)
If running that command works, then dvc should also work. Maybe there are some env vars that we are not handling correctly. :thinking: Need to investigate how kerberos works with hadoop. Let's keep this open for now.
Btw, I'm actively working on pyarrow integration(should replace our lame CLI hadoop calls) and it is going very good so far. Added bonus is that it has pretty straightforward kerberos arguments in https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html#pyarrow.hdfs.connect , so maybe we'll be able to fix it easilly :slightly_smiling_face: Stay tuned!
Wasn't able to implement it myself, as I am not a kerberos user and won't be able to properly test it. To implement it, one would need to do pretty much the same as in https://github.com/iterative/dvc/pull/2424/files , but now instead of setting gss_auth when using paramiko, one would need to set kerb_ticket(according to https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html) in https://github.com/iterative/dvc/blob/0.81.3/dvc/remote/hdfs.py#L52. And that would be it, pretty much 馃檪
I'm trying to implement this on a Cloudera Hadoop Cluster.
As we've discovered with @anderl80 , pyarrow already handles kerberos gracefully. So no action needed here. Closing.
@efiop @anderl80 Just for everyone to understand, how does this affect users that want to use kerberos authentication for hdfs if pyarrow handles this gracefully? Do we have to set anything to enable it?
@benelot No, no need to set it up. If hadoop works from CLI with kerberos, then dvc should work too, just make sure to configure url/user/etc correctly and the kerb ticket will be picked up automatically by pyarrow.
Most helpful comment
I'm trying to implement this on a Cloudera Hadoop Cluster.