Dvc: kerberos authentication for hdfs

Created on 16 Jul 2019 · 11Comments · Source: iterative/dvc

Kerberos is the only supported security option for hadoop, thus passing kerberos authentication will be necessary for hdfs users who want to use dvc in a remote environment.

Currently I can run dvc commands without a problem on the edge node of my hadoop cluster (after runnining kinit). If I try running the same dvc commands on a remote machine (with hadoop installed #2273), I receive:

ERROR: failed to upload '.dvc/cache/64/a414dd1dd7f45f5f8e14d7607b05ce' to
'hdfs://MyUserName@MyServer/path/to/dvc/storage/a414dd1dd7f45f5f8e14d7607b05ce' -
hdfs command 'HADOOP_USER_NAME=MyUserName hadoop fs -mkdir -p
hdfs://MyUserName@MyServer/path/to/dvc/storage/64' finished with non-zero return code 1':
b'mkdir: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]\n'

This is not an issue with kerberos, as I can successfully make directories with webHDFS from the remote machine.

Does dvc currently attempt to pass kerberos credentials? (maybe something to add to #50)
It is not clear to me how dvc is communicating with the hdfs server. If someone can clarify this, Ill take a look (disclaimer, I don't have much experience in this area).
If the entirety of dvc hdfs commands can be made via webHDFS #1629, this should be relatively straightforward to implement.

good first issue help wanted p3-nice-to-have research

Source

JoshuaPostel

Most helpful comment

I'm trying to implement this on a Cloudera Hadoop Cluster.

anderl80 on 20 Jan 2020

👍2

All 11 comments

@JoshuaPostel We are pretty much just running hadoop fs -* commands directly in dvc/remote/hdfs.py, it is pretty straightforward. Are you able to use hadoop fs commands directly from your remote machine? If so, do you run kinit before that or do some other actions to authenticate?

efiop on 16 Jul 2019

I will attempt to clarify with following example.

I have a hadoop cluster with whose edge node runs on MachineA
I develop on MachineB and run
user@MachineB:~$ dvc remote add foo hdfs://MyUserName@MachineA:/path

My misunderstanding: I expected running user@MachineB:~$ dvc push -r foo to push data to MachineA over a network, but it appears to me that running user@MachineB:~$ dvc push -r foo only communicates with MachineB.

If there is a way to have user@MachineB:~$ dvc push -r foo talk to MachineA please let me know (and sorry for all the noise).

Going forward:

It would be useful to clarify in the documentation which dvc remote methods can communicate with a remote machine (for example ssh) and which cannot (hdfs)
Implementing all of dvc's hdfs calls through webHDFS #1629 would allow for developing and running projects on machines other than the hadoop cluster's edge node
- It is not a good practice to develop or run large jobs on a hadoop cluster's edge node

JoshuaPostel on 17 Jul 2019

@JoshuaPostel hdfs does communicate your remote machine. E.g. hadoop fs -ls hdfs://MyUserName@MachineA:/path accesses MachineA. So when you specify the remote like that, dvc does push to the remote machine MachineA, but it runs hadoop command itself on your host machine MachineB. Does hadoop fs -ls hdfs://MyUserName@MachineA:/path work on your machineB ?

efiop on 17 Jul 2019

Yes, hadoop fs -ls hdfs://MyUserName@MachineA:/path runs on MachineB and successfully contacts MachineA, big thanks for the suggestion.

I am still receiving SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]. I will continue to debug this to determine whether this is due to the hadoop cluster, my credentials, my local hadoop's communication, or how dvc passes the credentials. This may take some time (need to reach out to an external team), but I will close the issue as it is likely local.

My apologies for my misunderstandings and thank you very much for all the support and suggestions. dvc is a great tool and I am looking forward to making it work in my team's environment :)

JoshuaPostel on 18 Jul 2019

👍1

If running that command works, then dvc should also work. Maybe there are some env vars that we are not handling correctly. :thinking: Need to investigate how kerberos works with hadoop. Let's keep this open for now.

efiop on 18 Jul 2019

Btw, I'm actively working on pyarrow integration(should replace our lame CLI hadoop calls) and it is going very good so far. Added bonus is that it has pretty straightforward kerberos arguments in https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html#pyarrow.hdfs.connect , so maybe we'll be able to fix it easilly :slightly_smiling_face: Stay tuned!

efiop on 19 Jul 2019

Wasn't able to implement it myself, as I am not a kerberos user and won't be able to properly test it. To implement it, one would need to do pretty much the same as in https://github.com/iterative/dvc/pull/2424/files , but now instead of setting gss_auth when using paramiko, one would need to set kerb_ticket(according to https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html) in https://github.com/iterative/dvc/blob/0.81.3/dvc/remote/hdfs.py#L52. And that would be it, pretty much 🙂

efiop on 15 Jan 2020

I'm trying to implement this on a Cloudera Hadoop Cluster.

anderl80 on 20 Jan 2020

👍2

As we've discovered with @anderl80 , pyarrow already handles kerberos gracefully. So no action needed here. Closing.

efiop on 5 Feb 2020

@efiop @anderl80 Just for everyone to understand, how does this affect users that want to use kerberos authentication for hdfs if pyarrow handles this gracefully? Do we have to set anything to enable it?

benelot on 27 Feb 2020

@benelot No, no need to set it up. If hadoop works from CLI with kerberos, then dvc should work too, just make sure to configure url/user/etc correctly and the kerb ticket will be picked up automatically by pyarrow.

efiop on 27 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings