Go-ipfs: `ipfs object diff` assumes links will have different names for the comparison

Created on 23 Jul 2018 · 9Comments · Source: ipfs/go-ipfs

See https://github.com/ipfs/go-ipfs/issues/5135#issuecomment-407152979.

It would also be interesting to see how this can be integrated with some UnixFS command that could show not only which IPLD node is different but how it differs at the UnixFS layer.

kinbug topiUnixFS topimerkledag

Source

schomatis

All 9 comments

Note: this command will soon be changed to use CoreAPI - https://github.com/ipfs/go-ipfs/pull/4643

magik6k on 25 Jul 2018

Thanks for the heads-up, I'm mostly interested in the logic inside dagutils.Diff, I don't mind calling it from a different API.

schomatis on 25 Jul 2018

Hi, When I try to use “ipfs object diff” command, I found that the diff command could not find the exact difference between two hash of object if the link.Name is empty, relative code in dagutils/diff.go. In the following steps, I create a test_data and test_data_appended which append one line. I expect “ipfs object diff” could say that the only difference is the last object of test_data_append.
Reproduce steps(useless output has been removed, test in origin/master and feat/coreapi-commands):

$ ipfs version
ipfs version 0.4.17
$ dd if=/dev/random of=test_data bs=1024k count=1
$ ipfs add test_data
$ ipfs object get <hash of test_data>
2018/07/30 14:43:14 proto: duplicate proto type registered: basictracer_go.wire.TracerState
{"Links":[{"Name":"","Hash":"QmbaVqMRkcuoc28EJiiHQ5X1E9c1EQSmZ7h6jpKM4MWAas","Size":262158},{"Name":"","Hash":"QmTLj6JAu9EqjAT8wzjhcgktA9SFaeRhikV8fPfirEGC9H","Size":262158},{"Name":"","Hash":"QmRteFHwCk7ViWoiNSvAXGHApYJZZuiShTqNkbEXgR13kh","Size":262158},{"Name":"","Hash":"QmV5o9vMJVvg6rGJZTukpz2BFzrz3xfaD2uriSA6qt9oMS","Size":262158}],"Data":"\u0008\u0002\u0018��@ ��\u0010 ��\u0010 ��\u0010 ��\u0010"}
$ echo 1 >> test_data_appended
$ ipfs add test_data_appended
$ ipfs object get <hash of test_data_appended>
{"Links":[{"Name":"","Hash":"QmbaVqMRkcuoc28EJiiHQ5X1E9c1EQSmZ7h6jpKM4MWAas","Size":262158},{"Name":"","Hash":"QmTLj6JAu9EqjAT8wzjhcgktA9SFaeRhikV8fPfirEGC9H","Size":262158},{"Name":"","Hash":"QmRteFHwCk7ViWoiNSvAXGHApYJZZuiShTqNkbEXgR13kh","Size":262158},{"Name":"","Hash":"QmV5o9vMJVvg6rGJZTukpz2BFzrz3xfaD2uriSA6qt9oMS","Size":262158},{"Name":"","Hash":"QmdytmR4wULMd3SLo6ePF4s3WcRHWcpnJZ7bHhoj3QB13v","Size":10}],"Data":"\u0008\u0002\u0018��@ ��\u0010 ��\u0010 ��\u0010 ��\u0010 \u0002"}
$ ipfs object diff -v <hash of test_data> <hash of test_data_appended>
QmRN6hWdkdv9kPeyGetpzusv3Nn9nQQ9NgRPX5aZk4KmiZ QmemtUpKDXqWacFtsVEKSn669Y3ZoR99sHT6rtZEPFyFbd
2018/07/30 14:44:52 proto: duplicate proto type registered: basictracer_go.wire.TracerState
Changed "" from QmTLj6JAu9EqjAT8wzjhcgktA9SFaeRhikV8fPfirEGC9H to QmbaVqMRkcuoc28EJiiHQ5X1E9c1EQSmZ7h6jpKM4MWAas.
Changed "" from QmRteFHwCk7ViWoiNSvAXGHApYJZZuiShTqNkbEXgR13kh to QmbaVqMRkcuoc28EJiiHQ5X1E9c1EQSmZ7h6jpKM4MWAas.
Changed "" from QmV5o9vMJVvg6rGJZTukpz2BFzrz3xfaD2uriSA6qt9oMS to QmbaVqMRkcuoc28EJiiHQ5X1E9c1EQSmZ7h6jpKM4MWAas.

Not sure if I use this command correctly. Thanks in advance.

bjzhang on 30 Jul 2018

👍1

Hey @bjzhang, thanks for taking the lead on this one, good catch!

Yes, it seems the Diff method relies on link names to compare IPLD nodes, but as you have seen in the importer package, a file DAG is made up of nodes with empty link names, so another more general method should be designed that doesn't assume that each link should have its own different name. It can still separate links by name (to speed up the comparison), but within each name group it should cross check every CID of the two nodes to see if it finds a match. (This will probably also need to be fixed in the complement ApplyChange function.)

I'll get on this so you can focus on #5135, but ping me if the fix isn't moving fast enough for your needs.

schomatis on 30 Jul 2018

@schomatis You are welcome. It is not a big blocker for me, IIUC I could compare the object manually by "ipfs object get". I could help you test it when your patches are available.

bjzhang on 31 Jul 2018

👍1

So, on second thought, this function won't be useful for the purposes of #5135, we may need a specialized UnixFS function. The algorithm of this function relies on the link names to find where the difference might be (and only then checks the CIDs), the problem is that we can't rely on link names as a UnixFS file doesn't use them, but we also can't rely just on CIDs, as any difference in a node will propagate back to the root.

What we want here is an algorithm that does go down when the CIDs differ (in contrast to this function), to actually check which child is causing the parent's CID to be different from the expected value. That could work when two DAGs are identical except for one node, but I'm not sure how it would play out in more complex scenarios. Let me rethink this a bit more.

schomatis on 31 Jul 2018

@whyrusleeping I'm trying to give some closure to this issue, I wanted to check this command as I thought it might helpful as a debug tool, but I was wrong, we're looking for something different. However, @bjzhang finding is valid, the Diff function relies on link names being different (pretty much because all of the methods in the merkledag package seem to work on that assumption), is that a bug that should be fixed? Or is that something that should be clarified in the function documentation?

schomatis on 31 Jul 2018

@bjzhang In the meanwhile you can use @achingbrain's little beauty unixfs (see https://github.com/ipfs/go-ipfs/issues/5024#issuecomment-390186589 on how to install it), that command will show you the attributes of the UnixFS node. The problem with the ipfs object get command is that it only shows the DAG basic attributes (link, name, hash, data), but what you're probably interested in is the UnixFS attributes (encoded inside DAG's data) like the Type and Blocksizes (the error you're getting in #5135 now is an invalid UnixFS type, which you won't see it with ipfs object get).

schomatis on 31 Jul 2018

👍1

@schomatis the diff algorithm could definitely be better. The only real reason for the name requirement was that it produced nice looking diffs for the usecase I was using it for (diffing unixfs structures).

Agree that it makes sense to recursively diff a bit more aggressively