Js-ipfs: The `dag` API - One API to manipulate all the IPLD Format objects

Created on 1 Nov 2016  路  15Comments  路  Source: ipfs/js-ipfs

We need to come up with an API to manipulate IPLD Format objects.

Currently, go-ipfs master ships with a dag API that offers get and put methods, it doesn't expose yet a dag resolve API.

For reference, here are the help texts:

禄 ipfs dag --help
USAGE
  ipfs dag - Interact with ipld dag objects.

SYNOPSIS
  ipfs dag

DESCRIPTION

  'ipfs dag' is used for creating and manipulating dag objects.

  This subcommand is currently an experimental feature, but it is intended
  to deprecate and replace the existing 'ipfs object' command moving forward.


SUBCOMMANDS
  ipfs dag get <cid>         - Get a dag node from ipfs.
  ipfs dag put <object data> - Add a dag node to ipfs.

  Use 'ipfs dag <subcmd> --help' for more information about each command.
禄 ipfs dag get --help
USAGE
  ipfs dag get <cid> - Get a dag node from ipfs.

SYNOPSIS
  ipfs dag get [--] <cid>

ARGUMENTS

  <cid> - The cid of the object to get

DESCRIPTION

  'ipfs dag get' fetches a dag node from ipfs and prints it out in the specifed format.
禄 ipfs dag put --help
USAGE
  ipfs dag put <object data> - Add a dag node to ipfs.

SYNOPSIS
  ipfs dag put [--format=<format> | -f] [--input-enc=<input-enc>] [--] <object data>

ARGUMENTS

  <object data> - The object to put

OPTIONS

  -f,        --format string - Format that the object will be added as. Default: cbor.
  --input-enc         string - Format that the input object will be. Default: json.

DESCRIPTION

  'ipfs dag put' accepts input from a file or stdin and parses it
  into an object of the specified format.

While, in js-ipfs, we have a pretty much straight out copy of this API, defined as an interface at: https://github.com/ipfs/interface-ipfs-core/tree/master/API/dag and an resolve API exposed by the IPLD Resolver that goes as (simple as) follows:

.resolve(cid, path, callback)

Note: this function is capable of resolving through different formats.

We need to complete the dag API definition, taking into account the following issues

Current shortcomings

It is impossible to ensure that the right type is returned when using a non-strict IPLD Format

To help understand this issue, let's define as a strict IPLD Format something like dag-pb, eth-block, git-block and other Merkle Data Structures that have been predefined and that its format follows a structure. non-strict IPLD Formats are (so far, we have one main case) data structures like dag-cbor, which have no definition when it comes to the keys and the value types of its data.

When resolving through non-strict IPLD Formats, the entity that requests for a .resolve to happen can't tell which is the type of the value that is going to be returned. This problem can be somewhat mitigated due to some languages support to type inference (or others that don't have type systems at all), it is an unavoidable problem when we have to pass a node through a transport like http. Let's illustrate the issue

//Imagine we have an object that is stored in cbor that looks like
{
  name: 'fancy-music.mp3',
  data: new Buffer(<bytes of fancy-music.mp3>)
}

// This object can be serialized and deserialize as many times we want,
// since cbor has a 1:1 mapping with JSON, however, if a http client 
// requests this object, it will have to be JSON.stringify'ed in order to 
// pass through the wire and so, the previous object will be converted to:

"{ \"name\": \"fancy-music.mp3\", \"data\": { \"type\": \"Buffer\" \"data\": [<array of bytes>] }"

// Now, if we do JSON.parse, we get:

{
  name: 'fancy-music.mp3',
  data: {
    type: 'Buffer',
    data: [<array of bytes>]
}

Now the client, would have to know that in the context of this application, the data field is a Buffer and cast it manually, but this has to be application specific, which makes it specially hard to work with.

Another case, is what happens today, is that go-ipfs base64 any buffer it has to send and convers to a string, so in fact the returned object form a go-ipfs http-api would look on the wire bore like:

"{ \"name\": \"fancy-music.mp3\", \"data\": \"base64encodedArrayofBytes\" }"

This is ok for dag-pb, because we can easily cast since we always know that data in dag-pb needs to be a buffer.

The user needs to know which type is going to be returned when doing an ipld.resolve across multiple IPLD Formats

In a similar way, each time a CID/path gets resolved and a change of IPLD Format is perfomed, the receiver needs to know before hand what is going to be the data type of the returned object.

Proposed solutions

1:1 JSON mapping. In order to support the weird casting, every IPLD Format would require to have a 1:1 mapping with JSON (toJSON, fromJSON methods), which is something non trivial (even if we reduced the scope for 'every object needs to be created first in its native serialized format and then converted to JSON'.

Last mile resolve. Another suggestion would be to do a last mile resolve, where what gets passed on the wire is the last IPLD node serialized (in a block) and the client deserializes that node and resolves any remainderPath, being able to capture the right type for that object.

Boxing of values. We can also considere the boxing of values, where every value is passed around as a byte array inside a 'box' and that box also has a label saying its type, so that the consumer can properly cast it to the right type

Notes

We still haven't had the chance to have a long discussion about this dag API, this issue purpose is to collect ideas and get feedback on the proposals.

@whyrusleeping here is the brain dump including notes from our chat yesterday, please add anything that I missed :)

exploration kinenhancement

All 15 comments

(1) on the Buffer problem

This object can be serialized and deserialize as many times we want,
since cbor has a 1:1 mapping with JSON, however, if a http client
requests this object, it will have to be JSON.stringify'ed in order to
pass through the wire and so, the previous object will be converted to:

  • It could be passed as raw cbor, base64 encoded. instead of json.
  • It could be passed as EJSON (a JSON flavor that signals buffers with {"$binary": "<b64-data>"})
  • We could standardize to use EJSON (or some similar variant) instead of pure JSON. I wouldn't be opposed to this because JSON's lack of binary buffer is a horrendous omission that keeps wasting my time.

(2) type returned on ipld.resolve

In a similar way, each time a CID/path gets resolved and a change of IPLD Format is perfomed, the receiver needs to know before hand what is going to be the data type of the returned object.

Sorry, I don't see why-- can you give a concrete example with the CID/path case?

1:1 JSON mapping

Last mile resolve. Another suggestion would be to do a last mile resolve, where what gets passed on the wire is the last IPLD node serialized (in a block) and the client deserializes that node and resolves any remainderPath, being able to capture the right type for that object.

Oh-- maybe i get it now. You mean you need the type because you're converting to JSON on the wire?

Yeah, if you convert to JSON on the wire, there HAS to be a 1:1 well-defined mapping. I would encourage maybe the use of EJSON or something that is super clear instead of JSON.

However-- you may be able to circumvent the problem in MANY use cases (probably not all). Just return the blocks-- all of them -- as serialized, and let the libraries de-serialize them. so the HTTP API should return raw blocks, and js-ipfs-api should deserialize them. (this wont cover all use cases, but most).

Another thing is that we have been discussing a proper RPC API over a socket for some time (over proper multiaddr sockets). This will be useful in go-ipfs to prevent the need of an HTTP endpoint, and being able to use a unix domain socket. Separately, this can be used with websockets to expose a high throughput, low request overhead API to js-ipfs-api -- it would avoid all the horrible HTTP requests. (it's what Facebook had to do to make chat work, way back in the day. that work helped motivate websockets.)

Boxing of values. We can also considere the boxing of values, where every value is passed around as a byte array inside a 'box' and that box also has a label saying its type, so that the consumer can properly cast it to the right type

Yeah-- if considering this, why not just send the raw blocks?

@diasdavid i'm sorry for taking so long.

Thank you @jbenet

It could be passed as raw cbor, base64 encoded. instead of json.

Same as the last mile resolved solution

It could be passed as EJSON (a JSON flavor that signals buffers with {"$binary": ""})
We could standardize to use EJSON (or some similar variant) instead of pure JSON. I wouldn't be opposed to this because JSON's lack of binary buffer is a horrendous omission that keeps wasting my time.

For all the readers: tl;dr; In HTTP, or data goes base64 encoded (text) or it goes in pure JSON (no Buffers).

So yeah, we can standardize that our HTTP API returns and expects EJSON and it should be parsed that way. //cc @hsanjuan

Oh-- maybe i get it now. You mean you need the type because you're converting to JSON on the wire?

JSON 1:1 mapping solution is different from last mile resolve.

Last Mile Resolve Issue (or the need to also signal the last CID)

If I have something like:

a(type: dag-cbor) -> b(type: dag-cbor) -> c(type: eth-block)

And if I try to fetch /<cid of a>/b/c, In a last mile resolve (sending the last block as a base64 encoded block), I would have to need from the http-api client, that the returning block is a eth-block and not a dag-cbor anymore like the CID I has in the beginning.

This can be solved with returning a JSON object like:

{
  cid: <last cid>
  block: <base 64 encoded>
}

Another thought, I now realise that we really should standardise on EJSON for our endpoints, independently of the DAGAPI, it just makes things more nice to understand.

  • :+1: for EJSON (or something like it, that's sane. i think there are a couple other similar variants, so we should look at them before picking one and doing the work of changing all the things)

Can we have a decision on this thread? Or, is there anything against moving forward with the last mile resolve?

@whyrusleeping @jbenet @nicola
?

馃尃 READ THIS - The Formalization of the DAG API

I had a very good chat yesterday with @whyrusleeping and found some new ways to simplify this a lot (but not 100%, there will be still some costs, but for 90% of the cases or more, it will be just right 馃憣馃徑) and also, he convinced me to merge dag get and dag resolve.

So, the DAG API will have only commands/function calls (not including the workbench stuff, but that is another endeavor) and these are:

ipfs dag get <[cid, cid+path]>

The dag get call will be able to fetch nodes and resolve paths, the options for this command are:

Options

  • --local-resolve - It will try to resolve everything it can within the scope of the first object, returning the last value it could + the remainerPath

Implementation details:

The http-api endpoint needs to reply with an header that specifies the content-type, e.g:

  • dag-cbor, dag-pb, eth-block and so forth, if it is returning a full node
  • json, cbor, ejson, octet-stream, string, number, etc, if it is returning a value within the node. This means that the local-resolver of each format needs to tell the type of the value of the thing it just resolved.

ipfs dag put <node>

The dag put call will store a single node at each time and return its respective CID

Options

  • --format - The multicodec of the format (dag-pb, dag-cbor, etc)
  • --hash-alg - The hash algorithm to be used
  • --input-enc - The input encoding of how the object is being passed. This is just a requirement to make sense of a blob of characters from a CLI context.

Shortcomming: There is no clear way to pass a git object, or a ethereum block this way, unless we had conversations from and to json for all of them. Solution: Just add the git, ethereum and other IPLD compatible objects using ipfs block put. Important to note: The shortcomming is just a CLI thing, within a language runtime, it will be fine doing dag put of ethereum nodes, since we have the classes for them.

Other clarifications

@diasdavid that sounds awesome!
Is there a plan to also merge Get and Resolve in the IPLD resolver?

@nicola absolutely :)

Reopening this issue as I keep referencing it

To update on the state of things:

  • In go-ipfs you can add raw git blocks via dag put with --input-enc set to raw or zlib, same should be the case for other formats
  • ipfs dag get currently only supports json output
  • My proposal for connecting js-ipfs-api with existing api:

    • Add cbor encoding to ipfs dag get

    • Instead of EJSON Proposed above, this has several advantages (almost no size overhead for binary data, should be faster cpu-wise too)

    • In CLI this encoding can be selected with the --encoding flag, json is still default

    • api.dag.get will be implemented using cbor serializations of dag nodes

    • We keep info about binary/text buffers

    • Most formats shouldn't have any problem serializing to/from it



      • No 'strict' format should have problems with it


      • I can't think of any non-strict format that would be problematic


      • Maybe things like java object serialization, for cases like this we can use cbor type tags or add type metadata into the serialization.



    • api.dag.put will serialize js dagNodes into cbor and send them with --input-enc=cbor.

    • This is easy on js side as cbor ipld utils can be used for serialization

    • This is easy on go side as the ipld plugins already can parse json objects into native formats.

@vmx ^^

Had a discussion with @diasdavid about this at DWeb and I came up with an alternative proposal.

Firstly, I don't think it's possible to come up with a universal JSON representation for a variety of codecs in many different formats. I wrote a bit more about this here.

I think the best solution to these API problems is to stop trying to encode things into JSON.

At the HTTP API layer, just return a binary representation of the node. Encode the remaining metatdata, remainingPath and the cid of the last resolved node, into the HTTP Headers.

At the CLI layer you do the same thing CURL does, return just the node body by default and add an option to dump the other metadata before it prints. Add another option to "prettify" the output, which would run the node through the language's codec implementation and then print the in-language representation.

Different codecs are going to take advantage of the features of the underlying format (JSON, CBOR, protobuf, msgpack, etc), and codec implementations are going to want to take advantage of the language's features in the same way. Trying to find a JSON representation of all these things is potentially endless.

I think this can be closed.

This approach (canonical JSON representation) has been abandoned.

We now have the IPLD Data Model which provides unified type representation for all the different codecs. If you want to pass this information over a protocol, like in the HTTP API, just send the encoded block data. Both sides of the exchange have agreement about type representations in their language based on the data model.

Was this page helpful?
0 / 5 - 0 ratings