Beaker: Feature Request: Linked Dats/files

Created on 20 Dec 2017 · 19Comments · Source: beakerbrowser/beaker

I've been thinking about this quite a bit (especially in the context of torrents), we need a feature that let's Dats declare other Dats or files from them that'll be needed to function for when we store them offline in our libraries.

Like how almost all Rotonde pages use the same JavaScript file ("dat://2714774d6c464dd12d5f8533e28ffafd79eec23ab20990b5ac14de940680a6fe/rotonde.js").

There should be a way to tell the browser that when the user adds this Dat site to the library to prompt them to also store another Dat or specific files from it, complete with version support to protect the host site from any breaking changes.

discussion feature request

Source

HughIsaacs2

Most helpful comment

Thanks for looping me in Tara; I do have thoughts but no answers however. All I will note right now is that having a deterministic way to tell which assets constitute a document/app/site seems generally desirable, and it might be best to try find more generic solutions than solving this problem in a way that is specific to beaker/dats.

I am not very knowledgeable about this, but here are some related specs and efforts I came across, that in some way list the resources required for offline consumption:

The manifest of an epub lists all resources it consists of.
The Web Publications group is discussing (e.g. here, here) the same question of how to list a publication's constituent resources, possibly building upon web application manifests.
Different but related are formats that don't only list the identifiers of assets, but also include their contents. The introduction of the web packages draft lists a some such formats.

Instead of adding an explicit manifest with resources, I would also consider the option of extracting the required resources from the document itself; e.g. all the srcs of imgs and scripts are required dependencies of an html document. See this discussion in the wpub group for a similar view. Although more complex to define and implement, a big bonus is that this way, content creators need not do anything special to make their documents usable offline, so it could work with existing websites.

Treora on 23 Dec 2017

❤2

All 19 comments

webdesserts on 20 Dec 2017

👍1

We'll keep this on our minds. I think subdats may end up being the solution for this but we'll see.

pfrazee on 20 Dec 2017

What are subdats?

Also I was thinking something like an additional resources list within dat.json

HughIsaacs2 on 22 Dec 2017

Resource listing, for example in a manifest file, is actually a bit
controversial (I know @treora has thoughts about this).

With ,

I am not very knowledgeable about this, but here are some related specs and efforts I came across, that in some way list the resources required for offline consumption:

The manifest of an epub lists all resources it consists of.
The Web Publications group is discussing (e.g. here, here) the same question of how to list a publication's constituent resources, possibly building upon web application manifests.
Different but related are formats that don't only list the identifiers of assets, but also include their contents. The introduction of the web packages draft lists a some such formats.

Treora on 23 Dec 2017

❤2

I would also consider the option of extracting the required resources from the document itself; e.g. all the srcs of imgs and scripts are required dependencies of an html document.

I think the biggest problem with this is that these tags can be changed at runtime. For example, Rotonde right now injects its list of dependencies (img, script, and style tags) after you load the initial script tag on the portal's index.html. Also these are generally direct links to single files. For dependencies that function more as a database (a folder of json files), you would need to declare a dependency on a folderset.

webdesserts on 23 Dec 2017

I think the biggest problem with this is that these tags can be
changed at runtime. For example, Rotonde right now injects its list of
dependencies (|img|, |script|, and |style| tags) after you load the
initial |script| tag on the portal's |index.html|.

You would indeed need to explicitly declare that these will (possibly)
be required. The questions are (Q1) how, and (Q2) whether the
'statically' depended assets also have to be declared in that way. Two
answer sets that seem natural to me:

(Q1) declare assets in a separate manifest file; (Q2) yes,
everything goes in there.
(Q1) add for each dynamic dependency; (Q2) no,
these links will be extracted just like the src of an img.

Also these are generally direct links to single files. For
dependencies that function more as a database (a folder of json
files), you would need to declare a dependency on a folderset.

Another good point. One thought on this: if you would solve this using a
syntax for folders, e.g. putting dat:1234ab/mydata/* in your manifest
file (or better even without the asterisk), you would also be able to
put it in a link tag. While http does not have a concept of folders,
perhaps dat urls do?

Treora on 23 Dec 2017

Subdats, and eventually a dat-cdn, who knows :-)

millette on 31 Dec 2017

Most of my observations are already captured in https://github.com/beakerbrowser/beaker/issues/752. I'll just add some observations.

HTML elements are not the ideal place for this because any policy we'd want to create regarding "save to library" would operate at the site level, not the page level. So, we need a specific file that can tell us the policy information. JSON is much easier to parse, in that case, than HTML is, and you might not always have HTML in a site, but you will have the dat.json manifest.

The 'subdat' concept is an idea that gets brought up a lot. In unix terms, it's basically a symlink from one dat to another. In git terms, it's like a submodule. It's a way to map an archive to a subfolder of another archive. Eg:

/foo -> dat://ffff..ff/
/foo/index.html -> dat://ffff..ff/index.html

Subdats are interesting because they could solve a lot of problems at once -- one such problem being this question of caching dependent dats. We could do a policy where subdats are saved along with their parent dats.

I've been hesitant to 👍 subdats so far because they also add complexity to the core rules of dat, but I think there's a good chance we'll end up implementing them eventually. I just want to give us time to think about it.

pfrazee on 1 Jan 2018

👍1

We should consider the Web Packaging standard in our discussions about this

https://github.com/WICG/webpackage

taravancil on 9 Jan 2018

Not sure if it was mentioned, but this sounds like a perfect extensionf or the existing dat.json manifest.

Maybe something as simple as

{
  "title": "Application Title",
  "dependencies": [
    "url": "dat://4483a2..66/",
    "url": "dat://4483a2..66/"
  ]
}

This could have potential for performence improvements by pre-fetching the dat metadata when the initial metadata is being downloaded.

Plus this is a dat-specific extension that could work for dats that weren't necessarily made to work with HTML or even a browser.

RangerMauve on 1 May 2018

I was under impression that Dat protocol also uses content addressablity via merkle trees under the hood (is it not) but it seems that unlike IPFS it is scoped to an individual archive.

Are there technical reasons (other than implementation effort it wolud take) why Dat could not make content addressablity across all of the Dat protocol ? It seems like it would resolve the issue and likely improve overall network performance.

In general I think supporting links at the protocol level say dat ln foo dat://ffff..ff/index.html would be a much better option than storing that elsewhere as all other dat clients would get support for this out of the box.

Gozala on 5 Jun 2018

The concepts page in the docs and the security and privacy page have a pretty good overview of why it is the way it is.

One of the main advantages of this is privacy. With IPFS where _everything_ is content addresed, it's easy to globally see who has a given file. With Dat, you only know if somebody is looking for a specific dat. And if you don't know the URL, you don't know what's in it or who has it. If you're looking for a specific piece of content, it's impossible to know which dats contain in.

RangerMauve on 5 Jun 2018

👍1

Just returning to say that dat.json now has a links object.

https://github.com/datprotocol/dat.json

It's likely that'll be used for this feature.

This opens dat.json up to the possibility of using the subresource, prefetch, dns-prefetch, preconnect, prerender and preload features in browsers, so those are options now.

I vote for "subresource" it was a non-standard addition to Chrome (removed in Chrome 50) and while the term doesn't fit the HTTP web use case, I think it fits the Dat web well. Plus many developers are already familiar with using it and it's use in Dat sites wouldn't be far off from its original intent in Chrome (only problem I can think of right now is confusion with the subresource-integrity feature).

EDIT: Also we should lock this feature down to just to specific files included in Dats not entire Dats as I can definitely see this being a hard drive space problem in the future. We have to avoid the situation where someone new to all of this loads terabytes of files onto many computers just because they wanted to use X amount of Dat based CDNs.

HughIsaacs2 on 5 Jun 2018

@pfrazee sorry for necroposting, but just being curious if closing this issue means the idea faded off the radar, or it may have become irrelevant due to other developments? Might you have a pointer to discussions/publications reflecting current state of play, if there are any?

You said above “I think subdats may end up being the solution for this but we'll see.”. And indeed, with the one-way mounts now having been introduced in Hyperdrive 10, I suppose one could mount all external resources’ drives and only use relative paths to point at them (though I guess you would have to mount their whole drives..). Does this solve the issue in your view?

Treora on 28 May 2020

PS Also related seems this recent discussion in https://github.com/datproject/comm-comm/issues/134 about a format-agnostic approach to linked dats: “a generic seeding service should not need any data structure specific code to know how to seed the data.” (source)

Treora on 28 May 2020

@Treora I do think mounts are our answer for Beaker. Ultimately for commanding any remote to cohost data, I think the API will be based on hypercores, so then the client commanding the remote needs to be data-structure aware

pfrazee on 29 May 2020

@Treora thank you for linking the comm-comm issue and the source link.

If you want to discuss further I'll answer here https://github.com/playproject-io/datdot-research/issues/17#issuecomment-643041293
I think there are multiple approaches with different pros/cons and I think a standard is needed, not only for key rotation/replacement/revocation, but also for dependencies and having a custom solution per app/protocol/datastructure is bad.
Also it's different when people control domains and want to change the content for one compared to provide proof they have the writekey to any given archive.
Yes - the latter can always be proven by challenging somebody to add a specific message, but why not avoid that by having a proper standard.

There are many ways why feeds need to be linked parent to dependant to dependencies, dependencies to dependant, domain to content, feed to author, related feeds amongst each other and I think it would be bad to have everyone (app/protocol/datastructure) make those things up instead of following a general standard