Previously Deno cached HTTP resources in a directory structure that was similar to the URL
It looked like this
$DENO_DIR/deps/https/deno.land/std/examples/curl.ts
This was a very simple and nice system as it's very obvious how URLs map to the cache. VS code plugin made assumptions about this layout.
However we discovered there are some URLs that don't easily map to this structure - in particular URLs with query strings, or URLs that don't have a file extension but serve TS/JS. (See #4030)
So now we use a cache structure which is flat and hashes URL
$DENO_DIR/deps/https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd
The VS Code plugin is very important - so let's figure out how to resolve this problem.
cc @axetroy @bartlomieju
What if there was a manifest in $DENO_DIR/deps
that contained information like:
{
"https://deno.land/foo/bar.ts": "https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd"
}
Potentially more data could be cached in the key, like if there were redirects from the requested URL to the final one.
Using manifest file was my initial thought, but we decided to skip it because URLs are consistently hashed. Adding such file, should not be a big problem though.
Waiting for @axetroy to let us know what are requirements for extension to make it work
a single manifest seem not a good solution.
Manifests will become bloated easily and caught performance
Or a file with a manifest.json maybe good.
My idea is to keep the url.path
part and only hash filename
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β href β
ββββββββββββ¬βββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ¬ββββββββ€
β protocol β β auth β host β path β hash β
β β β βββββββββββββββββββ¬βββββββΌβββββββββββ¬βββββββββββββββββ€ β
β β β β hostname β port β pathname β search β β
β β β β β β βββ¬βββββββββββββββ€ β
β β β β β β β β query β β
" https: // user : pass @ sub.example.com : 8080 /p/a/t/h ? query=string #hash "
β β β β β hostname β port β β β β
β β β β βββββββββββββββββββ΄βββββββ€ β β β
β protocol β β username β password β host β β β β
ββββββββββββ΄βββΌβββββββββββ΄βββββββββββΌβββββββββββββββββββββββββ€ β β β
β origin β β origin β pathname β search β hash β
βββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββββββββ΄ββββββββ€
β href β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
eg.
https://deno.land/std/examples/curl.ts
$DENO_DIR/deps/https/deno.land/std/examples/[hash1]
https://deno.land/std/examples/curl.ts?foo=bar
$DENO_DIR/deps/https/deno.land/std/examples/[hash2]
I'm not sure if this will include special characters
if lose directory structure but it is preserved in the module's source code.
// mod.ts
import "./folder1/folder2/mod.ts"
Generated cache files, they are in the same directory. It feels weird
BTW.
deno info
can analyze file dependencies, but it does not help cache module dependencies
eg.
$ deno info https://deno.land/std/path/mod.ts
local: $DENO_DIR/deps/https/deno.land/72ee5916977ca9d8801c801f642353d811373786e51e3d7574cca966634b4f97
type: TypeScript
compiled: $DENO_DIR/gen/https/deno.land/std/path/mod.ts.js
map: $DENO_DIR/gen/https/deno.land/std/path/mod.ts.js.map
deps:
https://deno.land/std/path/mod.ts
βββ¬ https://deno.land/std/path/win32.ts
β βββ https://deno.land/std/path/constants.ts
β βββ¬ https://deno.land/std/path/utils.ts
β β βββ https://deno.land/std/path/constants.ts
β βββ¬ https://deno.land/std/testing/asserts.ts
β βββ https://deno.land/std/fmt/colors.ts
β βββ https://deno.land/std/testing/diff.ts
β βββ¬ https://deno.land/std/testing/format.ts
β βββ https://deno.land/std/testing/asserts.ts
βββ¬ https://deno.land/std/path/posix.ts
β βββ https://deno.land/std/path/constants.ts
β βββ https://deno.land/std/path/utils.ts
βββ https://deno.land/std/path/constants.ts
βββ¬ https://deno.land/std/path/glob.ts
β βββ https://deno.land/std/path/constants.ts
β βββ https://deno.land/std/path/globrex.ts
β βββ https://deno.land/std/path/mod.ts
β βββ https://deno.land/std/testing/asserts.ts
βββ https://deno.land/std/path/globrex.t
You can find https://deno.land/std/path/mod.ts
which modules it depends on, but you don't know what the cache file$DENO_DIR/deps/https/deno.land/72ee5916977ca9d8801c801f642353d811373786e51e3d7574cca966634b4f97
depends on
And, using deno info xxx
to analyze modules is not good.
Because for Typescript Plugin, all operations are synchronous (no asynchronous).
Means that every time the analysis module runs the command deno info
It will block the entire Typescript Service
Therefore, it is good to maintain a regular directory structure.
We can infer the path of the local cache file through the HTTP module
https://deno.land/std/path/mod.ts
>>>>>>$DENO_DIR/deps/https/deno.land/72ee5916977ca9d8801c801f642353d811373786e51e3d7574cca966634b4f97
Remote URLs can also be inferred from locally cached paths
$DENO_DIR/deps/https/deno.land/72ee5916977ca9d8801c801f642353d811373786e51e3d7574cca966634b4f97
>>>>>>https://deno.land/std/path/mod.ts
@axetroy I secondary benefit of the hashed files was that it would fix the collision between importing 'example.com/mymodule' and 'example.com/mymodule/mod.ts', because before this would not work because 'directory could not be created because a file with name mymodule already exists'.
I also think creating huge manifest files will create some perf issues, but I think currently they are the best idea due to all of the benefits that the hashed file names provide. What if there was one manifest per origin?
I'm not sure it would be that horrible versus the current situation. We already write out headers and other stuff, and we don't expire stuff in the cache, so if we are intelligent about it, we wouldn't even touch the file unless we are actually writing out something to the cache. Also Deno shouldn't use it internally as an index or anything, it simply should write stuff out. If someone deletes only a part of a cache but doesn't delete the manifest file, that is there problem.
eg.
https://deno.land/std/examples/curl.ts
$DENO_DIR/deps/https/deno.land/std/examples/[hash1]
https://deno.land/std/examples/curl.ts?foo=bar
$DENO_DIR/deps/https/deno.land/std/examples/[hash2]
This will not work, because of the collisions @lucacasonato mentioned.
You can find https://deno.land/std/path/mod.ts which modules it depends on, but you don't know what the cache file > $DENO_DIR/deps/https/deno.land/72ee5916977ca9d8801c801f642353d811373786e51e3d7574cca966634b4f97 depends on
I'm not sure I understand this argument - contents of the file are not changed; it's just a filename in DENO_DIR
that is changed compared to original request.
Also DENO_DIR
is implementation detail of Deno and shouldn't really be played with by users.
I'm not sure it would be that horrible versus the current situation. We already write out headers and other stuff, and we don't expire stuff in the cache, so if we are intelligent about it, we wouldn't even touch the file unless we are actually writing out something to the cache. Also Deno shouldn't use it internally as an index or anything, it simply should write stuff out. If someone deletes only a part of a cache but doesn't delete the manifest file, that is there problem.
Agreed; providing a "manifest" file (which is basically mapping between path
part of URL and filename in DENO_DIR
) for each origin/host is very simple; if that would unblock the extension I can open a PR today.
I don't think they are collisions in the @axetroy proposition the hash is calculate with the part of the last path : after the last / => filename.
@lucacasonato exemple :
https://example.com/mymodule
mymodule is filename and not the directory =>
$DENO_DIR/deps/https/example.com/[hash1]
https://example.com/mymodule/mod.ts
mod.ts is filename =>
$DENO_DIR/deps/https/example.com/mymodule/[hash2]
https://example.com/mymodule/
mymodule is the directory the filename is ''=>
$DENO_DIR/deps/https/example.com/mymodule/[hash3]
https://example.com/?mymodule/
?mymodule/ is the filename because they are ?=>
$DENO_DIR/deps/https/example.com/[hash4]
the part after # is ignored because is a fragment of page
Sorry for my bad english ;)
@bardiarastin
Also DENO_DIR is implementation detail of Deno and shouldn't really be played with by users.
Deno users should not care about $ DENO_DIR
However, developers of third-party tools may need, eg, development of an editor extension
/cache module update tool
/ dependency analysis
, etc.
Agreed; providing a "manifest" file (which is basically mapping between path part of URL and filename in DENO_DIR) for each origin/host is very simple; if that would unblock the extension I can open a PR today.
This is probably the simplest solution and does not break the current cache layout.
But manifest
may cause performance issues because it requires frequent file reading and serialization and deserialization
Sorry i didn't describe it clearing why the new cache layout broke vscode extension
Take the module https://deno.land/std/path/mod.ts
as an example
Remote File: https://deno.land/std/path/mod.ts
Cache File: $DENO_DIR/deps/https/deno.land/std/path/mod.ts
if we open the cache file in editor you will see
// Copyright the Browserify authors. MIT License.
// Ported from https://github.com/browserify/path-browserify/
import * as _win32 from "./win32.ts";
import * as _posix from "./posix.ts";
import { isWindows } from "./constants.ts";
const path = isWindows ? _win32 : _posix;
export const win32 = _win32;
export const posix = _posix;
export const resolve = path.resolve;
export const normalize = path.normalize;
export const isAbsolute = path.isAbsolute;
export const join = path.join;
export const relative = path.relative;
export const toNamespacedPath = path.toNamespacedPath;
export const dirname = path.dirname;
export const basename = path.basename;
export const extname = path.extname;
export const format = path.format;
export const parse = path.parse;
export const sep = path.sep;
export const delimiter = path.delimiter;
export { EOL, SEP, SEP_PATTERN, isWindows } from "./constants.ts";
export * from "./interface.ts";
export * from "./glob.ts";
export * from "./globrex.ts";
Q: So how does the extension find the . /win32.ts
module?
A:
we can get the current cache file's url(https://deno.land/std/path/mod.ts
) from filepath($DENO_DIR/deps/https/deno.land/std/path/mod.ts
)
so. The module./win32.ts
url will be https://deno.land/std/path/win32.ts
. and cache file $DENO_DIR/deps/https/deno.land/std/path/win32.ts
Everything is alright.
Remote File: https://deno.land/std/path/mod.ts
Cache File: $DENO_DIR/deps/https/deno.land/[hash4mod]
Then in file $DENO_DIR/deps/https/deno.land/[hash4mod]
, we cannot find module ./win32.ts
./win32.ts
module's cache file should be $DENO_DIR/deps/https/deno.land/[hash4win32]
hash4win32 = SHA256('/std/path/win32.ts')
As long as we can calculate the hash4win32
, then we can find the ./win32.ts
module
Unfortunately, we donβt know what the current URL.
So we need a mapping to get url: $DENO_DIR/deps/https/deno.land/[hash4mod]
> https://deno.land/std/path/mod.ts
Here we get /std/path/mod.ts
and then get /std/path/win32.ts
Finally, we can calculate hash4win32
and resolve module.
@xpyxel
Deno may not need it because deno run xxx.ts
only reads the file once
But in the Typescript plugin, it does need to be read frequently
Every time the file changesrecompilation in editor, needs to read-> serialize.
Yeah I realized that after I commented, sorry about that lol @axetroy
Hi, this is the third time that I'm being mentioned instead of someone else on this project but for the third time THANK YOU FOR ALL THE HARD WORK ON DENO β€οΈ
@IllusionPerdu thanks for the example, but the problem is you used URLs which all could be valid filenames. Take for example jsdom
; the very first import statement reads:
import { dew } from "/npm:[email protected]/lib/api.dew.js";
. AFAIK colon cannot be used in filename on Windows, so this solution will not work.
@axetroy thanks for explanation, now I see what's the problem clearly. I don't have a good solution off top of my head right now besides the manifest/mapping file. I understand your concerns about serialization performance; however maybe we should start with this solution as it's pretty simple and see if it's really too slow?
@bartlomieju
I'm not sure if it's really slow, but we can try, at least it won't break the current cache layout.
And, can you print what the manifest/mapping file looks like if use this solution?
@axetroy I think we can do it per origin (so per directory in cache)
// $DENO_DIR/deps/deno.land/manifest.json
{
"/std/testing/asserts.ts": "<SHA1 hash>",
"/[email protected]/fmt/strings.ts": "<SHA1 hash>",
"/x/some?other=path": "<SHA1 hash>",
...
}
Would that work?
@bartlomieju it works and look good for me.
Good, I'll get to implementing it shortly
work fine with manifest.json
You would have to add a command in Deno to check the consistency between the cache files and the manifest because for some reason someone can manually delete files in the cache or edit the manifest.
For security, the manifest could by the same designation define an imprint of the files.
// $DENO_DIR/deps/deno.land/manifest.json
{
"/std/testing/asserts.ts": {file:"", crc:" "},
"/[email protected]/fmt/strings.ts": {file:"", crc:" "},
"/x/some?other=path": {file:"", crc:" "},
...
}
I'm worried about the speed of manifest.json when it's filled with many entries. Reading and writing to the cache becomes O(N) where it was previously O(1).
Would a distributed manifest be a problem. We already write out the headers, so write out in some sort of meta data file the original URL? That would put the work on the plugin to go and collect all the meta-data files to create a map, and what the FS events on the DENO_DIR for any new files and add them to the cache. For Deno it would be still be O(1).
@kitsonk By distributed manifest you mean having a manifest per directory? I guess that would work... but I'm still wondering if we can't avoid the manifest entirely.
@axetroy I'm a unclear why the VS Code plugin cannot call "deno info" ? We can modify "deno info" to fit your needs. It would be useful if the DENO_DIR was a private structure - that would allow us to do refactors more easily.
@ry Yes, I agree to privatize $DENO_DIR
This way, even if the cache layout changes again, third-party tools will not need to change
If we no longer depend on $DENO_DIR
Then deno info
needs to provide the mapping function of url <---> filepath
eg.
$ deno info $DENO_DIR/https/example.com/[hash]
# output
https://example.com/demo/mod.ts
$ deno info https://example.com/demo/mod.ts
# output
$DENO_DIR/https/example.com/[hash]
@ry
I'm a unclear why the VS Code plugin cannot call "deno info" ?
Commands can be called, but synchronously
This will cause the module parsing to be very slow, as it needs to frequently call deno info xxx
@axetroy how about we add an option to βdeno infoβ so it will not download nor compile but only look in the cache? We can also add an option for JSON output.
Could you provide a link to where in the plugin βdeno infoβ would be called?
@ry
@axetroy how about we add an option to βdeno infoβ so it will not download nor compile but only look in the cache? We can also add an option for JSON output.
SGTM
In fact, I have already implemented the solution of mianfest.json
here is how extension found Deno module.
But I rethought a while andI think deno info xx
is feasible
@bartlomieju
It seems that @ry is disagree with the mianfest.json
solution, and wanna privatizes $DENO_DIR
So, If we use the deno info
solution and make APIs similar to the following as much as possible
# locate from filepath
deno info $DENO_DIR/deps/https/deno.land/[hash] --cache-only --output=json
# locate from URL
deno info https://deno.land/std/path/mod.ts --cache-only --output=json
Stdout output:
{
"filepath": "$DENO_DIR/deps/https/deno.land/[hash]",
"url": "https://deno.land/std/path/mod.ts",
"type": "TypeScript"
}
If you use this approach, in TypeScript, you really need to call deno info
frequently.
So I hope there are flags to make the json as short as possible
eg. module's dependency tree is optional. There are some modules(https://dev.jspm.io/jsdom) with a lot of dependencies and it is not necessary.
@axetroy yes, I discussed this issue with @ry
There's one thing I am not sure about:
# Map URL to filepath
deno info https://deno.land/std/path/mod.ts --cache-only --output=json
Above example is self-explanatory and easy to implement; what about the one below:
# Map filepath to URL
deno info $DENO_DIR/deps/https/deno.land/[hash] --cache-only --output=json
Since we want to privatize DENO_DIR
then why do you need "reverse lookup"? Was that mistakenly put there, or you actually need the same functionality for "hashed" filenames?
That's also ref for #4140
Since we want to privatize
DENO_DIR
then why do you need "reverse lookup"? Was that mistakenly put there, or you actually need the same functionality for "hashed" filenames?
https://deno.land/std/path/mod.ts
to $DENO_DIR/deps/https/deno.land/[hash]
just need to calculate SHA256
of /std/path/mod.ts
and it's easy.
But what I really need is map $DENO_DIR/deps/https/deno.land/[hash]
to https://deno.land/std/path/mod.ts
Because in actual use case, I open the $DENO_DIR/deps/https/deno.land/[hash]
file in the editor. I have to figure out what its URL is in order to find where its dependent modules are.
If without additional information, it seems that Deno cannot infer what its URL is from $DENO_DIR/deps/https/deno.land/[hash]
This might be stupid, but why do we actually hash the file names in the first place? To make sure we have a valid set of characters for the filesystem right? Not for security. Couldn't we just use base64 encoding or a fast terrible encryption algorithm? The name doesn't need to be one way hashed because it needs to be secure - it just needs to be rewritten to valid filesystem chars.
base64 is reversible so that would fix the issue right?
This might be stupid, but why do we actually hash the file names in the first place? To make sure we have a valid set of characters for the filesystem right? Not for security.
That's right
Couldn't we just use base64 encoding or a fast terrible encryption algorithm? The name doesn't need to be one way hashed because it needs to be secure - it just needs to be rewritten to valid filesystem chars.
base64 is reversible so that would fix the issue right?
Now that I think about it, it might actually work, but it would rather be Base32 - that way we'd handle both case-sensitive and case-insensitive file systems.
EDIT: There might be the issue of filename length though.
You're right - filename length could be an issue (base32 will increase file length by an average of 60%). A solution could be to split the base32 encoded strings into 255 char segments (this would fit 99% of filesystems), and then have a folder structure like this: $DENO_DIR/deps/https/segment1/segment2/segment3
. Short file names would just be $DENO_DIR/deps/https/segment1
.
Stitching these back together into a single string or splitting them apart is trivial.
Just to point out, base32 or base64 do seem a bit of overkill... they are designed to encode arbitrary data. URLs can only contain a certain set of characters. Though I am not aware of an encoding that is file safe that only deals with encoding valid URL characters.
The following text:
raw.githubusercontent.com/microsoft/TypeScript/v3.7.2/lib/lib.dom.d.ts
Encodes in base-32 as:
Also, from an encoding perspective, domains are case-insensitive, while that path and query are case sensitive.
Why not try Punycode?
The example string you used above:
raw.githubusercontent.com/microsoft/TypeScript/v3.7.2/lib/lib.dom.d.ts
would result in the following in punycode:
a8d609a3d662b370256d531e451b7d1781d78e7a
compared to the Base32 example you provided, it is 73 characters less.
@xpyxel how were you encoding it via punycode?
Punycode usually is for taking non-ASCII UNICODE chars and encoding them in a URL. When I do a punycode encoding of example https://π±.com/
would become https://xn--s38h.com/
. For the example string, I get exactly what is in the example string, because they are valid encodings.
@kitsonk seems it was just a bug in the website i was using to test different encodings.
Why not use something like https://github.com/mit-pdos/noria (Rust)
or even SQLite instead of a JSON table to perform the reverse lookup?
That should be more performant than constantly parsing, stringifying, and writing JSON to disk.
Still, a descriptive filename should be considered a priority. Even with a language server, opening files which don't have the expected filename is not ideal, especially in other editors. At the very least, it should be:
$DENO_DIR/deps/https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd/curl.ts
instead of
$DENO_DIR/deps/https/deno.land/11d2ae31f7817c87b53893788205d431da511568dcd7bb9c363c6e52fb21a7fd
There are only a few characters which are invalid in filenames:
/<>:"/\|?*`
I would think a better approach would be to replace those characters and then always add a file extension for http/https imports. So curl.ts
becomes curl.ts.ts
which allows imports that don't have an extension (it is added) and it doesn't conflict with a url that just ends with curl
(it becomes curl.ts
on disk).
@brandonkal I think it is unavoidable once you consider query strings would also make unique resources in the cache.
I also think you are missing &
in your list of characters too.
There are only a few characters which are invalid in filenames:
You also have to consider:
~
and '
are among them. On windows I would avoid % and ! and probably some others too.So, although I do understand where the desire to have recognizable file names comes from, it is not as straightforward as you make it sound -- there are many more gotchas.
As a compromise I could see a solution where we strip clusters of undesirable characters from the URL and replace them by an underscore, strip the ?query&etc=42
part, and then append a hash of the entire URL to it. So e.g /some...thing...//WUT* //odd.ts?blah
becomes /some_thing_/wut_/_/odd.ts-1cb1ab6c0
.
I'm not sure why query strings changes things unless the old behaviour was to strip them off before creating a file path. If you replace ?
with -1-
for instance (numbering each invalid character) it doesn't seem to be an issue.
As far as I am aware, '&' is legal on unix and Windows but it does create issues with shell interpretation so it would be good to add that to the list.
Good points. Something like
"$DENO_DIR/[id-in-DB or hash]/[stripped-file-basename].[appended-extension-from-content-type]"
would be a nice compromise for readable filenames in editors and search.
@axetroy In https://github.com/denoland/deno/commit/0099c28db35a63678e04b31434d38db52a11c3a5 I changed the cache so that in the download cache you have $HASH.metadata.json
. In there you will find the original URL. This is a stop gap solution - ultimately I want to improve deno info
- but maybe this will unblock you in 0.35.0
.
@ry I have built Deno v0.35.0 locally and it works find.
waiting v0.35.0 release, then extensions will be released soon
[email protected] have been release and it works again with Deno v0.35.0, it is up to @ry to decide whether to close the issue
If implement the deno info
API in the future or refactor a new cache layout, @ me again
@axetroy Awesome - glad to hear it's unbroken. I will close this issue but keep working on the "deno info" improvements.
Most helpful comment
[email protected] have been release and it works again with Deno v0.35.0, it is up to @ry to decide whether to close the issue
If implement the
deno info
API in the future or refactor a new cache layout, @ me again