Beaker: Pagination on readdir

Created on 20 Apr 2018  路  20Comments  路  Source: beakerbrowser/beaker

If readdir had support for pagination via cursors that would be wonderful.

PRs welcome! enhancement

All 20 comments

Ok I'll look into this.

Polite bump. Thanks for considering! An issue when a folder has 50,000 files :)

I think this will be easier to implement once Hyperdb support lands. Hyperdrive doesn't have anything for streaming keys yet.

Yeah it'd just be implemented in the Web API layer by throwing away results. Which I'm okay with for now since eventually we should have the actual capability in the read api

In the future implementation, will the results be sorted by filename or something else like, for example, the dat version the file was put at?

Not sure, what would you prefer?

In the absence of a considerable performance overhead, I'd prefer directory entries to be sorted by the filename. This will give us the ability to use an archive as a database with indexes, for simple use-cases.

For example, if one wanted to get a sorted-by-createdAt/monotonic-timestamp-36 user's posts in fritter without having to wait for IndexedDB's indexing, one could just use the pagination feature here if the files would be sorted by the filename.

For example, if one wanted to get a sorted-by-createdAt/monotonic-timestamp-36 user's posts in fritter without having to wait for IndexedDB's indexing

That's kind of what I'm doing with my dat-object-store library.

dat://da03b54ff070571e65e41766544e0924ca1212b212d881542fd1abcebb9593bb/dat-object-store/README.md#example

I'm creating folders for the individual values, then using readdir() to find all their names, and doing a filter on them if there's a range.

With this change I'd like to be able to do the range filter as part of the readdir() call. This would help for cases where the folder has a __lot__ of subfolders to filter against.

@RangerMauve I'd be interested to look at that. I can't seem to find the README file or anything beside the test.js:

https://datbase.org/dat://da03b54ff070571e65e41766544e0924ca1212b212d881542fd1abcebb9593bb/contents/dat-object-store

Update: test/ folder -> test.js

Weird, not sure why datbase wouldn't be showing everything.

Try the HTTPS link: https://modules-rangermauve.hashbase.io/dat-object-store/README.md

@RangerMauve Oh wow, were you calling that module "Dat object store" before I started using that term in the spec?

@RangerMauve That's neat! But I hoped more for an efficient sorted index, but seems like this won't happen till efficient pagination support lands (if it will), or if one builds b-trees on top of DatArchive or something xD.

But really, nice job what you did there!

@pfrazee I think so. 馃槄 According to my history I started working on dat-object-store in early September.

@hossameldeen Thanks! My goal was to speed up lookup by relying on the tree structures in hyperdrive. Having proper b-trees (if stored in a single file) in a hyperdrive would be super inefficient because you'd need to duplicate the file each time. I think that once we get a way to query a subset of keys in a folder, the method I took will be a lot more viable.

In the meantime it pretty much worked for what I wanted to do with it. Now I just need to find time to finish litter which makes use of dat-object-store for indexing posts and stuff.

@RangerMauve great minds think alike 馃槃! Though I'm considering a rename to something like "JSON folders" or "JSON Schema Folders" to help make it more intuitive. We'll see.

@RangerMauve other than better filename queries, what else would your ideal index-storage toolset look like? Indexing is going to be a huge part of building these applications and it would make sense for us to spend some time on the builtin toolset to accommodate it. How well has your dat-object-store performed?

(Also sorry for stealing your name)

@pfrazee No worries about the name, it's pretty obvious. 馃槀

I haven't really done benchmarking yet. I've got some basic unit tests running via webrun, but I haven't made anything real with it yet.

Once I get around to finishing litter I'll likely know more. If you have ideas on how I could test performance, that'd be good too. My guess is that inserts are going to be slow-ish, but searches are going to be fast. Mostly when you're trying to filter by known values, since I just need to do a readdir.

You can try running the unit tests with npx @rangermauve/webrun dat://da03b54ff070571e65e41766544e0924ca1212b212d881542fd1abcebb9593bb/dat-object-store/test.js

When I was designing the library, I was thinking about the fritter use case. You want to be able to find some number of posts for your timeline (filter all posts by createdAt timestamp, with limits), find replies for a post (filter by threadRoot, and time), or look at a person's profile timeline (filter by author and time).

I also wanted to support stuff like "tags" where you'd have an array of values that would be indexed so I could say "find all posts tagged cats", which is why I unravel arrays.

The big thing missing from your dat-object-store proposal, IMO is how indexes would work, but I'm not sure if that should even be a built-in feature.

We're getting totally off topic for the issue, sorry about that everybody. But not sorry enough to relocate.

The big thing missing from your dat-object-store proposal, IMO is how indexes would work, but I'm not sure if that should even be a built-in feature.

Yeah, it's definitely in the "not yet decided" zone.

In citizen, I tested performance just by training it onto my existing profile, not a very scientific approach but it worked. The architecture I chose was for the index to be entirely in-memory and just serialize to a json file. The thinking was, yes, it's no good for anybody trying to remotely access the index with regularity, but it's easy to write and make perform well. I figured the index would only track values up to a certain limit, and that's how it'd avoid getting too big for memory. As you'd expect, once the index was loaded, it was very fast.

That makes sense. I was trying to also reduce the number of changing files so that you wouldn't have a bunch of duplicated content in the content hypercore.

On a side note it still concerns me a little that there's no compression or diffing going on when changing a file, but I guess that if it's good enough for git, it should be good enough for Dat.

On a side note it still concerns me a little that there's no compression or diffing going on when changing a file, but I guess that if it's good enough for git, it should be good enough for Dat.

Agree. It'll get better at some point. Maf has some ideas.

Experience sharing: that's how I've done it:

async getRowsUuids(datArchive: DatArchive, tableName: string, opts?: { reverse?: boolean, start?: number, count?: number }): Promise<{ entries: Array<string>, totalCount: number }>

(Assume tableName is directory path)

As outlined in hossameldeen/transparent-salaries#8, the points on this API:

(1) Consistency with DatArchive.history (and Array.slice, ignoring negatives). I find count easier to use than end, but probably end would be better.

(2) In the layer below DatArchive, is totalCount already there or will it require extra work? I'm not fixed on it if it's not there, I just added it because I already had it with the current implementation, and it helped with not showing "Load more salaries" when I've reached the end, to free the user from an extra click.

Anyway, just thought about sharing a usage :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

NicholasGWK picture NicholasGWK  路  4Comments

pmario picture pmario  路  4Comments

mundusnine picture mundusnine  路  3Comments

monteslu picture monteslu  路  3Comments

monkey000 picture monkey000  路  4Comments