To reduce space/network bw used by data files:
+1
These features would be great since the compression for text files can get up to ~70%, which is required when dealing with big DBs.
A better UX on setting and updating a site size limit would be a good companion to these features (ie: what happens when a size surpass the size limit).
Can I suggest automatic minification of html/css/js as well, as this would further reduce space usage. This could be done on the publishing stage of the website. Won't make any difference if they are already minified - but I doubt all users would bother to manually minify files.
The main problem is not storing the js/html files, but the databases. Removing white space from json files could help, but support for compressed database would be better solution.
I almost wonder if it would be more efficient to store everything in a huge database ( data and all files for websites) rather than having SQLite constantly loading up JSONs and writting to disk. Then you could just have compression on the entire database. Many databases also have BSON (binary JSON format) that would store the JSON files very efficiently (e.g. boolean stored as 1 bit vs "true"/"false" string).
Wouldn't help with network transmission at all though, and the disadvantage is that it reduces human readability, but it is an option.
Unfortunately - as far as i know - sqlite does not have compressed database support and probably it would be slow to select large files from database every time the user requests it.
Currently the every database data is stored two times: as json file and as sqlite database cache. In theory its possible to store the data only in the database, but then then modify/sign/send/validate would use much more cpu and disk i/o because to calculate the md5 hash of the data file you have to select all current data from database, but i have not made any benchmark for it yet.
As space usage of current ZeroTalk files (208user, 130topics, 600comments, 550upvotes):
The sqlite db is larger than json files probably because of the indexes.
zipvfs is an extended version of SQLite that allows for compression.
Sadly its not free: "You should only be able to see this software if you have a license. If you do not have a valid license you should delete the source code in this folder at once."
I have found an open source python implementation, but it isn't any sort of library, more just a gist that works, but isn't easy to integrate into anything.
I have did some experiment on compression speed on a content.json file with 12 000 files:
Original 1874.51KB
Zlib 0.074s 558.68KB
Gzip 0.130s 547.09KB
Deflate 0.125s 547.04KB
Bz2 0.291s 482.88KB
Bro 7.856s 438.01KB
Decompress
Zlib 0.010s
Gzip 0.011s
Deflate 0.009s
Bz2 0.089s
Bro 0.012s
Google's new Brotli offers nice rate, but the compression time price is huge.
Looks like the gzip offers the best speed/compression ratio. (maybe support both .gz / .bz2?)
The idea:
The html, css, js,etc. files would remain uncompressed (also on storage and transfer), because they only have to re-transfer very rarely and i think its not worth the extra (de)compression cpu time on every transfer.
Edit: Added LZMA/Brotli compression levels:
Original Compress 1874.51KB Decompress
Zlib 0.069s 558.68KB 0.011s
Gzip 0.132s 547.09KB 0.011s
Deflate 0.131s 547.04KB 0.009s
Bz2 0.299s 482.88KB 0.098s
Lzma 1.372s 455.77KB 0.057s
Bro/1 0.035s 506.78KB 0.011s
Bro/2 0.046s 504.26KB 0.010s
Bro/3 0.052s 502.25KB 0.011s
Bro/4 0.059s 507.89KB 0.010s
Bro/5 0.160s 527.18KB 0.012s
Bro/6 0.221s 527.73KB 0.012s
Bro/7 0.269s 527.06KB 0.012s
Bro/8 0.305s 526.27KB 0.012s
Bro/9 0.373s 525.81KB 0.012s
Bro/10 8.111s 438.01KB 0.012s
Bro/11 8.075s 438.01KB 0.012s
Brotli / level1 looks good and fast (Cons: One more binary dependency and the python module not submitted to pip yet)
Any news? Also, maybe the test should be updated because they changed a lot of things in Brotli
It's not planned yet.
Ok
I would be happy to be able to put .gz files in a ZeroSite and have the server serve them up with Content-Encoding: gzip if the client requests the file without .gz and has Accept-Encoding: gzip, otherwise the local daemon should uncompress before delivering to the browser. That latter part is really optional since there probably aren't any browsers that work with ZeroNet that don't support gzip.
Added facebook's zstd:
Original Compress 1874.51KB Decompress
Zlib 0.071s 558.68KB 0.010s
Gzip 0.131s 547.09KB 0.012s
Deflate 0.129s 547.04KB 0.010s
Bz2 0.299s 482.88KB 0.108s
Lzma 1.343s 455.77KB 0.058s
Bro/1 0.031s 506.78KB 0.011s
Bro/2 0.042s 504.26KB 0.010s
Bro/3 0.049s 502.25KB 0.011s
Bro/4 0.058s 507.89KB 0.011s
Bro/5 0.159s 527.18KB 0.012s
Bro/6 0.236s 527.73KB 0.012s
Bro/7 0.298s 527.06KB 0.012s
Bro/8 0.324s 526.27KB 0.012s
Bro/9 0.401s 525.81KB 0.012s
Bro/10 8.307s 438.01KB 0.013s
Bro/11 8.755s 438.01KB 0.013s
Zstd/1 0.017s 502.58KB 0.010s
Zstd/3 0.058s 521.81KB 0.024s
Zstd/5 0.152s 523.55KB 0.038s
Zstd/7 0.276s 518.76KB 0.050s
Zstd/9 0.470s 507.92KB 0.063s
Zstd/11 0.695s 507.03KB 0.076s
Zstd/13 0.957s 505.98KB 0.089s
Zstd/15 1.352s 504.79KB 0.101s
Zstd/17 1.838s 478.77KB 0.111s
Zstd/19 2.939s 465.31KB 0.121s
Zstd/21 4.155s 458.96KB 0.132s
Zstd's train function can be interesting for us (it allows to have shared dictionary between multiple compressed files)
@HelloZeroNet any plans to implement this?
I ask because if you say that this is not in the roadmap for the next months I'll probably implement it on the site level.
Thanks
There is no roadmap, so can't tell if it's on it or not. It's planned, but not sure yet if it will also support database files or only for static ones.
I have added .tar.gz, .tar.bz2, .zip support in latest rev: https://github.com/HelloZeroNet/ZeroNet/commit/2854e202e17926f136c053646f3530e1e1c9956d
example site: http://127.0.0.1:43110/1AsRLpuRxr3pb9p3TKoMXPSWHzh6i7fMGi/en.tar.bz2/index.html (please update your client before visit)
en dir uncompressed: 6.1MB, zipped: 1.5MB, tar.gz: 512KB, bz2: 247KB (!)
No database files support yet, but it's also planned.
@HelloZeroNet thanks for that. I have just implemented compression (zlib) of one column of the site database as well. 馃槃
For my use case I'm looking forward for the DB compression because that's what will be updated daily (unlike the site static files which are downloaded only once). In my compression implementation I've left all columns which I use to query uncompressed. I wonder how compressing them will impact on the query time.
Also, given that I'm compressing just one column (book description) I'm aware that this should not affect the diff syncing of the json files between nodes. I wonder how this will be affected once the whole json is compressed and only one row is updated/added to that json.
Thanks
Latest results from gzipped database (ZeroTalk user files):
Reading all files from archive
| | Intel i5 | Chip |
|---------|----------|-------|
| Raw | 0.51s | 1.74s |
| Tar.gz | 0.47 | 6.99s |
| Tar.bz2 | 4.5 | 82.3s |
| Zip | 0.38 | 4.03s |
Update: By dropping the signatures from archived content.json files reduces the size of the tar.gz file down to 2.5MB
Plans for tar.gz packed database:
If an archived user starts posting again, then archived content will disappear, because ZeroNet deletes all user before inserting new one. (Same problem if multiple archive contains data for same user)
Optional files in archived user directories will be deleted.
If you archive a user file with an active topic, then it will disappear.
On archiving we have to update every user's data.json which could take a lots of time.
Why not use a direct access archive format like ZIP instead of one that requires you to unpack the entire archive before accessing any files? It might even be possible to generate a header and send an individual file directly to the browser without compressing it since ZIP and gzip use compatible compression algorithms, though I'm not sure if any of the libraries support this out of the box.
zip also supported, but in this case it does not makes any difference, because when the zip/tar.gz files got updated we have to unpack every file to insert to db.
Another idea:
Instead of storing many files in .zip / tar.gz merge the archived ones into one. For example:
{
"users/112GGMvUJbBTCtQu8UUSYpo8UjLdo1B73n/content.json": {
"cert_auth_type": "web",
"cert_user_id": "[email protected]",
"modified": 1484261339
},
"users/112GGMvUJbBTCtQu8UUSYpo8UjLdo1B73n/data.json": {
...
}
This will speed up the opening, reading and parsing. My first benchmarks parsing all json files this way (I have removed the signatures from content.json files to reduce size):
| | Size | Intel i5 | Chip |
|-------------------|---------|----------|---------|
| Raw | 6.7MB | 0.48s | 3.37s |
| Merged Raw | 6.7MB | 0.10s | 1.46s |
| Tar.gz | 2.6MB | 0.58s | 8.64s |
| Merged Tar.gz | 2.4MB | 0.20s | 2.49s |
| Tar.bz2 | 1.9MB | 4.38s | 77.3s |
| Merged Tar.bz2 | 1.8MB | 0.58s | 7.2s |
| Zip | 4.1MB | 0.48s | 5.60s |
| Merged Zip | 2.4MB | 0.13s | 1.84s |
| Merged bro | 1.3MB | ~tar.gz | ~tar.gz |
So it significantly reduces the size of the .zip file 4.1MB -> 2.4MB and also speeds up the parsing process by 2-10 times.
I fail to understand the use case for archiving user files. What would it be for?
In my mind the whole purpose of allowing user content is to have it available for other users to benefit from it.
Take a forum for example, if a question is answered it should be always available (potentially through search).
In sites with user content, usually the user content itself is the main value provided by the site. Imagine if stack overflow archived user answers.
Also, the down side of the feature is that, from what I understood, there will be no way for the user to know that the content he may be looking for exists but it is archived.
I'm looking forward to the database compression feature 鈽猴笍
Without this the sites will be larger and larger by years dramatically increasing the initial sync time and the space required by the site.
This will allow the site owner to create checkpoints by merging all user created content into one file and define it as optional file, so if someone not interested in old content, then he/she only has to store and distribute the latest files.
IMHO this may lead to site owners using the feature incorrectly and archiving useful content. Sites with old content communicate maturity of the network.
Given that we are talking about text content the archive feature may be the easiest solution for the problem but maybe not the right solution. The right solution would be to improve the architecture of the network in such a way that it becomes highly effective in compressing and transferring content at speed (eg: transferring only diff of the json files to all nodes instead of only the nodes that are online). Allowing users to 'remove' content (by making it optional) will only mask the underlying issue, which should be optimising the network.
It is better to bet on making the network more efficient at transferring content than implementing features that will remove content from the network. Even if the site owner thinks that this is a good idea for his site, the users that took their time to create the content will dislike the fact that the site owner took the content the user created 'down'. And they may not abandon the site, but the network altogether.
Text is highly compressible and in years to come connections will continue to get faster and hard drives bigger. Even these days, downloading big files is a common thing for users. The upside of saving some 50MB by allowing the site owner to make content optional is lower than the downside of having user generated content (which took time and energy) removed from the site.
The current solution for large sites are deleting. The archiving will make these content still accessible, so from user perspective it's much better and I don't see why would make it anyone leave the network.
Keeping every data on every computer will not work. (think about mobile phones)
Compressing the data makes it 2-4 times smaller, so I would not call it as a long term solution.
Other problems with sites without archiving:
You talk about large sites as a problem that needs solution. Have you heard of any site that had to delete and start again because it was too successful and accumulated too much user content?
If you are referring to ZeroTalk I think 8MB is only a problem if the content is useless. If this is the case, instead of allowing archival of the content we could work feature that would increase the quality of the content (eg: up/down voting) in such a way that users will prefer to store and help distribute large sites instead of small ones.
I agree with merging multiple small json files into a big one. That definitely makes sense.
So maybe the archive feature can become an 'optimize site' feature where zeronet will compress and merge the json files in the most efficient way for that site.
I just think that making it easy to transform text content in optional will result in a net loss for the network. I would resort to this as a last resort feature to deal with big sites.
Also, maybe my definition of big is different. For me big (when talking about text) means 200MB+. Which, if we're talking about user generated content, is a nice problem to have.
You can call it optimize if you want. It's really up for the site owner which data he/she decides to remove from the default downloaded ones. It can be based on date/language/votes/etc.
Downloading and verifying ZeroTalk content (4700 files in 8MB, and I already deleted 2500 files to keep it under 10MB) could already take up to 10 minutes on mobile phones, which is I think already too much.
This time could be improved with protocol modifications (eg. pipelining), but the verification and wiriting to the storage is still going to be problematic. (around 50ms/user)
But from what I got, these 10 minutes would decrease if you merge the files. And that should probably be enough.
I'm all for removing bad content from the network (eg: spam, trolling, etc), but creating an archive feature as you propose may make it too easy to archive useful content, and despite the site owner best intentions, the network will be worse by the lack of content (we need to keep in mind that downloading optional files will be an advanced user skill given that it already requires an understanding of how the network works, which cannot be expected from new users).
Also, an initial wait time to download the site is a good price to pay given the benefits of having the site available on your phone offline (and all other benefits of having a site on zeronet instead of the internet).
We cannot compete in speed with normal internet sites, and we should not. We should invest in features that exploit what makes zeronet sites different from internet sites.
I think downloading optional files are not an advanced feature at all. It can be a button of "Download earlier topics", "Download downvoted comments" or "Download unanswered questions"
Yeah, that makes sense.
I understand the use case for the feature now. Sounds like a good idea. ;)
I must say though that this sounds very much like an overlapping feature with the merger sites. Or maybe I'm using merger sites incorrectly.
There is some overlap with merger sites, but this is more like a solution for storage and transfer of large ammount of data.
"with small files increase,computer disk might work slowly", will it happen?
@HelloZeroNet, is this yet a thing?
With .zip/tar.gz support it partially implemented, the next step is #1053
Are database files zip supported now?
No, but #1053 will add this feature as checkpoints are basically compressed databases
It's up to the site if it's puts up-to-date or outdated data in it, but I will check the possibilities of adding a simple json.gz support.
@antilibrary json.gz support added in Rev2180: https://github.com/HelloZeroNet/ZeroNet/commit/b503d59c49da148346aa9893d7287b8e9ccb46d2
Also a new API command (fileNeed) that allow you to start downloading optional files.
Example site that shows both of the new features: http://127.0.0.1:43110/1JokLn39tLeXbc7voPv5yuiZvzUnduKpL9
Most helpful comment
The main problem is not storing the js/html files, but the databases. Removing white space from json files could help, but support for compressed database would be better solution.