+1
These features would be great since the compression for text files can get up to ~70%, which is required when dealing with big DBs.
A better UX on setting and updating a site size limit would be a good companion to these features (ie: what happens when a size surpass the size limit).

makdisse on 31 May 2015

👍1

Can I suggest automatic minification of html/css/js as well, as this would further reduce space usage. This could be done on the publishing stage of the website. Won't make any difference if they are already minified - but I doubt all users would bother to manually minify files.

Idealcoder on 15 Jun 2015

👍2

The main problem is not storing the js/html files, but the databases. Removing white space from json files could help, but support for compressed database would be better solution.

HelloZeroNet on 16 Jun 2015

👍4

I almost wonder if it would be more efficient to store everything in a huge database ( data and all files for websites) rather than having SQLite constantly loading up JSONs and writting to disk. Then you could just have compression on the entire database. Many databases also have BSON (binary JSON format) that would store the JSON files very efficiently (e.g. boolean stored as 1 bit vs "true"/"false" string).

Wouldn't help with network transmission at all though, and the disadvantage is that it reduces human readability, but it is an option.

Idealcoder on 17 Jun 2015

Unfortunately - as far as i know - sqlite does not have compressed database support and probably it would be slow to select large files from database every time the user requests it.

Currently the every database data is stored two times: as json file and as sqlite database cache. In theory its possible to store the data only in the database, but then then modify/sign/send/validate would use much more cpu and disk i/o because to calculate the md5 hash of the data file you have to select all current data from database, but i have not made any benchmark for it yet.

As space usage of current ZeroTalk files (208user, 130topics, 600comments, 550upvotes):

Json files: 355k, zipped: 268k, tgz: 145k
Sqlite db: 408k, zipped: 187k, tgz: 187k

The sqlite db is larger than json files probably because of the indexes.

HelloZeroNet on 17 Jun 2015

zipvfs is an extended version of SQLite that allows for compression.

shakna-israel on 18 Jun 2015

Sadly its not free: "You should only be able to see this software if you have a license. If you do not have a valid license you should delete the source code in this folder at once."

HelloZeroNet on 18 Jun 2015

I have found an open source python implementation, but it isn't any sort of library, more just a gist that works, but isn't easy to integrate into anything.

shakna-israel on 18 Jun 2015

I have did some experiment on compression speed on a content.json file with 12 000 files:

Original        1874.51KB
Zlib    0.074s  558.68KB
Gzip    0.130s  547.09KB
Deflate 0.125s  547.04KB
Bz2     0.291s  482.88KB
Bro     7.856s  438.01KB
Decompress
Zlib    0.010s
Gzip    0.011s
Deflate 0.009s
Bz2     0.089s
Bro     0.012s

Google's new Brotli offers nice rate, but the compression time price is huge.
Looks like the gzip offers the best speed/compression ratio. (maybe support both .gz / .bz2?)

The idea:

The any.json.gz files would handled and transfered as normal files.
If the json -> data layer matches a .gz file then it transparently decompress then insert a data from it.
It would also handle the compression/decompression transparently if you read/write a json.gz file using the ZeroFrame API.

The html, css, js,etc. files would remain uncompressed (also on storage and transfer), because they only have to re-transfer very rarely and i think its not worth the extra (de)compression cpu time on every transfer.

Edit: Added LZMA/Brotli compression levels:

Original Compress 1874.51KB Decompress
Zlib     0.069s   558.68KB  0.011s
Gzip     0.132s   547.09KB  0.011s
Deflate  0.131s   547.04KB  0.009s
Bz2      0.299s   482.88KB  0.098s
Lzma     1.372s   455.77KB  0.057s
Bro/1    0.035s   506.78KB  0.011s
Bro/2    0.046s   504.26KB  0.010s
Bro/3    0.052s   502.25KB  0.011s
Bro/4    0.059s   507.89KB  0.010s
Bro/5    0.160s   527.18KB  0.012s
Bro/6    0.221s   527.73KB  0.012s
Bro/7    0.269s   527.06KB  0.012s
Bro/8    0.305s   526.27KB  0.012s
Bro/9    0.373s   525.81KB  0.012s
Bro/10   8.111s   438.01KB  0.012s
Bro/11   8.075s   438.01KB  0.012s

Brotli / level1 looks good and fast (Cons: One more binary dependency and the python module not submitted to pip yet)

HelloZeroNet on 22 Sep 2015

Any news? Also, maybe the test should be updated because they changed a lot of things in Brotli

TheNain38 on 16 Jan 2016

It's not planned yet.

HelloZeroNet on 16 Jan 2016

Ok

TheNain38 on 16 Jan 2016

I would be happy to be able to put .gz files in a ZeroSite and have the server serve them up with Content-Encoding: gzip if the client requests the file without .gz and has Accept-Encoding: gzip, otherwise the local daemon should uncompress before delivering to the browser. That latter part is really optional since there probably aren't any browsers that work with ZeroNet that don't support gzip.

seanlynch on 14 May 2016

👍1

Added facebook's zstd:

Original    Compress    1874.51KB   Decompress
Zlib        0.071s      558.68KB    0.010s
Gzip        0.131s      547.09KB    0.012s
Deflate     0.129s      547.04KB    0.010s
Bz2         0.299s      482.88KB    0.108s
Lzma        1.343s      455.77KB    0.058s
Bro/1       0.031s      506.78KB    0.011s
Bro/2       0.042s      504.26KB    0.010s
Bro/3       0.049s      502.25KB    0.011s
Bro/4       0.058s      507.89KB    0.011s
Bro/5       0.159s      527.18KB    0.012s
Bro/6       0.236s      527.73KB    0.012s
Bro/7       0.298s      527.06KB    0.012s
Bro/8       0.324s      526.27KB    0.012s
Bro/9       0.401s      525.81KB    0.012s
Bro/10      8.307s      438.01KB    0.013s
Bro/11      8.755s      438.01KB    0.013s
Zstd/1      0.017s      502.58KB    0.010s
Zstd/3      0.058s      521.81KB    0.024s
Zstd/5      0.152s      523.55KB    0.038s
Zstd/7      0.276s      518.76KB    0.050s
Zstd/9      0.470s      507.92KB    0.063s
Zstd/11     0.695s      507.03KB    0.076s
Zstd/13     0.957s      505.98KB    0.089s
Zstd/15     1.352s      504.79KB    0.101s
Zstd/17     1.838s      478.77KB    0.111s
Zstd/19     2.939s      465.31KB    0.121s
Zstd/21     4.155s      458.96KB    0.132s

Zstd's train function can be interesting for us (it allows to have shared dictionary between multiple compressed files)

HelloZeroNet on 5 Sep 2016

👍1

@HelloZeroNet any plans to implement this?
I ask because if you say that this is not in the roadmap for the next months I'll probably implement it on the site level.
Thanks

antilibrary on 18 Dec 2016

👍2

There is no roadmap, so can't tell if it's on it or not. It's planned, but not sure yet if it will also support database files or only for static ones.

HelloZeroNet on 19 Dec 2016

👍2

I have added .tar.gz, .tar.bz2, .zip support in latest rev: https://github.com/HelloZeroNet/ZeroNet/commit/2854e202e17926f136c053646f3530e1e1c9956d

example site: http://127.0.0.1:43110/1AsRLpuRxr3pb9p3TKoMXPSWHzh6i7fMGi/en.tar.bz2/index.html (please update your client before visit)

en dir uncompressed: 6.1MB, zipped: 1.5MB, tar.gz: 512KB, bz2: 247KB (!)

No database files support yet, but it's also planned.

HelloZeroNet on 19 Feb 2017

👍1

@HelloZeroNet thanks for that. I have just implemented compression (zlib) of one column of the site database as well. 😄
For my use case I'm looking forward for the DB compression because that's what will be updated daily (unlike the site static files which are downloaded only once). In my compression implementation I've left all columns which I use to query uncompressed. I wonder how compressing them will impact on the query time.
Also, given that I'm compressing just one column (book description) I'm aware that this should not affect the diff syncing of the json files between nodes. I wonder how this will be affected once the whole json is compressed and only one row is updated/added to that json.
Thanks

antilibrary on 19 Feb 2017

Latest results from gzipped database (ZeroTalk user files):

Raw, separate files: 7.6MB
tar: 12MB (hm)
zip: 4.8MB
tar.gz: 3.1MB
tar.bz2: 2.4MB
tar.zstandard: 2.1MB
tar.brotli: 2.0MB
tar.xz: 1.9MB

Reading all files from archive

|         | Intel i5 | Chip  |
|---------|----------|-------|
| Raw     | 0.51s    | 1.74s |
| Tar.gz  | 0.47     | 6.99s |
| Tar.bz2 | 4.5      | 82.3s |
| Zip     | 0.38     | 4.03s |

Update: By dropping the signatures from archived content.json files reduces the size of the tar.gz file down to 2.5MB

HelloZeroNet on 2 Mar 2017

Plans for tar.gz packed database:

It should be suitable for archiving user database files:

The site owner press the archive button
ZeroNet packs user files older than 1 month into users.2017-03-02.archive.tar.gz
Specifies the clients to delete all user files older, than 2017-02-02
Publishes data/users/2017-03-02.zeronet-archive.tar.gz file (as optional file)

Client detects new *archive.tar.gz file

Read all .json files in the tar.gz and insert to db as normal files

Problem 1

If an archived user starts posting again, then archived content will disappear, because ZeroNet deletes all user before inserting new one. (Same problem if multiple archive contains data for same user)

Solution 1: Delete all data associated to user -> Check all archive if it has data for the user and import it if necessary ->Import new user data file. Pro: Probably need no modification in sites source code, Con: Opening archives is slow and we have to do it for every file
Solution 2: Treat archived jsons as separate files. Pro: Archived data rows will be untouched if user start posting again. Cons: Probably needs modification in every site code. And re-index the db.

Problem 2

Optional files in archived user directories will be deleted.

Problem 3

If you archive a user file with an active topic, then it will disappear.

Problem 4

On archiving we have to update every user's data.json which could take a lots of time.

HelloZeroNet on 2 Mar 2017

Why not use a direct access archive format like ZIP instead of one that requires you to unpack the entire archive before accessing any files? It might even be possible to generate a header and send an individual file directly to the browser without compressing it since ZIP and gzip use compatible compression algorithms, though I'm not sure if any of the libraries support this out of the box.

seanlynch on 3 Mar 2017

zip also supported, but in this case it does not makes any difference, because when the zip/tar.gz files got updated we have to unpack every file to insert to db.

HelloZeroNet on 3 Mar 2017

Another idea:
Instead of storing many files in .zip / tar.gz merge the archived ones into one. For example:

{
"users/112GGMvUJbBTCtQu8UUSYpo8UjLdo1B73n/content.json": {
   "cert_auth_type": "web",
   "cert_user_id": "[email protected]",
   "modified": 1484261339
},
"users/112GGMvUJbBTCtQu8UUSYpo8UjLdo1B73n/data.json": {
...
}

This will speed up the opening, reading and parsing. My first benchmarks parsing all json files this way (I have removed the signatures from content.json files to reduce size):

| | Size | Intel i5 | Chip |
|-------------------|---------|----------|---------|
| Raw | 6.7MB | 0.48s | 3.37s |
| Merged Raw | 6.7MB | 0.10s | 1.46s |
| Tar.gz | 2.6MB | 0.58s | 8.64s |
| Merged Tar.gz | 2.4MB | 0.20s | 2.49s |
| Tar.bz2 | 1.9MB | 4.38s | 77.3s |
| Merged Tar.bz2 | 1.8MB | 0.58s | 7.2s |
| Zip | 4.1MB | 0.48s | 5.60s |
| Merged Zip | 2.4MB | 0.13s | 1.84s |
| Merged bro | 1.3MB | ~tar.gz | ~tar.gz |

So it significantly reduces the size of the .zip file 4.1MB -> 2.4MB and also speeds up the parsing process by 2-10 times.

HelloZeroNet on 4 Mar 2017

I fail to understand the use case for archiving user files. What would it be for?
In my mind the whole purpose of allowing user content is to have it available for other users to benefit from it.
Take a forum for example, if a question is answered it should be always available (potentially through search).
In sites with user content, usually the user content itself is the main value provided by the site. Imagine if stack overflow archived user answers.
Also, the down side of the feature is that, from what I understood, there will be no way for the user to know that the content he may be looking for exists but it is archived.

I'm looking forward to the database compression feature ☺️

antilibrary on 4 Mar 2017

Without this the sites will be larger and larger by years dramatically increasing the initial sync time and the space required by the site.
This will allow the site owner to create checkpoints by merging all user created content into one file and define it as optional file, so if someone not interested in old content, then he/she only has to store and distribute the latest files.

HelloZeroNet on 4 Mar 2017

IMHO this may lead to site owners using the feature incorrectly and archiving useful content. Sites with old content communicate maturity of the network.
Given that we are talking about text content the archive feature may be the easiest solution for the problem but maybe not the right solution. The right solution would be to improve the architecture of the network in such a way that it becomes highly effective in compressing and transferring content at speed (eg: transferring only diff of the json files to all nodes instead of only the nodes that are online). Allowing users to 'remove' content (by making it optional) will only mask the underlying issue, which should be optimising the network.
It is better to bet on making the network more efficient at transferring content than implementing features that will remove content from the network. Even if the site owner thinks that this is a good idea for his site, the users that took their time to create the content will dislike the fact that the site owner took the content the user created 'down'. And they may not abandon the site, but the network altogether.
Text is highly compressible and in years to come connections will continue to get faster and hard drives bigger. Even these days, downloading big files is a common thing for users. The upside of saving some 50MB by allowing the site owner to make content optional is lower than the downside of having user generated content (which took time and energy) removed from the site.

antilibrary on 4 Mar 2017

The current solution for large sites are deleting. The archiving will make these content still accessible, so from user perspective it's much better and I don't see why would make it anyone leave the network.

Keeping every data on every computer will not work. (think about mobile phones)

Compressing the data makes it 2-4 times smaller, so I would not call it as a long term solution.
Other problems with sites without archiving:

Eg. on Android the default block size is 32kb, so if you store a 100byte file it will still takes up 32kb. For that reason currently if you download ZeroTalk to your mobile it will take up almost 100MB of space. (instad of 8MB)
Initial sync: Downloading, parsing, checking signature of 10000 files takes 10000 more time, than the same data from 1 file.

HelloZeroNet on 4 Mar 2017

You talk about large sites as a problem that needs solution. Have you heard of any site that had to delete and start again because it was too successful and accumulated too much user content?
If you are referring to ZeroTalk I think 8MB is only a problem if the content is useless. If this is the case, instead of allowing archival of the content we could work feature that would increase the quality of the content (eg: up/down voting) in such a way that users will prefer to store and help distribute large sites instead of small ones.
I agree with merging multiple small json files into a big one. That definitely makes sense.
So maybe the archive feature can become an 'optimize site' feature where zeronet will compress and merge the json files in the most efficient way for that site.
I just think that making it easy to transform text content in optional will result in a net loss for the network. I would resort to this as a last resort feature to deal with big sites.
Also, maybe my definition of big is different. For me big (when talking about text) means 200MB+. Which, if we're talking about user generated content, is a nice problem to have.

antilibrary on 4 Mar 2017

You can call it optimize if you want. It's really up for the site owner which data he/she decides to remove from the default downloaded ones. It can be based on date/language/votes/etc.
Downloading and verifying ZeroTalk content (4700 files in 8MB, and I already deleted 2500 files to keep it under 10MB) could already take up to 10 minutes on mobile phones, which is I think already too much.

This time could be improved with protocol modifications (eg. pipelining), but the verification and wiriting to the storage is still going to be problematic. (around 50ms/user)

HelloZeroNet on 4 Mar 2017

But from what I got, these 10 minutes would decrease if you merge the files. And that should probably be enough.
I'm all for removing bad content from the network (eg: spam, trolling, etc), but creating an archive feature as you propose may make it too easy to archive useful content, and despite the site owner best intentions, the network will be worse by the lack of content (we need to keep in mind that downloading optional files will be an advanced user skill given that it already requires an understanding of how the network works, which cannot be expected from new users).
Also, an initial wait time to download the site is a good price to pay given the benefits of having the site available on your phone offline (and all other benefits of having a site on zeronet instead of the internet).
We cannot compete in speed with normal internet sites, and we should not. We should invest in features that exploit what makes zeronet sites different from internet sites.

antilibrary on 4 Mar 2017

I think downloading optional files are not an advanced feature at all. It can be a button of "Download earlier topics", "Download downvoted comments" or "Download unanswered questions"

HelloZeroNet on 4 Mar 2017

Yeah, that makes sense.
I understand the use case for the feature now. Sounds like a good idea. ;)

antilibrary on 4 Mar 2017

I must say though that this sounds very much like an overlapping feature with the merger sites. Or maybe I'm using merger sites incorrectly.

antilibrary on 4 Mar 2017

There is some overlap with merger sites, but this is more like a solution for storage and transfer of large ammount of data.

HelloZeroNet on 4 Mar 2017

"with small files increase,computer disk might work slowly", will it happen?

p24ce on 28 Jul 2017

@HelloZeroNet, is this yet a thing?

DaniellMesquita on 8 Aug 2017

With .zip/tar.gz support it partially implemented, the next step is #1053

HelloZeroNet on 8 Aug 2017

Are database files zip supported now?

antilibrary on 8 Aug 2017

No, but #1053 will add this feature as checkpoints are basically compressed databases

HelloZeroNet on 8 Aug 2017

1053 is related to one use case where the database may have outdated content. I'm referring to the use case where the db won't have outdated content and all the site owner wants is to zip the json files to speed up download.

antilibrary on 8 Aug 2017

It's up to the site if it's puts up-to-date or outdated data in it, but I will check the possibilities of adding a simple json.gz support.

HelloZeroNet on 8 Aug 2017

@antilibrary json.gz support added in Rev2180: https://github.com/HelloZeroNet/ZeroNet/commit/b503d59c49da148346aa9893d7287b8e9ccb46d2
Also a new API command (fileNeed) that allow you to start downloading optional files.
Example site that shows both of the new features: http://127.0.0.1:43110/1JokLn39tLeXbc7voPv5yuiZvzUnduKpL9

HelloZeroNet on 9 Aug 2017

❤1

Zeronet: Gzipped file support

Most helpful comment

All 42 comments

It should be suitable for archiving user database files:

Client detects new *archive.tar.gz file

Problem 1

Problem 2

Problem 3

Problem 4

1053 is related to one use case where the database may have outdated content. I'm referring to the use case where the db won't have outdated content and all the site owner wants is to zip the json files to speed up download.

Related issues