Server: Requesting delta-sync in longterm [$325.00]

Created on 15 Jul 2016  路  53Comments  路  Source: nextcloud/server

Hi all,

Delta Sync would be great for my truecrypt/veracrypt huge files (~30-100gb). Without delta sync I must stay absent from this product.

Delta Sync would also provide you a feature to distinguish from owncloud

Couldn't there be an optional (maybe extension / folder / file -based) mechanism to perform Delta Sync ("optionally" as I agree that Delta Sync does not make sense for all kind of files/folders)? Maybe even using something existing like rsync?


There is an open bounty on this issue. Add to the bounty at Bountysource
badge

1. to develop bounty enhancement

Most helpful comment

Delta sync is a killer criteria for lots of users... I dont understand that this is not higher in priority because I know companies that moved to Dropbox because it's the only file cloud service offering it. Seems that OwnCloud will win the second place after DropBox ;-) I hope for NextCloud it will catch up too...

All 53 comments

I have looked into zsync (client rsync). If we would ever implement such a feature it will most likely will be using that. Since you have to offload all the computation to the client side else you will kill the server.

I don't know about truecrypt/veracrypt. But actually most container formats (and encrypted is even worse). don't lend themself particually well for delta sync. Since often a small change results in a a lot of changed bytes.

I can only speak for truecrypt (but suppose that veracrypt does the same): if you change 1MB within the container the whole file changes only a bit more than 1MB as well (I am very sure about this). Also not even password changes would change the whole file, only little parts, refer https://news.ycombinator.com/item?id=6523286 and http://crypto.stackexchange.com/questions/18479/how-does-truecrypt-change-password-without-the-need-for-a-complete-re-encryption
So Delta Sync + Truecrypt (veracrypt) is a really perfect combination.
Although I can see that this feature will not be desired from many people there are some cases, like mine, where it would be great. Maybe there are other cases that I/we cannot think of but exist. Not sure, but VM images might profit from delta sync as well for example. Also, for some files you might uncompress -> compare -> delta_sync -> compress_server_side_again [ok this might be a too costly action, I do not know. This would work for e.g. *.pptx etc as well]
Clientside computation seems rational to me.

Thanks for considering the feature in any future release.

That is an interesting feature for virtual images. Are there any experiences with encrypted containers and diff sync tools?

One annoying thing about the sync-process is that you must transfer the files through the client. You can't just place them from a hard disk or use a faster transfer mechanism. Therefore many ask for a diff-sync feature but the ability to compare files (based on a hash-sum) would already help a lot of these people and it's much easier to implement.

I don't mind at all to offload all the computation to the client side, as long as such feature is made available!!! We, actually, consider this a very important feature for business!

To have an idea in a 0-10 scale what would be the cost (not monetary) of developing this?

Tks

Dropbox and Onedrive have delta sync.
Seafile have delta sync, but it cause files broken.
I hope you see the rsync. I do need delta sync

@rullzer How would the client have the previous file for calculating the delta?

@cowai you either need to keep a copy of your last sync around (using file-system specific things like shadow copy seems out of the question for the broad range of platforms with sync clients) or you have to do block-level syncing instead, like "syncthing" does it.

I think block-level syncing like syncthing is probably the easiest implementation in code, and perhaps the cheapest to write. I'm seriously interested in this, and I know some companies that are too (Quickbooks files man...ugly stuff). Like @Bigpet said, you'd need a copy of the file before changes onhand, or put some hooks into writes that go into that specific directory, but the latter sounds very messy and dangerous. I wish I knew how to write code better because I would 100% do this..I'm definitely a Kindergarten koder compared to a lot of people that put stuff on github. Thought I'd voice that there's interest on my end, and on the end of local companies I know.

Are there concrete plans when delta syncing will be available? May I hope to see this implemented in Nextcloud 13 already?

There is some progress on owncloud:

owncloud/core#16162

@gschenck Please feel free to try out the latest code, the core implementation should be complete now.

@ahmedammar any plans on submitting the PR against NC as well further down the road?

@jkaberg once the work is complete and merged in oC I can have a look, assuming the code-base isn't too different at the core ...

@ahmedammar can you give us an update about the feature? (If possible a probable ETA?)

@maverick74 no ETA for nextcloud, if someone is willing to open a bounty for it I could look into it more urgently, otherwise, for reference:
owncloud/client#6131
owncloud/core#29404

It's not much and I'm not even sure I did this right since I never did this before, but I don't mind chipping in to help this get done.
Bountysource

The bounty is already at 115 dollar now. It should not be terribly hard to get this merged in Nc client and server, I think, but it won't make it for 13 馃槃

I won鈥檛 be looking into this until oC actually merge first, since that saves me any duplicated effort. Unless this bounty gets so big that I can ignore oC all together :)

FWIW i guess there are some news at https://github.com/owncloud/core/pull/29404

@maverick74 So It can be merged... @rullzer @jospoortvliet

FWIW i guess there are some news at owncloud/core#29404

That's the server side. Client-side is still on a development branch and subject to testing (https://github.com/owncloud/client/labels/Delta-sync). Unless this is not finished, it doesn't make a lot of sense to merge anything at the moment, so you can only help testing it.

I think nextcloud should hurry up, delta sync will be released in the next owncloud update:
https://owncloud.com/owncloud-implements-delta-sync-technology/

@petrk94 yeah, it could in theory be merged - but ownCloud notes it'll be in testing until 2019, let's see. @ahmedammar can make a PR for the server - the client will get it as we sync upstream actively still.

Im wondering why I get so much thump down, just want to keep the thread updated :/

If I'm understanding stuff correctly it sounds like NextCloud won't be having this feature any time soon, correct?

ownCloud currently uses client version 2.4. Version 2.5 is in beta tests now (https://github.com/owncloud/client/issues/6483) and the delta sync feature was announced for version 2.6. Now with a bit of guessing, between major releases there are often 6 months or more, so I wouldn't expect a working client before the end of this year.

From Nextcloud side, they took over their own development to realize the new client-side encryption which is currently in beta status. This feature is one of the main priorities at the moment, and it will probably take some time to ship this feature and get it really stable. After that, they could implement delta sync but I won't expect it before mid-2019. This is no official statement, priorities can change ...

Apparently the client-side is already merged ( https://github.com/owncloud/client/pull/6297 )

But they're still hunting for bugs until 2.6.0.

It would be nice to have it as an experimental Opt-in feature over here, however :)

We might merge it during the course of our 2.6 development, I suppose - but we have a huge amount of things we want to work on, not sure how high the prio is on this one. Help is welcome - if somebody feels like creating a PR for our client that backports this feature, that'd be cool of course!

ownCloud currently has delta-sync for testing in the server and the daily build of the client.

server: https://github.com/owncloud/core/pull/29404
client: https://github.com/owncloud/client/pull/6771

announcement: https://github.com/owncloud/core/pull/29404#issuecomment-474783452

Nice. We might work with its author @ahmedammar to get it into Nextcloud in the future as well. As I said, it isn't high on our priority list, as we still have a lot of stabilization to do for the Drive and E2EE features and have a lot of plans around UI and server integration work. But I believe you can donate to the feature to help motivate @ahmedammar :smile:

@jospoortvliet, is donating to the bounty/@ahmedammar the only way to get this higher in your priority list?

Delta sync is a killer criteria for lots of users... I dont understand that this is not higher in priority because I know companies that moved to Dropbox because it's the only file cloud service offering it. Seems that OwnCloud will win the second place after DropBox ;-) I hope for NextCloud it will catch up too...

Chipped in on the bounty because this is high-priority for me, and I would much rather stay with NC than convert (back) to OC. I'd be happy to help test as well.

+1 for me. Delta sync is hugely important. I can only hypothesize that the reason it's low on your priority list is that you are chasing cool new features vs what everyone can benefit from and maybe the voice of this need just isn't being heard (he who shouts loudest?) I need to sync VM images and huge PST files daily.

Is it possible to use Nextcloud server 15 but owncloud 2.6.0+ (featuring DeltaSync)? I migrated from ownCloud to nextCloud and would rather not risk migrating back,. This feature seems important. What drives the prioritization of Drive and E2EE ahead of DeltaSync?

Ill have a stab at this the next few days, looking at ownCloud - they use zsync, with a lot of code to integrate it with owncloud apis, this .zsync metadata file, meh.

Id rather go full rsync, it should be possible for the server to shell out to rsync daemon or client, and connect its stdin and stdout through a http tunnel to the nextcloud-clients.

FWIW, Windows has (had?) Remote Differential Compression built-in and there was some technical documentation on it that might have been useful, but I cannot find it anymore.

@iskradelta reinventing the wheel sounds like a great plan!

@ahmedammar, you're not making it easier... 馃槒

@ahmedammar reinvinting the wheel? Thats the opposite of my plan, instead of "reinventing the wheel", meaning "reimplement rsync algorithm or another differential algorithm", and then "reimplement or make yourslef a new protcol" or "now fit the existing wire protocol on top of your api"... the plan is to do the opposite, tunnel the existing rsync network wire protocol over an existing connection which nextcloud-servers to client has - the websocket connection, instead of HTTP tunnels as I wrote above, since most people cant configure that correctly.

A prototype is already working for me, on the nextcloud-server part, it took one evening of "coding".

Zsync implementation is self-mutilation "oh rsync cant be done over http, lets modify the network protocol to do rsync over http", but yeah you can tunnel anything over http or websockets, and the owncloud implementation of it, is bugy and too large to maintain.

@iskradelta will your implementation scale? If you have many users doing rsync you will make it do way more work than with zsync if I am not mistaken?

@ariselseng rsync is only cpu intensive on the sender side. The sender side can be the client or the server, depending on if the user is uploading or downloading. There is a limit to how many users can be syncing their tree (initial downloading) at the same time, that limit is the cpu available to the server, if not hitting bandwidth limit before that, and only gets hit - when the users tree (files) have changed timestamp or size - so once synced - many users can keep "syncing" without causing high cpu.

When, if ever, this becomes a problem there is a solution, to condier caching to avoid the expensive checksumming. But I dont like it, since it means we just assume that syncing means "is always initial sync" - that users dont have any of their data on their phones/clients. And its really a benefit (zsync pre-calculated metadatafile) when all the users are downloading the same tree (files), again in the case of zsync makes sense when its made for public data like iso images.

There is a reason even dropbox is using librsync. Its the best tool, the best.

Good luck.

@iskradelta I look forward to try out your experiment ;-)

wrt others asking about priorities - we prioritize things that benefit more users or that are paid for by customers. While everyone here cares deeply about deltasync, 99% of the users don't handle very big files in which small parts are regularly changed - the only scenario's I can think of are VM's and encrypted filesystems, both of which are never used by the vast majority of computer users. The drive and E2E have big benefits for normal users, meanwhile, so we focus there. And finishing those is taking more than long enough, I hope you don't mind that we don't take on another huge task until we have those both done. Our team can actually barely handle the support load for customers, that's the main reason we are not making much progress. We're trying to hire more people for 3 years already :(

Just to let everyone know: I deleted a post violating the code of conduct. If you want proof drop me a line (by email).

@Ornias1993 you're still invited to add your technical comments regarding this feature request.

We do not tolerate personal attacks, racism, sexism or any other form of discrimination. Disagreement is inevitable, from time to time, but respect for the views of others will go a long way to winning respect for your own view.

Just keep that in mind.

@kesselb
It seems being a douche is okey, as long as being a douche is project related and not personal. Thats not a moderation policy I can accept and thus will not assist any further.

@kesselb, dropping you a line.

@realies Feel free to write me an email (I had that in mind with drop me a line and added by email now to clarify that). Actually the comment in question was similar to https://github.com/nextcloud/server/issues/417#issuecomment-544148632 but used a language violation the code of conduct.

the only scenario's I can think of are VM's and encrypted filesystems, both of which are never used by the vast majority of computer users.

The scenario i deal with is big Outlook files. I could think that is used by more users.

the only scenario's I can think of are VM's and encrypted filesystems, both of which are never used by the vast majority of computer users.

The scenario i deal with is big Outlook files. I could think that is used by more users.

Yes this, or Thunderbird emails as well, or large zip files which you update, or large pictures composition like PSD which you update, etc. Anything large which receives data updates.
PDFs also, docx, or mp3/flacs for which you updates the ID3 tags...
Veracrypt containers, isos...

There are endless cases.

In a broader way, any update on any file is a delta sync.
Except for when the updated file has all its bits changed, only then, you can consider this update as an 'overwrite'.
In effect delta sync would be used for almost 99% of the updates made to files. I think.

There are endless cases.

Read the Nextcloud Case Studies. They have tens of thousand users from universities and the like where 99% of the users hardly have more than one file besides the intro file in their account and use it accordingly.

So even if you would say that delta-sync could save a multiple of traffic also with only small to medium files at scale, this is just not a relevant use case for Nextcloud since they don't target a such active user base.

Therefore your point is invalid and @jospoortvliet reasoning should get more credit. In fact Nextcloud should probably drop this item completely from their roadmap to focus more on what is important for their users and strengthen their USP. (After all there are other solutions which have delta-sync even with block based approaches, which can be used if you have a use case which requires that.)

I think it was explained before but:

  • small files (under 5 or 10 mb) don't benefit from deltasync - the overhead is not worth it
  • files that are compressed and/or encrypted usually change everywhere when a small modification is made, so they don't benefit either

So almost all common file types, including office documents (yes they are compressed), images, music and large PSD files etc do not benefit from it. A metadata change to a large movie might (not always, depends on the file format) and sometimes to large images, too. But how often do you do that? Once a month? It is really almost exclusively nice for VM images and encrypted container formats. And yes, they matter, but aren't the most important in the world for most of our users, sorry.

Look, customers use Nextcloud in many ways. SIEMENS for example uses it only with HUGE files (minimum 30 gigabyte, typically 50-100gb). Some media companies use it with PSD files of hundreds of MB's. If we could make those cases much more efficient with deltasync, we would look into it, but it wouldn't make a difference so we don't.

There is little point in discussing this further. We have a lot of work to do and until we have a larger team and have finished other tasks, we won't get to this. If somebody else wants to do it - please, go ahead, pull requests are welcome. If somebody wants to pay for it, get in contact with sales.

Was this page helpful?
0 / 5 - 0 ratings