Restic: Storage Backend: Amazon Cloud Drive

Created on 4 Jul 2015  Â·  56Comments  Â·  Source: restic/restic

The "Unlimited Everything" plan of Amazon Cloud Drive is a quite affordable backup storage option. Amazon Cloud Drive has its own RESTful API.

backend discussion

Most helpful comment

I tried to use ACD with restic (through Fuse) and the system is still reliable (same errors as kisscool). I tried rClone to check their backend and there isn't any problem.

But I do not like rClone (snapshots...).

Conclusion: +1 for a native ACD backend!

All 56 comments

Just thought the same thing.

I may look at this once #21 is in place - uploading without compression seems like a waste of bandwidth.

That depends on your use case ;)

There’s some proof of concept code I found: http://sprunge.us/fdQF — it requires that an oauth token is in /tmp/token.json, but seems to work for me.

Motivated people could turn that into a clean backend for ACD :).

+1 from me :)
I compile the backend @stapelberg found, and it indeed appears to work and use code from rclone.

+1 too
I'm trying to use Restic + acd_cli (FUSE python client for ACD), but it's very unreliable for now : some operations do not behave as expected (file truncate, rename) and Restic randomly panics as a result.
A working ACD storage backend would do wonder.

I'm currently reworking the interface to the backends, this includes a radical simplification. This is basically done, but not yet merged. For the plan, see #383, the PR is #395.

Afterwards it will be much easier to implement new backends.

Before implementing many new backends, I'd like to have a list of rules that services we write backends for must fulfill, this may include that a test instance of the service must be available that we can run the integration tests against.

Do you by chance know whether there is a test service for ACD we can use for tests?

As far as I know, there is no test instance for ACD. No mention of such a thing here : https://developer.amazon.com/public/apis/experience/cloud-drive/content/restful-api

But the https://github.com/ncw/rclone project already did an ACD backend in go. It seems to be fairly reusable as demonstrated by the proof of concept shared by @stapelberg .

Actually, looking at the revised interface, it would be reasonably easy to do a full wrapper for rclone filesystems. Maybe that way separate implementations isn't needed?

I don't know what @fd0 vision for Restic future is, but it would seem logical to focus the project on the backup intelligence instead of re-implementing a ton of remote filesystems one by one. Besides both project licenses are compatible.
It would also solve the worries about how to test all those backends.

@klauspost was your idea to create a wrapper around rclone/fs/fs.go ? Is it doable without being tightly coupled with the internal logic of rclone ?

Each backend implements the fs.Fs interface. Each file is represented as an fs.Object.

It should be fairly easy to create a restic backend that uses an rclone filessystem+folder, provided it is already set up in the rclone configuration.

Hm, interesting idea, I have to think about it. Not having to implement all the backends by ourselves looks like a good idea, on the other side (at least at the moment) I must admit that I don't like the thought of a tight coupling between restic and rcclone, as this introduces a dependency that we can't control...

I envision for restic that it should be easy to configure and use with a variety of suitable backends. This includes (in my opinion) only one place for configuration e.g. of the backends. Maybe that's possible with rcclone or at least part of their code. The interface looks suitable to be used with restic.

I pledged a 5$ bounty for this feature.

Some thoughts:

  • Amazon Cloud Drive is using AWS S3 / CloudFront as its backend. The GET requests are always redirected with a Location header to Cloudfront. So you could use the Range header to request only a portion of a pack file as it is required by the new backend API. See: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/RangeGETs.html
  • You could start with integrating go-acd. That's the library which is used by rclone.

in case the priority of this FR depends on the popular vote: +1

👍

👍

Yes please 👍

How about adding some more bounties to this feature?

See: https://www.bountysource.com/issues/23684796-storage-backend-amazon-cloud-drive

Hey, thanks for your interest in restic in general and this backend in particular.

Just to give you a heads-up what's the blocker here: I'm not sure how to handle third-party web services. For the local and sftp backends we have extensive tests in place that are ran for every push/PR as part of the CI tests on Travis. This is also true for the s3 backend, there we're using a local Minio s3 server instance, due to this I've found several bugs in the minio client library we're using in the s3 backend.

How can we run CI tests for backend implementations that require a third-party service? Is there e.g. a test service for ACD we could use? Or maybe just take well-tested code from other projects such as rclone?

One solution might be to register an account with Amazon, whitelist it for the Cloud Drive API and then use that for the CI tests? The downside is that such a test depends on Cloud Drive being available, but I guess we can wait for an hour or so occasionally before merging a PR? :)

That's the only solution I can imaging right now that allows us to run the tests against a live service (and that's desirable in my opinion).

When we add more backends for other services the following will happen:

  • The number of dependent services for running the CI tests grows
  • For each backend we will need a test account to use in the tests
  • This test account must allow parallel connections, as the tests (e.g. for a PR) are run in parallel

Did I forget anything?

Your list looks good. There are of course more effects, but I’m not sure whether they are in scope for the question you’re trying to answer:

  • More backends means changes that touch the backends API become more involved (need to update more code, test more code).
  • People who want to run the tests locally need to create test accounts or use their own account.
  • More backends make restic more appealing to more people :).

This test account must allow parallel connections, as the tests (e.g. for a PR) are run in parallel

I think a simple way to take care of this requirement is to use different directories for each test invocation. Sending requests in parallel is usually not an issue with these services, and the different directories make sure the tests don’t clash.

I think a simple way to take care of this requirement is to use different directories for each test invocation. Sending requests in parallel is usually not an issue with these services, and the different directories make sure the tests don’t clash.

What I meant was more of a question how many parallel connections a service accepts. For most web-based services this won't be limited (at least concerning the number of connections we require), but this may not be the case for other, more obscure services.

When a service limits the number of connections so aggressively that our testing is impacted, we could ask the service owner for an exception or rate-limit on our end as well. As a last resort, we could disable the tests for the backend in question or remove that backend altogether.

But, I suggest we cross that bridge when we get there :).

I tried to use ACD with restic (through Fuse) and the system is still reliable (same errors as kisscool). I tried rClone to check their backend and there isn't any problem.

But I do not like rClone (snapshots...).

Conclusion: +1 for a native ACD backend!

Is there a problem with trying to write this? The issue seems to imply it's basically done, but delayed due to other architecture goals. This was a while ago, though, so... should I write this backend from scratch and pull request, or is it still being done internally?

Writing the backend is the not so hard part, figuring out a good user interface for configuring access to the service is what's still missing here. What's the workflow for acquiring an oauth token for ACD? What does a user need to do to access ACD via the API?

rclone does this using a local webserver, and the backend for duplicity provided a link to a solution hosted by the developer. The former is cumbersome for headless, and the latter makes this into a service that you'd want to continue to provide for free, although the resources would cost quite a bit. That makes the former the only viable option, in my opinion.

Are you opposed to a webserver, in the style of rclone, to accomplish this? Perhaps restic could ask you what interface to bind on, or give you the opportunity to use a different copy of the same program (this is what rclone does) on a non-headless machine, and simply copy credentials from one host to the other (JSON prints in console, or on the page).

I'll have to check how rclone automates this process for testing, if it does that at all.

Yeah, testing is another story. How do we run the CI tests for these backends? Please don't get me wrong, I'd love to add cloud-based backends, but we need a clear strategy for this.

For configuration, I can also imagine having a CLI-based process, where restic prints instructions to the user. I'm not super familiar with the process at the moment, is that even possible?

What do you think?

Idea 1 - Just use rclone

What about simply integrating with rclone? Kinda a "unix philosophy" type of idea, you continue to be great at doing backups, they continue to be great at cloud copy, and users get best of both worlds.

I'd have to see what their API affords, if it exists, but since we'd be leaving configuration solely to rclone (users would use rclone to add their cloud accounts, then would configure restic to "use rclone") you'd add far less complexity to restic.

Idea 2 - How the tests might look if we went webserver route

I'll preface this by saying I've not read the docs for ACD, but I have implemented some basic oauth stuff in the past.

If we implemented a webserver, we could simply test the following thing:

  • Does the webserver respond on the right ports/addresses
  • When I GET the / route, does it contain what I'd expect it to (a redirect, the text we put there, etc)
  • If I mock the response that would come from amazon (more thought needed on how to do this correctly) how does it behave?
  • Still needs a lot more thought: static "testing" credentials of some sort, to test the API. I'll have to look at how you test other services first.

and a few others, but the point is, we mock what we have to, and test the rest. I don't know if that'd test _everything_ but it'd test as much as we possibly could. I've not written extensive tests for my projects in the past, so please correct me if this isn't testing the right things haha

Sorry for glossing over your reply, was typing mine a bit before I saw yours!

For configuration, I can also imagine having a CLI-based process, where restic prints instructions to the user.

I imagine the process would probably be something like this, and this... again, is just rclone. I only reference it over and over because it's the only program that does _exactly_ what we're talking about quite well.

To start:
Using config

Then, setup ACD:
ACD

Using auto config (non-headless)

It, very quickly, opened a local webserver in my browser which immediately redirected to Amazon for login, and once I signed in, it communicated with the CLI app producing:

This, in my browser:
browser

and this, in my command line
(60% of the credential is off to the right side of my screen, outside the screenshot, I don't think this is a security risk, haha)
code

Then I save it, and it's now usable in the program.

If I select the other option, then I am simply directed to do the following:
non headless

and that just does the same process on the machine I download rclone on (opens a browser, gives me an authorization key) but instead gives it to me in the CLI to copy and paste to the headless machine.

Thanks for describing the process in such great detail, that is already very similar to what I had in mind.

I'm wondering: Why is the webserver needed at all? This process works for a "workstation" type of machine, but not on a server (where there is not browser). The workflow used by rclone is described here: http://rclone.org/remote_setup/

I don't know why we need a webserver for this, but I haven't implemented an oauth-based login workflow yet.

We'll also need a config file to store the token configured for the remote in, that's also not yet done.

From my understanding, which is limited, the oauth data is provided to the user using the GET data in a redirect.

Have a look at the URL in my screenshot of the browser. That was put there by Amazon. After I hit sign-in on my amazon cloud drive, it redirected me, immediately, to that 127.0.0.1 URL.

Perhaps that is the only way to get this data. This is likely the case, because rclone implemented a webserver instead of picking another simpler solution. When I implemented oauth before, this seemed to be the implication.

If I am correct, then it follows that you must run your own webserver to provide a page to redirect to amazon, and a page to handle the redirect from amazon to do this, and this must be accessed through a web-browser.


As for config file, I think all we need is a file that's in a default location (~/.restic.conf) but can be configured via a flag or environment variable. I think this is a bit dirty, but it's only viable solution that is transparent to the average user, but powerful for those who wish to do it "their way"

That sounds plausible. Let me think about a strategy here, this may take some time.

We'll need to:

  • have a config file to store the authentication tokens
  • have "instances" of backends, e.g. something called my_amazon_account which is a ACD backend configured with a login token, so users can run restic --repo my_amazon_account:/foo/bar/dir ...
  • have a workflow to create these login tokens
  • register a client id and secret for use with restic, and hide it in the source code (similar to what rclone does)

Anything else I'm missing here?

I think you got the big stuff outlined there.

Would you want to move all current backends into a single abstraction that supports this, or would this whole system become a "cloud" backend in the current sense of a backend (which itself is configured through special restic commands)?

Each instance of a cloud backend (google drive, onedrive, amazon cloud drive, S3?) has the following components:

  • Optional: authorization mechanism
  • Optional: arbitrary persistent state, typically related to authorization, but should support other things. This would be written by restic to a file for each instance of the backend.
  • A protocol for communication (ie: the interface/API)

and maybe some other stuff I'm missing

The current abstraction, from my quick read, only relates to the last thing. I think this is a pretty smart way of handling backends, if you're looking to revamp it a bit.

The other option is to simply, as I said, implement a "cloud" backend which does all of these things and rolls all the different providers together under it's umbrella.

  • Have a workflow to refresh auth tokens (if they expire and the provider supplies a refresh token, which should be stored next to the auth token in your point 1)

Here's some background in regards to embedding a client secret in open source applications: http://stackoverflow.com/a/28109307

As far as I understand the problem: You're not allowed to embed a client secret in an open source application. rclone employs some obfuscation to hide what they're embedding.

I doubt that embedding a static client id/secret in restic's source code is a good idea. On the other hand, having the user register an application themselves is complicated.

This article describes how to do oauth2 with Go: https://jacobmartins.com/2016/02/29/getting-started-with-oauth2-in-go/

I doubt that embedding a static client id/secret in restic's source code is a good idea.

There is no real solution, it is a broken concept to assume that any client can keep a secret.

However, if you consider what the client secret contains, it is not that important. The only real thing it allows is for Amazon (and others) to be able to identify a specific client, nothing more. It does not grant any special access - your tokens are used for that.

Sure a publicly available "client secret" can make other application identify themselves as restic, but other than risk that "restic" will be banned (or more likely rate limited) as a client, there is not much risk at exposing the client "secret". It will never put any user data in jeopardy.

The problem here is that somebody needs to register the clientID, for example me. If I'm using my normal Amazon account (or even worse, my Google account), and "violate" the TOS for the service by publishing the client secret, they can terminate my account. That's not something I'm going to risk.

Another problem is that once the client secret changes (or is revoked), we're stuck with older versions of restic e.g. in Debian stable which are unable to communicate with the service because of a hardcoded (and now invalid) client secret. This is the case even if access to the service is restored shortly after, but the client secret has changed.

I've thought about possible solutions and found only two:

  • Live with the risk and just put the client secret into the source
  • Build restic in a way that users need to register their own client ID and client secret, via a nice UI that minimizes the hassle

Currently, I'm in favor of the second option, we need a UI for the oauth token thing anyway. What do you think?

If I'm using my normal Amazon account [...]

I know that Nick has had some correspondence with Amazon, since rclone was being rate limited due to many users. It is however my impression (from memory) that they were quite forthcoming and encouraged OS development, and have made exemptions for his client. So I guess my advise would be to contact them and see how things go from there. In the overall picture I don't think they would mind the business coming from restic users.

Interesting idea, do you have any hint on who to contact at Amazon?

For Microsoft OneDrive he said that he did not contact anyone: https://github.com/ncw/rclone/issues/372

I know that @breunigs had bad luck with his amazon cloud drive duplicity backend — they wouldn’t give him any rate limit exemptions AFAIK.

I have only read the last few comments, so please forgive me if this info is not needed:

  • rclone implements the web server on top of it offering remote setup where you copy the URL. Having a local webserver is just more convenient
  • if you want to whitelist any redirect target in Amazon, it has to be on a https machine – linking to http URLs is not okay. Only exception is localhost. So, for remote setup you can either redirect the user to a blank page and hope they realize what they have to do, or host some page with instructions. I added https://breunig.xyz/duplicity/copy.html for duplicity, since it doesn't have https infrastructure yet. Amazon will add all details in the query string, so you can get away with making this a static page
  • You need an Amazon Developer account. You can use your existing credentials to log in I believe, but you can also create a new one
  • There is a process where you register your app and then at some later stage you can create a security profile for said app. This process is very confusing, because of horrible UX, but it should work without human interaction from Amazon. (Note: by App they usually refer to "mobile apps", but not always. Click around a bit)
  • What the limits are is unclear, Amazon don't say. It's clear there are multiple stages: per user, per API endpoint, per credentials
  • If you want production limits, you send an email with "details" to [email protected]. Use a big player mail server, or they will tag you as spam and it takes a month or two until some poor soul went through all their spam.

Also, a final word of advice: read through rclone's workarounds for Amazon Drive. The API contains a lot of undocumented "eventual consistency" gotchas. It even goes out of its way to cache an outdated response it gave you, so that you need to wait even longer if you were too hasty to begin with. This is on top of it reporting errors when there are none, one just needs to wait.

HTH,
Stefan

Thanks for the information!

Just throwing something out there:

What is we remove all (but local and REST) backends from restic and stick them into restic/rest-server?

This allows restic to focus on doing backups properly and filesystem implementations are done in the rest-server.
This also leaves restic with just 1 backends API to maintain.

This doesn't solve the testing problem, but will certainly help keep the restic source clean/focussed and it is easier to make API changes inside restic.

Thanks for the suggestion. Unfortunately I don't like it at all, in my opinion this approach (adding an intermediate layer including a new transport via HTTP) will lead to even more problems.

The backend API interface was stable for a long time, then changed recently, and will be stable again. The interface is already rather small.

We should try to get backends into restic (including proper CI tests) as soon as possible, that's IMHO the only way to make sure they work.

In case of the Amazon ACD backend, we need to answer the outstanding questions first.

The Amazon Developer Guide for Amazon Drive (what's it called these days) states that:

What Not To Build

[...]

  • Don’t build apps that encrypt customer data

I feel that Amazon Drive is not the right platform for securely storing encrypted backups.

Interesting. This must be a new addition, as it definitely was not the case when ACD support was added to Arq.

Seems ACD is not a real storage option after all.

Indeed an addition within the last year, wasn't listed one year ago: http://web.archive.org/web/20160322034250/https://developer.amazon.com/public/apis/experience/cloud-drive/content/developer-guide

Amazon has since clarified this in https://forums.developer.amazon.com/questions/54909/impact-of-dont-encrypt-customer-data-part-of-drive.html:

What if the customer choses to encrypt their data?
They can do that, and that is fine.

So, restic and other apps should be good.

I think their intention is to protect the users having their data encrypted without a way to recover it.

Steffen

One other motivation which I find plausible is to increase interoperability — if each application encrypts their files, the user’s ability to switch between applications is severely hampered.

I asked Arq Backup support. They encrypt everything, and said that their app had been approved by Amazon, and to not worry.

I'm not sure what Amazon is trying to say. But seems that are now evaluating each case as they come in.

Not sure if anybody is aware of the recent ACD drama with acd_cli and rclone, but a TL;DR of the situation is that they have had their ACD API access revoked due to TOS violations. Their efforts to regain API access are apparently being hampered by the fact that Amazon has stopped accepting new third-party apps for ACD. I assume this latter revelation stops any Restic ACD support in its tracks, unless the project had already obtained ACD API access.

acd_cli API access was revoked due to a security issue with their oauth app, not a TOS violation. The problem has been fixed and Amazon re-instated their key. Although this is off topic from this project.

New ACD API access is currently closed.

Thanks for posting this here, I wasn't aware of it. I had reservations implementing ACD, and it seems that Amazon indeed did not like secrets in the code of an Open Source program: rclone was banned for it: https://forum.rclone.org/t/rclone-has-been-banned-from-amazon-drive/2314

On the other hand, acd_cli implemented an OAUTH auth service (not sure what the correct nomenclature here is). This handles authorization for all users, and there apparently was a bug that allowed people to access/modify other people's files.

Since Amazon isn't accepting new clients anyway I'm closing this issue for now. Thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fd0 picture fd0  Â·  42Comments

tyll picture tyll  Â·  67Comments

fd0 picture fd0  Â·  63Comments

middelink picture middelink  Â·  48Comments

fd0 picture fd0  Â·  51Comments