Rclone: copy large objects as-is when syncing swift-to-swift

Created on 20 Jul 2016  路  3Comments  路  Source: rclone/rclone

When syncing a container between two swift endpoints using rclone, if there is a large object in one of these containers (SLO or DLO), the whole object is downloaded and then uploaded as a single object into the destination cluster (segmented again if necessary). This is problematic for two reasons: one, there are valid use cases where a Swift user will create a large object out of segments of arbitrary sizes, and the reassembly in the destination will break this segmentation scheme; two, if the segments are in the same container as the manifest object, the reassembled object plus the segments will be migrated (thus doubling the data in the destination).

It would be more desirable if rclone, when syncing between two swift containers, would recognize a large object and migrate the individual segments from the source to the destination and then re-create the manifest in the destination.

Swift bug doc fix

Most helpful comment

We're encountering this bug also, and the large object segment duplication behavior is particularly problematic for us. Has anyone worked on (or around....) this issue and has any code or process to share?

All 3 comments

This would be difficult to achieve with the architecture of rclone at the moment. The two remotes (source swift and destination swift) know nothing about each other and have no way to communicate something like chunk sizes.

You could sync the _segments container and remake the manifest I suppose.

Why is this causing you a problem?

For reference: http://docs.openstack.org/developer/swift/overview_large_objects.html

In Swift, there are two ways to create a large object. One (static large object) is to use a manifest that is a JSON file that contains a list of paths to individual chunks, along with the size, byte range in the larger object and MD5 of the chunks. The second (dynamic large object) doesn't use a strict manifest and instead has a 0-byte object with a metatag that specifies a path prefix for all the chunks, and relies on lexicographical ordering to determine the chunk ordering for the large object.

In both of these cases, the actual chunks can be of any size (not always the max object size according to doing a GET on cluster.example.com/info and looking at swift.max_file_size). In addition, the paths to the chunks include the container, so the chunks can be anywhere - in the same container, even. Not all clients use the same segmenting scheme (e.g. not everyone puts chunks in a separate *_segments container, some clients put the chunks in the same container).

The issue (for us) in how rclone currently handles large objects is when we're trying to copy data from one swift cluster to another where the user in the source cluster has used an application (or has written one) will segment uploaded files with different sized chunks and possibly put them in the same container. The application (and the user) expects this large object to still be a large object in the destination cluster looking the same way they uploaded it (i.e. the chunks are all there and are all the same size and in the same place). Also, if we are migrating a container that has large objects and the chunks are stored in the same container, the current implementation migrates both the object (which it reassembles) _and_ the chunks, which doubles the amount of data that gets copied.

A possible solution to this would be to do a HEAD on the object before downloading it for copy and check if it is a large object (this will be indicated by the presence of headers). If it is, download the individual segments (and upload them to the destination cluster), then remake the original manifest in the destination cluster (for an SLO you can download the JSON manifest by using a query parameter; see http://docs.openstack.org/developer/swift/overview_large_objects.html#retrieving-a-large-object)

Does this all make sense? We have a huge use case for using rclone to perform cluster migrations and this functionality would help our customers' users not be disrupted by migrations.

We're encountering this bug also, and the large object segment duplication behavior is particularly problematic for us. Has anyone worked on (or around....) this issue and has any code or process to share?

Was this page helpful?
0 / 5 - 0 ratings