Server: Improve distribution of previews and avatars to multiple object store buckets

Created on 29 Jul 2020  路  4Comments  路  Source: nextcloud/server

Problem

  • right now all previews (and avatars) are stored in the app data folder
  • this folder is in oc_storages inside the root storage
  • in a multibucket setup each storage is put into one bucket
  • the root storage is always put into the first bucket - -0 https://github.com/nextcloud/server/blob/9e884567680e359365a74f2b1039ce4e919b8400/lib/private/legacy/OC_Util.php#L159
  • this results in a quite uneven distribution of objects across the buckets, because all previews of all files are put into one bucket, which limits the usability of the object store as there are recommended upper limits of objects for a single bucket

Ideas

  • first there was the idea to move the previews back to the user storage as one knows from the fileID the user home storage and can look up the correct storage and bucket. @icewind1991 highlighted that this was the case already in the past and we moved away from it, because it resulted in a lot of complexity for sharing of files that are on an external storage. See #1741
  • @icewind1991 brought up another idea, that is now possible due to #19214 - we could add to this app data wrapper for the preview folder a config check to change the bucket depending on preview subfolder. This would allow to not need to move the previews out of the app data (to avoid the issues with sharing and external storages) and at the same time distribute the previews. This would most likely happen in https://github.com/nextcloud/server/blob/cb057829f72c70e819f456edfadbb29d72dba832/lib/private/Preview/Storage/Root.php#L70

Things to keep in mind

  • have a migration path - we could just drop all previews and regenerate, but this could cause problems on bigger instances right after the migration due to the massive amount of preview generations.
  • the migration path should be transparent and step by step (i.e. per user or even per preview/previewed file):

    • having a flag in the user config/filecache that indicates if the new or old approach is used

    • after the upgrade all use the old approach

    • having a background job (that only runs on CLI or in very small batches) that takes unprocessed previews and locks their "migration happening" state

    • the job then copies the previews, changes the flag and then deletes the old previews

    • having a way to see the status of this migration somewhere in the admin panel or the CLI

  • maybe also move this directly to an explicit way of indicating which preview is on which bucket to make it easier later to extend the bucket count and being able to move new files into new buckets (See #22039 )

@kesselb @rullzer @icewind1991 Feedback welcome

1. to develop enhancement object storage high

Most helpful comment

Hi @MorrisJobke as I can see, the idea is to split the preview folder into a small number of folders than the default (fileId) using the md5-first-7-letters approach.
Till that, no big problem, the "error" solved is the load on the directory listing(mostly on local filesystem environments).
My concern on this ticket is related to bucket configuration and the limit of thems.
The idea is to have a configuration (like the current user bucket preference) that will save the previews randomly inside the buckets, using the md5 first seven letters. Even tho maybe an "on the fly" approach can be used, and avoid some database inner joins.

A math approach can be:

  1. First, check the multi bucket config and get the number of buckets
  2. Take the first md5 7 letters and convert them to numbers (from hexadecimal to decimal)
  3. Divide the decimal result by the number of buckets and get rest of the division
  4. Dynamically set the bucket number for uploading and downloading the preview, based on the calc made

This will indeed divide the storage for previews, around the buckets.
Something that I would mention is that this will only work with a static bucket number. If bucket number is increased, then is mandatory to save the calculated bucket by id in an additional table.
Regards

All 4 comments

Hi @MorrisJobke as I can see, the idea is to split the preview folder into a small number of folders than the default (fileId) using the md5-first-7-letters approach.
Till that, no big problem, the "error" solved is the load on the directory listing(mostly on local filesystem environments).
My concern on this ticket is related to bucket configuration and the limit of thems.
The idea is to have a configuration (like the current user bucket preference) that will save the previews randomly inside the buckets, using the md5 first seven letters. Even tho maybe an "on the fly" approach can be used, and avoid some database inner joins.

A math approach can be:

  1. First, check the multi bucket config and get the number of buckets
  2. Take the first md5 7 letters and convert them to numbers (from hexadecimal to decimal)
  3. Divide the decimal result by the number of buckets and get rest of the division
  4. Dynamically set the bucket number for uploading and downloading the preview, based on the calc made

This will indeed divide the storage for previews, around the buckets.
Something that I would mention is that this will only work with a static bucket number. If bucket number is increased, then is mandatory to save the calculated bucket by id in an additional table.
Regards

Hi @MorrisJobke as I can see, the idea is to split the preview folder into a small number of folders than the default (fileId) using the md5-first-7-letters approach.

This is already implemented. And as this is done as a layer we can reuse this layer to also do the distribution across multiple buckets.

A math approach can be:

  1. First, check the multi bucket config and get the number of buckets
  2. Take the first md5 7 letters and convert them to numbers (from hexadecimal to decimal)
  3. Divide the decimal result by the number of buckets and get rest of the division
  4. Dynamically set the bucket number for uploading and downloading the preview, based on the calc made

Yep - that would also be our naive approach. We maybe join the efforts in here with the ideas of #22039 and make this a bit more permanent. Otherwise changing the logic or changing the number of buckets will lead to wrongly calculated bucket numbers. So storing the result of the formula together with the preview makes it possible to change the number of buckets or the formula itself later on. But maybe this is also something we could do afterwards and have a solution for the first problem.

Something that I would mention is that this will only work with a static bucket number. If bucket number is increased, then is mandatory to save the calculated bucket by id in an additional table.

There we plan to use the filecache_extended table most likely.

Thanks for the feedback.

First proof of concept is in #22063 - this introduces the storages and already stores new previews in there.

Implemented in #22063. And there is a migration tool, that migrates pre-Nextcloud 19 previews to the new preview folders that #19214 (this also works for non-multibucket setups): #22135. If there are previews in the folder structure of #19214 already, there is no migration path yet.

Was this page helpful?
0 / 5 - 0 ratings