Cms: Cache Remote Images OFF is not respected and cause issues with large S3 bucket during Indexing.

Created on 3 Sep 2020  路  6Comments  路  Source: craftcms/cms

Description

We are trying to index 125k images from our S3 bucket into Craft. Those images are ~10mb TIFF, we will do transformations on them using AWS Lambda function and not Craft PHP transform. While trying to index those assets we had a couple of issues:

1- ./craft index-assets/all will crash around 15k because of memory issue (We increase memory_limit in php.ini, still happen)
2- ./craft index-assets/one photos/folder/subfolder will index the right amount of photo (~500 per folders) but it's slow but this does fixes our memory issue.
3- We had MYSQL execution timeout.
4- We can't really cancel the old index-assets call we made in the panel and seems like our PHP was still running in the background so we had to restart php-fpm.

Also,
maxCachedCloudImageSize is at 0 in our general config.

While trying to diagnostic the problem we found this in : backend/vendor/craftcms/cms/src/services/AssetIndexer.php:561

if (!is_array($dimensions)) {
           $tempPath = AssetsHelper::tempFilePath(pathinfo($filename, PATHINFO_EXTENSION));
           $volume->saveFileLocally($indexEntry->uri, $tempPath);
           $dimensions = Image::imageSize($tempPath);
}

This, no matter what will get the file locally and try to get the dimensions of it. This does not reflect our settings of not caching remote images... Of my best guess, this is the why our /storage/runtime/temp was getting full of the 10mb TIFF.

We understand that 125k is not really a "normal" case but even with all caching OFF we still see files going to /storage/runtime/temp and that shouldn't be the case.

Main issues:
1- Too much memory usage for the looping inside index-assets, doing it in chunk will probably fix that.
2- Still using /storage/runtime/temp to get dimensions of file while "no-caching" is ON

Steps to reproduce

  1. Have a large S3 bucket
  2. Connect craft cms to it with a S3 Assets Source
  3. Try to index all assets (./craft index-assets/all --cache-remote-images=0)
  4. Check your /storage/runtime/temp grows and your PHP memory usage explodes.

Additional info

  • Craft version: 3.5.7
  • PHP version: 7.3.21
  • Database driver & version: MySQL 10.1.45
  • Plugins & versions:
    Amazon S3: 1.2.11
bug

Most helpful comment

This has been a huge problem for us as well, on a couple of projects. There鈥檚 this nice little comment in code. Unfortunatelly it just doesn鈥檛 work, it鈥檚 always transferring the complete files when indexing assets on S3 volumes.
https://github.com/craftcms/cms/blob/bc77a2d4bcee8dcce04c4b05df120e6955cfe4a7/src/services/AssetIndexer.php#L550

Related: https://github.com/craftcms/aws-s3/pull/23, https://github.com/craftcms/aws-s3/pull/95

All 6 comments

This has been a huge problem for us as well, on a couple of projects. There鈥檚 this nice little comment in code. Unfortunatelly it just doesn鈥檛 work, it鈥檚 always transferring the complete files when indexing assets on S3 volumes.
https://github.com/craftcms/cms/blob/bc77a2d4bcee8dcce04c4b05df120e6955cfe4a7/src/services/AssetIndexer.php#L550

Related: https://github.com/craftcms/aws-s3/pull/23, https://github.com/craftcms/aws-s3/pull/95

@carlcs We found a little work around since we (really) need to process those 125K,

We batch the index by subfolders (./craft index-assets/one {volume}/{subfolder}/{subfolder}/ --cache-remote-images=0 --delete-missing-assets) in a .sh and at the same time process a recucring clear-caches/temp-files

while true  
do  
  ./craft clear-caches/temp-files  
  sleep 300  
done

While this issue is fixed, we can proceed step by step our indexation process.

@jesuismaxime Thanks! I just wish we could stop wasting bandwidth, time and environment.

Seeing as how @boboldehampsink had the closest encounter with why the related fixes were implemented in the AWS Volume, maybe he can shed some light or provide some background into why this was changed recently.

AWS S3 Plugin doesn't use "realtime" streaming from S3 because it would break downloading assets from the CP - they use stream + fseek which aren't compatible. Instead, it downloads them to memory then streams them from memory.

It could go for realtime streaming but it means that downloading assets has to be refactored.

Just a quick thought: The main issue seems to be that Craft CMS needs to have the dimensions of the asset. Does Craft CMS really need the width and height for each asset? S3 doesn't provide these informations without "downloading" the asset, it will great to kinda bypass these informations in the indexing process especially with "no-caching" ON or make width/height optional globally.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bitboxfw picture bitboxfw  路  3Comments

darylknight picture darylknight  路  3Comments

RitterKnightCreative picture RitterKnightCreative  路  3Comments

michel-o picture michel-o  路  3Comments

benface picture benface  路  3Comments