We are trying to index 125k images from our S3 bucket into Craft. Those images are ~10mb TIFF, we will do transformations on them using AWS Lambda function and not Craft PHP transform. While trying to index those assets we had a couple of issues:
1- ./craft index-assets/all will crash around 15k because of memory issue (We increase memory_limit in php.ini, still happen)
2- ./craft index-assets/one photos/folder/subfolder will index the right amount of photo (~500 per folders) but it's slow but this does fixes our memory issue.
3- We had MYSQL execution timeout.
4- We can't really cancel the old index-assets call we made in the panel and seems like our PHP was still running in the background so we had to restart php-fpm.
Also,
maxCachedCloudImageSize is at 0 in our general config.
While trying to diagnostic the problem we found this in : backend/vendor/craftcms/cms/src/services/AssetIndexer.php:561
if (!is_array($dimensions)) {
$tempPath = AssetsHelper::tempFilePath(pathinfo($filename, PATHINFO_EXTENSION));
$volume->saveFileLocally($indexEntry->uri, $tempPath);
$dimensions = Image::imageSize($tempPath);
}
This, no matter what will get the file locally and try to get the dimensions of it. This does not reflect our settings of not caching remote images... Of my best guess, this is the why our /storage/runtime/temp was getting full of the 10mb TIFF.
We understand that 125k is not really a "normal" case but even with all caching OFF we still see files going to /storage/runtime/temp and that shouldn't be the case.
Main issues:
1- Too much memory usage for the looping inside index-assets, doing it in chunk will probably fix that.
2- Still using /storage/runtime/temp to get dimensions of file while "no-caching" is ON
This has been a huge problem for us as well, on a couple of projects. There鈥檚 this nice little comment in code. Unfortunatelly it just doesn鈥檛 work, it鈥檚 always transferring the complete files when indexing assets on S3 volumes.
https://github.com/craftcms/cms/blob/bc77a2d4bcee8dcce04c4b05df120e6955cfe4a7/src/services/AssetIndexer.php#L550
Related: https://github.com/craftcms/aws-s3/pull/23, https://github.com/craftcms/aws-s3/pull/95
@carlcs We found a little work around since we (really) need to process those 125K,
We batch the index by subfolders (./craft index-assets/one {volume}/{subfolder}/{subfolder}/ --cache-remote-images=0 --delete-missing-assets) in a .sh and at the same time process a recucring clear-caches/temp-files
while true
do
./craft clear-caches/temp-files
sleep 300
done
While this issue is fixed, we can proceed step by step our indexation process.
@jesuismaxime Thanks! I just wish we could stop wasting bandwidth, time and environment.
Seeing as how @boboldehampsink had the closest encounter with why the related fixes were implemented in the AWS Volume, maybe he can shed some light or provide some background into why this was changed recently.
AWS S3 Plugin doesn't use "realtime" streaming from S3 because it would break downloading assets from the CP - they use stream + fseek which aren't compatible. Instead, it downloads them to memory then streams them from memory.
It could go for realtime streaming but it means that downloading assets has to be refactored.
Just a quick thought: The main issue seems to be that Craft CMS needs to have the dimensions of the asset. Does Craft CMS really need the width and height for each asset? S3 doesn't provide these informations without "downloading" the asset, it will great to kinda bypass these informations in the indexing process especially with "no-caching" ON or make width/height optional globally.
Most helpful comment
This has been a huge problem for us as well, on a couple of projects. There鈥檚 this nice little comment in code. Unfortunatelly it just doesn鈥檛 work, it鈥檚 always transferring the complete files when indexing assets on S3 volumes.
https://github.com/craftcms/cms/blob/bc77a2d4bcee8dcce04c4b05df120e6955cfe4a7/src/services/AssetIndexer.php#L550
Related: https://github.com/craftcms/aws-s3/pull/23, https://github.com/craftcms/aws-s3/pull/95