Silverstripe-framework: HTTPStreamResponse potentially storing entire file in memory

Created on 28 Jul 2019  Â·  12Comments  Â·  Source: silverstripe/silverstripe-framework

On v4.

Relating to this issue raised https://stackoverflow.com/questions/57085714/silverstripe-4-large-asset-downloads-exhausts-memory-limit-in-httpstreamrespon

Looking at the HTTPStreamResponse, in getBody():

public function getBody()
    {
        $body = $this->getSavedBody();
        if (isset($body)) {
            return $body;
        }
        // Consume stream into string
        $body = $this->consumeStream(function ($stream) {
            $body = stream_get_contents($stream);
            // If this stream isn't seekable, we'll need to save the body
            // in case of subsequent requests.
            if (!$this->isSeekable()) {
                $this->setBody($body); // <---------------
            }
            return $body;
        });
        return $body;
    }

The setBody() call as a result of non-seekable streams will end up storing the entire body object in memory.

I am thinking this might be a problem when the class is used to stream a non-video/audio file, such as serving a download stream for an archive/document.

What would be the downside for not storing anything in body?

affectv4 efforhard impacmedium typbug

Most helpful comment

I've prepped a quick patch: https://github.com/silverstripe/silverstripe-framework/pull/9153 . I didn't have time to add tests - I'm boarding a flight in 10 minutes.

All 12 comments

The problem here isn't necessarily that line - the entire contents of the stream is already loaded into memory with stream_get_contents. This seems like a bit of a problem with the architecture of this reponse though - the stream response shouldn't read the whole stream into memory to response, otherwise what's the point!

Yes it does look like its missing the conventional “loop and read N bytes, flush buffer” thing that i normally see when outputting streams

Or are we supposed to handle that outside this class?

@cwchong It looks to me like getBody() isn’t intended to be called during the normal response output - outputBody() should be called instead, which does appear to stream the contents instead of loading them into a string.

Is it core code that’s triggering the call to getBody()?

Actually I just tested it and it looks like ChangeDetectionMiddleware is triggering that call 😞.

I think the best way to solve this would be to grab the MD5 hash of the file in FlysystemAssetStore::createResponseFor() (assuming that’s possible & performant) and

$response->addHeader('ETag', sprintf('"%s"', md5($fileHash)));

Edit:

It looks like the only way to actually calculate the hash is to pass the stream piece-by-piece like this: https://github.com/emgag/flysystem-hash/blob/master/src/HashPlugin.php. That doesn’t look very performant to me, especially if it’s streaming from a remote filesystem or something... perhaps we’d be better off skipping etag generation for HTTPStreamResponse?

perhaps we’d be better off skipping etag generation for HTTPStreamResponse

I completely agree. It looks like ChangeDetectionMiddleware doesn't override an existing etag if set, so we should say that HTTPStreamResponses should manage their own etags.

I've prepped a quick patch: https://github.com/silverstripe/silverstripe-framework/pull/9153 . I didn't have time to add tests - I'm boarding a flight in 10 minutes.

The FlysystemAssetStore has shit load of specialised logic dedicated to calculate the MD5 hash of files. In 99% of cases, it will already have that hash cache somewhere, so it would be very performant for the AssetStore to attach the ETag header.

On second thought we're not using MD5 for our hash ... would it be problematic if we use SHA1 to generate those ETag headers for files?

Maybe when the md5 gets created, at that point the sha1 can get created as well and cached and used when needed, so they are both created at the same time in one place.

A big chunk of that hashing logic was put in place to avoid having to recompute hashes all the time. It feels a bit painful computing an another hash value just for ETag if the Sha1 hash could serve the same function.

It doesn't look like the HTTP specs says anything about how ETag are generated so my guess is that SHA1 would work just as well as MD5 https://en.wikipedia.org/wiki/HTTP_ETag

@maxime-rainville Nice! Do we _always_ calculate the SHA1 hash for every file that’s uploaded? I just want to make sure you can’t successfully upload a massive file, then fail to download it if the SHA1 isn’t already in the Sha1FileHashingService cache and it crashes trying to calculate it or something. If we _always_ calculate it, it’ll fail on upload anyway so we shouldn’t see this bug again

If I’m understanding correctly, we could just add the following to FlysystemAssetStore::createResponseFor() and ChangeDetectionMiddleware would simply skip over etag generation as the header is already set:

/** @var FileHashingService $hasher */
$hasher = Injector::inst()->get(FileHashingService::class);
$etag = $hasher->computeFromFile($fileID, $flysystem);

$response = HTTPStreamResponse::create($stream, $size)
    ->addHeader('Content-Type', $mime)
    ->addHeader('ETag', $etag);

We compute the Sha1 hash of each file, so if you uploaded a really big files your request might take a few seconds to come back the first time because it needs to compute that hash. However, that hash is cache, so on subsequent request it doesn't recompute it.

Also the hash is computed by reading a stream. So we shouldn't need to load the entire file in memory at once to do it. https://github.com/silverstripe/silverstripe-assets/blob/fc17276f93c6a5110fbe91e0796d4f0051c1c482/src/Storage/Sha1FileHashingService.php#L49-L57

Your example code should work just fine.

Was this page helpful?
0 / 5 - 0 ratings