Parse-server: "Clean Up Files" Feature

Created on 14 Mar 2016 · 56Comments · Source: parse-community/parse-server

Make sure these boxes are checked before submitting your issue -- thanks for reporting issues back to Parse Server!

[x] You've met the prerequisites.
[x] You're running the latest version of Parse Server.
[x] You've searched through existing issues. Chances are that your issue has been reported or resolved before.

One of the features that I liked on the hosted Parse was, in the settings, the button Clean Up Files. This way, every file stored in S3 for example, that wasn't anymore referenced from a PFFile, would be deleted. I liked it specially because it allowed us to save on unused/unneeded resources.

Maybe a Rest call using the master key would be initially enough? In the future, with possible integration with the parse-dashboard?

I know it's lower priority compared to the features/fixes that are being developed, but that would be great to have.

enhancement

Source

natanrolnik

👍24

Most helpful comment

+1 , agree with the need.

natario1 on 17 Mar 2016

👍19

All 56 comments

This would be pretty difficult actually, and would need to be built for each specific Files adapter. Right now, there's no 'listing' of what files exist through the adapter.

gfosco on 17 Mar 2016

😕1

+1 , agree with the need.

natario1 on 17 Mar 2016

👍19

It's possible to clean the unused files stored in GridStore now?

ckarmy on 18 Jul 2016

+1, It's a very useful feature.

yorkwang on 28 Sep 2016

+1, It would be nice.

Lokiitzz on 3 Oct 2016

umair6 on 19 Oct 2016

+1 very much needed

abdulwasayabbasi on 19 Oct 2016

JoseVigil on 21 Nov 2016

Just asking: how many of you ever actually needed a file after deleting pointers to it?

I feel the most common use of files is “if I delete the pointer, I don’t need the file anymore”. If this is the case, why not make it the default in parse-server?

I mean that when any object is deleted, after delete, all files are processed and adapter.deleteFile() is called for each. This could be opt-in / out in the ParseServer constructor, and is way easier than a complete “clean up” feature.

Given how tricky the full task is, it would also be cool if parse-server kept a Files table with url and usage_count , to simplify all the rest.

natario1 on 23 Nov 2016

@natario1
Just to answer your question why I want to keep the files is because of the intermediate state
e.g.
Consider a mobile game using Parse backend for keeping some zip packages to be used in game
While we delete/replace with some new packages on parse dashboard, there will be a state for sometime where the users still having old package URL locally in their app/game will start having issues until new configs/URLS are loaded
But I also want to keep size of my database as small as possible, so at some right time I will delete some old packages

abdulwasayabbasi on 23 Nov 2016

@abdulwasayabbasi makes sense, thank you. Just wondering how frequent that is.

Your use case would not be bothered by an ‘auto delete’ feature, since you are just updating the file field. To take advantage of it you would have to create a new object with the new package file, and delete the older object when you feel safe, so the old file gets auto deleted.

natario1 on 23 Nov 2016

I made my own "clean file". Maybe it could help someone!

https://gist.github.com/Lokiitzz/6afbf0573665d3170ffb1e83565a0fef

Be careful :)

Lokiitzz on 5 Dec 2016

Why not a PR to Parse Server? :)

davimacedo on 5 Dec 2016

The code won't work on the server as it loads all objects in memory.

flovilmart on 5 Dec 2016

yes.. you're right. I didn't check it before.

davimacedo on 5 Dec 2016

For those features, I'd love to see command line tools more than just another endpoint that require maintenance.

flovilmart on 5 Dec 2016

why not pass an "auto-delete-files" flag to the server on startup and when an individual file pointer is deleted or replaced it deletes the file? This feature would help the 50% of people who only use PFFiles for profile pictures (files that won't be needed after deletion or replacement) while leaving the other 50% who want fine grain control unaffected because they didn't pass the flag? Would this be a valid solution?

mrmarcsmith on 30 Dec 2016

👍12

I also have this problem. Deleted a lot of rows in my mongodb database with parse dashboard including the reference to many images. Now I am unable to find them and clean them up. Or is there any other (manual) way?

I expected the parse dashboard to cleanup pffiles before it removes the reference to them.

funkenstrahlen on 7 Feb 2017

👍3

Any progress on this?

respectTheCode on 30 Mar 2017

Not yet, this is not a feature that is actively worked on, but pull requests or a separate project could take care of it

flovilmart on 30 Mar 2017

Depending on how you look at this it is either an undocumented "feature" or a huge bug. Either way it has huge and expensive consequences that should at the very least be well documented.

respectTheCode on 30 Mar 2017

👍1

undocumented "feature" or a huge bug

What do you mean by that?

This is neither documented nor a bug as it's just not implemented, neither listing the missing files, nor deleting an existing file through the file adapters.

Because a file could be referenced by multiple Objects, we don't keep a reference count on them.
Steps are relatively easy to describe:

list all files present in the DB
list all files present in the bucket/storage
for each file present in the storage, if missing from the list from files in the DB, delete from storage.

However, not trivial to implement.

flovilmart on 30 Mar 2017

If you are using mongodb then as a workaround, you can write simple script to delete unreferenced file chunks directly from mongo db.
db_cleanup_script.js.zip
Make sure mongo is installed and running before running this script.

umair6 on 31 Mar 2017

Droppix on 5 Apr 2017

If memory was handled the same way this would be a memory leak. I guess you could call this a file leak.

A cleanup scripts is a workable solution to a one time problem but this is not a one time problem. To use a cleanup script in production you have to setup, maintain and monitor the infrastructure to run the cleanup script on a schedule. Then you need to monitor the impact of running that script and adjust the schedule, scale the servers and/or throttle the script to meet your needs. This is all very dependent on your exact use case and could change as your product and users change. This means constant monitoring is needed. If you have the team to solve this problem chances are you would not be using parse in the first place.

The logical solution here is to keep a reference count and delete the file when the counter gets to 0. This is code that could be written once and used in all but the most extreme cases.

respectTheCode on 5 Apr 2017

Originally, on parse.com, that was a cleanup script. Which seems to be efficient enough to work it out. Files can be passed around in different objects, stored in arrays or embedded into objects. THere's nothing that guarantees that one user won't reference the file by the URL, I did that for a project, just using the File as an uploader mechanism but then I would just pass the URL's around.

The logical solution here is to keep a reference count and delete the file when the counter gets to 0.

The script based solution is as valid as a ref-count based solution. That being said, 'over releasing' a file or missing to count the usage of a file when referenced by another would destroy the file.

You seem to have a good understanding of the problem, why not try to tackle it?

This repo started as a simple file list tool, maybe there's something to look for here.

flovilmart on 5 Apr 2017

Also given the costs related to unused files on S3, https://aws.amazon.com/s3/pricing/ (0.0023$ / Gb / month) this seems to be negligible.

flovilmart on 23 Apr 2017

Anyone can solve this issue? It's been more than one year.

yorkwang on 28 Jun 2017

Yes, anyone can solve, including you :)

natanrolnik on 28 Jun 2017

👍1

Depends on the size of a single file.

funkenstrahlen on 28 Jun 2017

In case we want to programatically delete a file, one option I can see so far is to make a request to end point defined in FileRouter L27, as Parse.File doesn't expose the delete method.
For creating file we have:
return CoreManager.getRESTController().request('POST', 'files/'+name, data);
So I tried to send this to delete a file:
return CoreManager.getRESTController().request('DELETE', 'files/'+name);

But got error from middleware.js trying to create buffer new Buffer(base64, 'base64');

What is the proper way to make such request to delete a file or any other way to programatically do this?

6thfdwp on 30 Jun 2017

Single file deletion is not implemented yet, and not required by the files adapters either I believe. We could start adding those.

flovilmart on 30 Jun 2017

Single file deletion is not implemented yet

Do you mean it's just not implemented in Parse.File?

I can see it's required in FilesAdapter, and FilesRouter define this end point as well.
So I suppose we can do single file deletion as long as our custom FilesAdapter implements this method, right? Btw, I'm using AzureStorageAdapter, I can see it has this method implemented.

This error may be related to the request format issue?

6thfdwp on 30 Jun 2017

The deletion is working with the file URL without app ID. i.e.:
curl -X DELETE -H "X-Parse...... http://domain/parse/files/appid/file
is not working but
curl -X DELETE -H "X-Parse...... http://domain/parse/files/file
is working :/

Edit: Oh someone has found it already https://github.com/parse-community/parse-server/issues/1411

hennessycreative on 13 Jul 2017

👍2

+1, It would very useful.

ghost on 17 Oct 2017

Any solution yet ?

xainpro on 17 Oct 2017

👍5

+1 (to keep this alive)

mtrezza on 7 Apr 2018

👍1

Feel free to open a pull request for a reference implementation, but i’ll Be closing this issue as it’s an off process job that may take a very long time to complete in order to clean dereferenced files. It’s not something that I as a maintainer want to actively work on (as stated many times) but I’ll hladly review a pull request if any change to parse server is needed for that feature.

As mentioned previously, all the work can be done externally, without needing change on this project.

flovilmart on 7 Apr 2018

GDPR requirements for anyone running parse for users that might exist in Europe and have uploaded personal data mean that without this feature anyone using parse without a way to mitigate this could have an expensive problem.

jeacott1 on 11 May 2018

👎2

@jeacott1 we provide a way to delete existing files on demand, through the REST API and the files adapters, so a conscious user could delete the existing picture upon replacement.

Also, we’re open to pull requests, I believe I don’t need to say it again, as it was basically the message posted before yours.
If you believe this project can’t help you achieve GPDR compliance, then you have 2 options, either fix it or stop using it. Trolling isn’t one.

Thanks.

flovilmart on 11 May 2018

👍1

ah - ok, I missed that. I didn't think there was a way to delete via the rest api. I thought it just removed the reference. just trying to understand how best to do this.

jeacott1 on 11 May 2018

curl -X DELETE \ -H "X-Parse-Application-Id:[AppId]" \ -H "X-Parse-Master-Key:[MasterKey]" \ http://[ParseServer Url]/files/5b6cd3a71873be9c79aedeb53ff71f05_fav.png

here is the code for REST API for deleting files, php API supports file delete as well:
I tested it for Digital Ocean Spaces and it works like a charm

try {
    $result = $testFile->delete(true);
    echo $result;
} catch (Exception $e) {
    echo 'Caught exception: ',  $e->getMessage(), "\n";
}

giomatiashvili on 15 Dec 2018

What do you think of this approach @mtrezza?

FilesController.js

async cleanUpFiles(database) {
    if (!this.adapter.getFiles) {
      return;
    }
    const files = await this.adapter.getFiles(this.config);
    if (files.length == 0) {
      return;
    }
    const schema = await database.loadSchema();
    const all = await schema.getAllClasses();
    const classQueries = {};
    for (const field of all) {
      const fields = field.fields;

      for (const fname of Object.keys(fields)) {
        const fl = fields[fname];

        if (fl.type == 'File') {
          const classData = classQueries[field.className] || [];
          classData.push(fname);
          classQueries[field.className] = classData;
        }
      }
    }
    if (Object.keys(classQueries).length == 0) {
      return;
    }
    for (const file of files) {
      try {
        const promises = [];
        for (const className of Object.keys(classQueries)) {
          const keys = classQueries[className];
          const queries = [];
          for (const key of keys) {
            const query = new Parse.Query(className);
            query.equalTo(key, file);
            queries.push(query);
          }
          let orQuery = new Parse.Query(className);
          orQuery = Parse.Query.or.apply(orQuery, queries);
          orQuery.select('objectId');
          promises.push(orQuery);
        }
        const data = await Promise.all(promises.map(query => query.first({ useMasterKey: true })));
        let remove = true;
        for (const obj of data) {
          if (obj) {
            remove = false;
            break;
          }
        }
        if (!remove) {
          continue;
        }
        await file.destroy({ useMasterKey: true });
      } catch (e) {
        // ** //
      }
    }
  }

And then getFiles needs to be added to the adapter. For GridFS:

async getFiles(config) {
    const bucket = await this._getBucket();
    const files = [];
    const fileNamesIterator = await bucket.find().toArray();
    fileNamesIterator.forEach(({filename}) => {
      const file = new Parse.File(filename);
      file._url = this.getFileLocation(config, filename);
      files.push(file);
    });
    return files;
}

And then attached to a route in FilesRouter.js

Conceptually, this looks up schema for all classes, and then figures out which fields are files. Next, it queries those fields in the respective classes for each file, and if there's no reference, it removes it.

It takes about 2-3 min per 1000 files. Tested on my servers and works well. Could be faster, but I was conscious of query limits removing files by accident. I wanted to be 100% sure the file is unreferenced prior to deletion.

Related: #546, #6780

dblythy on 11 Nov 2020

It is a good start, but there are cases in which the files are not stored in a field of type File. Sometimes people store references to files in arrays and objects. I've also seen people just uploading the files and never referencing them in any other object. So I'm afraid of having this kind of script running automatically.

davimacedo on 11 Nov 2020

cases in which the files are not stored in a field of type File.

Hmmm, interesting. What do you think of:

Requiring the locations of the files in the /POST request to delete files, e.g:

{
     '_User' : [
        'photos' // if schema tells photos is array, change to containedIn
     ],
    'Photos' : [
        'photos.thumbnail' 
     ]
}

Or, perhaps add a callback in Parse.Cloud or something for whether file should delete if it's been flagged for "cleanup".

The only other solution I can think is to query every object and loop through fields to check for the file, which would be quite intensive.

Either way the warnings of the caveats will have to be shown in the dashboard / docs prior to running the function.

dblythy on 11 Nov 2020

Actually the current way only searching in the file fields is already very intensive depending on the size of the collections and how many files the app has. This is probably a script not to run in the parse-server process but probably via cli.

davimacedo on 11 Nov 2020

Sometimes people store references to files in arrays and objects.

I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.

I've also seen people just uploading the files and never referencing them in any other object

Are these files still needed or should they be cleaned up?

I'm afraid of having this kind of script running automatically.

I agree. Such a script should not run automatically (without control of schedule and batch size anyway), because these mass queries can have a significant performance impact / cost implication on external resources.

Other thoughts:

How does this script scale, e.g for a S3 bucket with 10 million files or a MongoDB collection with 10 millions docs?
Do the queries in the script need any indices for efficiency? This can get quite complex when file references are searched in nested arrays and objects.
How is this script supposed to be invoked, e.g. via API trigger in a dedicated server instance?

mtrezza on 11 Nov 2020

I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.

I agree, it should be implicitly stated the risks / caveats, so people that store files in more complex structures understand not to use the cleanup, or the risks associated with running /cleanupfiles.

I'm afraid of having this kind of script running automatically.

I'd gather it would be a button in the dashboard (as with parse.com), that would be run once every month or so. I wouldn't propose running it unless the developer directly enacts it.

How does this script scale, e.g for a S3 bucket with 10 million files and a MongoDB collections with 5 millions docs?

Honestly, I wouldn't imagine it would be great, especially with configurations with multiple "_File" fields in schemas, as it queries files and classes one by one. I'd previously written it to use containedIn, but again was worried about query limits not returning the objects associated. I would imagine it would take a while, and would be a background task. (E.g "we're now cleaning up your files").

Do the queries in the script need any indices for efficiency?

I would imagine that would speed up the cleanup time. Maybe we could recommend creating indexes on File fields if you're using a cleanup?

Would running all the individual queries of the individual objects in parallel speed it up? Also is it worth removing await from the destroy command, so the script can keep looping through the files?

*How is this script supposed to be invoked, e.g. via API trigger in a dedicated server instance?

Via an API trigger:

router.post(
      '/files/cleanupfiles',
      Middlewares.handleParseHeaders,
      Middlewares.enforceMasterKeyAccess,
      this.cleanupHandler
);

dblythy on 11 Nov 2020

I'd not go with an api route. This process should not run in the same process of Parse Server. It may cause the app to be unresponsive in the case of an app with a large amount of files / objects.

I agree with a first simple version but we do need to make sure that there is a big alert for the developers before firing the script. If via dashboard, it should something like we have currently in place for deleting all rows in a class.

The caveat here is not only files not being deleted for a more complex structure, but a lot of files will actually be deleted by accident in a more complex structure.

We need to have in mind that the files feature is not only supposed to be used as referenced files. It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object. It is a valid feature, but we need to make sure that the developers know what they are doing.

Also, let's first agree about the api and how this feature will work and I may have some code to share.

davimacedo on 11 Nov 2020

A lot of ideas can be seen in this project: https://github.com/parse-server-modules/parse-files-utils

It is an old project but it has some code in place to search for all files in all objects of an app.

davimacedo on 11 Nov 2020

@mtrezza I believe we should reopen this issue, right? What is the new procedure?

davimacedo on 11 Nov 2020

@davimacedo Yes, thanks, the procedure is re-open and remove the up for grabs label when someone is actively working on it.

mtrezza on 11 Nov 2020

@davimacedo

I'd not go with an api route. This process should not run in the same process of Parse Server.

My first thought was that this script should not even be part of Parse Server, but an external tool. But then I thought we could make it part of Parse Server for convenience and advice developers to spin up a new, dedicated instance of Parse Server that does not take any app requests for this purpose. Like a LiveQuery server.

If via dashboard, it should something like we have currently in place for deleting all rows in a class.

Yes, it should definitely be more than a simple "Are you sure? Yes/No" dialog, with the infos:

it will delete any files that are not referenced in a File pointer field
it may have a severe performance impact on the running Parse Server instance
it may require creating indices in the DB (if the script can't do that)
it may have a sever performance impact on the DB
it may have a cost implication on external resources (such as a huge file list query, although I think S3 only charges for data traffic?)

It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object.

Do you have any example use cases in mind for unreferenced files in a storage bucket, so we can get a better feel for how many deployments would be affected? I can only think of files like logs that are stored for manual retrieval, or maybe the files are processed automatically by a script of the storage bucket provider. All rare use cases I think.

I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system.

mtrezza on 11 Nov 2020

I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system.

That's why I'd not go with the script in the api. It will be only a matter of time for people to start complaining about the script not working. The same happened with the push notifications system. It took a long time to have a scalable process because previously it was a single parse server instance trying to handle all pushes.

For this to be scalable in the api, we'd need to to a similar approach to the one in push notifications. Break the files in small sets, put those sets on a queue and run multiple processes consuming the sets and processing one by one. Even though we are talking about something that will be complex to be written and also to be deployed.

davimacedo on 11 Nov 2020

👍1

Good points. @dblythy can you find anything reusable in the files utils repo that has been mentioned before?

mtrezza on 11 Nov 2020

I had a quick look through it and it seems to use a similar search algorithm as I wrote (lookup schema and look for “File”). I can have a more detailed look at that and also how the push notifications approach is done and work towards a cleanup feature similar to that.

dblythy on 11 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings