Make sure these boxes are checked before submitting your issue -- thanks for reporting issues back to Parse Server!
One of the features that I liked on the hosted Parse was, in the settings, the button Clean Up Files. This way, every file stored in S3 for example, that wasn't anymore referenced from a PFFile, would be deleted. I liked it specially because it allowed us to save on unused/unneeded resources.
Maybe a Rest call using the master key would be initially enough? In the future, with possible integration with the parse-dashboard?
I know it's lower priority compared to the features/fixes that are being developed, but that would be great to have.
This would be pretty difficult actually, and would need to be built for each specific Files adapter. Right now, there's no 'listing' of what files exist through the adapter.
+1 , agree with the need.
It's possible to clean the unused files stored in GridStore now?
+1, It's a very useful feature.
+1, It would be nice.
+1
+1 very much needed
+1
Just asking: how many of you ever actually needed a file after deleting pointers to it?
I feel the most common use of files is āif I delete the pointer, I donāt need the file anymoreā. If this is the case, why not make it the default in parse-server?
I mean that when any object is deleted, after delete, all files are processed and adapter.deleteFile() is called for each. This could be opt-in / out in the ParseServer constructor, and is way easier than a complete āclean upā feature.
Given how tricky the full task is, it would also be cool if parse-server kept a Files table with url and usage_count , to simplify all the rest.
@natario1
Just to answer your question why I want to keep the files is because of the intermediate state
e.g.
Consider a mobile game using Parse backend for keeping some zip packages to be used in game
While we delete/replace with some new packages on parse dashboard, there will be a state for sometime where the users still having old package URL locally in their app/game will start having issues until new configs/URLS are loaded
But I also want to keep size of my database as small as possible, so at some right time I will delete some old packages
@abdulwasayabbasi makes sense, thank you. Just wondering how frequent that is.
Your use case would not be bothered by an āauto deleteā feature, since you are just updating the file field. To take advantage of it you would have to create a new object with the new package file, and delete the older object when you feel safe, so the old file gets auto deleted.
I made my own "clean file". Maybe it could help someone!
https://gist.github.com/Lokiitzz/6afbf0573665d3170ffb1e83565a0fef
Be careful :)
Why not a PR to Parse Server? :)
The code won't work on the server as it loads all objects in memory.
yes.. you're right. I didn't check it before.
For those features, I'd love to see command line tools more than just another endpoint that require maintenance.
why not pass an "auto-delete-files" flag to the server on startup and when an individual file pointer is deleted or replaced it deletes the file? This feature would help the 50% of people who only use PFFiles for profile pictures (files that won't be needed after deletion or replacement) while leaving the other 50% who want fine grain control unaffected because they didn't pass the flag? Would this be a valid solution?
I also have this problem. Deleted a lot of rows in my mongodb database with parse dashboard including the reference to many images. Now I am unable to find them and clean them up. Or is there any other (manual) way?
I expected the parse dashboard to cleanup pffiles before it removes the reference to them.
Any progress on this?
Not yet, this is not a feature that is actively worked on, but pull requests or a separate project could take care of it
Depending on how you look at this it is either an undocumented "feature" or a huge bug. Either way it has huge and expensive consequences that should at the very least be well documented.
undocumented "feature" or a huge bug
What do you mean by that?
This is neither documented nor a bug as it's just not implemented, neither listing the missing files, nor deleting an existing file through the file adapters.
Because a file could be referenced by multiple Objects, we don't keep a reference count on them.
Steps are relatively easy to describe:
However, not trivial to implement.
If you are using mongodb then as a workaround, you can write simple script to delete unreferenced file chunks directly from mongo db.
db_cleanup_script.js.zip
Make sure mongo is installed and running before running this script.
+1
If memory was handled the same way this would be a memory leak. I guess you could call this a file leak.
A cleanup scripts is a workable solution to a one time problem but this is not a one time problem. To use a cleanup script in production you have to setup, maintain and monitor the infrastructure to run the cleanup script on a schedule. Then you need to monitor the impact of running that script and adjust the schedule, scale the servers and/or throttle the script to meet your needs. This is all very dependent on your exact use case and could change as your product and users change. This means constant monitoring is needed. If you have the team to solve this problem chances are you would not be using parse in the first place.
The logical solution here is to keep a reference count and delete the file when the counter gets to 0. This is code that could be written once and used in all but the most extreme cases.
Originally, on parse.com, that was a cleanup script. Which seems to be efficient enough to work it out. Files can be passed around in different objects, stored in arrays or embedded into objects. THere's nothing that guarantees that one user won't reference the file by the URL, I did that for a project, just using the File as an uploader mechanism but then I would just pass the URL's around.
The logical solution here is to keep a reference count and delete the file when the counter gets to 0.
The script based solution is as valid as a ref-count based solution. That being said, 'over releasing' a file or missing to count the usage of a file when referenced by another would destroy the file.
You seem to have a good understanding of the problem, why not try to tackle it?
This repo started as a simple file list tool, maybe there's something to look for here.
Also given the costs related to unused files on S3, https://aws.amazon.com/s3/pricing/ (0.0023$ / Gb / month) this seems to be negligible.
Anyone can solve this issue? It's been more than one year.
Yes, anyone can solve, including you :)
Depends on the size of a single file.
In case we want to programatically delete a file, one option I can see so far is to make a request to end point defined in FileRouter L27, as Parse.File doesn't expose the delete method.
For creating file we have:
return CoreManager.getRESTController().request('POST', 'files/'+name, data);
So I tried to send this to delete a file:
return CoreManager.getRESTController().request('DELETE', 'files/'+name);
But got error from middleware.js trying to create buffer new Buffer(base64, 'base64');
What is the proper way to make such request to delete a file or any other way to programatically do this?
Single file deletion is not implemented yet, and not required by the files adapters either I believe. We could start adding those.
Single file deletion is not implemented yet
Do you mean it's just not implemented in Parse.File?
I can see it's required in FilesAdapter, and FilesRouter define this end point as well.
So I suppose we can do single file deletion as long as our custom FilesAdapter implements this method, right? Btw, I'm using AzureStorageAdapter, I can see it has this method implemented.
This error may be related to the request format issue?
The deletion is working with the file URL without app ID. i.e.:
curl -X DELETE -H "X-Parse...... http://domain/parse/files/appid/file
is not working but
curl -X DELETE -H "X-Parse...... http://domain/parse/files/file
is working :/
Edit: Oh someone has found it already https://github.com/parse-community/parse-server/issues/1411
+1, It would very useful.
Any solution yet ?
+1 (to keep this alive)
Feel free to open a pull request for a reference implementation, but iāll Be closing this issue as itās an off process job that may take a very long time to complete in order to clean dereferenced files. Itās not something that I as a maintainer want to actively work on (as stated many times) but Iāll hladly review a pull request if any change to parse server is needed for that feature.
As mentioned previously, all the work can be done externally, without needing change on this project.
GDPR requirements for anyone running parse for users that might exist in Europe and have uploaded personal data mean that without this feature anyone using parse without a way to mitigate this could have an expensive problem.
@jeacott1 we provide a way to delete existing files on demand, through the REST API and the files adapters, so a conscious user could delete the existing picture upon replacement.
Also, weāre open to pull requests, I believe I donāt need to say it again, as it was basically the message posted before yours.
If you believe this project canāt help you achieve GPDR compliance, then you have 2 options, either fix it or stop using it. Trolling isnāt one.
Thanks.
ah - ok, I missed that. I didn't think there was a way to delete via the rest api. I thought it just removed the reference. just trying to understand how best to do this.
curl -X DELETE \
-H "X-Parse-Application-Id:[AppId]" \
-H "X-Parse-Master-Key:[MasterKey]" \
http://[ParseServer Url]/files/5b6cd3a71873be9c79aedeb53ff71f05_fav.png
here is the code for REST API for deleting files, php API supports file delete as well:
I tested it for Digital Ocean Spaces and it works like a charm
try {
$result = $testFile->delete(true);
echo $result;
} catch (Exception $e) {
echo 'Caught exception: ', $e->getMessage(), "\n";
}
What do you think of this approach @mtrezza?
FilesController.js
async cleanUpFiles(database) {
if (!this.adapter.getFiles) {
return;
}
const files = await this.adapter.getFiles(this.config);
if (files.length == 0) {
return;
}
const schema = await database.loadSchema();
const all = await schema.getAllClasses();
const classQueries = {};
for (const field of all) {
const fields = field.fields;
for (const fname of Object.keys(fields)) {
const fl = fields[fname];
if (fl.type == 'File') {
const classData = classQueries[field.className] || [];
classData.push(fname);
classQueries[field.className] = classData;
}
}
}
if (Object.keys(classQueries).length == 0) {
return;
}
for (const file of files) {
try {
const promises = [];
for (const className of Object.keys(classQueries)) {
const keys = classQueries[className];
const queries = [];
for (const key of keys) {
const query = new Parse.Query(className);
query.equalTo(key, file);
queries.push(query);
}
let orQuery = new Parse.Query(className);
orQuery = Parse.Query.or.apply(orQuery, queries);
orQuery.select('objectId');
promises.push(orQuery);
}
const data = await Promise.all(promises.map(query => query.first({ useMasterKey: true })));
let remove = true;
for (const obj of data) {
if (obj) {
remove = false;
break;
}
}
if (!remove) {
continue;
}
await file.destroy({ useMasterKey: true });
} catch (e) {
// ** //
}
}
}
And then getFiles needs to be added to the adapter. For GridFS:
async getFiles(config) {
const bucket = await this._getBucket();
const files = [];
const fileNamesIterator = await bucket.find().toArray();
fileNamesIterator.forEach(({filename}) => {
const file = new Parse.File(filename);
file._url = this.getFileLocation(config, filename);
files.push(file);
});
return files;
}
And then attached to a route in FilesRouter.js
Conceptually, this looks up schema for all classes, and then figures out which fields are files. Next, it queries those fields in the respective classes for each file, and if there's no reference, it removes it.
It takes about 2-3 min per 1000 files. Tested on my servers and works well. Could be faster, but I was conscious of query limits removing files by accident. I wanted to be 100% sure the file is unreferenced prior to deletion.
Related: #546, #6780
It is a good start, but there are cases in which the files are not stored in a field of type File. Sometimes people store references to files in arrays and objects. I've also seen people just uploading the files and never referencing them in any other object. So I'm afraid of having this kind of script running automatically.
cases in which the files are not stored in a field of type File.
Hmmm, interesting. What do you think of:
Requiring the locations of the files in the /POST request to delete files, e.g:
{
'_User' : [
'photos' // if schema tells photos is array, change to containedIn
],
'Photos' : [
'photos.thumbnail'
]
}
Or, perhaps add a callback in Parse.Cloud or something for whether file should delete if it's been flagged for "cleanup".
The only other solution I can think is to query every object and loop through fields to check for the file, which would be quite intensive.
Either way the warnings of the caveats will have to be shown in the dashboard / docs prior to running the function.
Actually the current way only searching in the file fields is already very intensive depending on the size of the collections and how many files the app has. This is probably a script not to run in the parse-server process but probably via cli.
Sometimes people store references to files in arrays and objects.
I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.
I've also seen people just uploading the files and never referencing them in any other object
Are these files still needed or should they be cleaned up?
I'm afraid of having this kind of script running automatically.
I agree. Such a script should not run automatically (without control of schedule and batch size anyway), because these mass queries can have a significant performance impact / cost implication on external resources.
Other thoughts:
I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.
I agree, it should be implicitly stated the risks / caveats, so people that store files in more complex structures understand not to use the cleanup, or the risks associated with running /cleanupfiles.
I'm afraid of having this kind of script running automatically.
I'd gather it would be a button in the dashboard (as with parse.com), that would be run once every month or so. I wouldn't propose running it unless the developer directly enacts it.
- How does this script scale, e.g for a S3 bucket with 10 million files and a MongoDB collections with 5 millions docs?
Honestly, I wouldn't imagine it would be great, especially with configurations with multiple "_File" fields in schemas, as it queries files and classes one by one. I'd previously written it to use containedIn, but again was worried about query limits not returning the objects associated. I would imagine it would take a while, and would be a background task. (E.g "we're now cleaning up your files").
- Do the queries in the script need any indices for efficiency?
I would imagine that would speed up the cleanup time. Maybe we could recommend creating indexes on File fields if you're using a cleanup?
Would running all the individual queries of the individual objects in parallel speed it up? Also is it worth removing await from the destroy command, so the script can keep looping through the files?
*How is this script supposed to be invoked, e.g. via API trigger in a dedicated server instance?
Via an API trigger:
router.post(
'/files/cleanupfiles',
Middlewares.handleParseHeaders,
Middlewares.enforceMasterKeyAccess,
this.cleanupHandler
);
I'd not go with an api route. This process should not run in the same process of Parse Server. It may cause the app to be unresponsive in the case of an app with a large amount of files / objects.
I agree with a first simple version but we do need to make sure that there is a big alert for the developers before firing the script. If via dashboard, it should something like we have currently in place for deleting all rows in a class.
The caveat here is not only files not being deleted for a more complex structure, but a lot of files will actually be deleted by accident in a more complex structure.
We need to have in mind that the files feature is not only supposed to be used as referenced files. It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object. It is a valid feature, but we need to make sure that the developers know what they are doing.
Also, let's first agree about the api and how this feature will work and I may have some code to share.
A lot of ideas can be seen in this project: https://github.com/parse-server-modules/parse-files-utils
It is an old project but it has some code in place to search for all files in all objects of an app.
@mtrezza I believe we should reopen this issue, right? What is the new procedure?
@davimacedo Yes, thanks, the procedure is re-open and remove the up for grabs label when someone is actively working on it.
@davimacedo
I'd not go with an api route. This process should not run in the same process of Parse Server.
My first thought was that this script should not even be part of Parse Server, but an external tool. But then I thought we could make it part of Parse Server for convenience and advice developers to spin up a new, dedicated instance of Parse Server that does not take any app requests for this purpose. Like a LiveQuery server.
If via dashboard, it should something like we have currently in place for deleting all rows in a class.
Yes, it should definitely be more than a simple "Are you sure? Yes/No" dialog, with the infos:
It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object.
Do you have any example use cases in mind for unreferenced files in a storage bucket, so we can get a better feel for how many deployments would be affected? I can only think of files like logs that are stored for manual retrieval, or maybe the files are processed automatically by a script of the storage bucket provider. All rare use cases I think.
I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system.
I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system.
That's why I'd not go with the script in the api. It will be only a matter of time for people to start complaining about the script not working. The same happened with the push notifications system. It took a long time to have a scalable process because previously it was a single parse server instance trying to handle all pushes.
For this to be scalable in the api, we'd need to to a similar approach to the one in push notifications. Break the files in small sets, put those sets on a queue and run multiple processes consuming the sets and processing one by one. Even though we are talking about something that will be complex to be written and also to be deployed.
Good points. @dblythy can you find anything reusable in the files utils repo that has been mentioned before?
I had a quick look through it and it seems to use a similar search algorithm as I wrote (lookup schema and look for āFileā). I can have a more detailed look at that and also how the push notifications approach is done and work towards a cleanup feature similar to that.
Most helpful comment
+1 , agree with the need.