What is the most efficient way to delete orphan blobs from a Blobstore?
App functionality & scope:
- A (logged-in) user wants to create a post containing some normal
datastore fields (e.g. name, surname, comments) and blobs (images). - In addition, the blobs are uploaded asynchronously before the resto
of the data is sent via a POST- This leaves a good chance of having orphans as, for example, a user may upload images but not complete the form for one reason or another. This issue would be minimized by not using an asynchronous upload of the blobs before sending the rest of the data, however, this issue would still be there on a smaller scale.
Possible, yet inefficient solutions:
- Whenever a post is completed (i.e. the rest of the data is sent), you add the blob keys to a table of “used blobs”. Then, you can run a cron every so often and compare all of the blobs with the table of “used blobs”. Those that have been uploaded over an hour ago yet are still “not used” are deleted.
- My understanding is that running through a list of potentially hundreds of thousands of blob keys and comparing it with another table of hundreds of thousands of “used blob keys” is very inefficient.
Is there any better way of doing this? I’ve searched for similar posts yet I couldn’t find any mentioning efficient solutions.
Thanks in advance!
Thank for the comments. However, I understood those solutions well, I find them too inefficient. Querying thousands of entries for those that are flagged as “unused” is not ideal.
I believe I have come up with a better way and would like to hear your thoughts on it:
When a blob is saved, immediately a deferred task is created to delete the same blob in an hour’s time. If the post is created and saved, the deferred task is deleted, thus the blob will not be deleted in an hour’s time.
I believe this saves you from having to query thousands of entries every single hour.
What are your thoughts on this solution?