I’m writing a Google App Engine database that once it goes live will probably hold over 10 million records with fairly constant queries, insertions and deletions.
Is this much data going to be a problem? I’m not worried about the cost ($$$) just the performance of the database. The queries will be based on two fields that are both StringProperty and return less than 100 records.
The database has two ‘tables’, the one that will be getting most of the queries against it has records that take around 100 bytes. The larger table won’t get as many queries (maybe 1/10th the number as the small table) and those records are around 30K each.
Are deletions an expensive operation? Is it better to not delete old records and just mark then as deleted and maybe delete them in bulk in a cron job?
I am aware of the distributed nature of Google App Engine and replication and those issues won’t be a problem.
10 million records are not a big amount for the datastore, so you don’t have to worry, as long as your queries can take advantage of indexes. For instance if you’ve to walk a larger data set 100 records a time, instead of saying that you want to start from a certain position in the dataset, you can remember the last ORDER BY field value at the end of the page and ask for elements coming after it (WHERE field > ‘…’ — supposing ascending order).
You can use task queues instead of cron jobs to do deletions, it all depends how fast you want to get back to the user. Datastore operations tends to be slow, but if it’s just one record to delete, it could be acceptable. However, if you’ve to do multiple operations it can get really slow, thus is better to execute these kind of tasks in a task queue and keep great responsiveness in the application.
Datastore records can’t exceed 1Mb, 30Kb is a big record size, but shouldń’t cause any problems. Remember that only short strings (500 characters or less) can be indexed.