I have an application where there are a list of items that my users will page through. I have handled paging through an index field (I needed it for other things eitherway so I figured why not).
My issue is that I want to implement a “goto” feature; where the user can skip directly to an item instead of paging through them using the provided navigation buttons (next and previous). For instance, they can enter 1000 in the “goto” box and have the 1000th item displayed. There is a disconnect between the nth item and its index – the index is guaranteed to be in order but is not guaranteed to be sequential so I can’t just filter by the index. I thought about using the offset parameter of fetch, but I remember when I first started programming using appengine I was told not to use that due to performance issues.
Would offset be the best way to go here, or is there a better way? Also, are the costs associated with it simply that it would take longer to get the results, or will it count towards my datastore reads/small operations?
EDIT: I don’t mean this in a bad way but in order to stave off the people who will tell me to use cursors… 🙂 I handle paging in a way that is more useful to me than if I would use cursors. Thank you in advance for your concern. Additionally, I thought I’d spell out what I’m trying to do a bit in code:
q = Item.all()
#orders it by highest index first which is how client handles items
q = q.order('-index')
#count is determined automatically but is at least 25 and not greater than 300
q = q.fetch(limit=count, offset=i)
EDIT 2: Based on the comments I decided to try storing my items in memcache, and do all of my filtering, ordering, offsets, etc… in memory. Item are grouped by Category which could hold up to 1500 items, and I store each Category in the memcache under its own key. The only issue I could think of is that each Item can worst-case scenario be 2kb in size. It’s not so likely that a Category will have anywhere near 1500 Items in it, or that the Item will reach the worst-case scenario size, but if it does, it will exceed the 1mb memcache limit. Any suggestions on how to handle that? Also, there could be around 10 Categories; will this much storage in memcache cause it to flush more often? And finally, would it be worth it to use offset when I fetch the Entities or is memcache a better solution (Items will be accessed quite frequently, usually in small groups (25-30))?
EDIT 3: I now have a sequential way of referencing items. Each item has an id which uniquely identifies it across categories, an index which is a way of ordering items within a category non-sequentially, and num which is sequential, but isn’t implicit to the item (everytime I pull the items out of memcache I order it by index, and then iterate through the list of items, assigning each item a num given the current number of iterations) I guess that’s a convoluted way of saying:
for i in range(0, len(items)):
items[i]['num'] = i
EDIT 4: Item Model:
class Item(db.Model):
item_id = db.IntegerProperty()
index = db.IntegerProperty()
#I used StringProperty instead of ReferenceProperty because I'm a cheapo with memory
category = db.StringProperty()
I kept num separate from the Model because of the cost associated with updating it to be sequential on adds and removes. Therefore, I use index to maintain the (nonsequential) order of the items, and everytime the list of dicts representing the items for a specific category is kicked out of the datastore, I run through them and add a sequential “num” to each item. num is really only for the client (read: browser) since my UI is entirely dynamic (all AJAX; no page reloads whatsoever) and I cache every item that is sent to the browser in javascript. Server-Side I don’t necessarily need a sequential order for the items; there are certain functions on the client side that need it, and the server would do just fine with a non-sequential index.
The main crux of my question seems to have turned into whether I should keep this model, ie storing all items for a category in memcache, or going back to retrieving the items directly from the datastore. Items will be requested a lot (I don’t have an exact amount or even an estimate of how many times per second, but it should be many items requested per second). I know that there’s no way to precisely determine how long the items will be in memcache before getting kicked out, but can I assume it won’t be happening every few minutes? Because if otherise, I feel like the best way to go is with memcache, but I could be missing something. Oh, and hopefully this will be the last edit before I steal all of SO’s disk space 😉
EDIT 5 So much for no more edits… This is a chart of my calculations for time complexity when using memcache and the datastore or just the datastore (left out time complexity for the datastore because I’m not sure exactly what it is. It’s too late to go read the BigTable paper again to try and figure it out so I’ll just assume it’s the same for operations on a hashtable). These are all best cases. For the memcache solution, the worst case you would need to add N datastore reads (since all the items in the category must be read into the memcache). This chart is leaving anything extra not having to do with storing or retrieving the data (ie sorts, filters) out of the equation for both memcache and datastore solutions. For the memcache only solution, num is not stored in the datastore. For the datastore only solution it is, which is why there is the extra cost associated with an Add or Remove (updating the num for each item).
n DS = number of DataStore operations
w = write
r = read
N = number of items in category (for Add and Remove this is the number before
the operation is performed)
c = count of items to read
o = offset
+------------------------------------------------------------------------------+
| Memcache | Datastore |
|------------------------------------------------------------------------------|
| | | | |
| Reads | O(o + c) | Reads | c DS r |
|-------+-------------------------------|-------+------------------------------|
| | | | |
|Reads w| O(o + c) |Reads w| o + c DS r |
|Offset | |Offset | |
|-------+-------------------------------|-------+------------------------------|
| | | | |
| Adds | 1 DS w + O(N) | Adds | 1 + N DS w & N - 1 DS r |
|-------+-------------------------------|-------+------------------------------|
| | | | |
|Removes| 1 DS rw + O(o + N) |Removes| N - o DS wr |
|-------+-------------------------------|-------+------------------------------|
| | | | |
| Edits | 1 DS rw + O(o) | Edits | 1 DS rw |
|-------+-------------------------------|-------+------------------------------|
So the question is, does the worse time complexity for the memcache solution outweigh the potential more DS operations that come with the datastore solution, unless memcache eviction could cause more DS operations in the memcache solution than the datastore solution (because each time the items are evicted from mecache we have to do N DS r to repopulate the memcache). This is all assuming reads will happen much more frequently than writes which in this application will be the case once initial data loading is done.
Updated for Edit 4.
Your
Itemmodel looks reasonable, the biggest issue is how how to manage the sequential index. I’m still hesitent to rely on memcache in the way you describe, because cache eviction dramatically slows your read operations (which are common and user-facing) unless you have the datastore properly backing up the state of your data.So, feel free to continue storing items in memcache. However, on inserts or deletes, make sure to update
numin the datastore as well. (If you already have the entire set ofItemsin memcache, no read ops are required. Just update all the items in memcache and write them to the datastore simultaneously.)The worst case scenario is still as I described it before your 4th edit. Inserting an element is 1 read + 1 write. Removing an element is N reads + N writes where N is the number of items in the category. Looking up an item is just 1 read. Each of these scenarios assume memcache is empty.
If you were using an offset, each insert would be 1 write. Removing an element would be 1 write. But, reading an element is N reads, where N is the sequential index of the item you are retrieving. If you’re using memcache, but aren’t backing up the value of
numin the datastore, you’ll also fall into this scenario.In most cases, reads are far more common than writes, so maintaining
numin the datastore is far more efficient.An addendum:
Cloud SQL is another option if your data size isn’t too large. SQL in general is much better at sequential queries like the one you are trying to do, at the cost of scaling poorly with large data sets.
The per use pricing is relatively cheap if you suspect you’ll have minimal usage.