I’m building a Lifestreaming app that will involve pulling down lots of feeds for lots of users, and performing data-mining, and machine learning algorithms on the results. GAE’s load balanced and scalable hosting sounds like a good fit for a system that could eventually be moving around a LOT of data, but it’s lack of cron jobs is a nuisance. Would I be better off using Django on a co-loc and dealing with my own DB scaling?
I’m building a Lifestreaming app that will involve pulling down lots of feeds for
Share
While I can not answer your question directly, my experience of building Microupdater (a news aggregator collecting a few hundred feeds on AppEngine) may give you a little insight.
Fetching feeds. Fetching lots of feeds by cron jobs (it was the only solution until SDK 1.2.5) is not efficient and scalable, which has lower limit on job frequency (say 1 min, so you could only fetch at most 60 feeds hourly). And with latest SDK 1.2.5, there is XMPP API, which I have not implemented yet. The best promising approach would be PubSubHubbub, of which you offer an callback url and HubBub will notify you new entries in real-time. And there is an demo implementation on AppEngine, which you can play around.
Parsing feeds. You may already know that parsing feeds is cpu-intensive. I use Universal Feed Parser by Mark Pilgrim, when parsing a large feed (say a public google reader topic), AppEngine may fail to process all entries. My dashboard have a lot of these CPU-limit warnings. But it may result in my incapability to optimize the code yet.
Totally said, AppEngine is not yet an ideal platform for lifestream app, but that may change in future.