I need to implement “inverse document frequency” in Google app engine. I’m looking for suggestions to improve efficiency. Now I take the basic routine as,
when parsing a webpage I save each pair to datastore, like,
for(String phrase : phrase_collection){
dataStore.put(phrase, domain);
}
when computing the IDF later I fetch the occurrence of the phrase from datastore, like,
for(String phrase : phrase_collection){
long count = dataStore.get(phrase).size();
}
However the speed is not satisfying and results in 30sec timeout often. In this scenario I have additional challenges,
-Multi language input(webpages). So the phrases are also in different languages, which makes it hard to cache.
-Parsing webpages and ranking phrases take much time as well. The whole process is like charset_detect -> language_detect -> parse according to different languages -> ranking.
Always on enabled in GAE.
I’m looking forward to any suggestions! Thanks in advance!
You’re doing an individual get (and put) for each phrase. This is naturally going to be very slow, as you’re doing a great many roundtrips to the datastore. Instead, you should use the variants of put and get that accept an iterable of entities or keys, and execute them all in a single transaction.
You should also do this work ‘offline’ – as Stefan suggests, using backends or task queues. Task queues would likely be a better match here.