I need to scrape a simple webpage which has the following text:
Value=29
Time=128769
The values change frequently.
I want to extract the Value (29 in this case) and store it in a database. I want to scrape this page every 6 hours. I am not interested in displaying the value anywhere, I just am interested in the cron. Hope I made sense.
Please advise me if I can accomplish this using Google’s App Engine.
Thank you!
Sure! E.g., in Python,
urlfetch(with the URL as argument) to get the contents, then a simplere.search(r'Value=(\d+)').group(1)(if the contents are as simple as you’re showing) to get the value, and adb.putto store it. Do you want the Python details spelled out, or do you prefer Java?Edit: urllib / urllib2 would also be feasible (GAE does support them now).
So
cron.yamlshould be something like:and
app.yamlsomething like:You may have other entries in either or both, of course, but this is the subset needed to “refresh the value”. A possible
refvalue.pymight be:Depending on what else your web app is doing, you’ll probably want to move the
class Valueto a separate file (e.g.models.pywith other models) which of course you’ll then have to import (from this.pyfile and from others which do something interesting with all of your saved values). Here I’ve taken some possible anomalies into account (noValue=found on the target page) but not others (the target page’s server does not respond or gives an error); it’s hard to know exactly what anomalies you need to consider and what you want to do if they occur (what I’m doing here is very simply recordingNoneas the value at the anomaly’s time, but you may want to do more… or less — I’ll leave that up to you!-)