I am new to python and havent found anything which suggests this is probably dead easy.
The page I am scrapping is fairly simple but it completely updates every 2 minutes. I have managed to scrap all the data, but the issue is that even though the program runs every 2 minutes (I have tried through taskeng.exe and looping in the script), the html it is pulling from the website seems to refresh every 12 minutes. For the sake of clarity, the website I am scrapping has a time stamp when it updates. My program pulls that stamp (along with other data) and writes to a csv file. But its pulling the same data for 12 minutes and then suddenly the data arrives. So the output looks like:
16:30, Data1, Data2, Data3
16:30, Data1, Data2, Data3
...
16:30, Data1, Data2, Data3
16:42, Data1, Data2, Data3
16:42, Data1, Data2, Data3
where as it should be:
16:30, Data1, Data2, Data3
16:32, Data1, Data2, Data3
16:34, Data1, Data2, Data3
16:36, Data1, Data2, Data3
16:38, Data1, Data2, Data3
16:40, Data1, Data2, Data3
16:42, Data1, Data2, Data3
I think this has to do with the cache on myside. How can I force my http requests to completely refresh or force python to not store it in the cache?
I am using BeautifulSoup and Mechanize. My code for the http request is below:
mech = Browser()
url = "http://myurl.com"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
If it helps to post all my code, I can do that. Thanks in advance for any advice
You could use a simpler tool like
requests.But if you really want to stick with mechanize you can also skip the Browser() stuff (which is probably introducing cookies into your requests). Check the mechanize docs for more details.