I have a web scraping script that gets new data once every minute, but over the course of a couple of days, the script ends up using 200mb or more of memory, and I found out it’s because mechanize is keeping an infinite browser history for the .back() function to use.
I have looked in the docstrings, and I found the clear_history() function of the browser class, and I invoke that each time I refresh, but I still get 2-3mb higher memory usage on each page refresh. edit: Hmm, seems as if it kept doing the same thing after I called clear_history, up until I got to about 30mb worth of memory usage, then it cleared back down to 10mb or so (which is the base amount of memory my program starts up with)…any way to force this behavior on a more regular basis?
How do I keep mechanize from storing all of this info? I don’t need to keep any of it. I’d like to keep my python script below 15mb memory usage.
You can pass an argument
history=whateverwhen you instantiate theBrowser; the default value isNonewhich means the browser actually instantiates theHistoryclass (to allowbackandreload). The simplest approach (will give an attribute error exception if you ever do call back or reload):a cleaner approach would implement other methods in
NoHistoryto give clearer exceptions on erroneous use of the browser’sbackorreload, but this simple one should suffice otherwise.Note that this is an elegant (though not well documented;-) use of the dependency injection design pattern: in a (bleah) “monkeypatching” world, the client code would be expected to overwrite
b._historyafter the browser is instantiated, but with dependency injection you just pass in the “history” object you want to use. I’ve often maintained that Dependency Injection may be the most important DP that wasn’t in the “gang of 4” book!-).