currently I have a spider written in Java that logs into a supplier website and spiders the website. (using htmlunit)
It keeps the session (cookie) and even lets me enable/disable javascript etc.
I also use htmlparser (java) to help parse the html and extract the relevant information.
Does python have something similar to do this?
Python has urllib2 to crawl pages, which supports password authentication and cookies.
There is also a HTMLParser for extracting html, but some people prefer the more feature-full BeatifulSoup.