So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I’m using version 2.7 and reading the python library, but I have a few problems
1. httplib.HTTPConnection and request concept to me is new and I don’t understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.
Just for background, I need to download a page and replace any img with ones I have
And it would be nice if you guys could tell me your opinion of 2.7 and 3.1
Use Python 2.7, is has more 3rd party libs at the moment.(Edit: see below).I recommend you using the stdlib module
urllib2, it will allow you to comfortably get web resources.Example:
For parsing the code, have a look at
BeautifulSoup.BTW: what exactly do you want to do:
Edit: It’s 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can.
python-requestsis a very nice high-level library which is easier to use thanurllib2.