I’d like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data.
- How would that work?
- What tools/libraries can/should I use?
- Are there good tutorials on that?
- How do I best deal with binary data (e.g. pretty pdf)?
- Are there already good solutions for that?
requestsfor downloading the pages.lxmlfor scraping the data.If you want to use a powerful scraping framework there’s
Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.