I am trying to find or build a web scraper that is able to go through and find every state/national park in the US along with their GPS coordinates and land area. I have looked into some frameworks like Scrapy and then I see there are some sites that are specifically for Wikipedia such as http://wiki.dbpedia.org/About. Is there any specific advantage to either one of these or would either one work better to load the information into an online database?
Share
Let’s suppose you want to parse pages like this Wikipedia page. The following code should work.
I tested it, and it produces the following output:
I think that’s a start. If some page fails, you have to see if the layout changes, etc.
Of course, you will also have to find a way of obtaining all the links you want to parse.
One important thing: Do you know if is permitted to scrape Wikipedia? I have no idea, but you should see if it is before doing it…
;)