I’m curious about website scraping (i.e. how it’s done etc..), specifically that I’d like to write a script to perform the task for the site Hype Machine.
I’m actually a Software Engineering Undergraduate (4th year) however we don’t really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we’re mainly focused around theory and client side applications.
Any help or directions greatly appreciated.
I’m curious about website scraping (i.e. how it’s done etc..), specifically that I’d like
Share
The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that’s what you’re looking for, it would be good to start there.
You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here’s some docs on how you might download a url in python and parse XML in python.
Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you’ll get the new data the second it becomes available, you’ll put a lot of load on the site, and there’s a good chance they’ll block you. Try not to run your script more often than you need to.