How would I come about writing a crawler for the Google chrome extensions website? https://chrome.google.com/webstore/category/extensions
I am doing a little security research on chrome extensions. About 100 extensions per category and the problem I am having right now is writing a crawler to at least grab the UID’s. The website appears to be updated by javascript. If I were to grab the html, I’d get nothing because the site appears to load the rest of the page at a later stage. In other words, the core content I need (i.e. the DOM with all the extension elements) seems to load after I grab the HTML using python. Any ideas?
Yes, the webpage doesn’t contain the data – it is downloaded separately. A URL like this one is used:
Note that this has to be a POST request (without any POST data), other requests will be rejected for security reasons. You have to remove
")]}'"at the beginning of the file and"[]\n"in various other places – then you should get proper JSON that can be parsed viajson.loads. The data isn’t very structured but should be good enough for crawling.Note that the
pvparameter looks like it might change soon (this Unix time corresponds to a date four days ago), you can use the Network tab of Chrome’s Developer Tools to see the current request parameters. Thecategoryparameter is the identifier of the category – it’s the URL part afterhttps://chrome.google.com/webstore/category/in Web Store links.