I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
I would like to extract the JSON object defined in ‘window.blog.data’.
Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can’t seem to find a method that will return the exact object without parsing)
Thanks
Edit:
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?
BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
In simple cases you could:
<script>‘s text using an html parserwindow.blog...is a single line or there is no';'inside the object and extract the javascript object literal using simple string manipulations or a regexExample:
If the assumptions are incorrect then the code fails.
To relax the second assumption, a javascript parser could be used instead of a regex e.g.,
slimit(suggested by @approximatenumber):There is no need to treat the object literal (
obj) as a json object. To get the necessary info,objcan be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed byslimit).