I am scraping some websites using BeautifulSoup and Requests. There is one page that I am examining that has its data inside of a <script language="JavaScript" type="text/javascript"> tag. It looks like this:
<script language="JavaScript" type="text/javascript">
var page_data = {
"default_sku" : "SKU12345",
"get_together" : {
"imageLargeURL" : "http://null.null/pictures/large.jpg",
"URL" : "http://null.null/index.tmpl",
"name" : "Paints",
"description" : "Here is a description and it works pretty well",
"canFavorite" : 1,
"id" : 1234,
"type" : 2,
"category" : "faded",
"imageThumbnailURL" : "http://null.null/small9.jpg"
......
Is there a way that I can create a python dictionary or json object out of the page_data variable within this script tag? That would be much nicer then trying to obtain values with BeautifulSoup.
If you use BeautifulSoup to get the contents of the
<script>tag, thejsonmodule can do the rest with a bit of string magic:The
.partition()and.rpartition()combo above split the text on the first{and on the last}in the JavaScript text block, which should be your object definition. By adding the braces back to the text we can feed it tojson.loads()and get a python structure from it.This works because JSON is basically the Javascript literal syntax objects, arrays, numbers, booleans and nulls.
Demonstration: