I have such files to parse (from scrapping) with Python:
some HTML and JS here...
SomeValue =
{
'calendar': [
{ 's0Date': new Date(2010, 9, 12),
'values': [
{ 's1Date': new Date(2010, 9, 17), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 18), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 19), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 20), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 21), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 22), 'price': 9900 },
{ 's1Date': new Date(2010, 9, 23), 'price': 9900 }]
},
'data': [{
index: 0,
serviceClass: 'Economy',
prices: [9900, 320.43, 253.27],
eTicketing: true,
segments: [{
indexSegment: 0,
stopsCount: 1,
flights: [{
index: 0,
... and a lot of nested data and again HTML and JS...
I need to parse it and extract all json data. Now I use regex with cleaning all ‘\n’ and ‘\t’ and eval() function to convert it to Python dictionary.. I really don’t like this solution, eval() especially. But I looked at BeautifulSoup and lxml, and didn’t find something that will help to parse it.
Can you suggest something better than regex and eval() for this task?
Page example: http://codepaste.ru/3830/
aarrghhh no regex dont use regex no regex no no nooooooo
Use the
jsonmodule to handle JSON data:Use
BeautifulSouporlxmlto handle parsing the html page:If you want specific help, you’ll need to provide specific data e.g. the class of the tags in which this data is enclosed. You could
soup.findAllthe script tags, for instance, then strip some lines to get to the JSON, then feed that intojson.loads.