Im trying to write a script that gets google’s ajax search results (For example: http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=filetype:pdf ) and download every file. Right now I’m stuck trying to convert the response to a python dictionary so its easier to move through.
import subprocess
import ast
subprocess.call("curl -G -d 'q=filetype:pdf&v=1.0' http://ajax.googleapis.com/ajax/services/search/web > output",stderr=subprocess.STDOUT,shell=True)
file = open('output','r')
contents = file.read()
output_dict = ast.literal_eval(contents)
print output_dict
When I run it, I get:
$ python script.py
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2643 0 2643 0 0 15926 0 --:--:-- --:--:-- --:--:-- 26696
Traceback (most recent call last):
File "script.py", line 7, in <module>
output_dict = ast.literal_eval(contents)
File "/usr/lib/python2.7/ast.py", line 80, in literal_eval
return _convert(node_or_string)
File "/usr/lib/python2.7/ast.py", line 63, in _convert
in zip(node.keys, node.values))
File "/usr/lib/python2.7/ast.py", line 62, in <genexpr>
return dict((_convert(k), _convert(v)) for k, v
File "/usr/lib/python2.7/ast.py", line 79, in _convert
raise ValueError('malformed string')
ValueError: malformed string
The file looks like:
{"responseData": {"results":[{"GsearchResultClass":"GwebSearch",
"unescapedUrl":"http://www.foundationdb.com/AlphaLicenseAgreement.pdf",
"url":"http://www.foundationdb.com/AlphaLicenseAgreement.pdf",
"visibleUrl":"www.foundationdb.com",
"cacheUrl":"http://www.google.com/search?q\u003dcache:W7zhFlfbm6UJ:www.foundationdb.com",
"title":"FoundationDB Alpha Software Evaluation License Agreement",
"titleNoFormatting":"FoundationDB Alpha Software Evaluation License Agreement",
"content":"FOUNDATIONDB. ALPHA SOFTWARE EVALUATION LICENSE AGREEMENT. PLEASE READ CAREFULLY THE TERMS OF THIS ALPHA SOFTWARE \u003cb\u003e...\u003c/b\u003e",
"fileFormat":"PDF/Adobe Acrobat"
},
{"GsearchResultClass":"GwebSearch",
"unescapedUrl":"https://subreg.cz/registration_agreement.pdf",
"url":"https://subreg.cz/registration_agreement.pdf",
"visibleUrl":"subreg.cz",
"cacheUrl":"http://www.google.com/search?q\u003dcache:ODtRmQsiHD0J:subreg.cz",
"title":"Registration Agreement",
"titleNoFormatting":"Registration Agreement",
"content":"Registration Agreement. In order to complete the registration process you must read and agree to be bound by all terms and conditions herein. TERMS AND \u003cb\u003e...\u003c/b\u003e",
"fileFormat":"PDF/Adobe Acrobat"
},
{"GsearchResultClass":"GwebSearch",
"unescapedUrl":"http://supportdetails.com/export.pdf",
"url":"http://supportdetails.com/export.pdf",
"visibleUrl":"supportdetails.com",
"cacheUrl":"http://www.google.com/search?q\u003dcache:h0LvxrTTKzIJ:supportdetails.com",
"title":"Export PDF - Support Details",
"titleNoFormatting":"Export PDF - Support Details",
"content":"",
"fileFormat":"PDF/Adobe Acrobat"
},
{"GsearchResultClass":"GwebSearch",
"unescapedUrl":"http://www.fws.gov/le/pdf/travelpetbird.pdf",
"url":"http://www.fws.gov/le/pdf/travelpetbird.pdf",
"visibleUrl":"www.fws.gov",
"cacheUrl":"",
"title":"pet bird",
"titleNoFormatting":"pet bird",
"content":"U.S. Fish \u0026amp; Wildlife Service. Traveling Abroad with. Your Pet Bird. The Wild Bird Conservation Act (Act), a significant step in international conservation efforts to \u003cb\u003e...\u003c/b\u003e",
"fileFormat":"PDF/Adobe Acrobat"
}],
"cursor":{"resultCount":"72,800,000",
"pages":[{"start":"0","label":1},
{"start":"4","label":2},
{"start":"8","label":3},
{"start":"12","label":4},
{"start":"16","label":5},
{"start":"20","label":6},
{"start":"24","label":7},
{"start":"28","label":8}],
"estimatedResultCount":"72800000",
"currentPageIndex":0,
"moreResultsUrl":"http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dfiletype:pdf","searchResultTime":"0.04"
}
},
"responseDetails": null,
"responseStatus": 200
}
God that took forever to format
Google returns JSON, so use the
jsonmodule instead of the ast module you are using now.You may also want to study the
urllib2module to load the URL response instead of relying on curl.