Hi Guys !
I’m still discovering Twisted and I’ve made this script to parse the content of HTML table into excel. This script is working well ! My question is how can I do the same, for only one webpage (http://bandscore.ielts.org/) but with a lot of POST requests to be able to fetch all the results, parse it with beautifulSoup and then put them into excel ?
Parsing the source and putting it into excel is O.K, but I don’t know how to do a POST request with Twisted in order to implement that in
This is the script I use for parsing (with Twisted) a lot of different pages
(I want to be able to write the same script, but with a lot of different POST data on the same page and not a lot of pages):
from twisted.web import client
from twisted.internet import reactor, defer
from bs4 import BeautifulSoup as BeautifulSoup
import time
import xlwt
start = time.time()
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet("BULATS_IA_PARSED")
global x
x = 0
Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
urls = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List]
def finish(results):
global x
for result in results:
print 'GOT PAGE', len(result), 'bytes'
soup = BeautifulSoup(result)
tableau = soup.findAll('table')
try:
rows = tableau[3].findAll('tr')
print("Fetching")
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
texte_bu = td.text
texte_bu = texte_bu.encode('utf-8')
#print("Writing...")
#print texte_bu
ws.write(x,y,td.text)
y = y + 1
except(IndexError):
print("No IA for this country")
pass
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
wb.save("IALOL.xls")
print "Elapsed Time: %s" % (time.time() - start)
Thank you very much in advance for your help !
You have two options. Keep using
getPageand tell it to POST instead of GET or useAgent.The API documentation for
getPagedirects you to the API documentation forHTTPClientFactoryto discover additional supported options.The latter API documentation explicitly covers
methodand implies (but does a bad job of explaining)postdata. So, to make a POST withgetPage:There is a howto-style document for
Agent(linked from the overall web howto documentation index. This gives examples of sending a request with a body (ie, see theFileBodyProducerexample).