sorry for the stupid question … just started using python (but I love it).
The problem:
I want to scrape data from the center for documentation of violism in syria. currently I’m using this scraper to collect the data. the problem is that I can access only one row instead of scraping all rows from the table.
the preferred output should look like
name status sex province area dateofdeath causeofdeath
import urllib2
from BeautifulSoup import BeautifulSoup
f = open('syriawar.tsv', 'w')
f.write("Row" + "\t" + "Data" + "\n")
for x in range (0,249):
syria = "file" + "\t" + str(x)
print "fetching data ... " + syria
url ='http://vdc-sy.org/index.php/en/martyrs/' + str(x) + '/c29ydGJ5PWEua2lsbGVkX2RhdGV8c29ydGRpcj1ERVNDfGFwcHJvdmVkPXZpc2libGV8c2hvdz0xfGV4dHJhZGlzcGxheT0wfA=='
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
sentence = soup.findAll('tr')[3].text
words = sentence
Data = str(words)
f.write(str(x) + "\t" + Data + "\n" )
f.close()
You need another layer of iteration. You should first call findAll(‘tr’) to get all the rows. Then remove the rows that are headers and empty and then loop through the remaining rows and call .text on those elements to get the text of the rows you want. Write each row to the file from within your inner loop.
Here is the script fixed. Note that the utf-8 codec had to used because the page contains unicode in the text. You should verify that this is getting everything you want. The empty tags were causing Beautiful Soup some problems.
Another spiffy way to do this is to use Scrapemark. It works great for tables and lists.