I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used.
Can anyone help me with this?
I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the database.
def parse(self,text):
soup = BeautifulSoup(text)
arr = soup.findAll('tbody')
for i in range(0,len(arr)-1):
data=Data()
soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')
c=0
for j in arr2:
if str(j).find("<a href=") > 0:
data.sourceURL = self.getAttributeValue(str(j),'<a href="')
else:
if c == 2:
data.Hits=j.renderContents()
#and few others...
c = c+1
data.save()
Any suggestions?
Note: I already ask this question here but that was closed due to incomplete information.
Don’t do this: Just call
arr2 = arr[i].findAll('td')instead.This will also be slow:
Assuming that getAttributeValue gives you the
hrefattribute, use this instead:In general, you shouldn’t need to convert the BeautifulSoup object back into a string if all you want to do is parse it and extract values. Since the
findandfindAllmethods give you back searchable objects, you can keep searching by invoking thefind/findAll/etc. methods on the results.