I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you’ve parsed your document into an lxml tree, you can use the
.xpath()method to search for elements.There is also a dedicated
lxml.html()module with additional functionality.Note that in the above example I passed the
responseobject directly tolxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with therequestslibrary, you want to setstream=Trueand pass in theresponse.rawobject after enabling transparent transport decompression:Of possible interest to you is the CSS Selector support; the
CSSSelectorclass translates CSS statements into XPath expressions, making your search fortd.empformbodythat much easier:Coming full circle: BeautifulSoup itself does have very complete CSS selector support: