I have a simple Python script that uses BeautifulSoup to find a section of the HTML tree. For example, to find everything inside the <div id="doctext"> tags, the script does this:
html_section = str(soup.find("div", id="doctext"))
I would like to be able to make the arguments to find() vary, however, according to strings given in an input file. For example, a user could feed the script a URL followed by a string like "div", id="doctext", and the script would adjust the find accordingly. Imagine that the input file looks like this:
http://www.example.com | "div", id="doctext"
The script splits the line to get the URL, which works fine, but I want it to also grab the arguments. For example:
vars = line.split(' | ')
html = urllib2.urlopen(vars[0]).read()
soup = BeautifulSoup(html)
args = vars[1].split()
html_section = str(soup.find(*args))
This doesn’t work—and probably doesn’t make sense as I’ve been trying multiple ways to do this. How do I get the string provided by the input file and prepare it into the right syntax for the soup.find() function?
You could parse
linelike this:yields
Then you could call
soup.findlike this:WARNING: Note that if
doctext(or some other keyword argument) contains a comma, thenwill split the parameters in the wrong place. This problem might arise if you are searching for some
textcontent that contains a comma.So let’s look for a better solution:
To avoid the problem described above, you might consider using the JSON format for the arguments: if
linelooks like this:Then you could parse it with
which yields
Then you could call
soup.findwithAn added advantage is that you can supply any number of soup.find’s positional arguments (for
name,attrs,recursive, andtext), not just thename.