Assume the following is a subset of the HTML document… note there are multiple tables that repeat, though the <a name="1"> may be “2”, “3” , “4”, etc. with different text for each table.
<table align="center" width="550">
<tr>
<td valign="top" width="300"><b>Product:</b></img></td>
<td>
<a name="1"></a>1) Text Editor
<p>An application for the editing of text files.</p>
<br>
<b>Application Name: Notepad</b>
<br>
<b>Type: Writing</b>
<br><br></td>
</tr>
</table>
I want to be able to find an “a” tag that equals a particular “#” (in this case, 1)
and be able to somehow get the text of: “1) Text Editor”.
I know if I beautifulsouped the whole document I can use something like findAll("table") to give me all the tables, but I do not know how I can possibly get to that value. I may be able to do something like findAll("a"), but how would I specify the “name” to be equal to (1 in this case)? Even if I could do that, I wouldnt be able to get to the “1) Text Editor” since that “a” tag is empty.. and I also couldnt get to things like the “<b>Application Name: Notepad</b>” part.
What is the best solution with a combination of python/beautifulsoup, or if there is some better way to get that “1) Text Editor” and “Application Name” and “Type” parts of the table out based on the fact that there is a <a name="1"></a> that precedes it? Sample syntax would be great.
You can specify attributes with
findAll…… and then get the next node …
… and the next
<b>element …… and so on.
By the way, the
attrsargument is only necessary becausenameis a special argument tofindAll(). If it had been some other attribute, you could have used e.g.