I am currently using BeautifulSoup to extract HTML elements and attributes.
I also would like to know the nested level of each element extracted.
For Example:
Sample HTML:
<html>
<head>
<title>Element Attributes Test</title>
</head>
<body>
<div id="abc">
<ol id="def">
<li class="testItem"> <a href="http://testpage.html">
</li>
<li class="testItem"> <table id="testTable">
<tr>
<td>
<div id="testDiv">
</div>
</td>
</tr>
</table>
</li>
</ol>
</div>
</body>
</html>
I would like to get the path information for a particular element as output in the Path Column.
----------------------------------
Element | Attribute | Path
----------------------------------
html | None | document
----------------------------------
head | None | html
----------------------------------
title | None | html.head
----------------------------------
body | None | html
----------------------------------
div | id="abc" | html.body
-----------------------------------
ol | id="def" | html.body.div
-----------------------------------
li | class=".."| html.body.div.ol
-----------------------------------
a | href=".." | html.body.div.ol.li
-----------------------------------
li | class=".."| html.body.div.ol
-----------------------------------
table | id="..." | html.body.div.old.li
-----------------------------------
tr | None | html.body.div.li.table
-----------------------------------
I am able to extract Element and its associated attributes but unable to find out an appropriate way to get the path to that particular element.
How do I extract the same using BeautifulSoup?
Are there any other libraries that I can use to extract the same?
Thanks in advance.
You might take the following approach in getting a bottom up path for all the html elements