I am trying to process some files that are named xls and can be opened in Excel but they are web archive files There are some nested tables, I want to work first with only the non-nested tables. I thought I could catch the non-nested tables by looking only for those tables whose parent element had a body tag but for none of my tables is table.get_parent().tag==’body’ true. Even for the table snip below the tag of the parent element of that particular table is a div tag
<html>
<head>
<META http-equiv=3DContent-Type content=3D'text/html; charset=utf-8'><script type=3Dtext/javascript src=3DShow.js>/* Do Not Remove This Comment */</script></head>
<body>
<table class=3Dreport id=3DID0EI>
<tr>
<th>
I checked and the body tag is closed as is the table tag.
table.getparent()
returns
<Element div at 9f05f10>
note, I am getting my tables by reading in the document as a string and following these general steps
myTree=html.fromstring(someString)
tables=myTree.cssselect('table')
tables=theTree.cssselect('table')
xpath to the rescue
There is probably some fancy xpath (that some SO smarty will post) to do it but this should be super fast (and easy to read)
Update
css version same idea