I have loaded HTML into pyqt and would like to create a list of all the content on the page.
I then need to be able to get the position of the text, using .geometry()
I would like a list of objects, where the following would be possible:
for i in list_of_content_in_html:
print i.toPlainText(), i.geometry() #prints the text, and the position.
In case I am unclear, by “contents” I mean in the HTML below, contents is
‘c’, ‘r1 c1’, ‘r1, c2’, ‘row2 c2’, ‘more contents’ – the text the web user sees in the browser, basically.
c
<table border="1">
<tr>
<td>r1 c1</td>
<td>r1 c2</td>
</tr>
<tr>
<td></td>
<td>row2 c2</td>
</tr>
</table>
more contents
This doesn’t seem to be possible using QtWebKit and pages like this one, that nest objects but don’t use
<p>...</p>for other text, that is outside of the table. In resultcandmore contentsdon’t go into separate QWebElements. They are only to be found in the BODY level block. As a solution one could run that page through a parser. Simply traversing through children of currentFrame documentElement brings out following elements:Code for this:
.
A different approach
The desired course of action depends on what you want to achieve. You can get all the strings from the
QWebPageusingwebpage.currentFrame().documentElement().toPlainText(), but that just shows the whole page as a string with no positioning information related to all the tags. Browsing theQWebElementtree gives you the desired information but it has the drawbacks, which I mentioned above.If you really want to know the position of all text, The only accurate way to do this (other than rendering the page and using OCR) is breaking text into characters and saving their individual bounding boxes. Here’s how I did it:
First I parsed the page with BeautifulSoup4 and enclosed every non-space text character
Xin a<span class="Nd92KSx3u2">X</span>. Then I ran a PyQt script (actually a PySide script) which loads the altered page and printed out the characters with their bounding boxes after I looked them up usingfindAllElements('span[class="Nd92KSx3u2"]').parser.py:
charpos.py:
input.html (slightly altered to show more problems with simple string dumping:
and the test run:
Looking at the bounding boxes, it is (in this simple case without changes in font size and things like subscripts) quite easy to glue them back into words if you wish.