I have loaded HTML into pyqt and would like to create a list of

Question

0

Editorial Team

Asked: May 30, 20262026-05-30T03:28:41+00:00 2026-05-30T03:28:41+00:00

I have loaded HTML into pyqt and would like to create a list of

0

I have loaded HTML into pyqt and would like to create a list of all the content on the page.

I then need to be able to get the position of the text, using .geometry()

I would like a list of objects, where the following would be possible:

for i in list_of_content_in_html:
    print i.toPlainText(), i.geometry() #prints the text, and the position.

In case I am unclear, by “contents” I mean in the HTML below, contents is
‘c’, ‘r1 c1’, ‘r1, c2’, ‘row2 c2’, ‘more contents’ – the text the web user sees in the browser, basically.

c
<table border="1">
<tr>
<td>r1 c1</td>
<td>r1 c2</td>
</tr>
<tr>
<td></td>
<td>row2 c2</td>
</tr>
</table>
more contents

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T03:28:42+00:00

This doesn’t seem to be possible using QtWebKit and pages like this one, that nest objects but don’t use <p>...</p> for other text, that is outside of the table. In result c and more contents don’t go into separate QWebElements. They are only to be found in the BODY level block. As a solution one could run that page through a parser. Simply traversing through children of currentFrame documentElement brings out following elements:

# position in element tree, bounding box, tag, text:
(0, 0) [0, 0, 75, 165] HTML - u'c\nr1 c1\tr1 c2\nrow2 c2\nmore contents'
(1, 1) [8, 8, 67, 157] BODY - u'c\nr1 c1\tr1 c2\nrow2 c2\nmore contents'
(2, 0) [8, 27, 75, 119] TABLE - u'r1 c1\tr1 c2\nrow2 c2'
(3, 0) [9, 28, 74, 118] TBODY - u'r1 c1\tr1 c2\nrow2 c2'
(4, 0) [9, 30, 74, 72] TR - u'r1 c1\tr1 c2'
(5, 0) [11, 30, 32, 72] TD - u'r1 c1'
(5, 1) [34, 30, 72, 72] TD - u'r1 c2'
(4, 1) [9, 74, 74, 116] TR - u'row2 c2'
(5, 1) [34, 74, 72, 116] TD - u'row2 c2'

Code for this:

import sys
from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import *

class WebPage(QObject):
    finished = Signal()
    def __init__(self, data, parent=None):
        super(WebPage, self).__init__(parent)
        self.output = []
        self.data = data
        self.page = QWebPage()
        self.page.loadFinished.connect(self.process)

    def start(self):
        self.page.mainFrame().setHtml(self.data)

    @Slot(bool)
    def process(self, something=False):
        self.page.setViewportSize(self.page.mainFrame().contentsSize())
        frame = self.page.currentFrame()
        elem = frame.documentElement()
        self.gather_info(elem)
        self.finished.emit()

    def gather_info(self, elem, i=0):
        if i > 200: return
        cnt = 0
        while cnt < 100:
            s = elem.toPlainText()
            rect = elem.geometry()
            name = elem.tagName()
            dim = [rect.x(), rect.y(), 
                rect.x() + rect.width(), rect.y() + rect.height()]
            if s: self.output.append(dict(pos=(i, cnt), dim=dim, tag=name, text=s))
            child = elem.firstChild()
            if not child.isNull():
                self.gather_info(child, i+1)
            elem = elem.nextSibling()
            if elem.isNull(): 
                break
            cnt += 1

webpage = None

def print_strings():
    for s in webpage.output:
        print s['pos'], s['dim'], s['tag'], '-',  repr(s['text'])

if __name__ == '__main__':
    app = QApplication(sys.argv)
    data = open(sys.argv[1]).read()
    webpage = WebPage(data)
    webpage.finished.connect(print_strings)
    webpage.start()

.

A different approach

The desired course of action depends on what you want to achieve. You can get all the strings from the QWebPage using webpage.currentFrame().documentElement().toPlainText(), but that just shows the whole page as a string with no positioning information related to all the tags. Browsing the QWebElement tree gives you the desired information but it has the drawbacks, which I mentioned above.

If you really want to know the position of all text, The only accurate way to do this (other than rendering the page and using OCR) is breaking text into characters and saving their individual bounding boxes. Here’s how I did it:

First I parsed the page with BeautifulSoup4 and enclosed every non-space text character X in a <span class="Nd92KSx3u2">X</span>. Then I ran a PyQt script (actually a PySide script) which loads the altered page and printed out the characters with their bounding boxes after I looked them up using findAllElements('span[class="Nd92KSx3u2"]').

parser.py:

import sys, cgi, re
from bs4 import BeautifulSoup, element
magical_class = "Nd92KSx3u2"
restricted_tags="title script object embed".split()
re_my_span = re.compile(r'&lt;span class="%s"&gt;(.+?)&lt;/span&gt;' % magical_class)

def no_nl(s): return str(s).replace("\r", "").replace("\n", " ")

if len(sys.argv) != 3:
    print "Usage: %s <input_html_file> <output_html_file>" % sys.argv[0]
    sys.exit(1)

def process(elem):
    for x in elem.children:
        if isinstance(x, element.Comment): continue
        if isinstance(x, element.Tag):
            if x.name in restricted_tags:
                continue
        if isinstance(x, element.NavigableString):
            if not len(no_nl(x.string).strip()):
                continue  # it's just empty space
            print '[', no_nl(x.string).strip(), ']',  # debug output of found strings
            s = ""
            for c in x.string:
                if c in (' ', '\r', '\n', '\t'): s += c
                else: s += '<span class="%s">%s</span>' % (magical_class, c)
            x.replace_with(s)
            continue
        process(x)

soup = BeautifulSoup(open(sys.argv[1]))
process(soup)
output = re_my_span.sub(r'<span class="%s">\1</span>' % magical_class, str(soup))
with open(sys.argv[2], 'w') as f:
    f.write(output)

charpos.py:

import sys
from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import *
magical_class = "Nd92KSx3u2"

class WebPage(QObject):
    def __init__(self, data, parent=None):
        super(WebPage, self).__init__(parent)
        self.output = []
        self.data = data
        self.page = QWebPage()
        self.page.loadFinished.connect(self.process)

    def start(self):
        self.page.mainFrame().setHtml(self.data)

    @Slot(bool)
    def process(self, something=False):
        self.page.setViewportSize(self.page.mainFrame().contentsSize())
        frame = self.page.currentFrame()
        elements = frame.findAllElements('span[class="%s"]' % magical_class)
        for e in elements:
            s = e.toPlainText()
            rect = e.geometry()
            dim = [rect.x(), rect.y(), 
                rect.x() + rect.width(), rect.y() + rect.height()]
            if s and rect.width() > 0 and rect.height() > 0: print dim, s

if __name__ == '__main__':
    app = QApplication(sys.argv)
    data = open(sys.argv[1]).read()
    webpage = WebPage(data)
    webpage.start()

input.html (slightly altered to show more problems with simple string dumping:

a<span>b<span>c</span></span>
<table border="1">
<tr><td>r1 <font>c1</font>  </td><td>r1 c2</td></tr>
<tr><td></td><td>row2 &amp; c2</td></tr>
</table>
more <b>contents</b>

and the test run:

$ python parser.py input.html temp.html
[ a ] [ b ] [ c ] [ r1 ] [ c1 ] [ r1 c2 ] [ row2 & c2 ] [ more ] [ contents ]
$ charpos.py temp.html
[8, 8, 17, 26] a
[17, 8, 26, 26] b
[26, 8, 34, 26] c
[13, 48, 18, 66] r
[18, 48, 27, 66] 1
[13, 67, 21, 85] c
[21, 67, 30, 85] 1
[36, 48, 41, 66] r
[41, 48, 50, 66] 1
[36, 67, 44, 85] c
[44, 67, 53, 85] 2
[36, 92, 41, 110] r
[41, 92, 50, 110] o
[50, 92, 61, 110] w
[61, 92, 70, 110] 2
[36, 111, 47, 129] &
[51, 111, 59, 129] c
[59, 111, 68, 129] 2
[8, 135, 21, 153] m
[21, 135, 30, 153] o
[30, 135, 35, 153] r
[35, 135, 44, 153] e
[8, 154, 17, 173] c
[17, 154, 27, 173] o
[27, 154, 37, 173] n
[37, 154, 42, 173] t
[42, 154, 51, 173] e
[51, 154, 61, 173] n
[61, 154, 66, 173] t
[66, 154, 75, 173] s

Looking at the bounding boxes, it is (in this simple case without changes in font size and things like subscripts) quite easy to glue them back into words if you wish.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have loaded HTML into pyqt and would like to create a list of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply