Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7525205
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T03:28:41+00:00 2026-05-30T03:28:41+00:00

I have loaded HTML into pyqt and would like to create a list of

  • 0

I have loaded HTML into pyqt and would like to create a list of all the content on the page.

I then need to be able to get the position of the text, using .geometry()

I would like a list of objects, where the following would be possible:

for i in list_of_content_in_html:
    print i.toPlainText(), i.geometry() #prints the text, and the position.

In case I am unclear, by “contents” I mean in the HTML below, contents is
‘c’, ‘r1 c1’, ‘r1, c2’, ‘row2 c2’, ‘more contents’ – the text the web user sees in the browser, basically.

c
<table border="1">
<tr>
<td>r1 c1</td>
<td>r1 c2</td>
</tr>
<tr>
<td></td>
<td>row2 c2</td>
</tr>
</table>
more contents
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T03:28:42+00:00Added an answer on May 30, 2026 at 3:28 am

    This doesn’t seem to be possible using QtWebKit and pages like this one, that nest objects but don’t use <p>...</p> for other text, that is outside of the table. In result c and more contents don’t go into separate QWebElements. They are only to be found in the BODY level block. As a solution one could run that page through a parser. Simply traversing through children of currentFrame documentElement brings out following elements:

    # position in element tree, bounding box, tag, text:
    (0, 0) [0, 0, 75, 165] HTML - u'c\nr1 c1\tr1 c2\nrow2 c2\nmore contents'
    (1, 1) [8, 8, 67, 157] BODY - u'c\nr1 c1\tr1 c2\nrow2 c2\nmore contents'
    (2, 0) [8, 27, 75, 119] TABLE - u'r1 c1\tr1 c2\nrow2 c2'
    (3, 0) [9, 28, 74, 118] TBODY - u'r1 c1\tr1 c2\nrow2 c2'
    (4, 0) [9, 30, 74, 72] TR - u'r1 c1\tr1 c2'
    (5, 0) [11, 30, 32, 72] TD - u'r1 c1'
    (5, 1) [34, 30, 72, 72] TD - u'r1 c2'
    (4, 1) [9, 74, 74, 116] TR - u'row2 c2'
    (5, 1) [34, 74, 72, 116] TD - u'row2 c2'
    

    Code for this:

    import sys
    from PySide.QtCore import *
    from PySide.QtGui import *
    from PySide.QtWebKit import *
    
    class WebPage(QObject):
        finished = Signal()
        def __init__(self, data, parent=None):
            super(WebPage, self).__init__(parent)
            self.output = []
            self.data = data
            self.page = QWebPage()
            self.page.loadFinished.connect(self.process)
    
        def start(self):
            self.page.mainFrame().setHtml(self.data)
    
        @Slot(bool)
        def process(self, something=False):
            self.page.setViewportSize(self.page.mainFrame().contentsSize())
            frame = self.page.currentFrame()
            elem = frame.documentElement()
            self.gather_info(elem)
            self.finished.emit()
    
        def gather_info(self, elem, i=0):
            if i > 200: return
            cnt = 0
            while cnt < 100:
                s = elem.toPlainText()
                rect = elem.geometry()
                name = elem.tagName()
                dim = [rect.x(), rect.y(), 
                    rect.x() + rect.width(), rect.y() + rect.height()]
                if s: self.output.append(dict(pos=(i, cnt), dim=dim, tag=name, text=s))
                child = elem.firstChild()
                if not child.isNull():
                    self.gather_info(child, i+1)
                elem = elem.nextSibling()
                if elem.isNull(): 
                    break
                cnt += 1
    
    webpage = None
    
    def print_strings():
        for s in webpage.output:
            print s['pos'], s['dim'], s['tag'], '-',  repr(s['text'])
    
    if __name__ == '__main__':
        app = QApplication(sys.argv)
        data = open(sys.argv[1]).read()
        webpage = WebPage(data)
        webpage.finished.connect(print_strings)
        webpage.start()
    

    .


    A different approach

    The desired course of action depends on what you want to achieve. You can get all the strings from the QWebPage using webpage.currentFrame().documentElement().toPlainText(), but that just shows the whole page as a string with no positioning information related to all the tags. Browsing the QWebElement tree gives you the desired information but it has the drawbacks, which I mentioned above.

    If you really want to know the position of all text, The only accurate way to do this (other than rendering the page and using OCR) is breaking text into characters and saving their individual bounding boxes. Here’s how I did it:

    First I parsed the page with BeautifulSoup4 and enclosed every non-space text character X in a <span class="Nd92KSx3u2">X</span>. Then I ran a PyQt script (actually a PySide script) which loads the altered page and printed out the characters with their bounding boxes after I looked them up using findAllElements('span[class="Nd92KSx3u2"]').

    parser.py:

    import sys, cgi, re
    from bs4 import BeautifulSoup, element
    magical_class = "Nd92KSx3u2"
    restricted_tags="title script object embed".split()
    re_my_span = re.compile(r'&lt;span class="%s"&gt;(.+?)&lt;/span&gt;' % magical_class)
    
    def no_nl(s): return str(s).replace("\r", "").replace("\n", " ")
    
    if len(sys.argv) != 3:
        print "Usage: %s <input_html_file> <output_html_file>" % sys.argv[0]
        sys.exit(1)
    
    def process(elem):
        for x in elem.children:
            if isinstance(x, element.Comment): continue
            if isinstance(x, element.Tag):
                if x.name in restricted_tags:
                    continue
            if isinstance(x, element.NavigableString):
                if not len(no_nl(x.string).strip()):
                    continue  # it's just empty space
                print '[', no_nl(x.string).strip(), ']',  # debug output of found strings
                s = ""
                for c in x.string:
                    if c in (' ', '\r', '\n', '\t'): s += c
                    else: s += '<span class="%s">%s</span>' % (magical_class, c)
                x.replace_with(s)
                continue
            process(x)
    
    soup = BeautifulSoup(open(sys.argv[1]))
    process(soup)
    output = re_my_span.sub(r'<span class="%s">\1</span>' % magical_class, str(soup))
    with open(sys.argv[2], 'w') as f:
        f.write(output)
    

    charpos.py:

    import sys
    from PySide.QtCore import *
    from PySide.QtGui import *
    from PySide.QtWebKit import *
    magical_class = "Nd92KSx3u2"
    
    class WebPage(QObject):
        def __init__(self, data, parent=None):
            super(WebPage, self).__init__(parent)
            self.output = []
            self.data = data
            self.page = QWebPage()
            self.page.loadFinished.connect(self.process)
    
        def start(self):
            self.page.mainFrame().setHtml(self.data)
    
        @Slot(bool)
        def process(self, something=False):
            self.page.setViewportSize(self.page.mainFrame().contentsSize())
            frame = self.page.currentFrame()
            elements = frame.findAllElements('span[class="%s"]' % magical_class)
            for e in elements:
                s = e.toPlainText()
                rect = e.geometry()
                dim = [rect.x(), rect.y(), 
                    rect.x() + rect.width(), rect.y() + rect.height()]
                if s and rect.width() > 0 and rect.height() > 0: print dim, s
    
    if __name__ == '__main__':
        app = QApplication(sys.argv)
        data = open(sys.argv[1]).read()
        webpage = WebPage(data)
        webpage.start()
    

    input.html (slightly altered to show more problems with simple string dumping:

    a<span>b<span>c</span></span>
    <table border="1">
    <tr><td>r1 <font>c1</font>  </td><td>r1 c2</td></tr>
    <tr><td></td><td>row2 &amp; c2</td></tr>
    </table>
    more <b>contents</b>
    

    and the test run:

    $ python parser.py input.html temp.html
    [ a ] [ b ] [ c ] [ r1 ] [ c1 ] [ r1 c2 ] [ row2 & c2 ] [ more ] [ contents ]
    $ charpos.py temp.html
    [8, 8, 17, 26] a
    [17, 8, 26, 26] b
    [26, 8, 34, 26] c
    [13, 48, 18, 66] r
    [18, 48, 27, 66] 1
    [13, 67, 21, 85] c
    [21, 67, 30, 85] 1
    [36, 48, 41, 66] r
    [41, 48, 50, 66] 1
    [36, 67, 44, 85] c
    [44, 67, 53, 85] 2
    [36, 92, 41, 110] r
    [41, 92, 50, 110] o
    [50, 92, 61, 110] w
    [61, 92, 70, 110] 2
    [36, 111, 47, 129] &
    [51, 111, 59, 129] c
    [59, 111, 68, 129] 2
    [8, 135, 21, 153] m
    [21, 135, 30, 153] o
    [30, 135, 35, 153] r
    [35, 135, 44, 153] e
    [8, 154, 17, 173] c
    [17, 154, 27, 173] o
    [27, 154, 37, 173] n
    [37, 154, 42, 173] t
    [42, 154, 51, 173] e
    [51, 154, 61, 173] n
    [61, 154, 66, 173] t
    [66, 154, 75, 173] s
    

    Looking at the bounding boxes, it is (in this simple case without changes in font size and things like subscripts) quite easy to glue them back into words if you wish.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have an activex object I loaded into an html page. I then use
I have an html page loaded into a PHP variable and am using str_replace
I have loaded a table of data into an HTML page so that it
I have a dynamic list loaded into a HTML 5 web app built using
I have an HTML string from a Ajax loaded source. I would like to
How can I check whether all JavaScript functions have loaded properly on a page?
I have a loaded XDocument that I need to grab all the attributes that
I loaded the following HTML via Ajax into a web page powered by jQuery
I need to dynamically load content from HTML emails into some type of content
I have a webview that has some custom HTML loaded into it. In this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.