Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8814063
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T04:00:50+00:00 2026-06-14T04:00:50+00:00

I have an HTML document and I would like to find the HTML element

  • 0

I have an HTML document and I would like to find the HTML element which is the closest wrapper to the largest cluster of mentions of a given word.

With the following HTML:

<body>
    <p>
        Hello <b>foo</b>, I like foo, because foo is the best.
    <p>
    <div>
        <blockquote>
            <p><strong>Foo</strong> said: foo foo!</p>
            <p>Smurfs ate the last foo and turned blue. Foo!</p>
            <p>Foo foo.</p>
        </blockquote>
    </div>
</body>

I would like to have a function

find_largest_cluster_wrapper(html, word='foo')

…which would parse the DOM tree and return me <blockquote> element, because it contains the largest density of foo mentions and it is the closest wrapper.

The first <p> contains foo 3 times, the <b> only once, inner <p>s contain foo 3 times, twice and twice again, <strong> only once. But <blockquote> contains foo 4 times. So does the <div>, but it is not the closest wrapper. The <body> element has the highest number of mentions, but it is too sparse of a cluster.

Straightforward implementation without clustering would give me always <html> or <body> or something like that, because such elements always have the largest number of requested mentions and are probably the closest wrapper to them. However, I need something taking the largest cluster as I am interested only in the part of the web page with the highest density of the word.

I am not very curious about the parsing part, it could be well solved by beautifulsoup4 or other libraries. I am wondering about an efficient algorithm to do the clustering. I googled for a while and I think clustering package in scipy could be helpful, but I have no idea how to use it. Could anyone recommend me the best solution and kick me to the right direction? Examples would be totally awesome.


Well, it would be difficult to answer to such question in general, because the conditions are, as you pointed out, vague. So, more specifically:

Typically, the document will contain probably only one such cluster. My intention is to find such cluster and get it’s wrapper so I can manipulate with it. The word could be mentioned also somewhere else on the page, but I am looking for a notable cluster of such words. If there are two notable clusters or more, then I have to use an external bias to decide (examine headers, title of the page, etc.). What does it mean the cluster is notable? It means precisely what I just presented – that there are no “serious” competitors. If a competitor is serious or not I could provide in some number (ratio), e.g. if there is cluster of 10 and cluster of 2, the difference would be 80%. I could say if there is a cluster with a difference larger than 50%, it would be the notable one. That means, if it would be cluster of 5 and another of 5, the function would return None (could not decide).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T04:00:51+00:00Added an answer on June 14, 2026 at 4:00 am

    So here is an approach:

    |fitness(node, word) = count of word in node text if node is a leaf
    |fitness(node, word) = sum(fitness(child, word) for child in children) / 
                             count of overall elements in node tree
    

    Here it is:

    import lxml.html
    
    node = """<html><body>
        <p>
            Hello <b>foo</b>, I like foo, because foo is the best.
        <p>
        <div>
            <blockquote>
                <p><strong>Foo</strong> said: foo foo!</p>
                <p>Smurfs ate the last foo and turned blue. Foo!</p>
                <p>Foo foo.</p>
            </blockquote>
        </div>
    </body></html>"""
    
    node = lxml.html.fromstring(node)
    
    def suitability(node, word):
        mx = [0.0, None]
        _suitability(node, word, mx)
        return mx[1]
    
    def _suitability(node, word, mx):
    
        children = node.getchildren()
        sparsity = 1
        result = float(node.text_content().lower().count(word))
        for child in children:
            res, spars = _suitability(child, word, mx)
            result += res
            sparsity += spars
        result /= sparsity
        current_max, max_node = mx
        if current_max < result:
            mx[0] = result
            mx[1] = node
        return result, sparsity
    
    print suitability(node, 'foo')
    

    It gives us blockquote element as the fittest. And by adjusting the score function you can change the parameters of desired cluster.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a div element in an HTML document. I would like to extract
I have an HTML document where I would like to semantically group text to
I have a HTML file I got from Wikipedia and would like to find
I have this: $(document).ready(function() { $(input[type=button]).click(function () { $(#test).html(W3Schools); alert(Would submit: + $(this).siblings(input[type=text]).val()); $.ajax({
I am using Xerces in Java. I would like to parse an HTML document
I would like to Find word on page, wrap in span tags with class.
Possible Duplicate: How to parse and process HTML with PHP? I have HTML document
I have an HTML document where I have two different tables. One is class
I have an HTML document as a string I want to search for a
I have a HTML document with the below setup: <div class=main-div style=padding: 5px; border:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.