I have an HTML document and I would like to find the HTML element which is the closest wrapper to the largest cluster of mentions of a given word.
With the following HTML:
<body>
<p>
Hello <b>foo</b>, I like foo, because foo is the best.
<p>
<div>
<blockquote>
<p><strong>Foo</strong> said: foo foo!</p>
<p>Smurfs ate the last foo and turned blue. Foo!</p>
<p>Foo foo.</p>
</blockquote>
</div>
</body>
I would like to have a function
find_largest_cluster_wrapper(html, word='foo')
…which would parse the DOM tree and return me <blockquote> element, because it contains the largest density of foo mentions and it is the closest wrapper.
The first <p> contains foo 3 times, the <b> only once, inner <p>s contain foo 3 times, twice and twice again, <strong> only once. But <blockquote> contains foo 4 times. So does the <div>, but it is not the closest wrapper. The <body> element has the highest number of mentions, but it is too sparse of a cluster.
Straightforward implementation without clustering would give me always <html> or <body> or something like that, because such elements always have the largest number of requested mentions and are probably the closest wrapper to them. However, I need something taking the largest cluster as I am interested only in the part of the web page with the highest density of the word.
I am not very curious about the parsing part, it could be well solved by beautifulsoup4 or other libraries. I am wondering about an efficient algorithm to do the clustering. I googled for a while and I think clustering package in scipy could be helpful, but I have no idea how to use it. Could anyone recommend me the best solution and kick me to the right direction? Examples would be totally awesome.
Well, it would be difficult to answer to such question in general, because the conditions are, as you pointed out, vague. So, more specifically:
Typically, the document will contain probably only one such cluster. My intention is to find such cluster and get it’s wrapper so I can manipulate with it. The word could be mentioned also somewhere else on the page, but I am looking for a notable cluster of such words. If there are two notable clusters or more, then I have to use an external bias to decide (examine headers, title of the page, etc.). What does it mean the cluster is notable? It means precisely what I just presented – that there are no “serious” competitors. If a competitor is serious or not I could provide in some number (ratio), e.g. if there is cluster of 10 and cluster of 2, the difference would be 80%. I could say if there is a cluster with a difference larger than 50%, it would be the notable one. That means, if it would be cluster of 5 and another of 5, the function would return None (could not decide).
So here is an approach:
Here it is:
It gives us blockquote element as the fittest. And by adjusting the score function you can change the parameters of desired cluster.