Given 2 html sources, I want to first extract the main content out of

Question

0

Asked: June 1, 20262026-06-01T14:02:34+00:00 2026-06-01T14:02:34+00:00

Given 2 html sources, I want to first extract the main content out of

0

Given 2 html sources, I want to first extract the main content out of it using something like this. Are there any other better libraries – I am specifically looking for Python/Javascript ones?

Once I have the two extracted contents, I want to return a score between 0 and 1 denoting how similar they are e.g. news articles on the same topic from CNN and BBC would have higher similarity scores since they are on the same topic or webpages pertaining to the same product on Amazon.com and Walmart.com would have a high score too. How can I do this? Are there existing libraries that do this already? What are some good libraries I can use? Basically I am looking for a combination of automatic summarization, keyword extraction, named-entity recognition and sentiment-analysis.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T14:02:35+00:00

There are many things embedded in your question. I will try to provide you with a library or else will suggest you Algorithms that can solve your tasks (which you can Google and you will get many python implementations)

Point 1. To extract main content out of html (http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html) & for other NLP related stuff you can check out NLTK. Its written in Python. You can also check out for a library called BeautifulSoup, its awesome (http://www.crummy.com/software/BeautifulSoup/)

Point 2. When you say:

Once I have the two extracted contents, I want to return a score between 0 and 1 denoting how similar they are….

For this I suggest you can cluster your document set using any unsupervised learning clustering technique. Since your problem falls under the distance-metric based clustering so it should be really easy for you to cluster similar documents and then assign a score to them based on their similarity with the cluster centroid. Try either K-Means or Adaptive Resonance Theory. In the latter you dont need to define the number of clusters in advance. OR as larsman points out in his comments you can simply use TF-IDF (http://www.miislita.com/term-vector/term-vector-3.html)

Point 3.When you say:

Basically I am looking for a combination of automatic summarization, keyword extraction, named-entity recognition and sentiment-analysis

For Automatic Summarization use Non Negative Matrix Factorization

For Keyword extraction use NLTK

For Named-Entity Recognition use NLTK

For Sentiment Analysis use NLTK

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given 2 html sources, I want to first extract the main content out of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply