I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server’s job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don’t want to download every site I visit, but parse it on the fly and just store it’s rank and url. Some of them look a bit complicated to me – for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
Share
I don’t think regular expressions are appropriate for HTML parsing. I’m using libxml2, and I enjoy it very much – easy to use, portable and lightning fast.