I’m writing a simple web spider.
The idea is to get a page programmatically using QNetworkAccessManager, QNetworkReply and QNetworkRequest, everything works fine.
The problem I encounter is that (for some pages) I get different/unmatching results programmatically or by visiting “manually” the page with a browser.
I always get sintactically correct HTML pages, but they look to me like some sort of “spider protection” answers.
The pages I’m referring AREN’T POST pages, the tests I’m doing are with very simple url pages, sometimes with parameters (e.g. http://www.sample.com/index.php?param=something), sometimes even with plain page.html urls.
The pseudocode is as follows:
QNetworkRequest req;
req.setUrl(QUrl(myurl));
req.setRawHeader(*I did try this one with no success*);
QNetworkAccessManager man;
QNetworkReply rep = man->get(req);
//finish and error slots connection code here
.
.
.
void replyFinished()
{
QNetworkReply* rep = qobject_cast<QNetworkReply *>(sender());
if (rep->error() == QNetworkReply::NoError)
{
// read data from QNetworkReply here
QByteArray bytes = rep->readAll();
QString stringa(bytes);
qDebug() << stringa;
}
}
In the finish() slot I’m printing the data from the networkreply and sometimes I get unmatching results from the simple “View Source” operation in the browser got by visiting by hand the url.
Sometimes I get a custom “Not found” page, sometimes some more weird pages with logins forms or other unexpected contents.
Maybe it’s some kind of spider protection ? Can anyone help ?
There are 3 main methods protecting from webspiders:
As far as the first two options go – you should use some tcp/ip sniffer like Smart sniff to check if data sent by browser are equal to those sent by your program. If it’s equal, that means that you are probably hitting some sort of javascript-barrier. If so, you might try to use some javascript-enabled browsing engine like QWebPage. I don’t know if it’s executing it’s javascript when not connected to any QWebView though – perhaps a hidden view might be necesary.
If I find myself in a situation that I need to impersonate browser to some remote service, I usually simply write Firefox-plugin (using javascript); that usualy eliminates any of above problems 😉