I’m writing a simple web spider. The idea is to get a page programmatically

Question

0

Asked: June 10, 20262026-06-10T19:51:40+00:00 2026-06-10T19:51:40+00:00

I’m writing a simple web spider. The idea is to get a page programmatically

0

I’m writing a simple web spider.
The idea is to get a page programmatically using QNetworkAccessManager, QNetworkReply and QNetworkRequest, everything works fine.

The problem I encounter is that (for some pages) I get different/unmatching results programmatically or by visiting “manually” the page with a browser.
I always get sintactically correct HTML pages, but they look to me like some sort of “spider protection” answers.
The pages I’m referring AREN’T POST pages, the tests I’m doing are with very simple url pages, sometimes with parameters (e.g. http://www.sample.com/index.php?param=something), sometimes even with plain page.html urls.

The pseudocode is as follows:

QNetworkRequest req;
req.setUrl(QUrl(myurl));
req.setRawHeader(*I did try this one with no success*);
QNetworkAccessManager man;
QNetworkReply rep = man->get(req);
//finish and error slots connection code here

.
.
.

void replyFinished()
{
    QNetworkReply* rep = qobject_cast<QNetworkReply *>(sender());
    if (rep->error() == QNetworkReply::NoError)
    {
        // read data from QNetworkReply here
        QByteArray bytes = rep->readAll();
        QString stringa(bytes); 
        qDebug() << stringa;
    }
}

In the finish() slot I’m printing the data from the networkreply and sometimes I get unmatching results from the simple “View Source” operation in the browser got by visiting by hand the url.

Sometimes I get a custom “Not found” page, sometimes some more weird pages with logins forms or other unexpected contents.
Maybe it’s some kind of spider protection ? Can anyone help ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T19:51:41+00:00

There are 3 main methods protecting from webspiders:

Web browser identification – using message headers the website is seeing the difference between browser and web-crawlers. You write that you used raw headers – are you sure that you provide the same headers and values your browser does?
Session data/coockies – closely related to previous ones. Login forms suggest that website is expecting to get some informations that browser would normally send.
Javascript code printing actual html data into web document. Are you checking if you get the same html cody by checking source of a website in your web browser (view->source), or are you checking html layout by some tool like Firebug?
Javascript redirecting – browser is downloading website that is using javascript to redirect you to the website with actual content.

As far as the first two options go – you should use some tcp/ip sniffer like Smart sniff to check if data sent by browser are equal to those sent by your program. If it’s equal, that means that you are probably hitting some sort of javascript-barrier. If so, you might try to use some javascript-enabled browsing engine like QWebPage. I don’t know if it’s executing it’s javascript when not connected to any QWebView though – perhaps a hidden view might be necesary.

If I find myself in a situation that I need to impersonate browser to some remote service, I usually simply write Firefox-plugin (using javascript); that usualy eliminates any of above problems 😉

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a simple web spider. The idea is to get a page programmatically

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply