I have a web crawler and the whole web to crawl. what should be

Question

0

Asked: June 17, 20262026-06-17T09:37:37+00:00 2026-06-17T09:37:37+00:00

I have a web crawler and the whole web to crawl. what should be

0

I have a web crawler and the whole web to crawl.
what should be my strategy? what kind of classification algorithms should i use ?

I am saying i have a web crawler , i din mean manually crawling the web .

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T09:37:37+00:00

You can try and classify each page you crawl and determine if it is a restaurant or not (binary classifier) and use supervised learning.

You can use the Bag of Words model for it – which means, use the words as “features” and their existence (and number of occurances) determines the value of the feature.

You will also need to first manually label a set of pages and determine for them if they are a restaurant page or not. The data you generate is called your training set.

Note that the bag of words model tend to have a huge feature space – so you are going to need a classifier that is not sensitive to non informative features.

You can later use cross-validation to estimate how good your model is.

Here are some suggestions I found useful when classifying data using the bag of words model:

SVM tends to be very useful and yield very good results for the Bag of Words model. I did not see significance different between the performance of linear kernel and gaussian kernel.
Use stemming and filter stop words – you don’t need the noise it generates.
Use bi-grams, they are very informative and at least for me – tend to increase the accuracy of the classifier significantly.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a web crawler and the whole web to crawl. what should be

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply