I have a web crawler and the whole web to crawl.
what should be my strategy? what kind of classification algorithms should i use ?
I am saying i have a web crawler , i din mean manually crawling the web .
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
You can try and classify each page you crawl and determine if it is a restaurant or not (binary classifier) and use supervised learning.
You can use the Bag of Words model for it – which means, use the words as “features” and their existence (and number of occurances) determines the value of the feature.
You will also need to first manually label a set of pages and determine for them if they are a restaurant page or not. The data you generate is called your training set.
Note that the bag of words model tend to have a huge feature space – so you are going to need a classifier that is not sensitive to non informative features.
You can later use cross-validation to estimate how good your model is.
Here are some suggestions I found useful when classifying data using the bag of words model: