We are interested in doing binary classification of web pages present across the web e.g. Ecommerce vs Non-Ecommerce.
Currently, we are using Mahout library with Naive Bayes algorithm. We are creating training data from existing classified URLs and feature set from the same.
What is the best possible way in terms of accuracy to perform this task?
I need help in terms of algorithm, libraries(usable with JAVA) or any better ideas that help in such types of classification.
Thanks in advance.
The question is quite general so I can add only general information.
The ways to improve the quality of your classification are (in order of importance):