Problem Statement is somewhat like this:
Given a website, we have to classify it into one of the two predefined classes (say whether its an e-commerce website or not?)
We have already tried Naive Bayes Algorithms for this with multiple pre-processing techniques (stop word removal, stemming etc.) and proper features.
We want to increase the accuracy to 90 or somewhat closer, which we are not getting from this approach.
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
We were thinking, if we are too sure of these identifiers, why don’t we create a rule based classifier where we will classify a page as per a set of rules(which will be written on the basis of some priority).
e.g. if it contains shop/shopping and has checkout button then it’s an ecommerce page.
And many similar rules in some priority order.
Depending on a few rules we will visit other pages of the website as well (currently, we visit only home page which is also a reason of not getting very high accuracy).
What are the potential issues that we will be facing with rule based approach? Or it would be better for our use case?
Would be a good idea to create those rules with sophisticated algorithms(e.g. FOIL, AQ etc)?
A Decision Tree algorithm can take your data and return a rule set for prediction of unlabeled instances.
In fact, a decision tree is really just a recursive descent partitioner comprised of a set of rules in which each rule sits at a node in the tree and application of that rule on an unlabeled data instance, sends this instance down either the left fork or right fork.
Many decision tree implementations explicitly generate a rule set, but this isn’t necesary, because the rules (both what the rule is and the position of that rule in the decision flow) are easy to see just by looking at the tree that represents the trained decision tree classifier.
In particular, each rule is just a Boolean test for a particular value in a particular feature (data column or field).
For instance, suppose one of the features in each data row describes the type of Application Cache; further suppose that this feature has three possible values, memcache, redis, and custom. Then a rule might be Applilcation Cache | memcache, or does this data instance have an Application Cache based on redis?
The rules extracted from a decision tree are Boolean–either true or false. By convention False is represented by the left edge (or link to the child node below and to the left-hand-side of this parent node); and True is represented by the right-hand-side edge.
Hence, a new (unlabeled) data row begins at the root node, then is sent down either the right or left side depending on whether the rule at the root node is answered True or False. The next rule is applied (at least level in the tree hierarchy) until the data instance reaches the lowest level (a node with no rule, or leaf node).
Once the data point is filtered to a leaf node, then it is in essence classified, becasue each leaf node has a distribution of training data instances associated with it (e.g., 25% Good | 75% Bad, if Good and Bad are class labels). This empirical distribution (which in the ideal case is comprised of a data instances having just one class label) determines the unknown data instances’s estimated class label.
The Free & Open-Source library, Orange, has a decision tree module (implementations of specific ML techniques are referred to as “widgets” in Orange) which seems to be a solid implementation of C4.5, which is probably the most widely used and perhaps the best decision tree implementation.
An O’Reilly Site has a tutorial on decision tree construction and use, including source code for a working decision tree module in python.