I would like to train and use a bayesian classifier for the following situation:
- Semi-structured data – basically an XML schema
- Information is contained in multiple plain text fields
- Some fields / parts of the schema may be repeated an arbitrary number of times
The classification itself is fairly simple – basically I need a probability of the document being in a specific category.
Design constraints:
- Solution must be either be open source, or available under another royalty-free license
- It must be possible to save / load classifiers for future use
- It must be possible to embed this library in a larger Java-based application (i.e. must work a a Java/JVM library)
Are there any libaries / tools that would fit this requirement?
I’m not sure whether you already have your classifier ready, but I’ve used Apache’s UIMA framework for a couple of drawer projects. UIMA is “just” a framework, but does come with some logic. Some heavy-duty googling came up with an example bayesian classifier using UIMA.
It has mechanisms for modifying your configurations at runtime, but I’m also a bit unclear as to what you mean by “save and load classifiers”. Does this mean that you have an array of binary classifiers you would like to load (and unload) at runtime, or do you have different models that you would like to load/unload?
The answers to your other questions are: