I am sucessfully splitting Sentences into words with a StringTokenizer.
Is there a tool which is able to split compound words like Projektüberwachung into their parts Projekt and überwachung or even some longer ones?
The reason for splitting the compound words is that i want to do a text-extraction. I want to convert phrases like these Projektplanung und -überwachung into the two parts Projektplanung and Projektüberwachung. And splitting the compound word is my first step.
JWordSplitter
Randomly saw this on synaptic this morning. Here is the description from the site:
“jWordSplitter is a small Java library that splits compound words into their parts. This is especially useful for languages like German where an infinite number of new words can be formed by just appending nouns (“Donaudampfschifffahrtskapitän”).”
Usage is as simple as this:
Unfortunately, there is no pre-built library in the download section, but it is easy to build. Here is a short description how to do this in three simple steps.
Checkout the sources via SVN:
svn co https://jwordsplitter.svn.sourceforge.net/svnroot/jwordsplitter/trunk jwordsplitterOpen the Maven Project e.g. in Netbeans
Build the library which includes the dictionary (jwordsplitter-3.2.jar, 300kB)