I want an algorithm which would create all possible phrases in a block of text. For example, in the text:
"My username is click upvote. I have 4k rep on stackoverflow"
It would create the following combinations:
"My username"
"My Username is"
"username is click"
"is click"
"is click upvote"
"click upvote"
"i have"
"i have 4k"
"have 4k"
..
You get the idea. Basically the point is to get all possible combinations of ‘phrases’ out of a sentence. Any thoughts for how to best implement this?
Basically you need to first separate the block of text into sentences. That’s tricky enough, even in English since you need to look out for periods, question marks, exclamation marks and any other sentence terminators.
Then you process one sentence at a time after removing all punctuation (commas, semi-colons, colons, and so on).
Then, when you’re left with an array of words, it becomes simpler:
That’s it, pretty simple (after initial massaging of the text block, which may not be as simple as you think).
This will give you all phrases of two or more words in every sentence.
The separation into sentences, separation into words, removal of punctuation and so on will be the hardest bit but I’ve already shown you some simple initial rules to follow. The rest should be added every time a block of text breaks the algorithm.
Update:
As requested, here’s some Java code which gives the phrases:
which outputs:
Now, keep in mind this is pretty basic Java (some might say it’s C written in a Java dialect :-). It’s just meant to illustrate how to output word groupings from a sentence as you asked for.
It does not do all the fancy sentence detection and punctuation removal I mentioned in the original answer.