I would like to create algorithm to distinguish the persons writing on forum under different nicknames.
The goal is to discover people registring new account to flame forum anonymously, not under their main account.
Basicaly I was thinking about stemming words they use and compare users according to similarities or these words.

As shown on the picture there is user3 and user4 who uses same words. It means there is probably one person behind the computer.
Its clear that there are lot of common words which are being used by all users. So I should focus on “user specific” words.
Input is (related to the image above):
<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter
Output should be:
user1
user2
user3 = user4
I am doing this in Java but I want this question to be language independent.
Any ideas how to do it?
1) how to store words/users? What data structures?
2) how to get rid of common words everybody use? I have to somehow ignore them among user specific words. Maybe I could just ignore them because they get lost. I am afraid that they will hide significant difference of “user specific words”
3) how to recognize same users? – somehow count same words between each user?
I am very thankful for every advice in advance.
I recommend a language modelling approach. You can train a language model (unigram, bigram, parsimonious, …) on each of your user accounts’ words. That gives you a mapping from words to probabilities, i.e. numbers between 0 and 1 (inclusive) expressing how likely it is that a user uses each of the words you encountered in the complete training set. Language models can be stored as arrays of pairs, hash tables or sparse vectors. There are plenty of libraries on the web for fitting LMs.
Such a mapping can be considered a high-dimensional vector, in the same way documents are considered as vector in the vector space model of information retrieval. You can then compare these vectors by using KL-divergence or any of the popular distance metrics: Euclidean distance, cosine distance, etc. A strong similarity/small distance between two users’ vectors might then indicate that they belong to one and the same user.