I’m doing some personal research into text analysis, and have come up with close to 70 metrics (pronoun usage frequency, reading levels, vowel frequency, use of bullet points, etc) to “score” a piece of text.
Ideally, separate pieces of text from the same author would have similar scores. The ultimate goal is to index a great deal of authors, and use scores to guess at who wrote a separate, anonymous piece of text.
I’d like the scores to normalize from 0 to 100 and represent a percentage of how “similar” two pieces of text are in writing style. Questions like How to decide on weights? and How to calculate scores? describe the math behind scoring metrics and how to normalize, but assume every metric is weighted the same.
My question is this: how do I determine the proper weight to use when scoring each metric, to ensure that the cumulative score per-user most accurately describes the writing from that specific user?
Also, weights can be assigned per-user. If syllables per word most aptly describes who wrote a piece for Alice, while the frequency of two-letter words is the best for Bob, I’d like Alice’s heaviest weight to be on syllables per word, and Bob’s to be on frequency of two-letter words.
If you want to do it with weighted scores, have a look at http://en.wikipedia.org/wiki/Principal_component_analysis – you could plot the values of the first (largest) couple of principal components for different authors and see if you find a clustering. You can also take a plot of the smallest few principal components and see if anything stands out – if it does, it is probably from a glitch or a mistake – it tends to pick out exceptions from general rules.
Another option is http://en.wikipedia.org/wiki/Linear_discriminant_analysis
I suppose you could build per-author weights if you built weights for the classification Alice vs not-Alice, and weights for the classification Bob vs not-Bob.
Another way of trying to identify authors is to build a http://en.wikipedia.org/wiki/Language_model for each author.
It occurs to me that if you are prepared to claim that your different measures are independent, you can then combine them with http://en.wikipedia.org/wiki/Naive_Bayes_classifier. The log of the final Bayes factor will then be the sum of the logs of the individual Bayes factors, which gives you your sum of weighted scores.