I need a frequency-sorted dictionary for a compression program, (permissive or GPLv3 compatible license), but haven’t the slightest clue where to get one under such a license (all had missing or bad copyright notices). Would anyone have recommendations as to where to get one? I’ve looked for a while, but my only option seems to be creating my own, which I doubt the effective quality of, using e-books. (it would not be wholly representative of all English, much less modern English, my target.)
PS: about 200,000-50,000 words is a good target. Huge files is not a good idea.
What you want is a unigram distribution built over a large quantity of representative English text. A ‘unigram distribution’ is the formal term for what you’re calling a ‘dictionary with frequencies’.
Google published a giant collection of ngrams under a permissive license.
See http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html.
Or http://books.google.com/ngrams/datasets.
If you don’t need all those obscure words, then just chop the distribution to what you want.
As for licensing, even the FSF says that the GPL is inapplicable to dictionaries. They aren’t ‘source’. So the CC license here works perfectly fine for incorporating in whatever.
If you don’t care about having entirely representative data, then download the wikipedia dumps and the Ruby tool for extracting text, and do your own unigram distribution.
Whatever you choose, you’ll be working with a lot of data if you want useful results.