Can anyone show a simple implementation or usage example of a tf-idf algorithm in Smalltalk for natural language processing?
I’ve found an implementation in a package called NaturalSmalltalk, but it seems too complicated for my needs. A simple implementation in Python is like this one.
I’ve noticed there is another tf-idf in Hapax, but it seems related to analysis of software systems vocabularies, and I didn’t found examples of how to use it.
I am the author of the original Hapax package for Visualworks. Hapax is a general purpose information retrieval package, it should be able to work with any kind of text files. I just happens so that I used to use it to analyze source code files.
The class that you are looking for is
TermDocumentMatrix, there should be two methodsglobalWeighting:andlocalWeighting:to which you pass instances ofInverseDocumentFrequencyand eitherLogTermFrequencyorTermFrequencydepending on your needs. Typically when referring to tfidf people mean it to include logarithmic term frequencies.There should best tests demonstrating the TDM class using a small example corpus. If the tests have not been ported to Squeak, please let me know so I can provide you with an example.