Given a document collection in Text::DocumentCollection in Perl, I want to calculate the cosine similarity between any two documents in the collection using Text::Document.
I think this can probably be done using EnumerateV and callbacks, but I’m having trouble figuring out the specifics. (This SO question is helpful, but I’m still stuck.)
To be specific, suppose the collection is stored in test.db as follows:
#!/usr/bin/perl -w
use Text::DocumentCollection;
use Text::Document;
$c = Text::DocumentCollection->new( file => 'test.db' );
my $text = 'Stack Overflow is a programming | Q & A site that’s free. Free to ask | questions, free to answer questions|, free to read, free to index';
my @strings = split /\|/, $text;
my $i=0;
foreach (@strings) {
my $doc = Text::Document->new();
$doc->AddContent($_);
$c->Add(++$i,$doc);
}
Now suppose I need to read in test.db and calculate cosine similarity for all combinations of documents. (I don’t have access to the documents created in the code above other than through the stored database file.)
I think the answer is in constructing a subroutine that is accessed with the callback in EnumerateV, and I’m guessing that the subroutine also calls EnumerateV but I haven’t been able to figure it out.
You might want to start with something like this: