I am doing some linguistic research that depends on being able to query a corpus of 100 million sentences. The information I need from that corpus is along the lines: how many sentences had “john” as first word, “went” as second word and “hospital” as the fifth word…etc So I just need the count and don’t need to actually retrieve the sentences.
The idea I had was to split these sentences into words and store them into a database, where the columns would be the positions (word-1, word-2, word-3..etc) and the sentences would be the rows. So it looks like:
Word1 Word2 Word3 Word4 Word5 ….
Congress approved a new bill
John went to school
…..
And my purpose will then be fulfilled by calling something like COUNT(SELECT * where Word1=John and Word4=school). But I am wondering: Can this be better achieved using Lucene (or some other tool)?
The program I am writing (in Java) will be doing tens of thosands of such queries on that 100 million sentece corpus. So speed of look-up is important.
Thanks for any advice,
Anas
For example:
translates into:
(I chose arbitrary word codes) leading to store a row
(until the number of words you allocate per a sentence is exhausted).
This is a somewhat sparse matrix, so maybe this question will help.