I am doing some linguistic research that depends on being able to query a

Question

0

Asked: May 13, 20262026-05-13T17:16:32+00:00 2026-05-13T17:16:32+00:00

I am doing some linguistic research that depends on being able to query a

0

I am doing some linguistic research that depends on being able to query a corpus of 100 million sentences. The information I need from that corpus is along the lines: how many sentences had “john” as first word, “went” as second word and “hospital” as the fifth word…etc So I just need the count and don’t need to actually retrieve the sentences.

The idea I had was to split these sentences into words and store them into a database, where the columns would be the positions (word-1, word-2, word-3..etc) and the sentences would be the rows. So it looks like:

Word1 Word2 Word3 Word4 Word5 ….

Congress approved a new bill

John went to school

…..

And my purpose will then be fulfilled by calling something like COUNT(SELECT * where Word1=John and Word4=school). But I am wondering: Can this be better achieved using Lucene (or some other tool)?

The program I am writing (in Java) will be doing tens of thosands of such queries on that 100 million sentece corpus. So speed of look-up is important.

Thanks for any advice,

Anas

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T17:16:32+00:00

I suggest you read Search Engine versus DBMS. From what I gather, you do need a database rather than a full text search library.
In any case, I suggest you preprocess your text and replace every word/token with a number using a dictionary. This replaces every sentence with an array of word codes. I would then store every word place in a separate database column, simplifying counts and making them quicker.
For example:

A boy and a girl drank milk

translates into:

120 530 14 120 619 447 253

(I chose arbitrary word codes) leading to store a row

120 530 14 120 619 447 253 0 0 0 0 0 0 0 ….

(until the number of words you allocate per a sentence is exhausted).

This is a somewhat sparse matrix, so maybe this question will help.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am doing some linguistic research that depends on being able to query a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply