Given document-D1: containing words (w1,w2,w3) and document D2 and words (w2,w3..) and document Dn

Question

0

Asked: May 13, 20262026-05-13T08:56:10+00:00 2026-05-13T08:56:10+00:00

Given document-D1: containing words (w1,w2,w3) and document D2 and words (w2,w3..) and document Dn

0

Given document-D1: containing words (w1,w2,w3)
and document D2 and words (w2,w3..)
and document Dn and words ( w1,w2, wn)

Can I structure my data in big table to answer the questions like:
which words occur most frequently with w1,
or which words occur most frequently with w1 and w2.

What I am trying to achieve is to find the third word Wx (suggestion) which ocures most frequently in documents togehter with given words W1 and W2

I know the solution in SQL, but is it possible with google-big table?

I know I would have to build my indices by myself, the question is how should I structure them to avoid index explosion

thanks
almir

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T08:56:10+00:00

Using list-properties and merge-join is the best way to answer set membership questions in Google App Engine: Building Scalable, Complex Apps on App Engine.

You could setup your model as follows:

class Document(db.Model):
    word = db.StringListProperty()
    name = db.StringProperty()

...

doc.word = ["google", "app", "engine"]

Then it would be easy to query for co-occurrence. For example, which documents have the words google and engine?

results = db.GqlQuery(
"SELECT * FROM Documents "
"WHERE word = 'google'"
"  and word = 'engine'")

docs = [d.name for d in results]

There are some limitations, though. From the presentation:

Index writes are done in parallel on
Bigtable Fast– e.g., update a list
property of 1000 items with 1000 row
writes simultaneously! Scales linearly
with number of items Limited to 5000
indexed properties per entity

But queries must unpackage all result
entities When list size > ~100, reads
are too expensive! Slow in wall-clock
time Costs too much CPU

You could also create a model of words and save in the StringListProperty only their keys, but depending on the size of your documents even that would not be feasible.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given document-D1: containing words (w1,w2,w3) and document D2 and words (w2,w3..) and document Dn

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply