I am totally a beginner and now trying to implement a simple search engine in python.
I doing the tokenizer well by used functions in NLTK. But I am now confused on storing the results of the tokenizer. I need to keep them for further indexing.
What’s the common way to do this? What kind of database should I use?
Introduction to Information Retrieval by Manning, Raghavan and Schütze devotes several chapters to index construction and storage; so does Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto.
For a simple hobby/study project, though, SQLite will suffice for index storage. You need a table that holds (term, document-id, frequency) triples to compute tf and one that stores (term, df) pairs, both with an index on the terms; that’s enough to compute tf-idf.