i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.
If I want to find one word I can do it by just putting all the text into the hashmap and store each word’s occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?
It depends a lot on the size of the text file. There are usually several cases you should consider:
Lot’s of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.
Lot’s of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.
If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.
There are probably lot’s of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.