I have written a c++ prog to query a 100 GB dictionary. I have split the dictionary into n number of files of equal size. All split-files are placed in the same dir. The dictionary is fully indexed, i.e., once a query comes I know which spit-file to open and where to seek. My question is for better performance, which split will be better:
(a) Small number of large files or (b) Large number of small files?
Also, what would be an ideal split?
I have written a c++ prog to query a 100 GB dictionary. I have
Share
I don’t think there’s a direct answer to that question. only experimenting can tell you. The cost of opening a file for read should be constant regardless of the size, reading the contents of the file is then of course dependant on the file size.
There are other hints though
I will assume that when you get a query, you open the file, parse/read it completly or until you find the word then close the file and return the result, in this case there are many enhancements to do, maybe you have them, maybe not, but here goes
case you might need to cache your files, or your search queries for
better performance
check when is a file loaded into memory
A total different approach would be to use a database with index. this problem you don’t have to deal with file opening problems