def makecounter():
return collections.defaultdict(int)
class RankedIndex(object):
def __init__(self):
self._inverted_index = collections.defaultdict(list)
self._documents = []
self._inverted_index = collections.defaultdict(makecounter)
def index_dir(self, base_path):
num_files_indexed = 0
allfiles = os.listdir(base_path)
self._documents = os.listdir(base_path)
num_files_indexed = len(allfiles)
docnumber = 0
self._inverted_index = collections.defaultdict(list)
docnumlist = []
for file in allfiles:
self.documents = [base_path+file] #list of all text files
f = open(base_path+file, 'r')
lines = f.read()
tokens = self.tokenize(lines)
docnumber = docnumber + 1
for term in tokens:
if term not in sorted(self._inverted_index.keys()):
self._inverted_index[term] = [docnumber]
self._inverted_index[term][docnumber] +=1
else:
if docnumber not in self._inverted_index.get(term):
docnumlist = self._inverted_index.get(term)
docnumlist = docnumlist.append(docnumber)
f.close()
print '\n \n'
print 'Dictionary contents: \n'
for term in sorted(self._inverted_index):
print term, '->', self._inverted_index.get(term)
return num_files_indexed
return 0
I get index error on executing this code: list index out of range.
The above code generates a dictionary index that stores the ‘term’ as a key and the document numbers in which the term occurs as a list.
For ex: if the term ‘cat’ occurs in documents 1.txt, 5.txt and 7.txt the dictionary will have:
cat <- [1,5,7]
Now, I have to modify it to add term frequency, so if the word cat occurs twice in document 1, thrice in document 5 and once in document 7:
expected result:
term <-[[docnumber, term freq], [docnumber,term freq]] <–list of lists in a dict!!!
cat <- [[1,2],[5,3],[7,1]]
I played around with the code, but nothing works. I have no clue to modify this datastructure to achieve the above.
Thanks in advance.
First, use a factory. Start with:
and later use
and as the
for term in tokens:loop,This leaves in each
self._inverted_index[term]a dict such asin your example case. Since you want instead in each
self._inverted_index[term]a list of lists, then just after the end of the looping add:Once made (this way or any other — I’m just showing a simple way to construct it!), this data structure will then actually be as awkward to use as you needlessly made it difficult to construct, of course (the dict of dict is much more useful and easy to use as well as to construct), but, hey, one’s man meat &c;-).