Okay, so, say I have a text file (not necessarily containing every possible symbol) and I’d like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.
I was considering doing something like this (in pseudocode):
function add_to_heap (symbol)
freq = heap.find(symbol).frequency
if (freq.exists? == true)
freq++
else
symbol.freq = 1
heap.insert(symbol)
MaxBinaryHeap heap
while somefile != EOF
symbol = read_byte(somefile)
heap.add_to_heap(symbol)
heap.sort_by_frequency()
while heap.root != empty
root = heap.extract_root()
do_stuff(root)
I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?
You can always use a HashMap isntead of the Heap. Like this you’ll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.
However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.