As a hobby project, I had like to implement a Morse code encoder and decoder in C++ and Python (both). I was wondering the right data structure I should use for that. Not only is this question related to this specific project, but in general, when one has to make predefined text replacements, what is the best and the fastest way of doing it?
I would avoid re-inventing any data structure if possible (and I think it is). Please note that this is purely a learning exercise and I have always wondered what would be the best way of doing this. I can store the code and the corresponding character in a dictionary perhaps, then iterate over the text and make replacements. Is this the best way of doing this or can I do better?
There’s no simple optimal structure – for any given fixed mapping, there might be fiendish bit-twiddling optimisations for that precise mapping, that are better or worse on different architectures and different inputs. A map/dictionary should be pretty good all the time, and the code is pretty simple.
My official advice is to stick with that. Lookup performance is very unlikely to be an issue for code like this, since most likely you can easily encode/decode faster than you can input/output.
As it’s a learning exercise, and you want to try out different possibilities: for the text -> morse you could use an array rather than a map/dictionary. Perhaps surprisingly, this is difficult to do in C++ and be completely portable. The following assumes that all uppercase letters have
charvalues greater thanA, which is not guaranteed by the standard:If you’re willing to assume that your code will only ever run on machines whose basic character set has the letters in a continuous run (true of ASCII, but not EBCDIC), you can tidy it up a bit:
To look up the character stored in the variable
c:The Python version can assume ASCII (I think), and you’d have to throw in some use of
ord.To cope with anything other than upper case letters, you need either a bigger array (to contain an entry for every possible char value), or else multiple arrays with bounds checks, special case code for punctuation and so on.