There are key self-contained algorithms – particularly cryptography-related such as AES, RSA, SHA1 etc – which you can find many implementations of for free on the internet.
Some are written to be nice and portable clean C.
Some are written to be fast – often with macros, and explicit unrolling.
As far as I can tell, none are trying to be especially super-small – so I’m resigned to writing my own – explicitly AES128 decryption and SHA1 for ARM THUMB2. (I’ve verified by compiling all I can find for my target machine with GCC with -Os and -mthumb and such)
What patterns and tricks can I use to do so?
Are there compilers/tools that can roll-up code?
It depends on what kind of space you are trying to optimise: code or data. There are essentially three variants of AES128 commonly in use, each differing in the amount of precomputed lookup table space.