I use this regex on some input,
[^a-zA-Z0-9@#]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.
Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
if your terminal emulator can display arbitrary unicode glyphs, a
print uwill then showIn any case, you can now, if you wish, use your original RE and you won’t accidentally “catch” the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I’m not sure that’s what you really want — why not accented letters but just ascii ones, for example? — but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the
htmlentitydefsstandard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).