I have a text file containing Arabic characters and some other characters (punctuation marks, numbers, English characters, … ).
How can I tell sed to remove all the characters in the file, except Arabic ones? In short I can say that we typically tell sed to remove/replace some specific characters and print others, but now I am looking for a way to tell sed just print my desired characters, and remove all other characters.
I have a text file containing Arabic characters and some other characters (punctuation marks,
Share
With GNU sed, you should be able to specify characters by their hex code. You can use those in a a character class:
You should also be able to achieve the same effect with the
trcommand:Both methods assume UTF8 encoding of your input file. Multi-byte characters have their highest bit set, so you can simply strip everything that’s a standard ASCII (7 bits) character.
To keep everything except some well defined characters, use a negative character classe:
Using a pattern alike to
[^…]\+might improve performance of the regex.