Im doing some changes in Linux locale files /usr/share/i18n/locales (like pt_BR), to change the default format of dates, time, numbers, etc. But since unicode chars are presented as strings in the <U9999> format, text is very hard to read.
Here is a snippet of it:
LC_TIME
abday "<U0044><U006F><U006D>";"<U0053><U0065><U0067>";/
"<U0054><U0065><U0072>";"<U0051><U0075><U0061>";/
"<U0051><U0075><U0069>";"<U0053><U0065><U0078>";/
"<U0053><U00E1><U0062>"
So, how to make a simple script (may be bash, python, pearl, whatever) to convert this text replacing the <Uxxxx> codes to their ASCII equivalents? (yes, they are all ASCI chars below 255, most even below 127)
If several answers are received, Ill accept the most elegant and/or the more detailed explained one (like options and flags used in comands)
As an example, the above text would be converted to:
LC_TIME
abday "Dom";"Seg";/
"Ter";"Qua";/
"Qui";"Sex";/
"Sáb"
Bonus points for another script that could do the opposite: convert all chars of a given string to <Uxxx> format.
Thanks!
Using Fields
Explanation
-F'<U0+|>': This is the magic that makes this script so short. We tell awk that the field separator is either<U0+or a simple>. The benefit of doing this is that awk will auto-strip these characters for us so we don’t have to do it manually withgsub()when it comes time to do the strtonum() conversion.for(i=1;i<=NF;i++): iterate over each fieldif($i ~ "^[0-9A-F]+$"): check if the current field is only composed of hex digits. Remember that due to #1 above something like<U006F>will be seen as6Fat this point$i=sprintf("%c", strtonum("0x"$i)): replace the hex digit with its corresponding ascii value. We must prefix the field$iwith"0x"so awk knows its a hex value}1: shortcut for a mandatoryprintor always print each lineOFS="": set the Output Field Separator to the null string. If we don’t do this, we will get spaces in the output everywhere there was a<U0+or>Using match() [requires gawk]