I have a huge xml file whose sample data is as follows :
<vendor name="aglaia"><br>
<vendorOUI oui="000B91" description="Aglaia Gesellschaft für Bildverarbeitung ud Kommunikation m" /><br>
</vendor><br>
<vendor name="ag"><br>
<vendorOUI oui="0024A9" description="Ag Leader Technology" /><br>
</vendor><br>
as it can be see there are text ” Gesellschaft für Bildverarbeitung ” which is not UTF-8 compliant because which I am getting errors from the xml validator , errors like:
Import failed: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
So the query is how to take care of this in Linux environment to convert the xml file to UTF-8 compliant format? or is there a way in bash such that while creating the xml in the first place i can ensure that all variables/strings are stored in UTF-8 compliant format?
Use the character set conversion tool:
See gnu-page
…and in file http://standards.ieee.org/develop/regauth/oui/oui.txt “aglia” (as in your example above) is reported as:
it seems like “ü” is the character that gets mangeld.
Update
When downloading “oui.txt” using wget, I see the character “ü” in the file. If you don’t have that something is broken in your download. consider using one of these:
wget --header='Accept-Charset: utf-8'curl -o oui.txtinsteadIf none of the above works, just open the link in you favorite browser and do a “save as”. In that case, comment the
wgetline in the script below.I had success with the following script (update BEGIN & END to get a valid XML-file)
Hope this helps!