I have an XML file containing texts in some languages. I want to extract the texts in just one language and store them in a separate file. How can I do this?
Here is some of the beginning lines of my file:
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4b">
<header creationtool="ORESAligner" creationtoolversion="1.0" datatype="plaintext" segtype="paragraph" adminlang="en-us" srclang="EN" o-tmf="ORES"/>
<body>
<tu tuid="55_100:6">
<prop type="session">55</prop>
<prop type="committee">3</prop>
<tuv xml:lang="EN">
<seg>RESOLUTION 55/100</seg>
</tuv>
<tuv xml:lang="AR">
<seg>القرار 55/100</seg>
</tuv>
<tuv xml:lang="ZH">
<seg>第55/100号决议</seg>
</tuv>
<tuv xml:lang="FR">
<seg>RÉSOLUTION 55/100</seg>
</tuv>
<tuv xml:lang="RU">
<seg>РЕЗОЛЮЦИЯ 55/100</seg>
</tuv>
<tuv xml:lang="ES">
<seg>RESOLUCIÓN 55/100</seg>
</tuv>
</tu>
</body>
</tmx>
Now say I want just English texts. the desired output should be:
RESOLUTION 55/100
How should I use this script? I am newbie in working XML files, and don’t know how can I use this XPath expression. As I know xmlstarlet is able to modify XML files. But I don’t know how…?
Extract English Nodes with XmlStarlet
You could use xmlstarlet to query your XML using XPath, and return just the nodes with an English-language attribute. For example:
Store Node Values in a File with Language Extension
If you want to store those values in some language-based file, then you could dump the values of each node found into a file with a language-based extension (e.g. “EN” for English).
With this example, the contents of all matching nodes will be written to /tmp/foo.EN for further processing. You can certainly adjust the shell redirection to suit any additional requirements.