I have a file which contains corrupt XML, There are some garbage characters at the end of line that I want to get rid of. These garbage characters do not allow me to use Python’s XML parser. Example:
<request><pair><name>q</name><value><![CDATA[LOL]]></value></pair><pair><name>start</name><value>1</value></pair></request>�J I�i�Y�Y��'z�3�u�J�5��}���#Q/k;!�ˑ�9Q){_������ŐF
<request><pair><name>q</name><value><![CDATA[LOL2]]></value></pair><pair><name>start</name><value>1</value></pair></request>4/lIT�l��'�c�Oֲ�{�;��_?��(>͏Y�mP��
How can I remove the garbage characters after </request> ? Or in other words, How to remove string between </request> and <request> ?
Please note from <request> to </request> is just one line so
Code:
awk '/<request>/ , /<\/request>/' test.txt
does not work.
My purpose is to extract value when name is “q” (LOL and LOL2) in this case. So if that can be done easily, I am not bothered about removing the junk characters.
Thank you for your time.
you can extract data using lxml and xpath expressions-
I tried this using your given xml sample and my output is
'LOL LOL2'