I have a file which contains corrupt XML, There are some garbage characters at

Question

0

Asked: May 27, 20262026-05-27T23:17:55+00:00 2026-05-27T23:17:55+00:00

I have a file which contains corrupt XML, There are some garbage characters at

0

I have a file which contains corrupt XML, There are some garbage characters at the end of line that I want to get rid of. These garbage characters do not allow me to use Python’s XML parser. Example:

<request><pair><name>q</name><value><![CDATA[LOL]]></value></pair><pair><name>start</name><value>1</value></pair></request>�J I�i�Y�Y��'z�3�u�J�5��}���#Q/k;!�ˑ�9Q){_������ŐF
<request><pair><name>q</name><value><![CDATA[LOL2]]></value></pair><pair><name>start</name><value>1</value></pair></request>4/lIT�l��'�c�Oֲ�{�;��_?��(>͏Y�mP��

How can I remove the garbage characters after </request> ? Or in other words, How to remove string between </request> and <request> ?

Please note from <request> to </request> is just one line so

Code:

awk '/<request>/ , /<\/request>/' test.txt

does not work.

My purpose is to extract value when name is “q” (LOL and LOL2) in this case. So if that can be done easily, I am not bothered about removing the junk characters.

Thank you for your time.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T23:17:56+00:00

Editorial Team

2026-05-27T23:17:56+00:00Added an answer on May 27, 2026 at 11:17 pm

you can extract data using lxml and xpath expressions-

import lxml
from lxml import etree
source_xml = " path to your xml file"
et = etree.parse(source_xml)
value = et.xpath("//document/request/pair[name='q']/value/text()")
print " ".join(value)

I tried this using your given xml sample and my output is 'LOL LOL2'

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a file which contains corrupt XML, There are some garbage characters at

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply