I have this partial XML
string = '''
<x:root>
<x:tag1 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue" />
<x:tag2 x:anyAttrib="anyValue" x:anyAttrib="anyValue" x:anyAttrib="anyValue">
someValue
</x:tag2>
<x:tag3> someValue
'''
Now I would like to “stupidly” repair it.
I have thought of a way- regexing all of the start elements and ending element –> checking which element is missing and just add it. with out getting into too much of details of course.
what I’ve come with so far is (and this does not work):
import re
starts = re.compile('(?<=<)x:\w+(?=>)|(?<=<)x:\w+(?! .+ />)')
print(start.findall(string))
what I expect is a list of x:root , x:tag2 , x:tag3
I’ve been googling and trying alot but could not find an answer.
They only thing I get from this expression is x:root , x:tag1 , x:tag3.
Please help
Thanks
Thanks for alexis for helping me.
The correct expression is:
Using this expression, you’ll be able to extract both cases:
first
<tag>second
<tag attrib1="value" attrib2="value" attribN="value"/>I tried to use some built-in python parsers with no luck, including Beautifulsoup which unfortunately did not fix the XML exactly the way I expected it to.
Have a good one! 🙂