I have some sgml files that are roughly standardized. However, there can be data

Question

0

Asked: May 18, 20262026-05-18T03:15:58+00:00 2026-05-18T03:15:58+00:00

I have some sgml files that are roughly standardized. However, there can be data

0

I have some sgml files that are roughly standardized. However, there can be data contained within a tag that I do not know exists before I open the file and personally read it. For example, the files have addresses and generally the addresses have a street, a city, a state, a zip and a phone. Each element of the address is indicated with a tag

 <ADDRESS>
 <STREET>One Main Street
 <CITY>Gotham City
 <ZIP>99999 0123
 <PHONE>555-123-5467
 </ADDRESS>

But, for example, I have discovered that there are tags for Country, STREET1, STREET2. I have over 200K files to process and I want know if it is possible to pull out all of the elements of the addresses without having to worry about knowing the existence of unknown tags.

What I have done so far is

h=fromstring(my_data_in_a_string)
for each in h.cssselect('mail_address'):
    each.text_content()

but what I get is problematic because I can’t identify where one element ends and the next begins

One Main StreetGotham City99999 0123555-123-5467

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T03:15:59+00:00

To get all the tags, we iter through the document like this:

Suppose your XML structure is like this:

<ADDRESS>
 <STREET>One Main Street</STREET>
 <CITY>Gotham City</CITY>
 <ZIP>99999 0123</ZIP>
 <PHONE>555-123-5467</PHONE>
 </ADDRESS>

We parse it:

>>> from lxml import etree
>>> f = etree.parse('foo.xml')  # path to XML file
>>> root = f.getroot() # get the root element
>>> for tags in root.iter(): # iter through the root element
...     print tags.tag       # print all the tags
... 
ADDRESS
STREET
CITY
ZIP
PHONE

Now suppose your XML has extra tags as well; tags you are not aware about. Since we are iterating through the XML, the above code will return those tags as well.

<ADDRESS>
         <STREET>One Main Street</STREET>
         <STREET1>One Second Street</STREET1>
        <CITY>Gotham City</CITY>
         <ZIP>99999 0123</ZIP>
         <PHONE>555-123-5467</PHONE>         
         <COUNTRY>USA</COUNTRY>    
</ADDRESS>

The above code returns:

ADDRESS
STREET
STREET1
CITY
ZIP
PHONE
COUNTRY

Now if we want to get the text of the tags, the procedure is the same. Just print tag.text like this:

>>> for tags in root.iter():
...     print tags.text
... 

One Main Street
One Second Street
Gotham City
99999 0123
555-123-5467
USA

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some sgml files that are roughly standardized. However, there can be data

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply