I have some sgml files that are roughly standardized. However, there can be data contained within a tag that I do not know exists before I open the file and personally read it. For example, the files have addresses and generally the addresses have a street, a city, a state, a zip and a phone. Each element of the address is indicated with a tag
<ADDRESS>
<STREET>One Main Street
<CITY>Gotham City
<ZIP>99999 0123
<PHONE>555-123-5467
</ADDRESS>
But, for example, I have discovered that there are tags for Country, STREET1, STREET2. I have over 200K files to process and I want know if it is possible to pull out all of the elements of the addresses without having to worry about knowing the existence of unknown tags.
What I have done so far is
h=fromstring(my_data_in_a_string)
for each in h.cssselect('mail_address'):
each.text_content()
but what I get is problematic because I can’t identify where one element ends and the next begins
One Main StreetGotham City99999 0123555-123-5467
To get all the tags, we iter through the document like this:
Suppose your XML structure is like this:
We parse it:
Now suppose your XML has extra tags as well; tags you are not aware about. Since we are iterating through the XML, the above code will return those tags as well.
The above code returns:
Now if we want to get the text of the tags, the procedure is the same. Just print tag.text like this: