I am trying to build a parser and save the results as an xml

Question

0

Editorial Team

Asked: May 14, 20262026-05-14T23:15:58+00:00 2026-05-14T23:15:58+00:00

I am trying to build a parser and save the results as an xml

0

I am trying to build a parser and save the results as an xml file but i have problems..

Would you experts please have a look at my code ?

Traceback :TypeError: expected string or buffer

import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL |  re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL |  re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
 for item in findtags2:
    art_elem = doc.createElement('artikel')
    countries.appendChild(art_elem)
    s = item.replace('<P>','')
    t = s.replace('</P>','')
    text_elem = doc.createTextNode(t)
    art_elem.appendChild(text_elem)    

print doc.toprettyxml()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T23:15:58+00:00

It’s good that you’re trying to using BeautifulSoup to parse HTML but this won’t work:

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

You’re trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:

regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
    print tag.contents

If you do actually want to parse the document as text with a regular expression then don’t use BeautifulSoup – just read the document into a string and parse that. But I’d suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to build a parser and save the results as an xml

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply