I’m working on a web app that pulls in a list of tweets through a python script. When I pull in a tweet that contains an em dash, I’m unable to parse the XML file.
My script is:
#! /usr/bin/python
import cgi
from peewee import *
from sql_connect import *
import sql_connect
import sys
xmlString = ""
# Create XML string
xmlString += "<TweetList>"
tweets = Tweet_Info.select()
for tweet in tweets:
xmlString += "<Tweet>"
xmlString += "<UserName>"
xmlString += tweet.user
xmlString += "</UserName>"
xmlString += "<UserImage>"
xmlString += tweet.user_image_url
xmlString += "</UserImage>"
xmlString += "<Text>"
xmlString += tweet.text
xmlString += "</Text>"
xmlString += "</Tweet>"
xmlString += "</TweetList>"
# Print beginning xml stuff
print "Content-Type: text/xml"
print
print '<?xml version="1.0" encoding="UTF-8"?>'
print xmlString
The error it gives when I load the python script in the browser is:
XML Parsing Error: no element found
Location: http://localhost/cgi-bin/GetTweets2.py
Line Number 2, Column 1:
I feel like the solution to this is probably fairly simple. I’ve tried using a variety of different encoding types for the xml, but with no success. Is there a specific encoding type that I should use? Or is there a simple way of filtering out a special character that I’m missing?
If you’re going to be generating XML, it’s a much better idea to do it the Right Way: create a data structure that holds the data you want to serialize, and the convert that to XML using built-in Python functionality. This approach also has the advantage that you won’t have to worry so much about encoding errors and weird input. (Think about what would happen in your current script if a tweet contained the text
</Text>.)