I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content: postData = environ[‘wsgi.input’].read(int(environ[‘CONTENT_LENGTH’]))
To attempt to store in the db:
from pysqlite2 import dbapi2 as sqlite ldb = sqlite.connect('/var/vhost/mysite.com/db/rssharvested.db') lcursor = ldb.cursor() lcursor.execute('INSERT into rss(data) VALUES(?)', (postData,))
This results in only the first few characters of the rss being stored in the record: ÿþ< I believe the initial chars are the BOM of the rss.
I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.
Running python 2.5.2 sqlite 3.5.7
Thanks in advance for any insight into this problem.
Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:
‘\xef\xbb\xbf
Thanks for the all the replies! Very helpful.
The sample I submitted didn’t make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).
\xef\xbb\xbf<?xml version=’1.0′ encoding=’utf-16′?><rss xmlns:xsi=’http://www.w3.org/2001/XMLSchema-instance’ xmlns:xsd=’http://www.w3.org/2001/XMLSchema’><channel><item d3p1:size=’0′ xsi:type=’tFileItem’ xmlns:d3p1=’http://htinc.com/opensearch-ex/1.0/’>
Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode(‘utf-8’).
Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding