Requirements:
I have a Python project which parses data feeds from multiple sources in varying formats (Atom, valid XML, invalid XML, CSV, almost-garbage, etc…) and inserts the resulting data into a database. The catch is the information required to parse each of the feeds must also be stored in the database.
Current solution:
My previous solution was to store small python scripts which are evaled on the raw data, and return a data object for the parsed data. I’d really like to get away from this method as it obviously opens up a nasty security hole.
Ideal solution:
What I’m looking for is what I would describe as a template-driven feed parser for Python, so that I can write a template file for each of the feed formats, and this template file would be used to make sense of the various data formats.
I’ve had limited success finding something like this in the past, and was hoping someone may have a good suggestion.
Instead of
evaling scripts, maybe you should consider making a package of them?Parsing CSV is one thing — the format is simple and regular, parsing XML requires completely another approach. Considering you don’t want to write every single parser from scratch, why not just write a bunch of small modules, each having identical API and use them? I believe, using Python itself (not some templating DSL) is ideal for this sort of thing.
For example, this is an approach I’ve seen in one small torrent-fetching script I’m using:
Main program:
parsers/csv.py:If you don’t particularly like dynamically loaded modules, you may consider writing, for example, a single module with several parses classes (probably derived from some “abstract parser” base class).