How can I correctly parse an XML stylesheet processing instruction? As I understand, the value of an XML processing instruction such as:
<?xml-stylesheet type="application/xsl" src="style.xsl" version="1.0"?>
is:
type="application/xsl" src="style.xsl" version="1.0"
How can I parse that into a list of key-value pairs? I’ve searched around for some examples of how to do this but haven’t been able to find any.
The key word here is correctly… I don’t want to just write a simple regex that may fail in certain situations, I want to make sure I parse this fully accordant to how you’d properly parse an XML stylesheet instruction.
The grammar of the XML stylesheet PI is given in the spec, so if you want to do it right, it’s simply a matter of writing a parser for that grammar. Since the language is in fact regular, it can be parsed correctly with a regular expression. The biggest complication is likely to be that since the XML spec does not require character references or the predefined entity references to be recognize within a processing instruction, you are likely to be responsible for handling those yourself.
As to exactly how you should do it, that depends on what environment you’re working in. As an example, here is an XQuery function that does the job and returns a list of elements created from the pseudo-attributes in the processing instruction; if the PI doesn’t match the grammar given in the spec, it returns a single element named
error.This function hands off the real work of parsing the pseudo-attributes to a separate recursive function which parses off one attribute-value pair on each call:
As the comments indicate (and as you can see), both of these benefit from knowing in advance that the PI is in fact legal. So we can parse off the pseudo-attribute name by stripping whitespace from whatever precedes the first “=” in the string, and so on.
The guarantee of correctness is given by a separate
check-sspifunction, which systematically constructs a regular expression in a way that makes it easy to compare the function with the grammar in the spec, to check that the function is correct.For the test string
the top-level
parse-sspifunction returnsThese functions could be somewhat more compact if we just did the parsing with a single Perl-style regular expression. Some people might find such a compact form more natural and easier to follow, some will prefer a less succinct formulation like that given here.