How can I correctly parse an XML stylesheet processing instruction? As I understand, the

Question

0

Asked: June 11, 20262026-06-11T05:00:40+00:00 2026-06-11T05:00:40+00:00

How can I correctly parse an XML stylesheet processing instruction? As I understand, the

0

How can I correctly parse an XML stylesheet processing instruction? As I understand, the value of an XML processing instruction such as:

<?xml-stylesheet type="application/xsl" src="style.xsl" version="1.0"?>

is:

type="application/xsl" src="style.xsl" version="1.0"

How can I parse that into a list of key-value pairs? I’ve searched around for some examples of how to do this but haven’t been able to find any.

The key word here is correctly… I don’t want to just write a simple regex that may fail in certain situations, I want to make sure I parse this fully accordant to how you’d properly parse an XML stylesheet instruction.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T05:00:41+00:00

The grammar of the XML stylesheet PI is given in the spec, so if you want to do it right, it’s simply a matter of writing a parser for that grammar. Since the language is in fact regular, it can be parsed correctly with a regular expression. The biggest complication is likely to be that since the XML spec does not require character references or the predefined entity references to be recognize within a processing instruction, you are likely to be responsible for handling those yourself.

As to exactly how you should do it, that depends on what environment you’re working in. As an example, here is an XQuery function that does the job and returns a list of elements created from the pseudo-attributes in the processing instruction; if the PI doesn’t match the grammar given in the spec, it returns a single element named error.

declare function bmt:parse-sspi($s as xs:string) 
  as element()* {

  if (bmt:check-sspi($s)) then
     let $s1 := substring-after($s,"<?xml-stylesheet"),
         $s2 := substring-before($s1,"?>")
     return bmt:parse-pseudoatts($s2) 
  else <error/>
};

This function hands off the real work of parsing the pseudo-attributes to a separate recursive function which parses off one attribute-value pair on each call:

declare function bmt:parse-pseudoatts($s as xs:string) 
  as element()* {

  (: We know that $s is a syntactically legal sequence
     of pseudo-attribute value specifications. So we
     can get by with simpler patterns than we would
     otherwise need.
     :)

  let $s1 := replace($s,"^\s+","")
  return if ($s1 = "") then () else
         let $s2 := substring-before($s, '='),
             $Name := normalize-space($s2),
             $s3 := substring-after($s, '='),
             $s4 := replace($s3,"^\s+",""),
             $Val := if (starts-with($s4,'"')) then
                        substring-before(
                          substring($s4,2),
                          '"')
                     else if (starts-with($s4,"'")) then
                        substring-before(
                          substring($s4,2),
                          "'")
                     else <ERROR/>,
             $sRest := if (starts-with($s4,'"')) then
                        substring-after(
                          substring($s4,2),
                          '"')
                     else if (starts-with($s4,"'")) then
                        substring-after(
                          substring($s4,2),
                          "'")
                     else ""

  return (element {$Name} { $Val }, 
          bmt:parse-pseudoatts($sRest))
};

As the comments indicate (and as you can see), both of these benefit from knowing in advance that the PI is in fact legal. So we can parse off the pseudo-attribute name by stripping whitespace from whatever precedes the first “=” in the string, and so on.

The guarantee of correctness is given by a separate check-sspi function, which systematically constructs a regular expression in a way that makes it easy to compare the function with the grammar in the spec, to check that the function is correct.

declare function bmt:check-sspi($s as xs:string) 
  as xs:boolean {

  let $pio := "<\?",
      $kw := "xml-stylesheet",
      $pic := "\?>",
      $S := "\s+",
      $optS := "\s*",
      $Name := "\i\c*",
      $CharRef := "&amp;#[0-9]+;|&amp;#x[0-9a-fA-F]+;",
      $PredefinedEntityRef := concat("&amp;amp;",
                                     "|&amp;lt;",
                                     "|&amp;gt;",
                                     "|&amp;quot;",
                                     "|&amp;apos;"),
      $dq := '"',
      $sq := "'",
      $dqstring := concat($dq,
                          "(",
                          "[^", $dq, "&lt;&amp;]",
                          "|",
                          "$CharRef",
                          "|",
                          "$PredefinedEntityRef",
                          ")*",
                          $dq),
      $sqstring := concat($sq,
                          "(",
                          "[^",$sq,"&lt;&amp;]",
                          "|",
                          "$CharRef",
                          "|",
                          "$PredefinedEntityRef",
                          ")*",
                          $sq),
      $psAttVal := concat("(",$dqstring,"|",$sqstring,")"),
      $pseudoAtt := concat("(", 
                           $Name, 
                           $optS, "=", $optS, 
                           $psAttVal,
                           ")"),
      $sspi := concat($pio,
                      $kw,
                      "(", $S, $pseudoAtt, ")*",
                      $optS,
                      $pic),
      $sspi2 := concat("^", $sspi, "$")
      return if (matches($s,$sspi2)) then true() else false()
};

For the test string

<?xml-stylesheet  foo="bar"
      href="http://www.w3.org/2008/09/xsd.xsl"
      type='text/xsl'
?>

the top-level parse-sspi function returns

<foo>bar</foo>
<href>http://www.w3.org/2008/09/xsd.xsl</href>
<type>text/xsl</type>

These functions could be somewhat more compact if we just did the parsing with a single Perl-style regular expression. Some people might find such a compact form more natural and easier to follow, some will prefer a less succinct formulation like that given here.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

How can I correctly parse an XML stylesheet processing instruction? As I understand, the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply