I’m building a tool that performs xpath 1.0 queries on XHTML documents. The requirement to use a namespace prefix in the query is killing me. The query looks like this:
html/body/div[@class='contents']/div[@class='body']/
div[@class='pgdbbyauthor']/h2[a[@name][starts-with(.,'Quick')]]/
following-sibling::ul[1]/li/a
(all on one line)
…which is bad enough, except because it’s xpath 1.0, I need to use an explicit namespace prefix on each QName, so it looks like this:
ns1:html/ns1:body/ns1:div[@class='contents']/ns1:div[@class='body']/
ns1:div[@class='pgdbbyauthor']/ns1:h2[ns1:a[@name][starts-with(.,'Quick')]]/
following-sibling::ns1:ul[1]/ns1:li/ns1:a
To set up the query, I do something like this:
var xpathDoc = new XPathDocument(new StringReader(theText));
var nav = xpathDoc.CreateNavigator();
var xmlns = new XmlNamespaceManager(nav.NameTable);
foreach (string prefix in xmlNamespaces.Keys)
xmlns.AddNamespace(prefix, xmlNamespaces[prefix]);
XPathNodeIterator selection = nav.Select(xpathExpression, xmlns);
But what I want is for the xpathExpression to use the implicit default namespace.
Is there a way for me to transform the unadorned xpath expression, after it’s been written, to inject a namespace prefix for each element name in the query?
I’m thinking, anything between two slashes, I could inject a prefix there. Excepting of course axis names like “parent::” and “preceding-sibling::” . And wildcards. That’s what I mean by “finagle a default namespace”.
Is this hack gonna work?
Addendum
Here’s what I mean. suppose I have an xpath expression, and before passing it to nav.Select(), I transform it. Something like this:
string FixupWithDefaultNamespace(string expr)
{
string s = expr;
s = Regex.Replace(s, "^(?!::)([^/:]+)(?=/)", "ns1:$1"); // beginning
s = Regex.Replace(s, "/([^/:]+)(?=/)", "/ns1:$1"); // stanza
s = Regex.Replace(s, "::([A-Za-z][^/:*]*)(?=/)", "::ns1:$1"); // axis specifier
s = Regex.Replace(s, "\\[([A-Za-z][^/:*\\(]*)(?=[\\[\\]])", "[ns1:$1"); // predicate
s = Regex.Replace(s, "/([A-Za-z][^/:]*)(?!<::)$", "/ns1:$1"); // end
s = Regex.Replace(s, "^([A-Za-z][^/:]*)$", "ns1:$1"); // edge case
s = Regex.Replace(s, "([-A-Za-z]+)\\(([^/:\\.,\\)]+)(?=[,\\)])", "$1(ns1:$2"); // xpath functions
return s;
}
This actually works for simple cases I tried. To use the example from above – if the input is the first xpath expression, the output I get is the 2nd one, with all the ns1 prefixes. The real question is, is it hopeless to expect this Regex.Replace approach to work, as the xpath expressions get more complicated?
If you know there is only one namespace (i.e. the XHTML namespace) and its defined as a default namespace then you can cheat by processing it with an XmlTextReader that is not namespace aware as follows:
That works for me and outputs “Example” so the path “html/body/h1” finds that “h1” element.
Other options are to run the input through some stylesheet first to strip namespaces and then process the transformation result with stripped namespaces.
And of course you could consider not to rely on the Microsoft XPath 1.0 implementation but move to third party XPath 2.0 or XQuery 1.0 implementations like Saxon or like XQSharp. Then you can define a default element namespace for your XPath or XQuery expressions and use paths without prefixes to select elements in the XHTML namespace.