Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5939529
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T15:50:28+00:00 2026-05-22T15:50:28+00:00

I want to parse a Feedburner feed with HtmlUnit. The feed is this one:

  • 0

I want to parse a Feedburner feed with HtmlUnit.
The feed is this one: http://feeds.feedburner.com/alcoanewsreleases

From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.

groovy code snippet:

def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")

Sample of the XML feed:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

[...SNIP...]

<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&amp;pageID=20110518006002en">
    <title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
    <dc:date>2011-05-18</dc:date
    <link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
    <description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
    <feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&amp;pageID=20100104006194en</feedburner:origLink>
</item>

[...SNIP...]

</rdf:RDF>

I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are

  • (this is the default) xmlns=”http://purl.org/rss/1.0/&#8221;
  • xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#&#8221;
  • xmlns:dc=”http://purl.org/dc/elements/1.1/&#8221;
  • xmlns:feedburner=”http://rssnamespace.org/feedburner/ext/1.0&#8243;

I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.

I have tried the same XPath with HtmlUnit but it does not work.

So I think I can phrase my question as:
How can I select a node from the default namespace with HtmlUnit?

Any ideas?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T15:50:29+00:00Added an answer on May 22, 2026 at 3:50 pm

    From this feed I want to read all item
    nodes, so normally a //item XPath
    should do the trick. Unfortunately
    that does not work in this case.

    In XPath, that means “select all elements whose local name is item that are in no namespace“. In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.

    What’s confusing is that in XML, <item> means “an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;” whereas in XPath, “item” means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)

    The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below…

    With Nokogiri I could just us the
    XPath //xmlns:item which works and
    returns all nodes from the feed.

    Whatever that is, it’s not XPath. Maybe it’s a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).

    So I think I can phrase my question
    as: How can I select a node from the
    default namespace with HtmlUnit?

    Let’s phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the “rss” namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.

    It’s kind of like asking, “How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?” Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you’ll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you’d think that compiler was nuts. It’s not supposed to care about variable names as long as you use valid identifiers.

    In the same way, when processing RSS, we’re not supposed to care about what namespace prefix is used, as long as it’s a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).

    In XPath 2.0, you can wildcard the namespace. This is very handy if you know you’re not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don’t think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won’t help you in HTMLUnit.

    So you have a couple of choices:

    • Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].

    or

    • The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn’t find any facility for doing that.

    Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.

    I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:

    //:item
    

    But I wouldn’t recommend that, for three reasons:

    1. It’s not valid XPath, so you can’t expect it to work with other programs.

    2. It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.

    3. It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don’t adequately support namespaces.

    HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don’t know if HTMLUnit exposes that functionality.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to parse the contents of http://feeds.feedburner.com/riabiz using XDocument.Parse(string) (because it gets
I want to parse information out of the text file this serves up: http://finance.yahoo.com/d/quotes.txt?s=GOOG+YHOO&f=sak2
I want to parse information in: http://feeds.informationweek.com/infoweek/news http://feeds.news.com.au/public/rss/2.0/fs_breaking_news_13.xml http://rss.cnn.com/rss/cnn_topstories.rss using php. And save the
I want to parse this site : http://its.wonju.go.kr/movinginfo2/DetailSub/StopDetail.asp?StopID=1959# So I tried use TFHpple. Like
I want to parse an rss feed from an android application. Everything related to
i want parse xml file, which does't have xml extension, like this: http://bizonek.wrzuta.pl/xml/plik/1ANdXCgTOit/unknow/undefined/643/ my
I want to parse the output from git log. My current tool does this
I want to parse a web page in Groovy and extract all of the
I work in VBA, and want to parse a string eg <PointN xsi:type='typens:PointN' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
Okay so I'm pulling in an XML feed from feedburner, using an XMLDataSource and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.