Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8752881
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T13:19:08+00:00 2026-06-13T13:19:08+00:00

I am new to XPath and it seems a bit tricky to me; Sometimes

  • 0

I am new to XPath and it seems a bit tricky to me; Sometimes I find it is not working the way I am thinking it should work.

When I scrape data from a website using XPath and Nokogiri, I find it difficult if the website has a complex structure. I use FirePath to get the XPath of an element but sometimes it does not seem to work. I have to remove extra tags added by the browser, like tbody.

I really want to know if there are some good tutorials and examples of XPath and Nokogiri. I could not find much after a Google search.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T13:19:09+00:00Added an answer on June 13, 2026 at 1:19 pm

    The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you’re looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.

    The second trick is to remember to use // to start your XPath, not /, unless you’re absolutely sure you want to start at the root of the document. // is like a '**/*' wildcard at the command-line in Linux. It searches everywhere.

    Also, don’t trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody, like you saw. Instead, use Ruby’s OpenURI or curl or wget to retrieve the raw source, and look at it with an editor like vi or vim, or use less or cat it to the screen. There’s no chance of having any changes to the file that way.

    Finally, it’s often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that’s harder to maintain or more fragile.

    Nokogiri itself is pretty easy. The majority of things you’ll want to do are simple combinations of two different methods: search and at. Both take either a CSS or XPath selector. search, along with its sibling methods xpath and css, return a NodeSet, which is basically an array of nodes that you can iterate over. at, css_at and xpath_at return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath variants accept an XPath, and the ...css ones take a CSS accessor.

    Once you have a node, generally you’ll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get] and the text using text.

    Using those methods we can search for all the links in a page and return their text and related href, using something like:

    require 'awesome_print'
    require 'nokogiri'
    require 'open-uri'
    doc = Nokogiri::HTML(open('http://www.example.com'))
    ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]
    

    Which outputs:

    [
        [0] [
            [0] "/",
            [1] ""
        ],
        [1] [
            [0] "/domains/",
            [1] "Domains"
        ],
        [2] [
            [0] "/numbers/",
            [1] "Numbers"
        ],
        [3] [
            [0] "/protocols/",
            [1] "Protocols"
        ],
        [4] [
            [0] "/about/",
            [1] "About IANA"
        ]
    ]
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Im trying the code below but it seems it does not work... Can someone
I'm new to using XPath so I've been fooling around with an XPath Evaluator
I'm new to jQuery. This would be no problem for me using XPath expressions,
<?php $feed = file_get_contents('http://thexmofo.wordpress.com/feed/'); $xml = new SimpleXMLElement($feed); $xml->registerXPathNamespace('media', 'http://thexmofo.wordpress.com/feed/'); $images = $xml->xpath('/rss/channel/item/media:content@url'); var_dump($images);
New to Node.js and Express, I am trying to understand the two seems overlapping
I've found a similar question on SO , however, that seems not exactly what
I'm looking for a way to concatenate two arbitrary, valid XPath expressions to build
click(new XPath(/HTML[1]/BODY[1]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/DIV[1]/SPAN[1]/SPAN[2]/#text[1])); ..or var foo = document.evaluate(/HTML[1]/BODY[1]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/DIV[1]/SPAN[1]/SPAN[2]/#text[1], document, null, XPathResult.ANY_TYPE,null); Result to: [Exception... The
There is a similar question, but it seems that the solution didn't work out
I tried a few solutions already posted here, but nothing seems to work. That

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.