Is there something like Ruby’s nokogiri on nodejs?
I mean a user-friendly HTML-parser.
I’d seen on Node.js modules page some parsers, but I can’t find something pretty and fresh.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
If you want to build DOM you can use jsdom.
There’s also cheerio, it has the jQuery interface and it’s a lot faster than older versions of jsdom, although these days they are similar in performance.
You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.
parse5 also looks like a good solution. It’s fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.
If the website you’re trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you’re considering phantomjs. And you can control casperjs from node with SpookyJS.
Beside phantomjs there’s zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.
There’s a nettuts+ toturial for the latter solutions.