I need to extract the text (only plain text) from an arbitrary web page

Question

0

Asked: May 30, 20262026-05-30T12:31:11+00:00 2026-05-30T12:31:11+00:00

I need to extract the text (only plain text) from an arbitrary web page

0

I need to extract the text (only plain text) from an arbitrary web page (I do bypass the cross-domain problem with a simple php proxy on my server).
I do, as usual,

$.get(url, function(data) {
  process(data);
});

and, in my process() function I have the content of the page.
I want to consider a particular div (here ‘#my-div’) in that page, or, if not present – as a fallback – the whole body.

I would like to do something like this:

function process(content) {
  if ($(content).find('#my-div'))
    $('#output').text($(content).find('#my-div').text());
  else
    $('#output').text($(content).find('body').text());
}

But I always bet get an empty result when “finding” ‘body’: any suggestion?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T12:31:12+00:00

Some issues…

function process(content) {
   // The if() will always be true, because a jQuery object is always retruend
  if ($(content).find('#my-div'))
    $('#output') = $(content).find('#my-div').text();   // invalid assignment
  else
    $('#output') = $(content).find('body').text();      // invalid assignment
}

Fixed…

function process(content) {
  var nodes = $(content);  // cache the elements
  if (nodes.find('#my-div').length)
    $('#output').text(nodes.find('#my-div').text());  
  else
    $('#output').text(nodes.find('body').text());     
}

Now theoretically it would seem to work, but there are issues with passing an entire HTML document to the $ function. You’ll find that some browsers strip out some of the elements, like <head> and <body>.

You’ll ultimately need to test for each of these situations, something like this…

function process(content) {
  var nodes = $(content);  // cache the elements
  var my_div = nodes.find('#my-div');  // try to get nested #my-div

  if( !my_div.length ) {
      my_div = nodes.filter('#my-div'); // try to get #my-div at top level

      if( !my_div.length ) {
          my_div = nodes.find('body')   // try to get nested body

          if( !my_div.length ) {
              my_div = nodes;  // assume the body content is at the top level
          }
      }
  }
  $('#output').text(my_div.text());   
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to extract the text (only plain text) from an arbitrary web page

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply