I am trying to obtain the text content of a Internet Explorer web browser

Question

0

Editorial Team

Asked: June 1, 20262026-06-01T15:43:54+00:00 2026-06-01T15:43:54+00:00

I am trying to obtain the text content of a Internet Explorer web browser

0

I am trying to obtain the text content of a Internet Explorer web browser window.

I am following these steps:

obtain a pointer to IHTMLDocument2
from the IHTMLDocument2 i obtain the body as an IHTMLElement
~~3. On the body i call get_innerText~~

Edit

I obtain all the children of the body and try to do a recursive call on all the IHTMLElements
if i get any element which is not visible or if i get an element whose tag is script, i ignore that element and all its children.

My problem is

that along with the text which is visible on the page i also get content having for which style=”display: none”
For google.com, i also get javascript along with the text.

I have tried a recursive approach, but i am clueless as to how to deal with scenarios like this,

<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>

In this scenario i wont be able to get “Hello World 1”

Can anyone please help me out with the best way to obtain the text from an IHTMLDocument2*.
I am using C++ Win32, no MFC, ATL.

Thanks,
Ashish.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T15:43:55+00:00

If you iterate backwards on the document.body.all elements, you will always walk on the elements inside out. So you don’t need to walk recursive yourself. the DOM will do that for you. e.g. (Code is in Delphi):

procedure Test();
var
  document, el: OleVariant;
  i: Integer;
begin
  document := CreateComObject(CLASS_HTMLDocument) as IDispatch;
  document.open;
  document.write('<div>Hello World 1<div style="display: none">Hello world 2<div>This DIV is also invisible</div></div></div>');
  document.close;
  for i := document.body.all.length - 1 downto 0 do // iterate backwards
  begin
    el := document.body.all.item(i);
    // filter the elements
    if (el.style.display = 'none') then
    begin
      el.removeNode(true);
    end;
  end;
  ShowMessage(document.body.innerText);
end;

A Side Comment:
As for your scenario with the recursive approach:

<div>Hello World 1<div style="display: none">Hello world 2</div></div>

If e.g. our element is the first DIV, el.getAdjacentText('afterBegin') will return "Hello World 1". So we can probably iterate forward on the elements and collect the getAdjacentText('afterBegin'), but this is a bit more difficult because we need to test the parents of each element for el.currentStyle.display.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to obtain the text content of a Internet Explorer web browser

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply