I am trying to obtain the text content of a Internet Explorer web browser window.
I am following these steps:
- obtain a pointer to IHTMLDocument2
- from the IHTMLDocument2 i obtain the body as an IHTMLElement
3. On the body i call get_innerText
Edit
- I obtain all the children of the body and try to do a recursive call on all the IHTMLElements
- if i get any element which is not visible or if i get an element whose tag is script, i ignore that element and all its children.
My problem is
- that along with the text which is visible on the page i also get content having for which style=”display: none”
- For google.com, i also get javascript along with the text.
I have tried a recursive approach, but i am clueless as to how to deal with scenarios like this,
<div>
Hello World 1
<div style="display: none">Hello world 2</div>
</div>
In this scenario i wont be able to get “Hello World 1”
Can anyone please help me out with the best way to obtain the text from an IHTMLDocument2*.
I am using C++ Win32, no MFC, ATL.
Thanks,
Ashish.
If you iterate backwards on the
document.body.allelements, you will always walk on the elements inside out. So you don’t need to walk recursive yourself. the DOM will do that for you. e.g. (Code is in Delphi):A Side Comment:
As for your scenario with the recursive approach:
If e.g. our element is the first DIV,
el.getAdjacentText('afterBegin')will return"Hello World 1". So we can probably iterate forward on the elements and collect thegetAdjacentText('afterBegin'), but this is a bit more difficult because we need to test the parents of each element forel.currentStyle.display.