I have few thousands HTML sources to read. It is from forum which started from 2004. My basic idea is to read through page by change the page number in Python script. All thing I need is like this
lot of other tag from beginning
<div id="posts">
lot of stuff between
</div>
lot of other tag till ending
I use beautifulsoup findAll command to read the stuff between and which works perfectly in 99% percent time, I think. Suddenly, one page gives me frustration. And the structure is like below
lot of other tag from beginning
<div id="posts">
first part
</div>
second part
</div>
lot of other tag till ending
As you can see, here is a unparallel which has no before. Then the beautifulsoup thought that the second last is the ending for the then it stopped ignoring the useful second part between the unparallel and the real ending for
I believe it is rare condition since I finished another thread which contains 1960 pages which has no such problem. This problem occurred in an old thread. any one has any idea? Is there any fixing tool ? It is quite frustrated.
Thanks in advance
oh dear.
Easiest way would be to fix the page so all end tags have a start tag….
Basically the mark up is not correct, browsers have all sorts of ifs and buts to cope with this and other fun ones like
to cope with the bad old days where html wasn’t valid xml.
It’s do able in code, though a lot of work, but basically you have to “guess” where the missing start tag should be.
In this specific case where would youy logically inset a start div, or could you afford to rip out the orphaned end tag. You have to guess the intent… Painful, very painful.
Quite liklely to make a mess of your logic. Me I’d throw an error on this page and move to the next, then get it fixed.