here is the html
<table>
<tr>
<td class="break">mono</td>
</tr>
<tr>
<td>c1</td>
<td>c2</td>
<td>c3</td>
</tr>
<tr>
<td>c11</td>
<td>c22</td>
<td>c33</td>
</tr>
<tr>
<td class="break">dono</td>
</tr>
<tr>
<td>d1</td>
<td>d2</td>
<td>d3</td>
</tr>
<tr>
<td>d11</td>
<td>d22</td>
<td>d33</td>
</tr>
</table>
Now I want output like this in a csv file:
mono c1 c2 c3
mono c11 c22 c33
dono d1 d2 d3
dono d11 d22 d33
But I am getting output like this:
mono
c1 c2 c3
c11 c22 c33
dono
d1 d2 d3
d11 d22 d33
Here is my code:
import codecs
from bs4 import BeautifulSoup
with codecs.open('dump.csv', "w", encoding="utf-8") as csvfile:
f = open("input.html","r")
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
csvfile.write(str(td.find(text=True)))
csvfile.write(",")
csvfile.write("\n")
Please help me to resolve this issue.Thanks.
Edit:
Explained with some more details.Here I need to add first section (mono,dono etc) to be appended.
The rule here is that unless I encountered a new “break” class,text inside of that class should be appended to any tr below that.
Since your new question is effectively an entirely different question from the original, here’s an entirely different answer:
I’m assuming that a row will either be exactly 1 “break” column, or 1 or more regular columns. If those assumptions aren’t true, the code can be modified.
Also, if the generator expression in the
joinfunction confuses you, the same thing can be rewritten as an explicit loop: print the header; then for each column, print that column; then print a newline.Since you asked for an explanation of
'break' in cols[0].get('class', []), I’ll break it down.colsis alistof the BS4Tagobjects for everytdnodes in the currenttrnode.cols[0]is the first one.cols[0].get('class', [])treats theTagobject as a dictionary, as described in the docs, and calls the familiarget(key, defaultvalue)method on it.Tagattributes by name always returns alist. While BS3 would return'foo bar'for<td class='foo bar'>and'bar'for<td class='foo' class='bar'>, BS4 will return['foo', 'bar']for both.cols[0].get('class', [])will be['break']for the<td class='break'>case, and[]for all of the other cases in your sample input.As mentioned above, I’m assuming that a row will either be exactly 1 “break” column, or 1 or more regular columns. You can see where I’m making use of those assumptions in the code. But if any of those assumptions are broken, you haven’t told us enough to know what you want to do in those cases.
If you have any rows with no columns, obviously the
cols[0]will raise anIndexError. But you have to decide what to do in that case. Should it do nothing? Print just the header? Change to a state where nothing gets printed until we see a header row? Whatever you decide, it should be easy to code.If you have any rows with a header followed by normal rows, the normal rows will be ignored. If you have any headers that aren’t the first column in a row, they will be treated like normal values. If you have multiple headers in the same row, all but the first will be ignored. And so on. In each case, this may or may not be what. But you have to decide what you want, before you can write the code.