I need to extract the first table (not the material in the first table tag) from a section in a html. The table may spread out in multiple pages, so it may be under multiple table tags. There may be more than one table in the section. My logic is that if there are text node between table tags, then they are different tables. If there is no text node between tables tags, they are part of one table. How can I implement this?
I didn’t use xpath to find the first table because I need to identify the appropriate section first by using reg exp to check each text node.
html='<body>
<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>
<table border="1">
<tr>
<td>row 3, cell 1</td>
<td>row 3, cell 2</td>
</tr>
<tr>
<td>row 4, cell 1</td>
<td>row 4, cell 2</td>
</tr>
</table>
<p>text </p> # Split by text, the below is a different table
<table border="1">
<tr>
<td>row 5, cell 1</td>
<td>row 5, cell 2</td>
</tr>
<tr>
<td>row 6, cell 1</td>
<td>row 6, cell 2</td>
</tr>
</body>'
This is my current code, which only picks up the first table tag rather than first TABLE(row 1-4 in my sample). I used gem tabler parser for extract the table.
require 'nokogiri'
require 'table_parser'
doc = Nokogiri::HTML(html)
table = Array.new
i = 0
doc.traverse do |node|
if node.name == 'table' && i == 0
table = TableParser::Parser::extract_table(node, node.path)
i +=1
end
end
puts table
It sounds like you want to merge consecutive tables: