Im converting some python scripts that uses regex to exract contents from a html

Question

0

Asked: June 9, 20262026-06-09T01:57:52+00:00 2026-06-09T01:57:52+00:00

Im converting some python scripts that uses regex to exract contents from a html

0

Im converting some python scripts that uses regex to exract contents from a html output to libxml2, but since im starting at this, a little help would be apreciated.

how i can extract the values from “working directory” , “Packages/Updates” , and “Java Data Model” of the example bellow using lxml?

<tr>
  <script>writeTD("row");</script>
  <td class="oddrow"><nobr>Working Dir</nobr></td>
  <script>writeTD("rowdata-l");</script>
  <td class="oddrowdata-l">/serves/test_servers</td>
</tr> 
<script>swapRows();</script>
<tr>
  <script>writeTD("row");</script>
  <td class="evenrow"><nobr>Packages/Updates</nobr></td>
  <script>writeTD("rowdata-l");</script>
  <td class="evenrowdata-l"><a href="updates.dsp">View</a></td>
</tr> 
<script>swapRows();</script>
<tr>
  <script>writeTD("row");</script>
  <td class="oddrow"><nobr>Java Data Model</nobr></td>
  <script>writeTD("rowdata-l");</script>
  <td class="oddrowdata-l">64-bit</td>
</tr>
</tbody></table>
</td>
</tr>
</tbody></table>

Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T01:57:54+00:00

Using the HTML you posted as content,

import lxml.html as LH
doc = LH.fromstring(content)
tds = (td.text_content() for td in doc.xpath('//td'))    
for td, val in zip(*[tds]*2):
    if td in ("Working Dir", "Java Data Model"):
        print(td,val)

yields

('Working Dir', '/serves/test_servers')
('Java Data Model', '64-bit')

This line does most of the work:

tds = (td.text_content() for td in doc.xpath('//td'))

It uses the xpath() method to search for all <td> tags. It uses the text_content() method to extract the associated text.

zip(*[tds]*2) is the grouper idiom to iterate over tds in pairs:

for td, val in zip(*[tds]*2):
    print(td,val)

Note that this assumes that <td> labels and values follow each other alternately.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Im converting some python scripts that uses regex to exract contents from a html

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply