I am trying to parse some html and I want to retrieve the actual

Question

0

Asked: June 16, 20262026-06-16T13:47:03+00:00 2026-06-16T13:47:03+00:00

I am trying to parse some html and I want to retrieve the actual

0

I am trying to parse some html and I want to retrieve the actual html between the tags, but instead my code is giving me what I believe is the location of the elements.

Here is my code so far:

import urllib.request, http.cookiejar
from lxml import etree
import io
site = "http://somewebsite.com"


cj = http.cookiejar.CookieJar()
request = urllib.request.Request(site)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0')
html = etree.HTML(opener.open(request).read())

xpath = "//li[1]//cite[1]"
filtered_html = html.xpath(xpath)
print(filtered_html)

Here is a piece of the html:

<div class="f kv">
<cite>
www.
<b>hello</b>
online.com/
</cite>
<span class="vshid">
</div>

Currently my code returns:

[<Element cite at 0x36a65e8>, <Element cite at 0x36a6510>, <Element cite at 0x36a64c8>]

How do I extract the actual html code between the cite tags? If I add “/text()” to the end of my xpath it gets me closer, but it leaves out what is in the b tags. My ultimate goal is for my code to give me “www.helloonline.com/”.

Thank you

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T13:47:04+00:00

Editorial Team

2026-06-16T13:47:04+00:00Added an answer on June 16, 2026 at 1:47 pm

Use //text() to get all text elements from a given location:

text = filtered_html.xpath('//text()')
print ''.join(t.strip() for t in text)  # prints "www.helloonline.com/"

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse some html and I want to retrieve the actual

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply