I’m not sure if this is possible and lxml documentation is not very good to me.
Can I for example use something like:
import lxml.html as lx
x = lx.parse('http://web.info/page.html')
y = x.xpath('\\something\interesting'[2])
or similar, so that I don’t download whole page?
If not with lxml is there some Python module that can do this?
No:
lxmlhas to parse the whole page before it can be guaranteed to find an individual bit of it, and to parse it the whole page, it obviously has to download the whole page. (But see also unutbu’s answer for a potential partial downloading/parsing approach.)And although I believe one can make HTTP requests for part of a file (I think via the
rangeheader?), that’s not guaranteed to be supported on the server side.It’s a shame that HTTP doesn’t include a method for sending an XPath query to the server along with the page request, and have the results of running that query on the page sent back.