I’m using Jsoup to parse some HTML to get some PDF url.
The PDF is shown in an <embed> tag like:
<html>
<body marginwidth="0" marginheight="0" style="background-color: rgb(38,38,38)">
<embed width="100%" height="100%" name="plugin" src="http://www.domain.com/apdf_id.pdf?tp=&arnumber=1253069&isnumber=28038" type="application/pdf">
</body>
</html>
How can I get the PDF URL from that page, so that I can download it to local machine?
Just select the
<embed type="application/pdf">element and get itssrcattribute as absolute URL.You could also select specifically the
<embed name="plugin">instead.Then you can use
java.net.URLto obtain it in flavor ofInputStream.Finally just write it to an arbitrary
OutputStreamsuch asFileOutputStreamthe usual way.See also: