I’m attempting to use wget with the -p option to download specific documents and

Question

0

Asked: May 14, 20262026-05-14T03:37:34+00:00 2026-05-14T03:37:34+00:00

I’m attempting to use wget with the -p option to download specific documents and

0

I’m attempting to use wget with the -p option to download specific documents and the images linked in the HTML.

The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget to not interpret the document as HTML and doesn’t search for images.

Is there a way to have wget strip the first X lines and/or force searching for images?

Example URL:

http://www.sec.gov/Archives/edgar/data/13239/000119312510070346/ds4.htm

First Lines of Content:

<DOCUMENT>
<TYPE>S-4
<SEQUENCE>1
<FILENAME>ds4.htm
<DESCRIPTION>FORM S-4
<TEXT>
<HTML><HEAD>
<TITLE>Form S-4</TITLE>

Last Lines of Content:

</BODY></HTML>
</TEXT>
</DOCUMENT>

EDIT: Solutions in PHP are certainly accepted.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T03:37:35+00:00

Editorial Team

2026-05-14T03:37:35+00:00Added an answer on May 14, 2026 at 3:37 am

Wget is actually detecting the img tags. The issue is the website is question has a robots.txt that disallows /Archives. Wget honors that request and does not retrieve additional documents.

However, you can use the downloaded document as input to wget to retrieve related documents:

wget -l 1 –base=url –force-html -i file

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m attempting to use wget with the -p option to download specific documents and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply