Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7008437
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T21:44:05+00:00 2026-05-27T21:44:05+00:00

I have implemented a web crawler that crawls and retrieves content from .edu TLD.

  • 0

I have implemented a web crawler that crawls and retrieves content from .edu TLD. The html content is being inserted into MySQL tables as the source code of the page. The script can go on for hours on a decent internet connection when a large number of seed urls are fed to the crawler. Now, my problem is that the script halts after crawling a number of links without giving any errors. I have used exception handling to handle “MySQL Server has gone away error” and has already eliminated a lot of problems and implemented if conditions that echo the errors if they are encountered. However I am not getting any errors. The problem is the halting of the script, whether I run it in the browser, Eclipse PDT or the CLI. Though it is worthy to note that the number of links crawled are somewhat different in all the three methods of running the script. I have altered the php.ini max_execution_time and other directives but this is not helping in anyway.

I have coded the script so that it resumes the crawling from where it halted, but I want the script to continue without halting so that I don’t have to monitor whether the script is running or not.

Should I make changes to my Apache httpd.conf files. If yes, then what those settings should be??

The description in these links for my web crawler may help.

  • Errors regarding Web Crawler in PHP
  • Solving "MySQL server has gone away" errors

This is the code that retrieves html from url. This is from simple_html_dom.

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//    $contents = retrieve_url_contents($url);
if (empty($contents))
{
    return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}

Here is the error log for the following links:

  • http://www.nust.edu.pk/
  • http://www.harvard.edu/
  • http://berkeley.edu/
  • http://www.columbia.edu/
  • http://www.princeton.edu/main
  • http://www.stanford.edu/

And the crawler stopped after crawling this link:

  • http://itunes.columbia.edu/m/

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:39] PHP Warning:
file_get_contents(http://lms.nust.edu.pk) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:41] PHP Warning:
file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

… (same error repeated twice) …

[01-Jan-2012 22:55:58] PHP Warning:
file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ipo) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:55:58] PHP Warning:
file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#tto) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:55:59] PHP Warning:
file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ilo) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:55:59] PHP Warning:
file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#mco) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:56:05] PHP Warning:
file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

… (same error repeated 18 times) …

[01-Jan-2012 22:57:33] PHP Warning:
file_get_contents(http://www.nust.edu.pk/#ctl00_SiteMapPath1_SkipLink)
[function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:57:33] PHP Notice: Undefined variable: parts in
D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 22:57:55] PHP Warning:
file_get_contents(http://www.harvard.edu/#skip) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:58:21] PHP Warning:
file_get_contents(http://www.harvard.edu/admissions-aid#undergrad) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:58:22] PHP Warning:
file_get_contents(http://www.harvard.edu/admissions-aid#grad) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:58:24] PHP Warning:
file_get_contents(http://www.harvard.edu/admissions-aid#continue) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 22:58:25] PHP Warning:
file_get_contents(http://www.harvard.edu/admissions-aid#summer) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

[01-Jan-2012 23:00:04] PHP Warning:
file_get_contents(http://www.harvard.edu/#) [function.file-get-contents]:
failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line
72

… (same error repeated 1 time) …

[01-Jan-2012 23:00:11] PHP Notice: Undefined variable: parts in
D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:41] PHP Warning:
file_get_contents(http://directory.berkeley.edu) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:47] PHP Notice: Undefined variable: parts in
D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:53] PHP Warning:
file_get_contents(http://students.berkeley.edu/uga/) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning:
file_get_contents(http://publicservice.berkeley.edu/) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning:
file_get_contents(http://students.berkeley.edu/osl/leadprogs.asp) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:17] PHP Notice: Undefined variable: parts in
D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:25] PHP Warning:
file_get_contents(http://bearfacts.berkeley.edu/bearfacts) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents() [streams.crypto]: this stream does not
support SSL/crypto in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning:
file_get_contents(http://career.berkeley.edu/) [function.file-get-contents]:
failed to open stream: Cannot connect to HTTPS server through proxy in
D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

And this is the error log from php-cgi.exe:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: php-cgi.exe
  Application Version:  5.3.8.0
  Application Timestamp:    4e537939
  Fault Module Name:    php5ts.dll
  Fault Module Version: 5.3.8.0
  Fault Module Timestamp:   4e537a04
  Exception Code:   c0000005
  Exception Offset: 0000c793
  OS Version:   6.1.7601.2.1.0.256.48
  Locale ID:    1033
  Additional Information 1: 0a9e
  Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
  Additional Information 3: 0a9e
  Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Please help me in this regard.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T21:44:05+00:00Added an answer on May 27, 2026 at 9:44 pm

    you should check call stack of php process (if running as CGI or CLI) or apache httpd process(if run as mod_php).

    Then you will see in which module/procedure are execution halted.
    Also you can check active TCP/IP connection made by your script, maybe there is some ongoing IO operation which caused your script to halted.

    I hope this helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have implemented an activity that retrieves data from a web service and display
I have implemented an Android code that calls a SOAP web service. The web
Usage scenario We have implemented a webservice that our web frontend developers use (via
I have implemented a new web app that makes use of the default membership
I have implemented a WCF inspector in my client application that consumes numerous web
I have implemented a web service(.asmx) using .NET framework that returns me a hash
I have implemented some tutorials of dojo all this tutorials work on html web
We have implemented a feature in our web app that updates the GUI in
I have implemented the JQuery Drag and Drop plug in into my web site.
I am developing a web-application in grails.In that I have implemented video's playing option.For

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.