Here is a function that validates .edu TLD and checks that the url does

Question

0

Asked: May 27, 20262026-05-27T23:26:48+00:00 2026-05-27T23:26:48+00:00

Here is a function that validates .edu TLD and checks that the url does

0

Here is a function that validates .edu TLD and checks that the url does not point to a .pdf document or a .doc document.

public function validateEduDomain($url) {
    if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $url) && !preg_match('/\.(pdf)|(doc)$/i', $url) )  {
        return TRUE;
    }
    return FALSE;

Now I am encountering links that point to jpg, rtf and others that simple_html_dom tries to parse and return its content. I want to avoid this happening by skipping all such links. The problem is that the list is non-exhaustive and I want the code to skip all such links. How am I supposed to do that??

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T23:26:48+00:00

Tring to filter urls by guessing what’s behind it will always fail in a number of cases. Assuming you are using curl to download, you should check if the response document-type header is among the acceptable ones:

<?php

require "simple_html_dom.php";

$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //default is to output it

$urls = array(
  "google.com", 
  "https://www.google.com/logos/2012/newyearsday-2012-hp.jpg", 
  "http://cran.r-project.org/doc/manuals/R-intro.pdf",
);
$acceptable_types = array("text/html", "application/xhtml+xml");

foreach ($urls as $url) {
  curl_setopt($curl, CURLOPT_URL, $url);
  $contents = curl_exec($curl);

  //we need to handle content-types like "text/html; charset=utf-8"
  list($response_type) = explode(";", curl_getinfo($curl, CURLINFO_CONTENT_TYPE));

  if (in_array($response_type, $acceptable_types)) {
    echo "accepting {$url}\n";
    // create a simple_html_dom object from string
    $obj = str_get_html($contents);
  } else {
    echo "rejecting {$url} ({$response_type})\n";
  }
}

running the above results in:

accepting google.com
rejecting https://www.google.com/logos/2012/newyearsday-2012-hp.jpg (image/jpeg)
rejecting http://cran.r-project.org/doc/manuals/R-intro.pdf (application/pdf)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here is a function that validates .edu TLD and checks that the url does

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply