I’ve got a list of websites for each US Congress member that I’m programmatically

Question

0

Asked: May 13, 20262026-05-13T07:11:27+00:00 2026-05-13T07:11:27+00:00

I’ve got a list of websites for each US Congress member that I’m programmatically

0

I’ve got a list of websites for each US Congress member that I’m programmatically crawling to scrape addresses. Many of the sites vary in their underlying markup, but this wasn’t initially a problem until I started seeing that hundreds of sites were not giving the expected results for the script I had written.

After taking some more time to evaluate potential causes, I found that calling strip_tags() on the results of file_get_contents() was erasing most of the source of the page many times! This was not only removing the HTML, it was removing the non-HTML that I wanted to scrape!

So I removed the call to strip_tags(), substituted a call to remove all non-alphanumeric characters and gave the process another run. It turned up other results, but still lacked many. This time it was because my regular expressions weren’t matching the desired patterns. After looking at the returned code, I realized that I had the remnants of HTML attributes interspersed throughout the text, breaking my patterns.

Is there a way around this? Is it the result of malformed HTML? Can I do anything about it?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T07:11:28+00:00

There’s a warning in the PHP manual that reads:

Because strip_tags() does not actually
validate the HTML, partial, or broken
tags can result in the removal of more
text/data than expected.

Since you are scraping many different sites, and you can’t account for the validity of their HTML, this will always be a problem. Unfortunately, regexps aren’t going to do it for you either, as regexps simply aren’t cut out to be document parsers.

I would use something like PHP Simple HTML DOM Parser, or even the built-in DOMDocument->loadHTML() method.

You could keep a small database that recorded each page you wanted to scrape, and where the information was found in the structure of that page. Each time you scraped it, you could do a quick check to see if the structure had changed, in which case you could update your database with the new path location for your DOM parser, and get it on the next scrape.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve got a list of websites for each US Congress member that I’m programmatically

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply