I have a crawler that downloads webpages, scrapes specific content and then stores that

Question

0

Asked: May 27, 20262026-05-27T01:17:07+00:00 2026-05-27T01:17:07+00:00

I have a crawler that downloads webpages, scrapes specific content and then stores that

0

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it’s searched for ( standard search engine type setup ).

The content is generally of two different encoding types… UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I’ve simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:

function clean_string( $string )
{

    $string = trim( $string );

    $string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );

    $string = $mysqli->real_escape_string( $string );

    return $string;

}

I now need to start capturing “special” characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.

From what I’ve read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I’m overlooking that may be giving me problems?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T01:17:08+00:00

You should convert your database encoding to UTF-8.

About the content: for every page you crawl, fetch the page’s encoding (from HTTP header/
meta charset) and use that encoding to convert to utf-8 like this:

$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);

Where THIS STRING’S ENCODING is the one you just grabbed as described above.

PHP manual on iconv: https://www.php.net/manual/en/function.iconv.php

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a crawler that downloads webpages, scrapes specific content and then stores that

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply