I have a problem regarding PHP, CURL and UT-8 Greek characters.
I try to retrieve some text from a website (a blog specifically), but when i read the retrieved text it is corrupted. It shows up something like Î ÏκοÏÏÏ ÏÎ¿Ï ÏÏÏÏον. The english characters on the other hand show up nice.
The website’s charset is ‘UTF-8’ and so is the charset in my script.
I use the following settings for CURL.
$ch = curl_init();
$useragent='Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2';
$header = array('Accept-Charset: UTF-8');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 2);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 3);
I use an Xpath Query $res=$xp->query("...") to find the place of the text.
Then i take the text like this:
foreach($res as $text_result)
$texter=trim($text_result->nodeValue);
I checked the returned text charset with mb_detect_encoding and its properly ‘UTF-8’.
The script runs correctly with most of the websites, but it fails with two of them.
I can’t figure out what the problem may be.
Does anyone have an idea?
Thank you all in advance.
UPDATE
I have fixed the error by adding this:
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
but now, when i insert the text into the database, they remain corrupted. The same in my pc (easyphp) works fine.
I own a free host at 000webhost.
I found the solution.
I had to convert the html entities encoding, by:
Solution was given here: solution