I scrape some sites that occasionally have UTF-8 characters in the title, but that

Question

0

Asked: May 30, 20262026-05-30T12:23:36+00:00 2026-05-30T12:23:36+00:00

I scrape some sites that occasionally have UTF-8 characters in the title, but that

0

I scrape some sites that occasionally have UTF-8 characters in the title, but that don’t specify UTF-8 as the charset (qq.com is an example). When I use look at the website in my browser, the data I want to copy (i.e. the title) looks correct (Japanese or Chinese..not too sure). I can copy the title and paste it into the terminal and it looks exactly the same. I can even write it to the DB and when I retrieve from the DB it still looks the same, and correct.

However, when I use cURL, the data that gets printed is wrong. I can run cURL from the command line or use PHP .. when it’s printed to the terminal it’s clearly incorrect, and it remains that way when I store it to the DB (remember: the terminal can display these characters properly). I’ve tried all eligible combinations of the following:

Setting CURLOPT_BINARYTRANSFER to true
mb_convert_encoding($html, 'UTF-8')
utf8_encode($html)
utf8_decode($html)

None of these display the characters as expected. This is very frustrating since I can get the right characters so easily just by visiting the site, but cURL can’t. I’ve read a lot of suggestions such as this one: How to get web-page-title with CURL in PHP from web-sites of different CHARSET?

The solution in general seems to be “convert the data to UTF-8.” To be honest, I don’t actually know what that means. Don’t the above functions convert the data to UTF-8? Why isn’t it already UTF-8? What is it, and why does it display properly in some circumstances, but not for cURL?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T12:23:37+00:00

Editorial Team

2026-05-30T12:23:37+00:00Added an answer on May 30, 2026 at 12:23 pm

have you tried :

$html = iconv("gb2312","utf-8",$html);

the gb2312 was taken from the qq.com headers

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I scrape some sites that occasionally have UTF-8 characters in the title, but that

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply