Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8833315
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T08:45:03+00:00 2026-06-14T08:45:03+00:00

I am downloading HTML files (raw HTML without any !DOCTYPE…) from a government website

  • 0

I am downloading HTML files (raw HTML without any !DOCTYPE…) from a government website and then extracting paragraphs to put them into a MySQL database.

I am using DOMDocument, so I am going

$doc = DOMDocument();
$doc->loadHTMLFile( "../notifs/notif$notif_no.htm" );

The problem comes because certain characters get transformed into something strange: e.g. (one type of) apostrophe becomes ¢€™.

If I then try and save this para to a text field in a table either it is refused by MySQL or it is recorded as these strange characters… depending on the encoding of the text field.

Also, if I go $doc->saveHTMLFile( “test.htm” ); it actually prints out the strange characters, not the apostrophe.

I know this has something to do with encoding, but several days’ googling and much looking at questions on SE have not led to the solution. Firefox tells me that the downloaded HTML files are in utf-8 encoding. I tried changing the php.ini file so the default_charset is “utf-8”. No joy.

I am more an application programmer than a website person so I am quite new to encoding. I have tried cracking this one myself but just don’t really understand what’s going on or what to do.

later

have found that by putting

$file = file_get_contents("../notifs/notif$notif_no.htm");
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );

then saveHTMLFile() outputs with a correct apostrophe… as does my echo of the SQL INSERT INTO … (…) VALUES (…) string. However the text in the MySQL text field obstinately refuses to cooperate. (naturally have tried multiple different collations). Meanwhile, mb_detect_encoding ( $clean_string ) prints “UTF-8” and mb_check_encoding ( $clean_string ) returns TRUE.

Another puzzling thing, though: if I do

$doc->loadHTML('<?xml encoding="latin1">' . $file )

this same partial success stays the same, right down to the “UTF-8” detected encoding. hmmmm

later

$doc = new DOMDocument();
$file = file_get_contents("../notifs/notif$notif_no.htm");
# without this following line adding an explicit encoding for the DOMDocument nothing worked!
$doc->loadHTML('<?xml encoding="UTF-8">' . $file );

and then, when you’ve extracted some text and cleaned it up a bit, calling it $clean_string

# convert difficult UTF-8 characters into HTML special sequences ("&rsquo;", etc.) 
$clean_string = mb_convert_encoding($clean_string, "HTML-ENTITIES", "UTF-8"); 

After this $clean_string contains sequences like “… wine&rsquo;s worth drinking”… but I, for one, can still be quite confused, because if you simply go

echo ">>> clean string $clean_string<br>";

… the “&rsquo;” sequence will of course be displayed by the browser as ‘ (single quote).

This is probably absolutely obvious to most PHPers… but if you want to display an accurate picture of what you have in $clean_string you have to go

$decoded_clean_string = htmlspecialchars( $clean_string, ENT_QUOTES );
echo ">>> decoded string: $decoded_clean_string<br>";
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T08:45:05+00:00Added an answer on June 14, 2026 at 8:45 am
    $doc = DOMDocument();
    $file = file_get_contents("../notifs/notif$notif_no.htm");
    $file = mb_convert_encoding($file, "UTF-8");
    $doc->loadHTML( $file );
    

    Worth a shot?

    or

    $file = mb_convert_encoding($file, 'HTML-ENTITIES', 'UTF-8');
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am downloading Sun/Oracle Mojarra implementation of JSF from http://javaserverfaces.java.net/download.html I downloaded the latest
I am downloading files from my server, saving them to device, and displaying them
Recently I started taking this guide to get myself started on downloading files from
http://code.google.com/intl/en/appengine/docs/python/tools/uploadingdata.html the api is : Downloading Data from App Engine To start a data
I am trying to make Quicktime files stream from a HTML document using the
I am working on downloading a file from html page. for example html =
I'm trying to get a bunch of HTML files downloaded from the internet and
I am using System.Net.WebClient.DownloadFile to download a large number of html files from a
My C# WinForms app downloads files from a .php downloading web service. The .php
For an iPhone app, I am downloading files using URL: From this type of

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.