I’m trying to move over some fish species information profiles from a bespoke CMS using latin1 charset to a WordPress customised (custom post type, with numerous meta fields) database which uses UTF-8.
On top of that, the old CMS uses some odd bbCode bits.
Basically, I’m looking for a function which will do this:
- Take information from my old database with
latin1_swedish_cicollation (andlatin1charset) - Convert all of the non-standard characters (we have characters from languages including but not exclusive of Croatian, Czech, Spanish, French and German) to HTML entities such as
á(numbers like&134;fine too). - Convert all of the bbCode (see below) to HTML
- Convert
'and"to HTML entities - Return the information with
utf-8charset to my new database
The bbCode to and from are:
$search = array( '[i]', '[/i]', '[b]', '[/b]', '[pl]', '[/pl]' );
$replace = array( '<i>', '</i>', '<strong>', '</strong>', '', '' );
The function that I’ve tried so far is:
$search = array( '[i]', '[/i]', '[b]', '[/b]', '[pl]', '[/pl]' );
$replace = array( '<i>', '</i>', '<strong>', '</strong>', '', '' );
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
return $content;
} else {
echo "<p>Couldn't convert to UTF-8.</p>";
}
}
}
function _clean($content) {
$content = _convert( $content );
/* edited out because otherwise all HTML appears as <html> rather than <html>
//$content = htmlentities( $content, ENT_QUOTES, "UTF-8" );
$content = str_replace( $search, $replace, $content );
return $content;
}
However this is stopping some fields from being imported to the new database and isn’t replacing the bbCode.
If I use the following code, it mostly works:
$var = str_replace( $search, $replace, htmlentities( $row["var"], ENT_QUOTES, "UTF-8" ) );
However, certain fields containing what I think are Czech/Croatian characters don’t appear at all.
Does anyone have any suggestions for how I can, in the order listed above, successfully convert the information from the “old format” to the new?
I would say if you want to convert all your non-ASCII characters you won’t need to do any
latin1toUTF-8conversion what so ever. Let’s say you run a function such ashtmlspecialcharsorhtmlentitieson your data, then all non-ASCII characters will be replaced with their corresponding entity code.Basically, after this step, there shouldn’t be any characters left that needs conversion to
UTF-8. Also, if you wanted to convert yourlatin1encoding string intoUTF-8i strongly suspectutf8_encodewill du just fine.PS. When it comes to converting
bbCodeintoHTMLI would recommend using regular expressions instead. For example you could do it all in a line like this: