When copy/paste from Word I end up with a lot of unsafe characters. Instead of find/replace each character individually I thought it would be useful to write a quick PHP script to do this.
When I hit submit with the sample HTML below each of the characters I would like to replace have been replace with a �. What am I doing wrong?
Am I right in thinking that if I use: htmlentities() or htmlspecialchars() this will replace the HTML markup?
Sample HTML block
<p>Nam ’velit metus, vulputate – eget sodales ut, dignissim “vehicula nisi”. Lor’em ipsum dolor sit amet, consectetur adipiscing elit. Nunc pharetra luctus mi, sollicitudin ultrices lacus iaculis sed. Nam aliquam, tortor id sodales scelerisque, est mauri’s adipiscing nunc, a tincidunt tortor elit eget quam. Fusce sagittis arcu ut urna egestas luctus. Aliquam erat volutpat. Suspendisse ut turpis mi. Nulla facilisi. Ut congue porta urna nec semper. Aenean feugiat ante vitae – dui accumsan placerat. Suspendisse aliquet, libero non tempor– dignissim, arcu nibh luctus magna, eu pellentesq’ue libero eros nec magna. Phasellus non ullamcorper nisi. Aenean sagittis elit ac lorem imperdiet ac consequat sem commodo. Aenean in elit at lectus blandit varius nec in erat. Mauris elementum, turpis eu eleifend pora, quam purus tempor justo, et feugiat tellus mi sed erat.</p>
<ul>
<li><strong>’Pellentesque’</strong> nec leo cursus ipsum rhoncus volutpat nec eget mi.</li>
<li><strong>N–am</strong> quis lectus enim, ac euismod urna.</li>
<li><strong>Donec</strong> varius massa augue, at feugiat tortor.</li>
<li><strong>“Duis”</strong> non massa eget elit euismod pulvinar.</li>
<li><strong>Duis</strong> bibendum sodales lorem, vel commodo metus volutpat a.</li>
<li><strong>Nu–nc</strong> pulvinar lacus in nisl dignissim euismod.</li>
<li><strong>“Nulla”</strong> tincidunt nulla adipiscing ante aliquet mattis</li>
</ul>
<?php
/**
*
* @param string $unformatted
* @return string
*/
function format($unformatted) {
$html = strtolower(trim($unformatted));
//replace accent characters, forien languages
$search = array('à','á','â','ã','ä','ç','è','é','ê','ë','ì','í','î','ï','ñ','ò','ó','ô','õ','ö','ù','ú','û','ü','ý','ÿ','À','Á','Â','Ã','Ä','Ç','È','É','Ê','Ë','Ì','Í','Î','Ï','Ñ','Ò','Ó','Ô','Õ','Ö','Ù','Ú','Û','Ü','Ý');
$replace = array('a','a','a','a','a','c','e','e','e','e','i','i','i','i','n','o','o','o','o','o','u','u','u','u','y','y','A','A','A','A','A','C','E','E','E','E','I','I','I','I','N','O','O','O','O','O','U','U','U','U','Y');
$html = str_replace($search, $replace, $html);
//replace common characters
$search = array('/(\s\&\s)/i', '/(\s\£\s)/i', '/(\s\$\s)/i');
$replace = array('&', '£', '$');
$html= preg_replace($search, $replace, $html);
//replace MS office crap
$search = array("‘", "’", "”", "“", "–", "…");
$replace = array("'", "'", '"', '"', "-", "...");
$html= str_replace($search, $replace, $html);
return $html;
}
if(isset($_POST['clean'])){
$html = format($_POST['html']);
}
?>
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<title>HTML Tidy</title>
<style type="text/css">
body {
color: #262626;
background: #f4f4f4;
font: normal 12px/18px Verdana, sans-serif;
height: 100%;
}
#container {
width: 760px;
margin: 40px auto 0 auto;
padding: 10px 60px;
border: solid 1px #cbcbcb;
background: #fafafa;
-moz-box-shadow: 0px 0px 10px #cbcbcb;
-webkit-box-shadow: 0px 0px 10px #cbcbcb;
}
</style>
</head>
<body>
<div id="container" class="content">
<h1>HTML Tidy</h1>
<form action="" method="post">
<textarea name="html" id="html" rows="20" cols="90"><?php if(isset($html)){ echo $html; } ?></textarea>
<input type="submit" name="clean" value="Clean" />
</form>
</div>
</body>
</html>
Properties of file

Page headers

htmlspecialcharsdoes exactly what needs to be done about unsafe characters, which are< > & ' "and nothing else.Your problem seems to be that your PHP file is not saved in the encoding you’re using for your web page. In 2012 we can safely say you really should always use UTF-8 and nothing else. (Unless you are using UTF-16, of course).
What happens then is a mess, involving PHP treating one multibyte character as multiple characters, replacing just a part of it and rendering it invalid. But even that isn’t unsafe. It’s just ugly and unreasoned.
The answer by @webarto does indeed solve the problem you are trying to solve, but it’s the wrong problem in the first place.
In the screenshot you posted, you should choose Other and select UTF-8, then find where the default encoding is set and set it to UTF-8, and use only UTF-8 from now on. Please.