A PHP function I am writing pulls a small bit of HTML data from another webpage using file_get_contents(), then parses out a piece of text and tries to store it in a database. The problem is, the data it gets must be encoded with a different charset or something (I’m not positive how to check this) because it often adds  (at seemingly random places in the string, not always at beginning or end) and every once in a while adds a new line where I don’t want one. The  is annoying but when the newline is added it causes the javascript function to fail. The javascript function is printed from a php script as follows:
print <<<END
setUpSend("${a}", "${b}", "${c}", "${d}");
END;
And when the newline is entered, the function no longer works (I suppose because of the newline), and viewing the source shows something like this:
print <<<END
setUpSend("a information", "b information
", "c information", "d information");
END;
I did some research and found that this  is the UTF-8 BOM (Byte Order Mark) and it is suggested to parse the information as xml not as a string – I found that there are some php libraries to do this (http://php.net/manual/en/book.xml.php) but was thinking there might be an easier way, like a simple php function that will convert it automatically, or strip unwanted characters.
Also, sometimes the information can contain quotes, but since that would mess up the js function as well, I tried to use PHP’s addslashes function and it just doesn’t add any slashes, not working at all. If I manually write the same exact string in php however, and use addslashes on that, it adds the slashes normally, so it makes me think that somehow php can’t understand the encoding of this text I am getting. Something weird is going on but I’m lost on how to fix it.
I’d be more than open to any suggestions as I’ve looked up a lot of stuff but can’t figure out a good way to solve this.
The
might be an UTF-8 encoded BOM. You can normally safely remove it if you know the source encoding is UTF-8.That’s a simple string operation:
However, it looks like that you should make your code input encoding aware. HTML data can be in different encodings, so it’s probably worth to normalize the HTML encoding upfront (e.g. convert all non UTF-8 charsets to UTF-8) and then make your own functions properly deal with UTF-8 encoded data.
You can obtain the response headers after you retrieved the data with
file_get_contents. Those are stored in$http_response_header. The following example demonstrates this(see HEAD first with PHP Streams for the
parse_http_response_headerfunction):You only need to check if that header line exists and if a charset has been specified. See the
Content-TypeRFC 2616 header specification how it is written:If there is no media-type given (type and sub-type), you can (but must not) try to guess it. As you’re dealing with HTML, this is normally
text/html.If no charset parameter is given, take the default charset for that type (
text). In HTTP this isISO‑8859(ref).To properly parse the parameter(s), please see Section 3.6:
To properly parse the parameter string I leave as an exercise.