A PHP function I am writing pulls a small bit of HTML data from

Question

0

Asked: May 28, 20262026-05-28T00:03:31+00:00 2026-05-28T00:03:31+00:00

A PHP function I am writing pulls a small bit of HTML data from

0

A PHP function I am writing pulls a small bit of HTML data from another webpage using file_get_contents(), then parses out a piece of text and tries to store it in a database. The problem is, the data it gets must be encoded with a different charset or something (I’m not positive how to check this) because it often adds ï»¿ (at seemingly random places in the string, not always at beginning or end) and every once in a while adds a new line where I don’t want one. The ï»¿ is annoying but when the newline is added it causes the javascript function to fail. The javascript function is printed from a php script as follows:

print <<<END
    setUpSend("${a}", "${b}", "${c}", "${d}");
END;

And when the newline is entered, the function no longer works (I suppose because of the newline), and viewing the source shows something like this:

print <<<END
        setUpSend("a information", "b information
", "c information", "d information");
END;

I did some research and found that this ï»¿ is the UTF-8 BOM (Byte Order Mark) and it is suggested to parse the information as xml not as a string – I found that there are some php libraries to do this (http://php.net/manual/en/book.xml.php) but was thinking there might be an easier way, like a simple php function that will convert it automatically, or strip unwanted characters.

Also, sometimes the information can contain quotes, but since that would mess up the js function as well, I tried to use PHP’s addslashes function and it just doesn’t add any slashes, not working at all. If I manually write the same exact string in php however, and use addslashes on that, it adds the slashes normally, so it makes me think that somehow php can’t understand the encoding of this text I am getting. Something weird is going on but I’m lost on how to fix it.

I’d be more than open to any suggestions as I’ve looked up a lot of stuff but can’t figure out a good way to solve this.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T00:03:32+00:00

The ï»¿ might be an UTF-8 encoded BOM. You can normally safely remove it if you know the source encoding is UTF-8.

That’s a simple string operation:

$withOutUTF8BOM = remove_UTF8BOM($withOrWithOutUTF8BOM);


/**
 * Remove UTF8BOM from the beginning of a string (if it exists)
 *
 * @return string
 */
function remove_UTF8BOM($str)
{
    $UTF8BOM = "\xEF\xBB\xBF";
    (0 === strpos($str, $UTF8BOM)) && $str = (string) substr($str, 3);
    return $str;
}

However, it looks like that you should make your code input encoding aware. HTML data can be in different encodings, so it’s probably worth to normalize the HTML encoding upfront (e.g. convert all non UTF-8 charsets to UTF-8) and then make your own functions properly deal with UTF-8 encoded data.

A PHP function I am writing pulls a small bit of HTML data from another webpage using file_get_contents(), then parses out a piece of text and tries to store it in a database. The problem is, the data it gets must be encoded with a different charset or something (I’m not positive how to check this)

You can obtain the response headers after you retrieved the data with file_get_contents. Those are stored in $http_response_header. The following example demonstrates this
(see HEAD first with PHP Streams for the parse_http_response_header function):

$url = 'http://example.com/';

$body = file_get_contents($url);

$responses = parse_http_response_header($http_response_header);

$contentType = $responses[0]['fields']['CONTENT-TYPE']; // CONTENT-TYPE

echo "Content-Type: $contentType\n";  # Content-Type: text/html; charset=UTF-8

You only need to check if that header line exists and if a charset has been specified. See the Content-Type^{RFC 2616} header specification how it is written:

list($typeAndSubType, $parameter) = explode(';' $contentType, 2) + array(NULL,NULL);

If there is no media-type given (type and sub-type), you can (but must not) try to guess it. As you’re dealing with HTML, this is normally text/html.

   Content-Type   = "Content-Type" ":" media-type

   media-type     = type "/" subtype *( ";" parameter )
   type           = token
   subtype        = token

If no charset parameter is given, take the default charset for that type (text). In HTTP this is ISO‑8859 (ref).

To properly parse the parameter(s), please see Section 3.6:

   parameter               = attribute "=" value
   attribute               = token
   value                   = token | quoted-string

To properly parse the parameter string I leave as an exercise.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

A PHP function I am writing pulls a small bit of HTML data from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply