I have a problem with accented letters.
For example:
I have a tag that contains: “il mio prodotto é molto bello”. However, the output is: “il mio prodotto “
When in xml, and there is an accented letter, the date is cut. I have a xml with:
<?xml version="1.0" encoding="utf-8"?>
Here is my parser code:
<?php
class Content_Handler {
function Content_Handler(){}
function start_element($parser, $name, $attrs) {
global $desc, $names, $link;
if ($name == "PRODUCT"){
$zupid = ($attrs["ZUPID"]);
echo "$zupid<br>";
}
if ($name == "DESCRIPTION") { $desc = true;}
if ($name == "NAME") { $names = true;}
if ($name == "DEEPLINK") { $link = true;}
}
function end_element($parser, $name) {
if ($name == "PRODUCT") {
print "<br />";
}
}
function characters($parser, $chars) {
global $desc, $names, $link;
if ($desc) { echo $chars."<br>"; $desc = false;}
if ($names) { echo $chars."<br>"; $names = false;}
if ($link) { echo $chars."<br>"; $link = false;}
}
}
$handler = new Content_Handler();
$cat_parser = xml_parser_create("UTF-8");
xml_parser_set_option($cat_parser, XML_OPTION_TARGET_ENCODING, "ISO-8859-1");
xml_set_object($cat_parser, $handler);
xml_set_element_handler($cat_parser, "start_element", "end_element");
xml_set_character_data_handler($cat_parser, "characters");
$file = "my.xml";
if ($file_stream = fopen($file, "r")) {
while ($data = fread($file_stream, 4096)) {
$this_chunk_parsed = xml_parse($cat_parser, $data, feof($file_stream));
if (!$this_chunk_parsed) {
$error_code = xml_get_error_code($cat_parser);
$error_text = xml_error_string($error_code);
$error_line = xml_get_current_line_number($cat_parser);
$output_text = "Parsing problem at line $error_line: $error_text";
die($output_text);
}
}
} else {
die("Can't open XML file.");
}
xml_parser_free($cat_parser);
?>
This is the normal error when dealing with SAX parsing in what appears to be any language (see previous answers on java and C!).
When you are a parsing SAX events, the Characters function isn’t the entire contents of the element between the start and end tag, it can be called many times, and when you are dealing with accented characters it is.
The full characters contents can only be determined by concatinating the values between a start and end tags.
so for your term ‘”il mio prodotto é molto bello’, characters will be called probably 3 times, with ‘il mio prodotto ‘, ‘é’ and ‘ molto bello’, so you need to concatinate them, not use them as litterals.
Your ‘characters’ function should be more like:
with your chars being used and reset in the end_element and start_element.