I am reading some lines from a file in the following format:
Identifier String Number String Number String Number String Number
Identifier String Number String Number String Number
Identifier String Number String Number
Identifier String Number String Number String Number String Number String Number
In the file that was given to me, I believe that the lines are very very long so the following code:
<?php
$fp = gzopen($filename, "r");
while($source = gzgets($fp, 4096)) {
$trans = array("\x0D" => "");
$source = strtr($source,$trans);
$source = trim($source);
$source = explode(' ', $source);
foreach($source as $value) {
$value = trim($value);
//Clean and insert into appropriate column
}
}
?>
is producing parsing errors i.e. I am not getting the expected column. When I am expecting a String, it gives me a number and when I want a number, it is returning an identifier. After hours of debugging, now I figured out that the buffer size of 4096 is not able to read really long lines so it is reading only part of the line and then reading the next chunk in the next iteration because of which the inner for loop is being messed up. I tried giving a large buffer value:
while($source = gzgets($fp, 409600)) {
but then my parsing is still messed up for some other weird case. How can I take care of this? Any suggestions?
The tasks of such type is simple to solve with FSM. In the case of FSM you define several states, one of which is “the current char is \r\n” – and now you’re free to read in any way you like.