I am trying to compare 2 csv files in php by importing them into multi-dimension array and using the array_diff function to find out differences.
The methodology that I am using is
1) Fetch every record of expected csv and dump into arr1
2) Fetch every record of actual csv and dump into arr2
3) Sort array1 using array_multisort
4) Sort array2 using array_multisort
5) Compare using array_diff function each record (eg arr1[0][1] vs arr2[0][1])
My objective is compare the files using php script in least possible time. I found the above approach to be the shortest (tried initially dumping th csv contents into MySQL and using db queries to compare, but for some unknown reason, the queries are working so damn slow that its crashing my Apache server after timeout)
I have files of sizes up to 300mb in csv, though usually it would be 70k records with 20 columns and 10mb size
I am pasting the code of what I have done (w.r.t the steps described above)
$header='';
$file_handle = fopen($fileExp, "r");
$k=0;
while ($data=fgetcsv($file_handle,0,$_POST['dl1'])) {
if(count($data)==1 && $data[0]=='')
continue;
else
{
$urarr1[$k]='';
for($i=0;$i<count($data);$i++)
{
if(in_array($i,$exclude_cols,true))
$rarr1[$k][$i]='NTBT';
else
$rarr1[$k][$i]=trim($data[$i]);
}
$k++;
}
}
fclose($file_handle);
echo '<br>Exp Record count: '.count($rarr1);
$header.='<br>Exp Record count: '.count($rarr1);
$hrow=$rarr1[0]; //fetch header row and then unset it
unset($rarr1[0]);
array_multisort($rarr1); //need to sort on all 20 columns asc
$rarr1=array_values($rarr1); //re-number the array
//writing the sorted o/p to file...debugging purposes
$fp = fopen($_POST['op'].'/file1.csv', 'w');
foreach ($rarr1 as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
//Repeat for actual .csv
$file_handle = fopen($fileAct, "r");
$k=0;
while ($data=fgetcsv($file_handle,0,$_POST['dl2'])) {
if(count($data)==1 && $data[0]=='')
continue;
else
{
for($i=0;$i<count($data);$i++)
{
if(in_array($i,$exclude_cols,true))
$rarr2[$k][$i]='NTBT';
else
$rarr2[$k][$i]=trim($data[$i]);
}
$k++;
}
}
fclose($file_handle);
unset($file_handle);
echo '<br>Act Record count: '.count($rarr2);
$header.='<br>Act Record count: '.count($rarr2);
unset($rarr2[0]);
array_multisort($rarr2);
$rarr2=array_values($rarr2);
$fp = fopen($_POST['op'].'/file2.csv', 'w');
foreach ($rarr2 as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
///Comparison logic
$header.= '<br>';
$header.= '<table>';
$header.= '<th>RECORD_ID</th>';
for($i=0;$i<count($hrow);$i++)
{
$header.= '<th>'.$hrow[$i].'_EXP</th>';
$header.= '<th>'.$hrow[$i].'_ACT</th>';
}
$r=array();
for($i=0;$i<count($rarr1);$i++)
{
if(array_diff($rarr1[$i],$rarr2[$i]) || array_diff($rarr2[$i],$rarr1[$i]))
{
$r[$i]=array_unique(array_merge(array_keys(array_diff($rarr1[$i],$rarr2[$i])),array_keys(array_diff($rarr2[$i],$rarr1[$i]))));
foreach($r[$i] as $key=>$v)
{
if(in_array($v,$calc_cols))
{
if(abs($rarr1[$i][$v]-$rarr2[$i][$v])<0.2)
{
unset($r[$i][$key]);
}
}
elseif(is_numeric($rarr1[$i][$v]) && is_numeric($rarr2[$i][$v]) && !in_array($v,$calc_cols) && ($rarr1[$i][$v]-$rarr2[$i][$v])==0)
{
unset($r[$i][$key]);
}
}
if(empty($r[$i]))
unset($r[$i]);
if(isset($r[$i]))
{
$header.= '<tr>';
$header.= '<td>'.$i.'</td>';
for($j=0;$j<count($rarr1[$i]);$j++)
{
if(in_array($j,$r[$i]))
{
$header.= '<td style="color:orange">'.$rarr1[$i][$j].'</td>';
$header.= '<td style="color:orange">'.$rarr2[$i][$j].'</td>';
}
else
{
$header.= '<td >'.$rarr1[$i][$j].'</td>';
$header.= '<td >'.$rarr2[$i][$j].'</td>';
}
}
$header.= '</tr>';
}
}
}
$header.= '</table>';
//print_r($r);
echo '<br>';
// if(!isset($r))
// $r[0]=0;
echo 'Differences :'.count($r) ;
$header.= '<br>';
$header.= 'Differences :'.count($r) ;
$time_end = microtime(true);
$execution_time = ($time_end - $time_start)/60; //dividing with 60 will give the execution time in minutes other wise seconds
echo '<br><b>Total Execution Time:</b> '.$execution_time.' Mins'; //execution time of the script
Though initially i found this working on most files, but later I found that for some files for unknown reason, the array_multisort is sorting the arr1 and arr2 differently even though the contents seem the same…I am not sure this is happening because of the data type mismatch but i tried type casting too and still it sorts but identical arrays in different fashion
Can someone please suggest what might wrong in above code? In addition, considering my requirements as mentioned above, is there a more convenient way to achieve this through php?? maybe a php plugin that compares .csv files or something?
EDIT: Sample data as requested. just a snapshot, actual would have many more columnas and rows. As stated above the .csv file sizes go well beyond 10mb! File 1 and File 2
236|INPQR|31-AUG-12|200 |INR| 664|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
236|INPQR|31-AUG-12|200 |INR| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
236|INPQR|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP |0 |0 |0 |0 |0
236|INPQR|31-AUG-12|200 |USD| 664|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |USD| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6652|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
225|INPZQ|31-AUG-12|200 |INR| 6652|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |INR| 6654|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |INR| 6654|AAAAAA,PPPPP
236|INPQR|31-AUG-12|200 |USD| 664|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |USD| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP
236|INPQR|31-AUG-12|200 |INR| 664|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
236|INPQT|31-AUG-12|200 |INR| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
236|INPQR|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6652|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
225|INPZQ|31-AUG-12|200 |INR| 6652|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |USD| 6654|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |INR| 6654|AAAAAA,PPPPP
UPDATE: the 2 csv files could contain different date formats as well each one of them might represent numbers in different format like 1.csv could have 12-jan-2013 and 0.01 as 1st row….2.csv would have 01/12/2013 and .01
Hence i dont think hash would work
There are many different ways to compare two CSV files. I used an approach to check for different rows in both files. I took into account that you want to remove certain columns from the rows.
I did not use sorting, because I check if a row is in the other file and not if its at the same position. The reason is simple: If one row doesnt match and is sorted at the beginning of the file, all rows after this row will be different.
Example:
In the code below you will see comments as to why I did something. I also tested the code on several large CSV file (~45mb and 100.000 rows) and got the number of different rows in less then 10 seconds per check.
results: