I’ve got two files,file a around 5mb, and file b around 66 mb. I

Question

0

Asked: May 16, 20262026-05-16T17:11:33+00:00 2026-05-16T17:11:33+00:00

I’ve got two files,file a around 5mb, and file b around 66 mb. I

0

I’ve got two files,file a around 5mb, and file b around 66 mb. I need to find out if there’s any occurnaces of the lines in file a, inside file b, and if so write them to file c.

This is the way I’m currently handling it:

ini_set("memory_limit","1000M");
set_time_limit(0);
$small_list=file("a.csv");
$big_list=file_get_contents("b.csv");
$new_list="c.csv";
$fh = fopen($new_list, 'a');
foreach($small_list as $one_line)
{
 if(stristr($big_list, $one_line) != FALSE) 
    {
    fwrite($fh, $one_line);
    echo "record found: " . $one_line ."<br>";
    }   
}

The issue is its been running(successfully) for over an hour and its maybe 3,000 lines into the 160,000 in the smaller file. Any ideas?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T17:11:34+00:00

Build arrays with hashes as indices:

Read in file a.csv line by line and store in a_hash[md5($line)] = array($offset, $length)
Read in file b.csv line by line and store in b_hash[md5($line)] = true

By using the hashes as indices you will automagically not wind up having duplicate entries.

Then for every hash that has an index in both a_hash and b_hash read in the contents of the file (using offset and length you stored in a_hash) to pull out the actual line text. If you’re paranoid about hash collisions then store offset/length for b_hash as well and verify with stristr.

This will run a lot faster and use up far, far, FAR less memory.

If you want to reduce memory requirement further and don’t mind checking duplicates then:

Read in file a.csv line by line and store in a_hash[md5($line)] = false
Read in file b.csv line by line, hash the line and check if exists in a_hash.
If a_hash[md5($line)] == false write to c.csv and set a_hash[md5($line)] = true

Some example code for the second suggestion:

$a_file = fopen('a.csv','r');
$b_file = fopen('b.csv','r');
$c_file = fopen('c.csv','w+');

if(!$a_file || !$b_file || !$c_file) {
    echo "Broken!<br>";
    exit;
}

$a_hash = array();

while(!feof($a_file)) {
    $a_hash[md5(fgets($a_file))] = false;
}
fclose($a_file);

while(!feof($b_file)) {
    $line = fgets($b_file);
    $hash = md5($line);
    if(isset($a_hash[$hash]) && !$a_hash[$hash]) {
        echo 'record found: ' . $line . '<br>';
        fwrite($c_file, $line);
        $a_hash[$hash] = true;
    }
}

fclose($b_file);
fclose($c_file);

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve got two files,file a around 5mb, and file b around 66 mb. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply