Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6470379
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T06:06:16+00:00 2026-05-25T06:06:16+00:00

I have a question regarding a search algorithm. I currently have 2 files in

  • 0

I have a question regarding a search algorithm. I currently have 2 files in plain text, each one of them has at least 10 million lines. For now, each line is a string, and I want to find each string in the first file that also appears in the second file. Is there a good way to do this efficiently? Any suggestions from either algorithm or a special language feature is appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T06:06:16+00:00Added an answer on May 25, 2026 at 6:06 am

    If you don’t know anything about the structure of the files (such as whether or not they’re sorted), there are many different approaches that you could take to solve the problem that, depending on your constraints on memory and space usage, may be what you’re looking for.

    If you have gobs of free RAM available, one option would be to create a hash table in memory to hold strings. You could then load all of the strings from the first file into the hash table. Then, load each of the strings from the second file one at a time. For each string, check if it’s in the hash table. If so, report a match. This approach uses O(m) memory (where m is the number of strings in the first file) and takes at least Ω(m + n) time and possibly more, depending on how the hash function works. This is also (almost certainly) the easiest and most direct way to solve the problem.

    If you don’t have much RAM available but time isn’t much of a constraint, you can use a modified version of this first algorithm. Choose some number of lines to load in from the first file. Then, load just those strings into a hash table. Once you’ve done this, scan over the entire second file to find any matches. Then, evict all of the lines from the hash table and load in the next set of lines from the first file and repeat. This has runtime Ω(mn/b), where b is the block size (since there are O(m/b) iterations of a complete linear scan of all n bytes in the second file). Alternatively, if you know that one file is much smaller than the other, you might want to consider just loading that whole file into memory and then scanning the other file.

    If you don’t have much RAM available but do have the ability to use up more disk space, one option might be to use an external sorting algorithm to sort each of the two files (or, at least, construct a directory listing the lines of each file in sorted order). Once you have the files in sorted order, you can scan across them in parallel, finding all matches. This uses the more general algorithm for finding all duplicated elements in two sorted ranges, which works like this:

    1. Keep track of two indices, one into the first list and one into the second list, that both start at zero.
    2. While both lists have items left:
      1. If the items at the corresponding indices match, report a match.
      2. Otherwise, if the item in the first list is smaller than the item in the second list, increase the index into the first list.
      3. Otherwise, increase the index of the second list.

    This algorithm would take roughly O(n log n) time to sort the two files and would then make a total of O(n) comparisons to do the work to find the common items across the lists. However, since string comparisons do not necessarily run in O(1) time (in fact, they often take much longer), the actual runtime for this might be much greater. If we assume that each file consists of n strings of length m, then the runtime for sorting would be O(mn log n), since each comparison takes O(m) time. Similarly, the comparison step might take O(mn) time, because each string comparison could take up to O(m) time as well. As a possible optimization, you may want to consider computing a small hash code (say, 16 or 32 bits). Assuming that the hash code gives good uniformity, this can dramatically cut down on the time required to compare strings, since most strings that aren’t identical will have different hash codes and can be compared in O(1) time.

    Finally, if each line of the file is reasonably long (say, at least 8 characters), one option might be to compute a 64-bit or larger hash value for each of the lines of the files. You could then use any of the above techniques to try to see if any hash codes are repeated in the two files (holding everything in a hash table, using external sorting, etc.) Assuming that you have enough bits in your hash code, the number of conflicts should be low and you should be able to find matches quickly and with far less memory usage.

    Hope this helps!

    Woohoo! This is my 1000th answer on Stack Overflow! 🙂

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a question regarding the two additional columns (timeCreated, timeLastUpdated) for each record
I have one question regarding domain account. I have one domain controller where all
I have a question regarding search engine optimisation. I did some research on my
I have a question regarding implementation of smart-search features. For example, consider something like
Have a question regarding something which has been bugging me for some time now.I'm
I have a small question regarding rails. I have a search controller which searches
Quick question regarding Google Maps. I have one fully working Google Maps project and
I have question regarding the SQLAlchemy. How can I add into my mapped class
I have question regarding the use of function parameters. In the past I have
I have a question regarding an update function I created... CREATE OR REPLACE FUNCTION

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.