Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9150795
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T11:39:32+00:00 2026-06-17T11:39:32+00:00

I would like to write a python script that addresses the following problem: I

  • 0

I would like to write a python script that addresses the following problem:

I have two tab separated files, one has just one column of a variety of words. The other file has one column that contains similar words, as well as columns other information. However, within the first file, some lines contain multiple words, separated by ” /// “. The other file has a similar problem, but the separator is ” | “.

File #1

RED
BLUE /// GREEN
YELLOW /// PINK /// PURPLE
ORANGE
BROWN /// BLACK

File #2 (Which contains additional columns of other measurements)

RED|PINK 
ORANGE
BROWN|BLACK|GREEN|PURPLE
YELLOW|MAGENTA

I want to parse through each file and match the words that are the same, and then append the columns of additional measurements too. But I want to ignore the /// in the first file, and the | in the second, so that each word will be compared to the other list on its own. The output file should have just one column of any words that appear in both lists, and then the appended additional information from file 2. Any help??


Addition info / update:

Here are 8 lines of File #1, I used color names above to make it more simple but this is what the words really are: These are the “symbols”:

ANKRD38  
ANKRD57  
ANKRD57
ANXA8 /// ANXA8L1 /// ANXA8L2  
AOF1  
AOF2  
AP1GBP1  
APOBEC3F /// APOBEC3G  

Here is one line of file #2: What I need to do is run each symbol from file1 and see if it matches with any one of the “synonyms”, found in file2, in column 5 (here the synonyms are A1B|ABG|GAP|HYST2477). If any symbols from file1 match ANY of the synonyms from col 5 file 2, then I need to append the additional information (the other columns in file2) onto the symbol in file1 and create one big output file.

9606  '\t'    1 '\t'    A1BG  '\t'   -   '\t'       A1B|ABG|GAB|HYST2477'\t'    HGNC:5|MIM:138670|Ensembl:ENSG00000121410|HPRD:00726    '\t' 19   '\t'  19q13.4'\t' alpha-1-B glycoprotein '\t' protein-coding '\t' A1BG'\t'    alpha-1-B glycoprotein'\t'  O '\t'  alpha-1B-glycoprotein '\t'  20120726

File2 is 22,000 KB, file 1 is much smaller. I have thought of creating a dict much like has been suggested, but I keep getting held up with the different separators in each of the files. Thank you all for questions and help thus far.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T11:39:33+00:00Added an answer on June 17, 2026 at 11:39 am

    EDIT

    After your comments below, I think this is what you want to do. I’ve left the original post below in case anything in that was useful to you.

    So, I think you want to do the following. Firstly, this code will read every separate synonym from file1 into a set – this is a useful structure because it will automatically remove any duplicates, and is very fast to look things up. It’s like a dictionary but with only keys, no values. If you don’t want to remove duplicates, we’ll need to change things slightly.

    file1_data = set()
    with open("file1.txt", "r") as fd:
        for line in fd:
            file1_data.update(i.strip() for i in line.split("///") if i.strip())
    

    Then you want to run through file2 looking for matches:

    with open("file2.txt", "r") as in_fd:
        with open("output.txt", "w") as out_fd:
            for line in in_fd:
                items = line.split("\t")
                if len(items) < 5:
                    # This is so we don't crash if we find a line that's too short
                    continue
                synonyms = set(i.strip() for i in items[4].split("|"))
                overlap = synonyms & file1_data
                if overlap:
                    # Build string of columns from file2, stripping out 5th column.
                    output_str = "\t".join(items[:4] + items[5:])
                    for item in overlap:
                        out_fd.write("\t".join((item, output_str)))
    

    So what this does is open file2 and an output file. It goes through each line in file2, and first checks it has enough columns to at least have a column 5 – if not, it ignores that line (you might want to print an error).

    Then it splits column 5 by | and builds a set from that list (called synonyms). The set is useful because we can find the intersection of this with the previous set of all the synonyms from file1 very fast – this intersection is stored in overlap.

    What we do then is check if there was any overlap – if not, we ignore this line because no synonym was found in file1. This check is mostly for speed, so we don’t bother building the output string if we’re not going to use it for this line.

    If there was an overlap, we build a string which is the full list of columns we’re going to append to the synonym – we can build this as a string once even if there’s multiple matches because it’s the same for each match, because it all comes from the line in file2. This is faster than building it as a string each time.

    Then, for each synonym that matched in file1, we write to the output a line which is the synonym, then a tab, then the rest of the line from file2. Because we split by tabs we have to put them back in with "\t".join(...). This is assuming I am correct you want to remove column 5 – if you do not want to remove it, then it’s even easier because you can just use the line from file2 having stripped off the newline at the end.

    Hopefully that’s closer to what you need?

    ORIGINAL POST

    You don’t give any indication of the size of the files, but I’m going to assume they’re small enough to fit into memory – if not, your problem becomes slightly trickier.

    So, the first step is probably to open file #2 and read in the data. You can do it with code something like this:

    file2_data = {}
    with open("file2.txt", "r") as fd:
        for line in fd:
            items = line.split("\t")
            file2_data[frozenset(i.strip() for i in items[0].split("|"))] = items[1:]
    

    This will create file2_data as a dictionary which maps a word on to a list of the remaining items on that line. You also should consider whether words can repeat and how you wish to handle that, as I mentioned in my earlier comment.

    After this, you can then read the first file and attach the data to each word in that file:

    with open("file1.txt", "r") as fd:
        with open("output.txt", "w") as fd_out:
            for line in fd:
                words = set(i.strip() for i in line.split("///"))
                for file2_words, file2_cols in file2_data.iteritems():
                    overlap = file2_words & words
                    if overlap:
                        fd_out.write("///".join(overlap) + "\t" + "\t".join(file2_cols))
    

    What you should end up with is each row in output.txt being one where the list of words in the two files had at least one word in common and the first item is the words in common separated by ///. The other columns in that output file will be the other columns from the matched row in file #2.

    If that’s not what you want, you’ll need to be a little more specific.

    As an aside, there are probably more efficient ways to do this than the O(N^2) approach I outlined above (i.e. it runs across one entire file as many times as there are rows in the other), but that requires more detailed information on how you want to match the lines.

    For example, you could construct a dictionary mapping a word to a list of the rows in which that word occurs – this makes it a lot faster to check for matching rows than the complete scan performed above. This is rendered slightly fiddly by the fact you seem to want the overlaps between the rows, however, so I thought the simple approach outlined above would be sufficient without more specifics.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

The following is for Python 3.2.3. I would like to write a function that
I would like to write a python script that takes a bunch of swf
I would like to write a python script that would upload any file I
I would like to write a script that performs the following Linux commands sequence:
In Python, I'd like to write a function that would pretty-print its results to
I would like to write a script that will tell another server to SVN
I would like to write some scripts in python that do some automated changes
I would like to create a python script that will do 3 things: 1)
I would like to write a small notification script using python watchdog for windows.
I would like to write a League Fixture generator in python, but I can't.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.