I am trying to write a python code to match things from two lists in python.
One tab-delimited file looks like this:
COPB2
KLMND7
BLCA8
while the other file2 has a long list of similar looking “names”, if you will. There should be some identical matches in the file, which I have succeeded in identifying and writing out to a new file. The problem is when there are additional characters at the end of one of the “names”. For example, COPB2 from above should match COPB2A in file2, but it does not. Similarly KLMND7 should match KLMND79. Should I use regular expressions? Make them into strings? Any ideas are helpful, thank you!
What I have worked on so far, after the first response seen below:
with open(in_file1, "r") as names:
for line in names:
file1_list = [i.strip() for i in line.split()]
file1_str = str(file1_list)
with open(in_file2, "r") as symbols:
for line in symbols:
items = line.split("\t")
items = str(items)
matches = items.startswith(file1_str)
print matches
This code returns False when I know there should be some matches.
string.startswith()No need for regex, if it’s only trailing charactersHere is a working piece of code:
It may be quite slow for large files. If it’s unacceptable, I could try to think about how to optimize it.