I have the following issue that I’m trying to solve in bash. I have two different files (file1, file2) contaning a list of information like the following:
HWI-1KL104:145:C18ANACXX:5:1101:1168:2164 4 * 0 0 * * 0 0 GTGCCTGAACTGGATGCATNGACAATGGGGAACATTACATATATAATACAAGGGAAACTCAAACGTTTCCNNNNNCAAGTATTTGACAGNNNNNNNNNNNN @B@DDFFFHHHHHIHIJIJ#3AFGHHJJJJIIJJIJIIIJJJJJJJGIIJIJJJIJIJJJJIJJI=@EED#####,,5=;ADDFEEDDD############
The string showed represent A SINGLE LINE. Meaning that if I do:
grep "HWI-1KL104:145:C18ANACXX:5:1101:1168:2164" file1
my output is the string above. The HWI-1KL104:145:C18ANACXX:5:1101:1168:2164 represents the ID of my line
You have to imagine millions of lines like this (~8GB of txt file) with different IDs
What I have to do is:
-
search for those IDs present in file1 that are present in file2
-
save the matched lines in file2 into a new file containing ONLY the ID + following information:
HWI-1KL104:145:C18ANACXX:5:1101:1196:2120
CCCCTTCTCCAGGGGACCANGTATGTTTCTCTTATGGTCCTCCTTGTTTACTAGCTTCTCTGGCAGTGAGATTGTAGGCTGGTAATCCTTTACTCNNTNNN CCCFFFFFHHHHHJJJJJJ#4CDEEDCDDDDDC######
so, discarding the stuff represented by 4 * 0 0 * * 0 0 (that is fixed in terms of lenght but not in content..meaning that could be 3 * 1 0 * * 0 1 and so on..).
So my file1 represent a sort of “reference” of my IDs that I want to find and save in file2.
It is quite difficult to me to explain. I hope you understand what I would like to do.
I think that a grep should work but I don’t know how to grep just some information within a line and compare to another file.
Could use a for loop
Above is the same script now expanded and the output sent to a file, so rather than pumping script to a file – you could execute the script and let it handle where it puts the output.
Sure you can run it on large files, it may just take a while to get going and may take some time to finish, the problem with using this method is that it works and is easy use but may not be as fast as some of the other complex methods suggested.
You could enable the working on id line to get more verbosity
additional notes:
you could dig further into initial grep like this:
This is now returning file name|all string then looking for pattern and returning everything after the pattern – you can customise it by adding more awk statements on the end of the line