I have idsfile.csv which is a comma separated file of ids (with no new line characters in),
and I would like to grab only the lines from a second datafile.txt file
which have one of those ids in (surrounded by tabs).
Sample idsfile.csv:
000001,000002,000005,000007,000008,000009,000011,000021,000029,000040,...
Sample datafile.txt:
titl e1 000001 description1
title2 000003 descr iption2
ti tle3 000021 des cripti on3
title4 000023 description4
If I was doing this without having to read in the ids from a file I would try:
grep -Ev '/\t000001\t|\t000002\t|\t000003\t/' datafile.txt > output.txt
but I am unsure how to read in the comma separated values in a way that I could then use them in the regular expression.
Does anyone know how I might assemble this as a one line command query please? Perhaps with textscan?
Edit: Actually, if I changed idsfile.csv to have an id on each line (with a tab before and after), then would I line similar to this work please or, I expect, is the syntax quite wrong:
grep -Evf idsfile.csv datafile.txt > output.txt
The single line of data in idsfile.csv is hostile to this workflow – you will have to transform it into a series of lines. The Unix toolset is based around lines!
So, we need to transliterate the commas into newlines:
A POSIX-compliant ‘grep’ will also recognize:
You might even be able to get away with:
This tells ‘grep’ to read the list of names to search for from its standard input.
Finally, if you’re using GNU grep, you could add ‘
-w‘ to search for words – it will require the pattern to be surrounded by non-alphanumeric characters (spaces in the examples).The ‘
-w'option means that if a line in datatfile.txt containsthe entry ‘000021’ will not select that line (without the ‘
-w‘, it would be selected).