For sometime I have been trying to play with grep to retrieve data from files and I noticed something funny.
It might be my ignorance but here is what happens…
Suppose I have a file ABC. the data is:
a
abc
ab
bac
bb
ac
Now ran this grep command,
grep a* ABC
I found the output to contain lines starting a with b.c. why is this happening?
You used ‘a*’ as your search pattern… the ‘*’ means ZERO or MORE of the previous character, so ‘b.c’ matches, having ZERO or more ‘a’s in it.
On a semi-related note, I’d recommend quoting the ‘a*’ bit, since if you have ANY files in the current subdirectory which start with a, you’ll be VERY surprised to see what you’re really searching for, since the shell (bash,zsh,csh,sh,dash,wtfsh…) will perform wildcard expansion automatically BEFORE the command is executed.
if you want to search for lines which START with ‘a’, then you’ll need to anchor the search pattern with a leading ^ character, so your pattern becomes ‘^a*’, but again, the * means ZERO or more, so it’s not useful in this situation where you only have one letter… use ‘^a’ instead.
As a contrived example, if you wanted to find all the lines containing a ‘c’ AND those containing the letters ‘bc’, then you could use ‘b*c’ as the search pattern… meaning ZERO or more b’s, and a c.
The power of the regex search pattern is immense, and takes some time to grok. Peruse the man pages for grep(1), regex(7), pcre(3), pcresyntax(3), pcrepattern(3).
Once you get the hang of them, regex’s are useful in sed, grep, perl, vim, (probably emacs too), … uh, it’s late (early?) nothing more comes to mind, but they’re VERY powerful.
As some bonus, ‘*’ means ZERO or more, ‘+’ means ONE or more, and ‘?’ means ZERO or ONE.
So to search for things with two or more a’s… ‘aa+’, which is 1 a, and 1+ a (1 or more)
I ramble…. (regex(7)!)