I’m making a shell script to find bigrams, which works, sort of.
#tokenise words tr -sc 'a-zA-z0-9.' '\012' < $1 > out1 #create 2nd list offset by 1 word tail -n+2 out1 > out2 #paste list together paste out1 out2 #clean up rm out1 out2
The only problem is that it pairs words from the end and start of the previous sentence.
eg for the two sentences ‘hello world.’ and ‘foo bar.’ i’ll get a line with ‘ world. foo’. Would it be possible to filter these out with grep or something?
I know i can find all bigrams containing a full stop with grep [.] but that also finds the legitimate bigrams.
Just replace the paste line with this:
This will filter out any lines that contain a period which is not the last character of a line.