I have text files which I need to remove stop words from them. I have the stop words stored in a text file. I load the “stop-word” text file into my Perl script and store the stop words in an array called “stops”.
Currently I am loading a different set of text files and I am storing them in a separate array then doing a pattern match to see if any of the words are indeed stop words.
I can print the stop words and know which ones are occurring in the files but how do I remove them from the text file and store a new text file so it has no stop words?
i.e Stopwords:
the
a
to
of
and
into
Text File:
“The girl was driving and crashed into a man”
Resulting file:
girl was driving crashed man
I load the file in:
$dirtoget = "/Users/j/temp/";
opendir( IMD, $dirtoget ) || die("Cannot open directory");`
@thefiles = readdir(IMD);`
foreach $f (@thefiles) {
if ( $f =~ m/\.txt$/ ) {
open( FILE, "/Users/j/temp/$f" ) or die "Cannot open FILE";
while (<FILE>) {
@file = <FILE>;
Here is the pattern matching loop:
foreach $word(split) {
foreach $x (@stop) {
if ($x =~ m/\b\Q$word\E\b/) {
$word='';
print $word,"\n";
Setting $word to be null.
Or I could do:
$word = '' if exists $stops{$word};
I’m just not sure how I set output file to no longer contain the matching words.
Is it stupid to store the words which don’t match in an array and output them to a file?
Overwriting the files in-place is possible, but a hassle. The Unix way of doing this is to just output the non-stopwords to standard output (which
printdoes by default), redirect thatthen proceed with the file
withoutstopwords.txt. This also allows the use of the program in a pipeline.