It involves rewriting .htm to .txt (output file)
then using a parser (stanford grammar parser) (output file)
for all the files in a directory.
MY QUESTION: I would like to get all the files in the directory, without doing it manually, and find a way to run the parser, without having to type it into the Terminal for each file.
Here is my code:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::FormatText;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new->parse_file("chpt15Intro.htm");
use HTML::FormatText;
my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 1000);
#print $formatter->format($tree); is replaced by push
push (my @files, $formatter->format($tree));
foreach my $files (@files) {
$files =~ s/^\s+//mg;
open MYFILE, ">ch15Intro.txt";
select MYFILE;
print $files;
}
In the Terminal, after getting the html file converted, I write:
script parsedch15Intro.txt ./lexparser.csh ch15Intro.txt
to save the output of the parser. This step still needs automation.
I’m a beginner so thanks for any advice.
I take it from your question that what you want to do is to apply this script to all the (html-) files in a certain folder, and output text versions of them.
So a simple solution is to simply replace the hardcoded file names with variables, and loop the script around the
@ARGV, e.g. the arguments to the script, like so:As you see, I cleaned up some of it. Use like so: