It involves rewriting .htm to .txt (output file) then using a parser (stanford grammar

Question

0

Editorial Team

Asked: May 22, 20262026-05-22T21:17:51+00:00 2026-05-22T21:17:51+00:00

It involves rewriting .htm to .txt (output file) then using a parser (stanford grammar

0

It involves rewriting .htm to .txt (output file)

then using a parser (stanford grammar parser) (output file)

for all the files in a directory.

MY QUESTION: I would like to get all the files in the directory, without doing it manually, and find a way to run the parser, without having to type it into the Terminal for each file.

Here is my code:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::FormatText;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new->parse_file("chpt15Intro.htm");

use HTML::FormatText;

my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 1000);
   #print $formatter->format($tree); is replaced by push
push (my @files, $formatter->format($tree));
foreach my $files (@files) {
    $files =~ s/^\s+//mg;
    open MYFILE, ">ch15Intro.txt"; 
    select MYFILE; 
    print $files;
}

In the Terminal, after getting the html file converted, I write:

script parsedch15Intro.txt ./lexparser.csh ch15Intro.txt

to save the output of the parser. This step still needs automation.

I’m a beginner so thanks for any advice.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T21:17:52+00:00

I take it from your question that what you want to do is to apply this script to all the (html-) files in a certain folder, and output text versions of them.

So a simple solution is to simply replace the hardcoded file names with variables, and loop the script around the @ARGV, e.g. the arguments to the script, like so:

for my $file (@ARGV) {
    next unless ($file =~ /^(.+).html*$/i);
    my $outfile = $1 . ".txt";
    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file($file); # credit to Phil for this one
    my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 1000);
    foreach my $files ($formatter->format($tree)) {
        $files =~ s/^\s+//mg;
        open my $fh, '>', $outfile or die $!; 
        print $fh $files;
    }
}

As you see, I cleaned up some of it. Use like so:

> script.pl *.htm

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

It involves rewriting .htm to .txt (output file) then using a parser (stanford grammar

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply